# Equation, p-value, \(R^2\), AIC and BIC of fitted polynomial

Source:`R/stat-poly-eq.R`

`stat_poly_eq.Rd`

`stat_poly_eq`

fits a polynomial by default with `stats::lm()`

but
alternatively using robust regression. From the fitted model it
generates several labels including the equation, p-value, F-value,
coefficient of determination (R^2), 'AIC', 'BIC', and number of observations.

## Usage

```
stat_poly_eq(
mapping = NULL,
data = NULL,
geom = "text_npc",
position = "identity",
...,
formula = NULL,
method = "lm",
method.args = list(),
n.min = 2L,
eq.with.lhs = TRUE,
eq.x.rhs = NULL,
small.r = FALSE,
small.p = FALSE,
CI.brackets = c("[", "]"),
rsquared.conf.level = 0.95,
coef.digits = 3,
coef.keep.zeros = TRUE,
rr.digits = 2,
f.digits = 3,
p.digits = 3,
label.x = "left",
label.y = "top",
label.x.npc = NULL,
label.y.npc = NULL,
hstep = 0,
vstep = NULL,
output.type = NULL,
na.rm = FALSE,
orientation = NA,
parse = NULL,
show.legend = FALSE,
inherit.aes = TRUE
)
```

## Arguments

- mapping
The aesthetic mapping, usually constructed with

`aes`

. Only needs to be set at the layer level if you are overriding the plot defaults.- data
A layer specific dataset, only needed if you want to override the plot defaults.

- geom
The geometric object to use display the data

- position
The position adjustment to use for overlapping points on this layer

- ...
other arguments passed on to

`layer`

. This can include aesthetics whose values you want to set, not map. See`layer`

for more details.- formula
a formula object. Using aesthetic names

`x`

and`y`

instead of original variable names.- method
function or character If character, "lm", "rlm" or the name of a model fit function are accepted, possibly followed by the fit function's

`method`

argument separated by a colon (e.g.`"rlm:M"`

). If a function different to`lm()`

, it must accept as a minimum a model formula through its first parameter, and have formal parameters named`data`

,`weights`

, and`method`

, and return a model fit object of class`lm`

.- method.args
named list with additional arguments.

- n.min
integer Minimum number of distinct values in the explanatory variable (on the rhs of formula) for fitting to the attempted.

- eq.with.lhs
If

`character`

the string is pasted to the front of the equation label before parsing or a`logical`

(see note).- eq.x.rhs
`character`

this string will be used as replacement for`"x"`

in the model equation when generating the label before parsing it.- small.r, small.p
logical Flags to switch use of lower case r and p for coefficient of determination and p-value.

- CI.brackets
character vector of length 2. The opening and closing brackets used for the CI label.

- rsquared.conf.level
numeric Confidence level for the returned confidence interval. Set to NA to skip CI computation.

- coef.digits, f.digits
integer Number of significant digits to use for the fitted coefficients and F-value.

- coef.keep.zeros
logical Keep or drop trailing zeros when formatting the fitted coefficients and F-value.

- rr.digits, p.digits
integer Number of digits after the decimal point to use for \(R^2\) and P-value in labels.

- label.x, label.y
`numeric`

with range 0..1 "normalized parent coordinates" (npc units) or character if using`geom_text_npc()`

or`geom_label_npc()`

. If using`geom_text()`

or`geom_label()`

numeric in native data units. If too short they will be recycled.- label.x.npc, label.y.npc
`numeric`

with range 0..1 (npc units) DEPRECATED, use label.x and label.y instead; together with a geom using npcx and npcy aesthetics.- hstep, vstep
numeric in npc units, the horizontal and vertical step used between labels for different groups.

- output.type
character One of "expression", "LaTeX", "text", "markdown" or "numeric".

- na.rm
a logical indicating whether NA values should be stripped before the computation proceeds.

- orientation
character Either "x" or "y" controlling the default for

`formula`

.- parse
logical Passed to the geom. If

`TRUE`

, the labels will be parsed into expressions and displayed as described in`?plotmath`

. Default is`TRUE`

if`output.type = "expression"`

and`FALSE`

otherwise.- show.legend
logical. Should this layer be included in the legends?

`NA`

, the default, includes if any aesthetics are mapped.`FALSE`

never includes, and`TRUE`

always includes.- inherit.aes
If

`FALSE`

, overrides the default aesthetics, rather than combining with them. This is most useful for helper functions that define both data and aesthetics and shouldn't inherit behaviour from the default plot specification, e.g.`borders`

.

## Value

A data frame, with a single row and columns as described under

**Computed variables**. In cases when the number of observations is
less than `n.min`

a data frame with no rows or columns is returned
rendered as an empty/invisible plot layer.

## Details

This statistic can be used to automatically annotate a plot with
\(R^2\), adjusted \(R^2\) or the fitted model equation. It supports
linear regression, robust linear regression and median regression fitted
with functions `lm`

, or `rlm`

. The \(R^2\)
and adjusted \(R^2\) annotations can be used with any linear model
formula. The confidence interval for \(R^2\) is computed with function
`ci_rsquared`

from package 'confintr'. The fitted
equation label is correctly generated for polynomials or quasi-polynomials
through the origin. Model formulas can use `poly()`

or be defined
algebraically with terms of powers of increasing magnitude with no missing
intermediate terms, except possibly for the intercept indicated by "- 1" or
"-1" or `"+ 0"`

in the formula. The validity of the `formula`

is
not checked in the current implementation, and for this reason the default
aesthetics sets \(R^2\) as label for the annotation. This statistic
generates labels as R expressions by default but LaTeX (use TikZ device),
markdown (use package 'ggtext') and plain text are also supported, as well
as numeric values for user-generated text labels. The value of `parse`

is set automatically based on `output-type`

, but if you assemble
labels that need parsing from `numeric`

output, the default needs to
be overridden. This stat only generates annotation labels, the predicted
values/line need to be added to the plot as a separate layer using
`stat_poly_line`

(or `stat_smooth`

), if
the default formula is overriden with an argument, it is crucial to make
sure that the same model formula is used in all layers. In this case it is
best to save the formula as an object and supply this object as argument to
the different statistics.

A ggplot statistic receives as `data`

a data frame that is not the one
passed as argument by the user, but instead a data frame with the variables
mapped to aesthetics. `stat_poly_eq()`

mimics how `stat_smooth()`

works, except that only polynomials can be fitted. Similarly to these
statistics the model fits respect grouping, so the scales used for `x`

and `y`

should both be continuous scales rather than discrete.

With method `"lm"`

, singularity results in terms being dropped with a
message if more numerous than can be fitted with a singular (exact) fit.
In this case or if the model results in a perfect fit due to a low
number of observations, estimates for various parameters are `NaN`

or
`NA`

. When this is the case the corresponding labels are set to
`character(0L)`

and thus not visble in the plot.

With methods other than `"lm"`

, the model fit functions simply fail
in case of singularity, e.g., singular fits are not implemented in
`"rlm"`

.

In both cases the minimum number of observations with distinct values in
the explanatory variable can be set through parameter `n.min`

. The
default `n.min = 2L`

is the smallest suitable for method `"lm"`

but too small for method `"rlm"`

for which `n.min = 3L`

is
needed. Anyway, model fits with very few observations are of little
interest and using larger values of `n.min`

than the default is
usually wise.

## Note

For backward compatibility a logical is accepted as argument for
`eq.with.lhs`

. If `TRUE`

, the default is used, either
`"x"`

or `"y"`

, depending on the argument passed to `formula`

.
However, `"x"`

or `"y"`

can be substituted by providing a
suitable replacement character string through `eq.x.rhs`

.
Parameter `orientation`

is redundant as it only affects the default
for `formula`

but is included for consistency with
`ggplot2::stat_smooth()`

.

R option `OutDec`

is obeyed based on its value at the time the plot
is rendered, i.e., displayed or printed. Set `options(OutDec = ",")`

for languages like Spanish or French.

## Aesthetics

`stat_poly_eq()`

understands `x`

and `y`

,
to be referenced in the `formula`

and `weight`

passed as argument
to parameter `weights`

. All three must be mapped to `numeric`

variables. In addition, the aesthetics understood by the geom
(`"text"`

is the default) are understood and grouping respected.

*If the model formula includes a transformation of x, a
matching argument should be passed to parameter eq.x.rhs
as its default value "x" will not reflect the applied
transformation. In plots, transformation should never be applied to the
left hand side of the model formula, but instead in the mapping of the
variable within aes, as otherwise plotted observations and fitted
curve will not match. In this case it may be necessary to also pass
a matching argument to parameter eq.with.lhs.*

## Computed variables

If output.type different from `"numeric"`

the returned tibble contains
columns listed below. If the model fit function used does not return a value,
the label is set to `character(0L)`

.

- x,npcx
x position

- y,npcy
y position

- eq.label
equation for the fitted polynomial as a character string to be parsed

- rr.label
\(R^2\) of the fitted model as a character string to be parsed

- adj.rr.label
Adjusted \(R^2\) of the fitted model as a character string to be parsed

- rr.confint.label
Confidence interval for \(R^2\) of the fitted model as a character string to be parsed

- f.value.label
F value and degrees of freedom for the fitted model as a whole.

- p.value.label
P-value for the F-value above.

- AIC.label
AIC for the fitted model.

- BIC.label
BIC for the fitted model.

- n.label
Number of observations used in the fit.

- grp.label
Set according to mapping in

`aes`

.- method.label
Set according

`method`

used.- r.squared, adj.r.squared, p.value, n
numeric values, from the model fit object

If output.type is `"numeric"`

the returned tibble contains columns
listed below. If the model fit function used does not return a value,
the variable is set to `NA_real_`

.

- x,npcx
x position

- y,npcy
y position

- coef.ls
list containing the "coefficients" matrix from the summary of the fit object

- r.squared, rr.confint.level, rr.confint.low, rr.confint.high, adj.r.squared, f.value, f.df1, f.df2, p.value, AIC, BIC, n
numeric values, from the model fit object

- grp.label
Set according to mapping in

`aes`

.- b_0.constant
TRUE is polynomial is forced through the origin

- b_i
One or columns with the coefficient estimates

To explore the computed values returned for a given input we suggest the use
of `geom_debug`

as shown in the last examples below.

## Alternatives

`stat_regline_equation()`

in package 'ggpubr' is
a renamed but almost unchanged copy of `stat_poly_eq()`

taken from an
old version of this package (without acknowledgement of source and
authorship). `stat_regline_equation()`

lacks important functionality
and contains bugs that have been fixed in `stat_poly_eq()`

.

## References

Originally written as an answer to question 7549694 at Stackoverflow but enhanced based on suggestions from users and my own needs.

## See also

This statistics fits a model with function `lm`

,
function `rlm`

or a user supplied function returning an
object of class `"lm"`

. Consult the documentation of these functions
for the details and additional arguments that can be passed to them by name
through parameter `method.args`

.

This `stat_poly_eq`

statistic can return ready formatted labels
depending on the argument passed to `output.type`

. This is possible
because only polynomial and quasy-polynomial models are supported. For
quantile regression `stat_quant_eq`

should be used instead of
`stat_poly_eq`

while for model II or major axis regression
`stat_ma_eq`

should be used. For other types of models such as
non-linear models, statistics `stat_fit_glance`

and
`stat_fit_tidy`

should be used and the code for construction of
character strings from numeric values and their mapping to aesthetic
`label`

needs to be explicitly supplied by the user.

Other ggplot statistics for linear and polynomial regression:
`stat_poly_line()`

## Examples

```
# generate artificial data
set.seed(4321)
x <- 1:100
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 4)
y <- y / max(y)
my.data <- data.frame(x = x, y = y,
group = c("A", "B"),
y2 = y * c(1, 2) + c(0, 0.1),
w = sqrt(x))
# give a name to a formula
formula <- y ~ poly(x, 3, raw = TRUE)
# using defaults
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line() +
stat_poly_eq()
# no weights
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula)
# other labels
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(use_label("eq"), formula = formula)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(use_label(c("eq", "R2")), formula = formula)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(use_label(c("R2", "R2.CI", "P", "method")), formula = formula)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(use_label(c("R2", "F", "P", "n"), sep = "*\"; \"*"),
formula = formula)
# grouping
ggplot(my.data, aes(x, y2, color = group)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula)
# rotation
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula, angle = 90)
# label location
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula, label.y = "bottom", label.x = "right")
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula, label.y = 0.1, label.x = 0.9)
# modifying the explanatory variable within the model formula
# modifying the response variable within aes()
formula.trans <- y ~ I(x^2)
ggplot(my.data, aes(x, y + 1)) +
geom_point() +
stat_poly_line(formula = formula.trans) +
stat_poly_eq(use_label("eq"),
formula = formula.trans,
eq.x.rhs = "~x^2",
eq.with.lhs = "y + 1~~`=`~~")
# using weights
ggplot(my.data, aes(x, y, weight = w)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula)
# no weights, 4 digits for R square
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula, rr.digits = 4)
# manually assemble and map a specific label using paste() and aes()
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = paste(after_stat(rr.label),
after_stat(n.label), sep = "*\", \"*")),
formula = formula)
# manually assemble and map a specific label using sprintf() and aes()
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = sprintf("%s*\" with \"*%s*\" and \"*%s",
after_stat(rr.label),
after_stat(f.value.label),
after_stat(p.value.label))),
formula = formula)
# x on y regression
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula, orientation = "y") +
stat_poly_eq(use_label(c("eq", "adj.R2")),
formula = x ~ poly(y, 3, raw = TRUE))
# conditional user specified label
ggplot(my.data, aes(x, y2, color = group)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = ifelse(after_stat(adj.r.squared) > 0.96,
paste(after_stat(adj.rr.label),
after_stat(eq.label),
sep = "*\", \"*"),
after_stat(adj.rr.label))),
rr.digits = 3,
formula = formula)
# geom = "text"
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(geom = "text", label.x = 100, label.y = 0, hjust = 1,
formula = formula)
# using numeric values
# Here we use columns b_0 ... b_3 for the coefficient estimates
my.format <-
"b[0]~`=`~%.3g*\", \"*b[1]~`=`~%.3g*\", \"*b[2]~`=`~%.3g*\", \"*b[3]~`=`~%.3g"
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula,
output.type = "numeric",
parse = TRUE,
mapping =
aes(label = sprintf(my.format,
after_stat(b_0), after_stat(b_1),
after_stat(b_2), after_stat(b_3))))
# Inspecting the returned data using geom_debug()
# This provides a quick way of finding out the names of the variables that
# are available for mapping to aesthetics with after_stat().
gginnards.installed <- requireNamespace("gginnards", quietly = TRUE)
if (gginnards.installed)
library(gginnards)
if (gginnards.installed)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula, geom = "debug")
#> [1] "Summary of input 'data' to 'draw_panel()':"
#> npcx npcy label
#> 1 NA NA italic(R)^2~`=`~"0.96"
#> eq.label
#> 1 italic(y)~`=`~-0.00450 + 0.00109*~italic(x) - 2.14%*% 10^{-05}*~italic(x)^2 + 1.06%*% 10^{-06}*~italic(x)^3
#> rr.label adj.rr.label rr.confint.label
#> 1 italic(R)^2~`=`~"0.96" italic(R)[adj]^2~`=`~"0.96" "95% CI [0.95, 0.97]"
#> AIC.label BIC.label f.value.label
#> 1 AIC~`=`~"-291.9" BIC~`=`~"-278.8" italic(F)[3*","*96]~`=`~"810."
#> p.value.label n.label grp.label method.label r.squared
#> 1 italic(P)~`<`~"0.001" italic(n)~`=`~"100" -1 "method: lm:qr" 0.9620171
#> adj.r.squared p.value n fm.method fm.class fm.formula
#> 1 0.9608301 5.110349e-68 100 lm:qr lm y ~ poly(x, 3, raw = TRUE)
#> fm.formula.chr x y PANEL group
#> 1 y ~ poly(x, 3, raw = TRUE) 1 1 1 -1
if (gginnards.installed)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula, geom = "debug", output.type = "numeric")
#> [1] "Summary of input 'data' to 'draw_panel()':"
#> npcx npcy label
#> 1 NA NA
#> coef.ls
#> 1 -4.495631e-03, 1.089644e-03, -2.139937e-05, 1.055387e-06, 2.268244e-02, 1.935252e-03, 4.440449e-05, 2.890821e-07, -1.981987e-01, 5.630500e-01, -4.819189e-01, 3.650820e+00, 8.433087e-01, 5.747135e-01, 6.309604e-01, 4.255076e-04
#> coefs r.squared
#> 1 -4.495631e-03, 1.089644e-03, -2.139937e-05, 1.055387e-06 0.9620171
#> rr.confint.level rr.confint.low rr.confint.high adj.r.squared f.value f.df1
#> 1 0.95 0.9464423 0.9696371 0.9608301 810.484 3
#> f.df2 p.value AIC BIC n rr.label b_0.constant b_0
#> 1 96 5.110349e-68 -291.8639 -278.838 100 FALSE -0.004495631
#> b_1 b_2 b_3 fm.method fm.class
#> 1 0.001089644 -2.139937e-05 1.055387e-06 lm:qr lm
#> fm.formula fm.formula.chr x y PANEL group
#> 1 y ~ poly(x, 3, raw = TRUE) y ~ poly(x, 3, raw = TRUE) 1 1 1 -1
# names of the variables
if (gginnards.installed)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula, geom = "debug",
summary.fun = colnames)
#> [1] "Summary of input 'data' to 'draw_panel()':"
#> [1] "npcx" "npcy" "label" "eq.label"
#> [5] "rr.label" "adj.rr.label" "rr.confint.label" "AIC.label"
#> [9] "BIC.label" "f.value.label" "p.value.label" "n.label"
#> [13] "grp.label" "method.label" "r.squared" "adj.r.squared"
#> [17] "p.value" "n" "fm.method" "fm.class"
#> [21] "fm.formula" "fm.formula.chr" "x" "y"
#> [25] "PANEL" "group"
# only data$eq.label
if (gginnards.installed)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula, geom = "debug",
output.type = "expression",
summary.fun = function(x) {x[["eq.label"]]})
#> [1] "Summary of input 'data' to 'draw_panel()':"
#> [1] "italic(y)~`=`~-0.00450 + 0.00109*~italic(x) - 2.14%*% 10^{-05}*~italic(x)^2 + 1.06%*% 10^{-06}*~italic(x)^3"
# only data$eq.label
if (gginnards.installed)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(aes(label = after_stat(eq.label)),
formula = formula, geom = "debug",
output.type = "markdown",
summary.fun = function(x) {x[["eq.label"]]})
#> [1] "Summary of input 'data' to 'draw_panel()':"
#> [1] "_y_ = -0.00450+0.00109 _x_-2.14×10<sup>-5</sup> _x_<sup>2</sup>+1.06×10<sup>-6</sup> _x_<sup>3</sup>"
# only data$eq.label
if (gginnards.installed)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula, geom = "debug",
output.type = "latex",
summary.fun = function(x) {x[["eq.label"]]})
#> [1] "Summary of input 'data' to 'draw_panel()':"
#> [1] "y = -0.00450 + 0.00109 x - 2.14\\times{} 10^{-05} x^2 + 1.06\\times{} 10^{-06} x^3"
# only data$eq.label
if (gginnards.installed)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula, geom = "debug",
output.type = "text",
summary.fun = function(x) {x[["eq.label"]]})
#> [1] "Summary of input 'data' to 'draw_panel()':"
#> [1] "y = -0.00450 + 0.00109 x - 2.14 10^-05 x^2 + 1.06 10^-06 x^3"
# show the content of a list column
if (gginnards.installed)
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(formula = formula) +
stat_poly_eq(formula = formula, geom = "debug", output.type = "numeric",
summary.fun = function(x) {x[["coef.ls"]][[1]]})
#> [1] "Summary of input 'data' to 'draw_panel()':"
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -4.495631e-03 2.268244e-02 -0.1981987 0.8433087105
#> poly(x, 3, raw = TRUE)1 1.089644e-03 1.935252e-03 0.5630500 0.5747134900
#> poly(x, 3, raw = TRUE)2 -2.139937e-05 4.440449e-05 -0.4819189 0.6309603945
#> poly(x, 3, raw = TRUE)3 1.055387e-06 2.890821e-07 3.6508204 0.0004255076
```