Custom Polynomial Models

Aims of ‘ggpmisc’ and caveats

Package ‘ggpmisc’ makes it easier to add to plots created using ‘ggplot2’ annotations based on fitted models and other statistics. It does this by wrapping existing model fit and other functions. The same annotations can be produced by calling the model fit functions, extracting the desired estimates and adding them to plots. There are two advantages in wrapping these functions in an extension to package ‘ggplot2’: 1) we ensure the coupling of graphical elements and the annotations by building all elements of the plot using the same data and a consistent grammar and 2) we make it easier to annotate plots to the casual user of R, already familiar with the grammar of graphics.

To avoid confusion it is good to make clear what may seem obvious to some: if no plot is needed, then there is no reason to use this package. The values shown as annotations are not computed by ‘ggpmisc’ but instead by the usual model-fit and statistical functions from R and R packages. The same is true for model predictions, residuals, etc. that some of the functions in ‘ggpmisc’ display as lines, segments, or other graphical elements.

It is also important to remember that in most cases data analysis including exploratory and other stages should take place before annotated plots for publication are produced. Even though data analysis can benefit from combined numerical and graphical representation of the results, the use I envision for ‘ggpmisc’ is mainly for the production of plots for publication or communication. In case case, whether used for analysis or communication, it is crucial that users cite and refer both to ‘ggpmisc’ and to the underlying R and R packages when publishing plots created with functions and methods from ‘ggpmisc’.

print(citation(package = "ggpmisc", auto = TRUE), bibtex = FALSE)
#> To cite package 'ggpmisc' in publications use:
#> 
#>   Aphalo P (2025). _ggpmisc: Miscellaneous Extensions to 'ggplot2'_. R
#>   package version 0.6.2.9001,
#>   <https://docs.r4photobiology.info/ggpmisc/>.

Preliminaries

Other packages are loaded in the section they are used.

library(ggpmisc)

Attaching package ‘ggpmisc’ also attaches package ‘ggpp’ as it provides several of the geometries used by default in the statistics described below. Package ‘ggpp’ can be loaded and attached on its own, and has separate documentation.

As we will use text and labels on the plotting area we change the default theme to an uncluttered one.

old_theme <- theme_set(theme_bw())

Fitted models in `stat_poly_line()` and `stat_poly_eq()`

Both functions are designed for flexibility while still attempting to keep their implementation and use rather straightforward.

Function stat_poly_eq() requires that the model formula is a regular polynomial (with no missing intermediate power terms), with the exception that the intercept can be missing. Function stat_poly_line() has fewer requirements for the argument passed to formula.

Both statistics: * Accept any R function or its name as argument for method, * as long as fitted model object belongs to classes "lm", "rlm", "lqs", "lts" or "gls", as tested with inherits(), and additionally models of other classes as long as methods are available. * Their implementation tolerates some missing methods and fields in model fit objects and their summary, i.e., returning NA_real_ or empty character strings depending on the case. * The formula used to construct the equation is, whenever possible, extracted from the fitted model object. Only exceptionally, with this approach fails, the formula passed as argument for formula in the call is used with a warning.

This design makes it possible to use as method existing model fit functions that return an object of one of the classes specifically supported, wrapper functions that call such functions, or in some other way return an object of a known class such as "lm" or a class with suitable methods implemented.

To showcase this, I explore the use of contributed and user defined model fit functions through examples. Several of the examples use facets or grouping to demonstrate how the models fitted depend on the data. The approach can be also useful in the absence of facets and/or groups, for example for dynamic reports or dashboards.

Two frequently use model fit functions that return model fit objects that require special handling to extract values are supported in ‘ggpmisc’ by specific statistics: stat_ma_line() and stat_ma_eq() for mayor axis regression with package ‘lmodel2’, and stat_quant_line(), stat_quant_band() and stat_quant_equation() for quantile regression with package ‘quantreg’. Their use is documented in the User Guide and their help pages.

Existing functions

We consider cases in which fitting linear models with ordinary least squares (OLS) as criterion is unsuitable.

In the case of outliers, many methods are available, and each fit function usually also supports several variations within a class of methods. In robust regression the weight of extreme observations in the computation of the penalty function is decreased compared to OLS. The effect of such decrease can be described as weights ranging continuously between 0 and 1, either actually used in the fit algorithm or computed a posteriori by comparison to OLS. In resistant regression, some observations are ignored altogether, which is equivalent to discrete weights taking one of two values, 0 or 1. Even if these approaches work automatically, similarly as when discarding observations manually, it is important to investigate the origin of outliers.

Generalised least squares (GLS) is an extension to OLS that makes it possible to describe how the variance of the response y varies. GLS supports description of heterogeneous variance by means of variance covariates, and correlation structures.

Error in variance (EIV) methods remove the assumption that the values of the covariate x are measured/known without error. It is important here to stress that linear models (LM) do not require us to assume that there is no variation in x, only that we have observed/measured/know these values without error. To apply most of these methods, we need independently obtained information about the size of the measuring errors affecting covariates.

Not exemplified here, major axis regression (MA), is useful when both x and y values are affected to comparable extents by measurement errors. This is typically the case when there is no causal direction between x and y such as in allometric relationships between body parts of the same individual. See stat_ma_line() and stat_ma_eq().

Quantile regression can be thought as similar to both a regression an a box plot. It makes it possible to fit curves that divide the observations in 2D space into quantiles, approximately preserving this split along a fitted curve of y on x. See stat_quant_line(), stat_quant_band() and stat_quant_eq().

As all these methods use different penalty functions when fitting a model to observations, they yield different estimates of parameter values for the same model formula. For several of these methods the approximate solution is obtained using numeral iteration. In these cases, it is important to check whether the solution is stable with different starting points for random number generation.

Robust regression (‘MASS’)

print(citation(package = "MASS", auto = TRUE), bibtex = FALSE)
#> To cite package 'MASS' in publications use:
#> 
#>   Ripley B, Venables B (2025). _MASS: Support Functions and Datasets
#>   for Venables and Ripley's MASS_. doi:10.32614/CRAN.package.MASS
#>   <https://doi.org/10.32614/CRAN.package.MASS>, R package version
#>   7.3-65, <https://CRAN.R-project.org/package=MASS>.

Package ‘MASS’ provides function rlm() for “Robust Fitting of Linear Models”. The returned value is an "rlm" object. Class "rlm" is derived from "lm" so the stats work mostly as expected. However, we need to be aware that no R^2 or P estimates are available and the corresponding labels are empty strings.

library(MASS)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  stat_poly_line(method = MASS::rlm) +
  stat_poly_eq(method = MASS::rlm, mapping = use_label("eq", "n"),
               label.x = "right") +
  facet_wrap(~class, ncol = 2)

When weights are available, either supplied by the user, or computed as part of the fit, they are returned in data. Having weights available allows encoding them using colour. We here use a robust regression fit with MASS::rlm().

ggplot(mpg, aes(displ, hwy)) +
  stat_poly_eq(use_label("eq", "n", "method"), 
               label.x = "right",
               method = MASS::rlm) +
  stat_poly_line(method = MASS::rlm, colour = "black") +
  stat_fit_deviations(method = "rlm",
                      mapping = aes(colour = after_stat(robustness.weights)),
                      show.legend = TRUE,
                      geom = "point") +
  scale_color_gradient(low = "red", high = "blue", limits = c(0, 1)) +
  theme(legend.position = "top")

my.formula <- y ~ poly(x, 2, raw = TRUE)
ggplot(mpg, aes(displ, hwy)) +
  stat_poly_eq(use_label("eq", "n", "method"), 
               formula = my.formula,
               label.x = "right",
               method = MASS::rlm) +
  stat_poly_line(method = MASS::rlm, 
                 formula = my.formula,
                 colour = "black") +
  stat_fit_deviations(mapping = aes(colour = after_stat(robustness.weights)),
                      method = "rlm",
                      formula = my.formula,
                      show.legend = TRUE,
                      geom = "point") +
  scale_color_gradient(low = "red", high = "blue", limits = c(0, 1)) +
  theme(legend.position = "top")

Robust regression (‘robustbase’)

print(citation(package = "robustbase", auto = TRUE), bibtex = FALSE)
#> To cite package 'robustbase' in publications use:
#> 
#>   Maechler M, Todorov V, Ruckstuhl A, Salibian-Barrera M, Koller M,
#>   Conceicao EL (2024). _robustbase: Basic Robust Statistics_.
#>   doi:10.32614/CRAN.package.robustbase
#>   <https://doi.org/10.32614/CRAN.package.robustbase>, R package version
#>   0.99-4-1, <https://CRAN.R-project.org/package=robustbase>.

Package ‘robustbase’ provides function lmrob() for “MM-type Estimators for Linear Regression”. The returned value is an "lmrob" object. Class "lmrob" is derived from "lm" so the stats work mostly as expected. However, we need to be aware that no R^2 or P estimates are available and the corresponding labels are empty strings.

library(robustbase)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  stat_poly_line(method = robustbase::lmrob) +
  stat_poly_eq(method = robustbase::lmrob, mapping = use_label("eq", "R2"),
               label.x = "right") +
  theme(legend.position = "bottom") +
  facet_wrap(~class, ncol = 2)

ggplot(mpg, aes(displ, hwy)) +
  stat_poly_eq(use_label("eq", "R2", "method"), 
               label.x = "right",
               method = robustbase::lmrob) +
  stat_poly_line(method = robustbase::lmrob, colour = "black") +
  stat_fit_deviations(method = robustbase::lmrob,
                      mapping = aes(colour = after_stat(robustness.weights)),
                      show.legend = TRUE,
                      geom = "point") +
  scale_color_gradient(name = "Robustness weights",
                       low = "red", high = "blue", limits = c(0, 1)) +
  theme(legend.position = "top")

my.formula <- y ~ poly(x, 2, raw = TRUE)
ggplot(mpg, aes(displ, hwy)) +
  stat_poly_eq(use_label("eq", "R2", "method"), 
               formula = my.formula,
               label.x = "right",
               method = robustbase::lmrob) +
  stat_poly_line(method = robustbase::lmrob, 
                 formula = my.formula,
                 colour = "black") +
  stat_fit_deviations(method = robustbase::lmrob,
                      mapping = aes(colour = after_stat(robustness.weights)),
                      formula = my.formula,
                      show.legend = TRUE,
                      geom = "point") +
  scale_color_gradient(name = "Robustness weights",
                       low = "red", high = "blue", limits = c(0, 1)) +
  theme(legend.position = "top")

Resistant regression (‘MASS’)

print(citation(package = "MASS", auto = TRUE), bibtex = FALSE)
#> To cite package 'MASS' in publications use:
#> 
#>   Ripley B, Venables B (2025). _MASS: Support Functions and Datasets
#>   for Venables and Ripley's MASS_. doi:10.32614/CRAN.package.MASS
#>   <https://doi.org/10.32614/CRAN.package.MASS>, R package version
#>   7.3-65, <https://CRAN.R-project.org/package=MASS>.

Package ‘MASS’ provides function lqs() for “Resistant Regression”. The returned value is an "lqs" object. Class "rlm" is not derived from "lm" but the structure is similar. However, we need to be aware that no R^2 or P estimates are available and the corresponding labels are empty strings.

library(MASS)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  stat_poly_line(method = MASS::lqs, se = FALSE) +
  stat_poly_eq(method = MASS::lqs, mapping = use_label("eq", "n"),
               label.x = "right") +
  theme(legend.position = "bottom") +
  facet_wrap(~class, ncol = 2)

Weights are currently not available for MASS::lqs() and NA vectors are returned instead. To highlight the LTS weights, rosbustbase::ltsReg() can be used as shown in the next section.

ggplot(mpg, aes(displ, hwy)) +
  stat_poly_eq(use_label("eq", "n", "method"), 
               label.x = "right",
               method = MASS::lqs) +
  stat_poly_line(method = "lqs:lms", colour = "black") +
  stat_fit_deviations(method = MASS::lqs,
                      mapping = 
                        aes(colour = factor(after_stat(robustness.weights))),
                      show.legend = TRUE,
                      geom = "point") +
  scale_color_manual(name = "LTS weights",
                     values = c("0" = "red", "1" = "black")) +
  theme(legend.position = "top")

my.formula <- y ~ poly(x, 2, raw = TRUE)
ggplot(mpg, aes(displ, hwy)) +
  stat_poly_eq(use_label("eq", "n", "method"), 
               formula = my.formula,
               label.x = "right",
               method = MASS::lqs) +
  stat_poly_line(method = "lqs:lms", 
                 formula = my.formula,
                 colour = "black") +
  stat_fit_deviations(mapping = 
                        aes(colour = factor(after_stat(robustness.weights))),
                      method = MASS::lqs,
                      formula = my.formula,
                      show.legend = TRUE,
                      geom = "point") +
  scale_color_manual(name = "LTS weights",
                     values = c("0" = "red", "1" = "black")) +
  theme(legend.position = "top")

Resistant regression (‘robustbase’)

print(citation(package = "robustbase", auto = TRUE), bibtex = FALSE)
#> To cite package 'robustbase' in publications use:
#> 
#>   Maechler M, Todorov V, Ruckstuhl A, Salibian-Barrera M, Koller M,
#>   Conceicao EL (2024). _robustbase: Basic Robust Statistics_.
#>   doi:10.32614/CRAN.package.robustbase
#>   <https://doi.org/10.32614/CRAN.package.robustbase>, R package version
#>   0.99-4-1, <https://CRAN.R-project.org/package=robustbase>.

Package ‘robustbase’ provides function ltsReg() for “least trimmed squares” regression. The returned value is an "lts" object. The usual accessors functions are available for Class "lts".

library(robustbase)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  stat_poly_line(method = robustbase::ltsReg, se = FALSE) +
  stat_poly_eq(method = robustbase::ltsReg, mapping = use_label("eq", "n"),
               label.x = "right") +
  theme(legend.position = "bottom") +
  facet_wrap(~class, ncol = 2)
#> Fitted line plotted
#> Fitted line plotted
#> Fitted line plotted
#> Fitted line plotted
#> Fitted line plotted
#> Fitted line plotted
#> Fitted line plotted

If no predict() method is found for a model fit object, stat_poly_line() searches for method fitted(), which is nearly always available. This is how the "lts" objects returned by robustbase::ltsReg() are by default supported.

For a linear regression, unless extrapolation is attempted this makes little, if any, difference to the line but no confidence band is displayed. With higher degree polynomials and sparse data this approach can result in a curve that is not smooth. It is possible for users to define the missing specialization for the predict() method. Predicting the fitted line using coefficients() and the known formula is straightforward, while estimating a prediction confidence band can be more difficult.

ggplot(mpg, aes(displ, hwy)) +
  stat_poly_eq(use_label("eq", "R2", "method"), 
               label.x = "right",
               method = robustbase::ltsReg) +
  stat_poly_line(method = robustbase::ltsReg, colour = "black") +
  stat_fit_deviations(mapping = aes(colour = factor(after_stat(robustness.weights))),
                      geom = "point",
                      method = robustbase::ltsReg,
                      show.legend = TRUE) +
  scale_color_manual(name = "LTS weights",
                     values = c("0" = "red", "1" = "blue")) +
  theme(legend.position = "top")
#> Fitted line plotted

my.formula <- y ~ poly(x, 2, raw = TRUE)
ggplot(mpg, aes(displ, hwy)) +
  stat_poly_eq(use_label("eq", "R2", "method"), 
               label.x = "right",
               formula = my.formula,
               method = robustbase::ltsReg) +
  stat_poly_line(method = robustbase::ltsReg, 
                 formula = my.formula,
                 colour = "black") +
  stat_fit_deviations(mapping = aes(colour = factor(after_stat(robustness.weights))),
                      formula = my.formula,
                      geom = "point",
                      method = robustbase::ltsReg,
                      show.legend = TRUE) +
  scale_color_manual(name = "LTS weights",
                     values = c("0" = "red", "1" = "blue")) +
  theme(legend.position = "top")
#> Fitted line plotted

Generalized Least Squares (‘nlme’)

print(citation(package = "nlme", auto = TRUE), bibtex = FALSE)
#> To cite package 'nlme' in publications use:
#> 
#>   Pinheiro J, Bates D, R Core Team (2025). _nlme: Linear and Nonlinear
#>   Mixed Effects Models_. doi:10.32614/CRAN.package.nlme
#>   <https://doi.org/10.32614/CRAN.package.nlme>, R package version
#>   3.1-168, <https://CRAN.R-project.org/package=nlme>.

In ‘ggpmisc’ (>= 0.6.2) function gls() from package ‘nlme’ is supported natively by functions stat_poly_line() and stat_poly_eq(). In this example, using a variance covariate, the computations do not converge with the data in one of the plot panels.

library(nlme)

ggplot(mpg, aes(displ, hwy)) +
   geom_point() +
   stat_poly_line(method = gls, 
                  method.args = list(weights = varPower()), 
                  se = FALSE) +
   stat_poly_eq(method = gls, 
                method.args = list(weights = varPower()),
                mapping = use_label("eq"),
   label.x = "right") +
   theme(legend.position = "bottom") +
   facet_wrap(~class, ncol = 2)
#> Warning: Computation failed in `stat_poly_line()`.
#> Caused by error:
#> ! maximum number of iterations reached without convergence
#> Warning: Computation failed in `stat_poly_eq()`.
#> Caused by error:
#> ! maximum number of iterations reached without convergence

Linear Splines (‘lspline’)

print(citation(package = "lspline", auto = TRUE), bibtex = FALSE)
#> To cite package 'lspline' in publications use:
#> 
#>   Bojanowski M (2017). _lspline: Linear Splines with Convenient
#>   Parametrisations_. doi:10.32614/CRAN.package.lspline
#>   <https://doi.org/10.32614/CRAN.package.lspline>, R package version
#>   1.0-0, <https://CRAN.R-project.org/package=lspline>.

Package ‘lspline’ provides functions that can be used to fit linear splines with function lm(). These functions are called from the model formula, and do not fit the change points where the knots are located. These functions, lspline(), qlspline() amd elspline() can be used directly in the model formula passed to stat_poly_line(). They can be used also with stat_poly_eq() but the equation label returned is incorrect, with the terms for the different segments displayed as the terms for the powers of a polynomial. Thus, it is necessary to construct in the call to aes() the equations for the different segments. The other labels are correctly generated.

library(lspline) # attachment needed for methods to be available

cp <- 4.1 # a guess
my.formula <- y ~ lspline(x, knots = cp)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  stat_poly_eq(aes(label = 
                     paste("\"L : \"*y~`=`~", round(after_stat(b_0), 2),
                           round(after_stat(b_1), 2), "~x",
                           "*\"; R: \"*y~`=`~", 
                           round(after_stat(b_0) + cp * after_stat(b_1), 2), 
                           round(after_stat(b_2), 2), "~x",
                           sep = "")), 
               parse = TRUE, output.type = "numeric",
               label.x = "right",
               formula = my.formula,
               method = "lm") +
  stat_poly_eq(use_label("R2", "P", "AIC"), 
               label.x = "right",
               label.y = 0.87,
               formula = my.formula,
               method = "lm") +
  stat_poly_line(method = "lm", 
                 formula = my.formula)

It is posisble to use rlm() instead of lm().

cp <- 4.1
my.formula <- y ~ lspline(x, knots = cp)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  stat_poly_eq(aes(label =
                     paste("\"L : \"*y~`=`~", round(after_stat(b_0), 2),
                           round(after_stat(b_1), 2), "~x",
                           "*\"; R: \"*y~`=`~", 
                           round(after_stat(b_0) + cp * after_stat(b_1), 2), 
                           round(after_stat(b_2), 2), "~x",
                           sep = "")), 
               parse = TRUE, output.type = "numeric",
               label.x = "right",
               formula = my.formula,
               method = "rlm") +
  stat_poly_eq(use_label("AIC"), 
               label.x = "right",
               label.y = 0.87,
               formula = my.formula,
               method = "rlm") +
  stat_poly_line(method = "rlm", 
                 formula = my.formula)

Linear Splines (‘segmented’)

Package ‘segmented’ has a design that makes it easy to use. The fitted model used has a structure consistent with that returned by lm(). Any differences are smoothed by query/extract functions that are consistent with those for "lm" objects. As segmented linear models are not polynomials, no eq.label is automatically generated. Another difference is that the location of change points is estimated and returned. Starting from ‘ggpmisc’ (>= 0.6.3) this information is returned by stat_poly_eq() and can be used to in user-assembled labels as shown in the examples below.

print(citation(package = "segmented", auto = TRUE), bibtex = FALSE)
#> To cite package 'segmented' in publications use:
#> 
#>   Muggeo VMR (2025). _segmented: Regression Models with Break-Points /
#>   Change-Points Estimation (with Possibly Random Effects)_.
#>   doi:10.32614/CRAN.package.segmented
#>   <https://doi.org/10.32614/CRAN.package.segmented>, R package version
#>   2.1-4, <https://CRAN.R-project.org/package=segmented>.

Function segreg() makes it possible to fit LM and GLM models that are segmented. Differently to ‘lspline’, the location of the change points or spline knots is estimated during model fitting in addition to estimating segment slopes and intercept. As ‘lspline’ it uses a model formula to set the model including control of the segmentation through seg(). Differently to lspline(), segreg() is a model fit function that returns a fitted model object with additional information on change points and slopes.

The model formula = y ~ seg(x, npsi = 1) requests a fit with a single change point, which being the default is equivalent to formula = y ~ seg(x).

library(segmented)

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  stat_poly_eq(aes(label = 
                     paste("x[cp]~`=`~", 
                           round(after_stat(knots)[[1]][1], 2), 
                           " %+-% ",
                           round(after_stat(knots.se)[[1]][1], 2),
                           "*\"; L: \"*y~`=`~", 
                           round(after_stat(b_0), 2), # "+",
                           round(after_stat(b_1), 2), "~x",
                           "*\"; R: \"*y~`=`~", 
                           round(after_stat(b_0) + 
                                   after_stat(knots)[[1]][1] * 
                                   (after_stat(b_1) - after_stat(b_2)),
                                 2),
                           "+",
                           round(after_stat(b_2), 2), "~x",
                           sep = "")), 
               parse = TRUE, output.type = "numeric",
               label.x = "right",
               formula = y ~ seg(x, npsi = 1),
               method = segreg) +
  stat_poly_eq(use_label("R2", "P", "AIC"), 
               label.x = "right",
               label.y = 0.87,
               formula = y ~ seg(x, npsi = 1),
               method = segreg) +
  stat_poly_line(method = segreg, 
                 formula = y ~ seg(x, npsi = 1))

In the next example the slope of the second segment is fixed at zero to keep it horizontal using formula = y ~ seg(x, npsi = 1, est = c(1, 0)) where 1 indicates a fitted slope and 0 a slope fixed at this value. Because of how the parameters are returned, the code to assemble the equation needs to be rewritten.

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  stat_poly_eq(aes(label = 
                     paste("x[cp]~`=`~",
                           round(after_stat(knots)[[1]][1], 2),
                           " %+-% ",
                           round(after_stat(knots.se)[[1]][1], 2),
                           "*\"; L: \"*y~`=`~", 
                           round(after_stat(b_1), 2), # "+",
                           round(after_stat(b_2), 2), "~x",
                           "*\"; R: \"*y~`=`~", 
                           round(after_stat(b_1) + 
                                   after_stat(knots)[[1]][1] *
                                   after_stat(b_2),
                                 2),
                           sep = "")), 
               parse = TRUE, output.type = "numeric",
               label.x = "right",
               formula = y ~ seg(x, npsi = 1, est = c(1, 0)),
               method = segreg) +
  stat_poly_eq(use_label("R2", "P", "AIC"), 
               label.x = "right",
               label.y = 0.87,
               formula = y ~ seg(x, npsi = 1, est = c(1, 0)),
               method = segreg) +
  stat_poly_line(method = segreg, 
                 formula = y ~ seg(x, npsi = 1, est = c(1, 0)))

User defined functions

As mentioned above, layer functions stat_poly_line() and stat_poly_eq() support the use of several different model fit functions, as long as the estimates and predictions can be extracted. In the case of stat_poly_eq(), the model formula embedded in this object must be a polynomial. The variables mapped to x and y aesthetics must be continuous numeric, times or dates. Models are fitted separately to each group and plot panel, similarly as by ggplot2::stat_smooth(). ‘ggpmisc’ provides additional flexibility by delaying some decisions until after the models have been fit to the observations.

stat_poly_line() and stat_poly_eq() adjust their behaviour based on the class of the model object returned by the method, not the name of the method passed in the call as argument. Thus, the class of the model fit object returned can vary in response to data, and thus also be different for different panels or groups in the same plot.

stat_poly_eq() builds the fitted model equation based on the model formula extracted from the model fit object, which can differ from that passed as argument to formula when calling the statistic. Thus, the model formula returned by the function passed as argument for method vary by group and/or panel.As the examples below demonstrate, this approach can be very useful. In addition, NA can be used to signal to stat_poly_line() and stat_poly_eq() that no model has been fit for a group or panel.

Below I use lm() and "lm" model fit objects in most examples, but the same approaches can be applied to objects of other classes, including "rlm", "lmrob", ’“lqs”,“lts”,“gls”`.

Fit or not

In the first example we fit a model only if a minimum number of distinct x values are present in data. We define a function that instead of fitting the model, returns NA when the condition is not fulfilled. Here "x" is the aesthetic, and has to be replaced by "y" depending on orientation.

fit_or_not <- function(formula, data, ..., orientation = "x") {
  if (length(unique(data[[orientation]])) > 5) {
    lm(formula = formula, data = data, ...)
  } else {
    NA_real_
  }
}

Instead of using stats::lm() as method, we pass our wrapper function fit_or_not() as the method. As the function returns either an "lm" object from the wrapped call to lm() or NA, the call to the statistics can make use of all other features as needed.

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  stat_poly_line(method = "fit_or_not") +
  stat_poly_eq(method = fit_or_not, 
               mapping = use_label("eq", "R2"),
               label.x = "right") +
  stat_panel_counts(label.x = "left", label.y = "bottom") +
  theme(legend.position = "bottom") +
  facet_wrap(~class, ncol = 2)

Degree of polynomial

Not fitting a model, is a drastic solution. An alternative that we implement next, is to plot the mean when the slope of the linear regression is not significantly different from zero. In the example below instead of using stats::lm() as method, we define a different wrapper function that tests for the significance of the slope in linear regression, and if not significant, fits the mean instead. _This works because the model formula is extracted from the fitted model rather than using the argument passed by the user in the call to the statistics.

Fitting different models to different panels is supported. User-defined method functions are required to return an object that inherits from class "lm". The function is applied per group and panel, and the model formula fitted can differ among them, as it is retrieved from this object to construct the equation label.

In the example below instead of using stats::lm() as method, we define a wrapper function that tests for the significance of the slope in linear regression, and if not significant, fits the mean instead.

poly_or_mean <- function(formula, data, ...) {
   fm <- lm(formula = formula, data = data, ...)
   if (anova(fm)[["Pr(>F)"]][1] > 0.1) {
      lm(formula = y ~ 1, data = data, ...)
   } else {
      fm
   }
}

We then use our function poly_or_mean() as the method.

ggplot(mpg, aes(displ, hwy)) +
   geom_point() +
   stat_poly_line(method = "poly_or_mean") +
   stat_poly_eq(method = poly_or_mean,
               mapping = use_label("eq", "R2"),
               label.x = "right") +
   theme(legend.position = "bottom") +
   facet_wrap(~class, ncol = 2)

ggplot(mpg, aes(displ, hwy, color = class)) +
   geom_point() +
   stat_poly_line(method = "poly_or_mean") +
   stat_poly_eq(method = poly_or_mean,
               mapping = use_label("eq"),
               label.x = "right") +
   theme(legend.position = "top")

A different approach to selecting the degree of a polynomial is to use stepwise selection based on AIC. In this case the formula passed as argument is the “upper” degree of the polynomial. All lower degree polynomials are fitted and the one with lowest AIC used.

best_poly_lm <- function(formula, data, ...) {
  poly.term <- as.character(formula)[3]
  degree <- as.numeric(gsub("poly[(]x, |, raw = TRUE|[)]", "", poly.term))
  fms <- list()
  AICs <- numeric(degree)
  for (d in 1:degree) {
    # we need to define the formula with the value of degree replaced
    working.formula <- as.formula(bquote(y ~ poly(x, degree = .(d), raw = TRUE)))
    fms[[d]] <- lm(formula = working.formula, data = data, ...)
    AICs[d] <- AIC(fms[[d]])
  }
  fms[[which.min(AICs)[1]]] # if there is a tie, we take the simplest
}

We then use our function best_poly_lm() as the method.

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  stat_poly_line(formula = y ~ poly(x, 3, raw = TRUE), method = "best_poly_lm") +
  stat_poly_eq(formula = y ~ poly(x, 3, raw = TRUE), method = "best_poly_lm",
               mapping = use_label("eq", "AIC"),
               size = 2.8,
               label.x = "right") +
  expand_limits(y = c(0, 65)) +
  facet_wrap(~class, ncol = 2)

ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point() +
  stat_poly_line(formula = y ~ poly(x, 3, raw = TRUE), method = "best_poly_lm") +
  stat_poly_eq(formula = y ~ poly(x, 3, raw = TRUE), method = "best_poly_lm",
               mapping = use_label("eq", "AIC"),
               label.x = "right") +
  expand_limits(y = c(0, 65)) +
  theme(legend.position = "top")

Functions that re-fit a fitted model

The examples in this section are for package ‘segmented’, however, a similar approach can be useful with any model fitting that involves calls to more than one function in sequence, such as computing starting values for a numerical method, or modifying the model fit object.

Segmented (‘segmented’)

Package ‘segmented’ implements methods for fitting models with change points. The location of the change points themselves are fitted with estimates and their standard deviations computed.

print(citation(package = "segmented", auto = TRUE), bibtex = FALSE)
#> To cite package 'segmented' in publications use:
#> 
#>   Muggeo VMR (2025). _segmented: Regression Models with Break-Points /
#>   Change-Points Estimation (with Possibly Random Effects)_.
#>   doi:10.32614/CRAN.package.segmented
#>   <https://doi.org/10.32614/CRAN.package.segmented>, R package version
#>   2.1-4, <https://CRAN.R-project.org/package=segmented>.

The code here, is equivalent to that shown above based on model-fit function segreg() also from package ‘segmented’. The difference is that when using segmented() the fitting is done in two steps. In this case there is no need to use this approach, but it can be useful when segmentation is to be applied only to some plot panels or applied conditionally in single panel plots. Another use is when fitting segmented linear regressions based on model-fit functions other than lm() or glm(). However, in this second case, some of the necessary model-fit query functions can be unavailable and may need to be defined by the user. There are also model fit methods that do not compute or provide the estimates required by the segmentation computations.

Package ‘segmented’ provides several specializations of method segmented(). These methods are dispathed according the model fit object passed to parameter obj. As this package provides all the expected query functions for the object segmented() methods return, their use is not too difficult. It is possible to write a wrapper function to pass as argument to stat_poly_line() and stat_poly_eq() and in addition, if used, construct the fitted equation labels within a call to aes(). This function, that I named lm2segmented.lm, calls first lm() and then calls segmented() with the fitted model as argument. It accepts additional arguments that make it possible to control the number and initial values of the knots. As in the case of other model fit functions, models with multiple explanatory variables are rarely used with ‘ggplot2’ stats.

library(segmented)

lm2segmented.lm <- 
  function(formula,
           data, 
           seg.Z = ~ x, # explanatory variable used for segmentation
           psi = median(data$x), # starting value(s) for knot(s)
           npsi = 1, # number of knots if psi is not given
           fixed.psi = NULL, # breakpoint not to be modified by fitting
           control = segmented::seg.control(), 
           ...) {
    obj <- lm(formula = formula, data = data, ...)
    segmented(obj = obj, 
              seg.Z = seg.Z, 
              psi = psi, 
              npsi = npsi, 
              fixed.psi = fixed.psi, 
              control = control, 
              model = TRUE, 
              keep.class = FALSE)
}

# using defaults, one breakpoint
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  stat_poly_line(method = lm2segmented.lm)

# requesting two breakpoints with starting values
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  stat_poly_line(method = lm2segmented.lm,
  method.args = list(psi = c(3, 5)))

A more complex example with labels, assembled within the call to aes(). In addition to the intercept and slope for the leftmost segment, it returns the slope for each segment to the right of the first one, but not their intercepts. As segment() fits and returns the position of the knots, which are also returned by stat_poly_eq() in ‘ggpmisc’ (>= 0.6.3). A label is added with the knot position, and two regression equations, one for each segment of the linear spline.

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  stat_poly_eq(aes(label = 
                     paste("italic(x)[cp]~`=`~", 
                           round(after_stat(knots)[[1]][1], 2), 
                           "*\"; L: \"*y~`=`~", 
                           round(after_stat(b_0), 2), #"+",
                           round(after_stat(b_1), 2), "~x",
                           "*\"; R: \"*y~`=`~", 
                           round(after_stat(b_0) + 
                                   after_stat(knots)[[1]][1] *
                                   (after_stat(b_1) - after_stat(b_2)),
                                 2),
                           "+",
                           round(after_stat(b_2), 2), "~x",
                           sep = "")), 
               parse = TRUE, output.type = "numeric",
               label.x = "right",
               formula = y ~ x,
               method = lm2segmented.lm) +
  stat_poly_eq(use_label("R2", "P", "AIC"), 
               label.x = "right",
               label.y = 0.87,
               formula = y ~ x,
               method = lm2segmented.lm,
               method.args = list(psi = c(3, 5))) +
  stat_poly_line(method = lm2segmented.lm, 
                 formula = y ~ x,)

Work-arounds

Some model-fit functions return objects for which no predict() method exist. In such cases, as users we can define such a method. More difficult situations can also be faced.

Errors in variables (‘refitME’)

In this section I show a situation I faced when trying to use package ‘refitME’. The functions exported by this package fail when called from within a function, so they do not currently work if passed as an argument for method. Below, a temporary work-around based on a solution provided by the maintainer of ‘refitME’ is shown.

print(citation(package = "refitME", auto = TRUE), bibtex = FALSE)
#> To cite package 'refitME' in publications use:
#> 
#>   Stoklosa J, Hwang W, Warton D (2025). _refitME: Measurement Error
#>   Modelling using MCEM_. doi:10.32614/CRAN.package.refitME
#>   <https://doi.org/10.32614/CRAN.package.refitME>, R package version
#>   1.3.1, <https://CRAN.R-project.org/package=refitME>.

As ‘refitME’ functions return a model fit object of the same class as that received as input, a model fitted with lm() could be, in principle, refit in a user-defined wrapper function as those above, calling first lm() and re-fitting the obtained model with a refitME::refitME() as shown in the code chunk below. However, at the moment the functions in package ‘refitME’ fail when called from within another function.

lm_EIV <- function(formula, data, ..., sigma.sq.u) {
  fm <- lm(formula = formula, data = data, ...)
  MCEMfit_glm(mod = fm, family = "gaussian", sigma.sq.u = sigma.sq.u)
}

After defining such a function, we could then use our function lm_EIV() as the method.

ggplot(mpg, aes(displ, hwy)) +
   geom_point() +
   stat_poly_line(method = "lm_EIV", method.args = c(sigma.sq.u = 0)) +
   stat_poly_eq(method = "lm_EIV", method.args = c(sigma.sq.u = 0),
                mapping = use_label("eq"),
   label.x = "right") +
   theme(legend.position = "bottom") +
   facet_wrap(~class, ncol = 2)

Produces one warning like below for each call of the statistics.

Warning: Computation failed in `stat_poly_eq()`.
Caused by error in `names(new.dat)[c(1, which(names(mod$data) %in% names.w))] <- names(mod$data)[c(1, which(names(mod$data) %in% names.w))]`:
! replacement has length zero```

I have reported the problem through a GitHub issue to Jakub Stoklosa, the maintainer of ‘refitME’, and he is looking for a permanent solution. Meanwhile, he has provided a temporary fix that I copy below with a few small edits to the ggplot code. Many thanks to Jakub Stoklosa for his very fast answer!

rm(list = ls(pattern = "*"))

library(refitME)

lm_EIV <- function(formula, data, ..., sigma.sq.u) {
  assign("my.data", value = data, envir = globalenv())
  fm <- lm(formula = formula, data = my.data, ...)
  fit <- refitME(fm, sigma.sq.u)
  
  fit$coefficients <- Re(fit$coefficients)
  fit$fitted.values <- Re(fit$fitted.values)
  fit$residuals <- Re(fit$residuals)
  fit$effects <- Re(fit$effects)
  fit$linear.predictors <- Re(fit$linear.predictors)
  fit$se <- Re(fit$se)
  
  if ("qr" %in% names(fit)) {
  fit$qr$qr <- Re(fit$qr$qr)
  fit$qr$qraux <- Re(fit$qr$qraux)
  }
  
  class(fit) <- c("lm_EIV", "lm")
  return(fit)
}

predict.lm_EIV <- function(object, newdata = NULL, ...) {
  preds <- NextMethod("predict", object, newdata = newdata, ...)
  
  if (is.numeric(preds)) {
    return(Re(preds))
  } else if (is.matrix(preds)) {
    return(Re(preds))
  } else {
    return(preds) # Leave unchanged if not numeric/matrix
  }
}

Here the limitation of ‘ggplot2’ and ‘ggpmisc’ in that the fit is recomputed in each layer causes a further difficulty, as with most algorithms using random numbers, the successive refits do not necessarily produce the exact same results. In this case, the equation and line are not matched, and the mismatch can be large for large values of sigma.sq.u and low replication, especially in the presence of outliers.

This is something that can most likely be fixed in ‘ggpmisc’ by setting the seed of the RNG. To check that work-around is working as expected, we set sigma.sq.u = 0 effectively using an OLS approach and expecting values nearly identical as with lm().

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  stat_poly_line(aes(y = hwy),
                 method = "lm_EIV",
                 method.args = list(sigma.sq.u = 0)) +
  stat_poly_eq(aes(y = hwy, 
                   label = paste(after_stat(eq.label), after_stat(rr.label), 
                                 sep = "~~")),
               method = "lm_EIV",
               method.args = list(sigma.sq.u = 0),
               label.x = "right") +
  facet_wrap(~class, ncol = 2)
#> [1] 2
#> [1] 2
#> [1] 2
#> [1] 2
#> [1] 2
#> [1] 2
#> [1] 2
#> [1] 2
#> [1] 2
#> [1] 2
#> [1] 2
#> [1] 2
#> [1] 2
#> [1] 2

We do not know the measurement error for the measurement of displ, the displacement volume of the engines. For this example we make up a value: sigma.sq.u = 0.01 expecting estimates different from above and with lm().

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  stat_poly_line(aes(y = hwy),
                 method = "lm_EIV",
                 method.args = list(sigma.sq.u = 0.01)) +
  stat_poly_eq(aes(y = hwy, 
                   label = paste(after_stat(eq.label), after_stat(rr.label), 
                                 sep = "~~")),
               method = "lm_EIV",
               method.args = list(sigma.sq.u = 0.01),
               label.x = "right") +
  facet_wrap(~class, ncol = 2)
#> [1] 5
#> [1] 11
#> [1] 6
#> [1] 6
#> [1] 6
#> [1] 8
#> [1] 6
#> [1] 5
#> [1] 12
#> [1] 6
#> [1] 6
#> [1] 6
#> [1] 8
#> [1] 6

Transformations and `eq.label`

In ‘ggplot2’ setting transformations in X and y scales results in statistics receiving transformed data as input. The data “seen” by statistics is the same as if the transformation had been applied within a call to aes(). However, the scales automatically adjust breaks and tick labels to display the values in the untransformed scale. When the transformation is applied in the model formula definition, the transformation is applied later, but still before the model is fit. all these cases, the model fit is based on transformed values and thus the estimates refer to them. This is why, a linear fit remains linear in spite of the transformations.

Only eq.label is affected by transformations in the sense that the x and y in the default equation refer to transformed values, rather than those shown in the tick labels of the axes, i.e., the default model equation label is incorrect. The correction to apply is to override the x and/or y symbols in the equation label by an expression that includes the transformation used. In these examples we fit polynomials by OLS using lm(), the default method.

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  stat_poly_line() +
  stat_poly_eq(mapping = use_label("eq", "R2", "n", "P"),
             eq.with.lhs = "log[10]~hwy~`=`~",
             eq.x.rhs = "~displ") +
  scale_y_log10() + 
  expand_limits(y = 50)

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  stat_poly_line(formula = y ~ poly(x, 3, raw = TRUE)) +
  stat_poly_eq(mapping = use_label("eq"),
               formula = y ~ poly(x, 3, raw = TRUE),
               eq.with.lhs = "hwy~`=`~",
               eq.x.rhs = "~(log[10]~displ)") +
  scale_x_log10(name = "displ") + 
  expand_limits(y = 50)

‘ggpmisc’ 0.6.2.9001

Pedro J. Aphalo

2025-08-25

Aims of ‘ggpmisc’ and caveats

Preliminaries

Fitted models in `stat_poly_line()` and `stat_poly_eq()`

Existing functions

Robust regression (‘MASS’)

Robust regression (‘robustbase’)

Resistant regression (‘MASS’)

Resistant regression (‘robustbase’)

Generalized Least Squares (‘nlme’)

Linear Splines (‘lspline’)

Linear Splines (‘segmented’)

User defined functions

Fit or not

Degree of polynomial

Functions that re-fit a fitted model

Segmented (‘segmented’)

Work-arounds

Errors in variables (‘refitME’)

Transformations and `eq.label`

Custom Polynomial Models

‘ggpmisc’ 0.6.2.9001

Pedro J. Aphalo

2025-08-25

Aims of ‘ggpmisc’ and caveats

Preliminaries

Fitted models in stat_poly_line() and stat_poly_eq()

Existing functions

Robust regression (‘MASS’)

Robust regression (‘robustbase’)

Resistant regression (‘MASS’)

Resistant regression (‘robustbase’)

Generalized Least Squares (‘nlme’)

Linear Splines (‘lspline’)

Linear Splines (‘segmented’)

User defined functions

Fit or not

Degree of polynomial

Functions that re-fit a fitted model

Segmented (‘segmented’)

Work-arounds

Errors in variables (‘refitME’)

Transformations and eq.label

Fitted models in `stat_poly_line()` and `stat_poly_eq()`

Transformations and `eq.label`