Skip to contents

stat_summary_xy() and stat_centroid() are similar to ggplot2::stat_summary() but summarize both x and y values in the same plot layer. Differently to stat_summary() no grouping based on data values is done; the grouping respected is that already present based on mappings to aesthetics. This makes it possible to highlight the actual location of the centroid with geom_point(), geom_text(), and similar geometries. Instead, if we use geom_rug() they are only a convenience avoiding the need to add two separate layers and flipping one of them using orientation = "y".

Usage

stat_apply_group(
  mapping = NULL,
  data = NULL,
  geom = "line",
  .fun.x = NULL,
  .fun.x.args = list(),
  .fun.y = NULL,
  .fun.y.args = list(),
  position = "identity",
  na.rm = FALSE,
  show.legend = FALSE,
  inherit.aes = TRUE,
  ...
)

stat_summary_xy(
  mapping = NULL,
  data = NULL,
  geom = "point",
  .fun.x = NULL,
  .fun.x.args = list(),
  .fun.y = NULL,
  .fun.y.args = list(),
  position = "identity",
  na.rm = FALSE,
  show.legend = FALSE,
  inherit.aes = TRUE,
  ...
)

stat_centroid(
  mapping = NULL,
  data = NULL,
  geom = "point",
  .fun = NULL,
  .fun.args = list(),
  position = "identity",
  na.rm = FALSE,
  show.legend = FALSE,
  inherit.aes = TRUE,
  ...
)

Arguments

mapping

The aesthetic mapping, usually constructed with aes. Only needs to be set at the layer level if you are overriding the plot defaults.

data

A layer specific dataset - only needed if you want to override the plot defaults.

geom

The geometric object to use display the data

.fun.x, .fun.y, .fun

function to be applied or the name of the function to be applied as a character string.

.fun.x.args, .fun.y.args, .fun.args

additional arguments to be passed to the function as a named list.

position

The position adjustment to use for overlapping points on this layer

na.rm

a logical value indicating whether NA values should be stripped before the computation proceeds.

show.legend

logical. Should this layer be included in the legends? NA, the default, includes if any aesthetics are mapped. FALSE never includes, and TRUE always includes.

inherit.aes

If FALSE, overrides the default aesthetics, rather than combining with them. This is most useful for helper functions that define both data and aesthetics and shouldn't inherit behaviour from the default plot specification, e.g. borders.

...

other arguments passed on to layer. This can include aesthetics whose values you want to set, not map. See layer for more details.

Value

A data frame with the same variables as the data input, with either a single or multiple rows, with the values of x and y variables replaced by the values returned by the applied functions, or possibly filled with NA if no function was supplied or available by default. If the applied function returns a named vector, the names are copied into columns x.names and/or y.names. If the summary function applied returns a one row data frame, it will be column bound keeping the column names, but overwritting columns x and/or y with y from the summary data frame. In the names returned by .fun.x the letter "y" is replaced by "x". These allows the use of the same functions as in

ggplot2::stat_summary().

x

x-value as returned by .fun.x, with names removed

y

y-value as returned by .fun.y, with names removed

x.names

if the x-value returned by .fun.x is named, these names

y.names

if the y-value returned by .fun.y is named, these names

xmin, xmax

values returned by .fun.x under these names, if present

ymin, ymax

values returned by .fun.y under these names, if present

<other>

additional values as returned by .fun.y under other names

Details

stat_apply_group applies functions to data. When possible it is preferable to use transformations through scales or summary functions such as ggplot2::stat_summary(), stat_summary_xy() or stat_centroid(). There are some computations that are not scale transformations but are not usual summaries either, as the number of data values does not decrease all the way to one row per group. A typical case for a summary is the computation of quantiles. For transformations are cumulative ones, e.g., using cumsum(), runmed() and similar functions. Obviously, it is always possible to apply such functions to the data before plotting and passing them to a single layer function. However, it can be useful to apply such functions on-the-fly to ensure that grouping is consistent between computations and aesthetics. One particularity of these statistics is that they can apply simultaneously different functions to x values and to y values when needed. In contrast to these statistics, geom_smooth applies a function that takes both x and y values as arguments.

These four statistics are similar. They differ on whether they return a single or multiple rows of data per group.

Note

The applied function(s) must accept as first argument a vector that matches the variables mapped to x or y aesthetics. For stat_summary_xy() and stat_centroid() the function(s) to be applied is(are) expected to return a vector of length 1 or a data frame with only one row, as mean_se(), mean_cl_normal() mean_cl_boot(), mean_sdl() and median_hilow() from 'ggplot2' do.

For stat_apply_group the vectors returned by the the functions applied to x and y must be of exactly the same length. When only one of .fun.x or .fun.y are passed a function as argument, the other variable in the returned data is filled with NA_real_. If other values are desired, they can be set by means of a user-defined function.

References

Answers to question "R ggplot on-the-fly calculation by grouping variable" at https://stackoverflow.com/questions/51412522.

Examples

set.seed(123456)
my.df <- data.frame(X = rep(1:20,2),
                    Y = runif(40),
                    category = rep(c("A","B"), each = 20))

# make sure rows are ordered for X as we will use functions that rely on this
my.df <- my.df[order(my.df[["X"]]), ]

# Centroid
ggplot(my.df, aes(x = X, y = Y, colour = category)) +
  stat_centroid(shape = "cross", size = 6) +
  geom_point()


ggplot(my.df, aes(x = X, y = Y, colour = category)) +
  stat_centroid(geom = "rug", linewidth = 1.5, .fun = median) +
  geom_point()


ggplot(my.df, aes(x = X, y = Y, colour = category)) +
  stat_centroid(geom = "text", aes(label = category)) +
  geom_point()


ggplot(my.df, aes(x = X, y = Y, colour = category)) +
  stat_summary_xy(geom = "pointrange",
                  .fun.x = mean, .fun.y = mean_se) +
  geom_point()


# quantiles
ggplot(my.df, aes(x = X, y = Y, colour = category)) +
  geom_point() +
  stat_apply_group(geom = "rug", .fun.y = quantile, .fun.x = quantile)


ggplot(my.df, aes(x = X, y = Y)) +
  geom_point() +
  stat_apply_group(geom = "rug", sides = "lr", color = "darkred",
                   .fun.y = quantile) +
  stat_apply_group(geom = "text", hjust = "right", color = "darkred",
                   .fun.y = quantile,
                   .fun.x = function(x) {rep(22, 5)}, # set x to 22
                   mapping = aes(label = after_stat(y.names))) +
                   expand_limits(x = 21)


my.probs <- c(0.25, 0.5, 0.75)
ggplot(my.df, aes(x = X, y = Y, colour = category)) +
  geom_point() +
  stat_apply_group(geom = "hline",
                   aes(yintercept = after_stat(y)),
                   .fun.y = quantile,
                   .fun.y.args = list(probs = my.probs))


# cummulative summaries
ggplot(my.df, aes(x = X, y = Y, colour = category)) +
  stat_apply_group(.fun.x = function(x) {x},
                   .fun.y = cummax)


ggplot(my.df, aes(x = X, y = Y, colour = category)) +
  stat_apply_group(.fun.x = cumsum, .fun.y = cumsum)


# diff returns a shorter vector by 1 for each group
ggplot(my.df, aes(x = X, y = Y, colour = category)) +
  stat_apply_group(.fun.x = function(x) {x[-1L]},
                   .fun.y = diff, na.rm = TRUE)


# Running summaries
ggplot(my.df, aes(x = X, y = Y, colour = category)) +
  geom_point() +
  stat_apply_group(.fun.x = function(x) {x},
                   .fun.y = runmed, .fun.y.args = list(k = 5))


# Rescaling per group
ggplot(my.df, aes(x = X, y = Y, colour = category)) +
  stat_apply_group(.fun.x = function(x) {x},
                   .fun.y = function(x) {(x - min(x)) / (max(x) - min(x))})


# inspecting the returned data
if (requireNamespace("gginnards", quietly = TRUE)) {
  library(gginnards)

  ggplot(my.df, aes(x = X, y = Y, colour = category)) +
    stat_centroid(.fun = mean_se, geom = "debug")

  ggplot(my.df, aes(x = X, y = Y, colour = category)) +
    stat_summary_xy(.fun.y = mean_se, geom = "debug")

  ggplot(my.df, aes(x = X, y = Y, colour = category)) +
    stat_apply_group(.fun.y = cumsum, geom = "debug")

  ggplot(my.df, aes(x = X, y = Y, colour = category)) +
    geom_point() +
    stat_apply_group(geom = "debug",
                    .fun.x = quantile,
                    .fun.x.args = list(probs = my.probs),
                    .fun.y = quantile,
                   .fun.y.args = list(probs = my.probs))
}

#> [1] "PANEL 1; group(s) 1, 2; 'draw_function()' input 'data' (head):"
#>    colour     x         y PANEL group x.names y.names
#> 1 #F8766D  5.75 0.3399159     1     1     25%     25%
#> 2 #F8766D 10.50 0.6736796     1     1     50%     50%
#> 3 #F8766D 15.25 0.8791947     1     1     75%     75%
#> 4 #00BFC4  5.75 0.1643890     1     2     25%     25%
#> 5 #00BFC4 10.50 0.5641866     1     2     50%     50%
#> 6 #00BFC4 15.25 0.8576917     1     2     75%     75%