Split data frame into chunks — split_chunks • photobiologyFlecks

Split a time series stored in a data frame at breaks (long time steps), returning a list of data frames or data chunks.

Usage

split_chunks(
  data,
  time.name = "TIMESTAMP",
  qty.name = NULL,
  time.step = NULL,
  chunk.min.time,
  chunk.min.rows = 2,
  add.diffs = TRUE,
  verbose = FALSE,
  na.rm = TRUE
)

Arguments

data: data.frame Containing at least one coloumn with time stamps and one column with a measured quantity.
time.name: character vector of length one Name of the variable containing time stamps for the observations.
qty.name: character vector Name(s) of variable(s) in data containing values observed quantities. If qty.name = NULL, the default, all columns are retained.
time.step: numeric The duration in seconds of one time step within a chunk. If NULL, the actual time steps are used.
chunk.min.time: numeric or duration Length of minimum time step length between data chunks. If numeric, expressed in seconds.
chunk.min.rows: integer The minimum number of rows that a chunk must have not to be discarded.
add.diffs: logical Flag indicating if values returned by diff() are to be added to the returned data frame chunks.
verbose: logical Report chunk names and lengths at each iteration. Useful for debugging.
na.rm: logical Omit rows of data containing NA values after selecting variables.

Value

A list of data frames of varying length, depending on the number of chunks found, possibly of length zero. The members of the list are named based on the starting time of each chunk. The variables included in the member data frames are those named by time.name and qty.name and optionally, their running differences.

Details

When time series of data are acquired in bursts or chunks separated by longer time intervals it can be useful to extract the chunks into separate data frames before further analysis. This implementation does not assume the same duration for all chunks or the gaps, it searches for time intervals longer than a threshold duration and splits the data at these points. If the data contains no gaps, the whole data is returned as a single chunk.

When a minimum length for the individuals chunks is set with an argument to chunk.min.rows, chunks with fewer rows are discarded silently, unless verbose = TRUE.

With add.diffs = TRUE the running differences between values in the current row and the one above are added to the returned data frames. The value in the first row is NA for running differences, except for the time, in which case it is the time difference to the precceeding value in data.

Method diff() must be available for the class of the variable named by the argument to time.name. The class of this column is in most cases numeric, date, or time. If add.diffs = TRUE this requirement also applies to the variable(s) named by the argument passed to qty.name.

The number of chunks in the returned list of data frames and their lengths are reported in a message().