Find spikes in vector

Find spikes in a numeric vector using the algorithm of Whitaker and Hayes (2018). Spikes are values in spectra that are unusually high or low compared to neighbours. They are usually individual values or very short runs of similar "unusual" values. Spikes caused by cosmic radiation are a frequent problem in Raman spectra. Another source of spikes are "hot pixels" in CCD and diode arrays. Other kinds of accidental "outliers" can be also detected.

Usage

find_spikes(
  x,
  x.is.delta = FALSE,
  height.threshold = 10,
  z.threshold = 5,
  k = 20,
  spike.direction = "both",
  na.rm = FALSE
)

Arguments

x: numeric vector containing the data.
x.is.delta: logical Flag indicating whether x contains differences or original values.
height.threshold: numeric The minimum height of spikes expressed relative to the median amplitude of the baseline local variation of x.
z.threshold: numeric Modified local \(Z\) values larger than z.threshold are detected as boundaries of spikes.
k: integer width of median window used for smoothing; must be odd
spike.direction: character One of "up", "down", "both" or "skip", indicating which spikes are to be returned, if any.
na.rm: logical indicating whether NA values should be stripped before searching for spikes.

Value

An integer vector of the same length as x. Values that are 0, +1 or -1 corresponding to no-spike, upwards-spike, and downwards-spike in the data. Conversion to logical with as.logical() results in a vector with TRUE for spikes and FALSE otherwise.

Details

Spikes are detected based on a modified \(Z\) score calculated from the differenced spectrum. The \(Z\) threshold used should be adjusted to the characteristics of the input and desired sensitivity. The lower the threshold the more stringent the test becomes, with shorter spikes being detected.

The algorithm uses running differences to detect abrupt changes in value, compared to an estimate of the baseline variation of the differences, approximating a baseline \(Z\) from MAD and a baseline value from the median differences. Currently, a single estimate of MAD is used but running medians, when posisble, as baseline. This comparison detects running differences that are unusually large, in most cases signalling a transition between values near the baseline and far from it, in both directions.

Transitions into- and out of spikes are distinguished based on the median of the non-differenced values, as a descriptor of the data baseline. As for the median of the differences, a running median is used when possible.

This function thus detects the start and end of each spike, and distinguishes upward and downward spikes.

k is the width in number of observations of the window used for running median smoothing to extract the baseline. A value several times the width of the broader spike but narrow enough to track broader peaks needs to be manually set in most cases.

With na.rm = TRUE, NA values are omitted before searching for spikes and set to 0L in the returned vector.

If all spikes are guaranteed to be one observation-wide and either going up or down from the baseline, it is possible to detect them based purely on the z.threshold by passing height.threshold = NA and either spike.direction = "up" or spike.direction = "down", which ensures very fast computation.

References

Whitaker, D. A.; Hayes, K. (2018) A simple algorithm for despiking Raman spectra. Chemometrics and Intelligent Laboratory Systems, 179, 82-84. doi:10.1016/j.chemolab.2018.06.009 .

Usage

Arguments

Value

Details

References

See also