Find spikes in a numeric vector using the algorithm of Whitaker and Hayes (2018). Spikes are values in spectra that are unusually high or low compared to neighbours. They are usually individual values or very short runs of similar "unusual" values. Spikes caused by cosmic radiation are a frequent problem in Raman spectra. Another source of spikes are "hot pixels" in CCD and diode arrays. Other kinds of accidental "outliers" can be also detected.
Usage
find_spikes(
x,
x.is.delta = FALSE,
height.threshold = 10,
z.threshold = 5,
k = 20,
spike.direction = "both",
na.rm = FALSE
)Arguments
- x
numeric vector containing the data.
- x.is.delta
logical Flag indicating whether
xcontains differences or original values.- height.threshold
numeric The minimum height of spikes expressed relative to the median amplitude of the baseline local variation of
x.- z.threshold
numeric Modified local \(Z\) values larger than
z.thresholdare detected as boundaries of spikes.- k
integer width of median window used for smoothing; must be odd
- spike.direction
character One of
"up","down","both"or"skip", indicating which spikes are to be returned, if any.- na.rm
logical indicating whether
NAvalues should be stripped before searching for spikes.
Value
An integer vector of the same length as x. Values that are
0, +1 or -1 corresponding to no-spike, upwards-spike,
and downwards-spike in the data. Conversion to logical with
as.logical() results in a vector with TRUE for spikes and
FALSE otherwise.
Details
Spikes are detected based on a modified \(Z\) score calculated from the differenced spectrum. The \(Z\) threshold used should be adjusted to the characteristics of the input and desired sensitivity. The lower the threshold the more stringent the test becomes, with shorter spikes being detected.
The algorithm uses running differences to detect abrupt changes in value, compared to an estimate of the baseline variation of the differences, approximating a baseline \(Z\) from MAD and a baseline value from the median differences. Currently, a single estimate of MAD is used but running medians, when posisble, as baseline. This comparison detects running differences that are unusually large, in most cases signalling a transition between values near the baseline and far from it, in both directions.
Transitions into- and out of spikes are distinguished based on the median of the non-differenced values, as a descriptor of the data baseline. As for the median of the differences, a running median is used when possible.
This function thus detects the start and end of each spike, and distinguishes upward and downward spikes.
k is the width in number of observations of the window used for
running median smoothing to extract the baseline. A value several times the
width of the broader spike but narrow enough to track broader peaks needs
to be manually set in most cases.
With na.rm = TRUE, NA values are omitted before searching for
spikes and set to 0L in the returned vector.
If all spikes are guaranteed to be one observation-wide and either going up
or down from the baseline, it is possible to detect them based purely on
the z.threshold by passing height.threshold = NA and either
spike.direction = "up" or spike.direction = "down", which
ensures very fast computation.
References
Whitaker, D. A.; Hayes, K. (2018) A simple algorithm for despiking Raman spectra. Chemometrics and Intelligent Laboratory Systems, 179, 82-84. doi:10.1016/j.chemolab.2018.06.009 .
See also
Other peaks and valleys functions:
find_peaks()
