I am trying to implement Chebyshev filter to smooth a time series but, unfortunately, there are NAs in the data series.
For example,
t <- seq(0, 1, len = 100)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
I am using Chebyshev filter: cf1 = cheby1(5, 3, 1/44, type = "low")
I am trying to filter the time series exclude NAs, but not mess up the orders/position. So, I have already tried na.rm=T, but it seems there's no such argument.
Then
z <- filter(cf1, x) # apply filter
Thank you guys.
Try using x <- x[!is.na(x)] to remove the NAs, then run the filter.
You can remove the NAs beforehand using the compelete.cases function. You also might consider imputing the missing data. Check out the mtsdi or Amelia II packages.
EDIT:
Here's a solution with Rcpp. This might be helpful is speed is important:
require(inline)
require(Rcpp)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)),NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)))
NAs <- x
x2 <- x[!is.na(x)]
#do something to x2
src <- '
Rcpp::NumericVector vecX(vx);
Rcpp::NumericVector vecNA(vNA);
int j = 0; //counter for vx
for (int i=0;i<vecNA.size();i++) {
if (!(R_IsNA(vecNA[i]))) {
//replace and update j
vecNA[i] = vecX[j];
j++;
}
}
return Rcpp::wrap(vecNA);
'
fun <- cxxfunction(signature(vx="numeric",
vNA="numeric"),
src,plugin="Rcpp")
if (identical(x,fun(x2,NAs)))
print("worked")
# [1] "worked"
I don't know if ts objects can have missing values, but if you just want to re-insert the NA values, you can use ?insert from R.utils. There might be a better way to do this.
install.packages(c('R.utils', 'signal'))
require(R.utils)
require(signal)
t <- seq(0, 1, len = 100)
set.seed(7337)
x <- c(sin(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA, NA, cos(2*pi*t*2.3) + 0.25*rnorm(length(t)), NA)
cf1 = cheby1(5, 3, 1/44, type = "low")
xex <- na.omit(x)
z <- filter(cf1, xex) # apply
z <- as.numeric(z)
for (m in attributes(xex)$na.action) {
z <- insert(z, ats = m, values = NA)
}
all.equal(is.na(z), is.na(x))
?insert
Here is a function you can use to filter a signal with NAs in it.
The NAs are ignored rather than replaced by zero.
You can then specify a maximum percentage of weight which the NAs may take at any point of the filtered signal. If there are too many NAs (and too few actual data) at a specific point, the filtered signal itself will be set to NA.
# This function applies a filter to a time series with potentially missing data
filter_with_NA <- function(x,
window_length=12, # will be applied centrally
myfilter=rep(1/window_length,window_length), # a boxcar filter by default
max_percentage_NA=25) # which percentage of weight created by NA should not be exceeded
{
# make the signal longer at both sides
signal <- c(rep(NA,window_length),x,rep(NA,window_length))
# see where data are present and not NA
present <- is.finite(signal)
# replace the NA values by zero
signal[!is.finite(signal)] <- 0
# apply the filter
filtered_signal <- as.numeric(filter(signal,myfilter, sides=2))
# find out which percentage of the filtered signal was created by non-NA values
# this is easy because the filter is linear
original_weight <- as.numeric(filter(present,myfilter, sides=2))
# where this is lower than one, the signal is now artificially smaller
# because we added zeros - compensate that
filtered_signal <- filtered_signal / original_weight
# but where there are too few values present, discard the signal
filtered_signal[100*(1-original_weight) > max_percentage_NA] <- NA
# cut away the padding to left and right which we previously inserted
filtered_signal <- filtered_signal[((window_length+1):(window_length+length(x)))]
return(filtered_signal)
}
Related
I cannot seem to even create a reproducible example on this as it works fine when I go through the code one line at a time.
The error message I get is as follows:
"Error in testData[, colCheck][length(testData[, colCheck])] - testData[, :
non-numeric argument to binary operator "
Both colCheck and testData$linearcorrd15N are numeric and like I said, the calculation works fine when I run it at that line. The error comes only when I run the function from QTest(df, colCheck).
Here is an example of what some of the code looks like. It will not produce an error, but maybe you can see something that I don't.
QTest <- function(testData, colCheck)
#%#
# testData <- This is the entire data frame for the std/ref that has too high
# of a SD, this way the data frame can be returned without the outlier
# colCheck <- The column name for values that were flagged for having too high of a SD
# This Q test info provided by: https://www.statisticshowto.com/dixons-q-test/
#%#
{
#Get the mean of the highest and lowest values
testData <- arrange(testData, desc(testData[, colCheck]))
len <- length(testData[,colCheck])-1
high <- sapply(1:len, function(i) testData[,colCheck][i])
meanhigh <- mean(high)
testData <- arrange(testData, (testData[, colCheck]))
low <- sapply(1:len, function(i) testData[,colCheck][i])
meanlow <- mean(low)
#If the mean of the lowest numbers is lower than the mean of the highest numbers, do this
if(meanlow < meanhigh){
QexpVal <- abs((testData[, colCheck][2] - testData[, colCheck][1])/
(testData[, colCheck][length(testData[, colCheck])] - testData[, colCheck][1]))
outlier <- testData[,colCheck][1]
closest <- testData[,colCheck][2]
#else if the mean of the lowest numbers is higher than the mean of the highest numbers, do this
} else {
QexpVal <- abs((testData[, colCheck][length(testData[,colCheck])-1] - (testData[, colCheck][length(testData[,colCheck])])) /
(testData[,colCheck][length(testData[,colCheck])]) - (testData[,colCheck][1]))
outlier <- testData[,colCheck][length(testData[,colCheck])]
closest <- testData[,colCheck][length(testData[,colCheck])-1]
}
return(QexpVal)
}
df <- data.frame(Row = c(1, 2, 3, 4, 5), Identifier.2 = "36-UWSIF-UT Glut1", linearcorrd15N = c(-11.63433,
-22.13869, -57.21795, -17.06438, -16.23358))
colCheck <- as.numeric(grep("linearcorrd15N", colnames(std1)))
QTestCorrVals <- QTest(df, colCheck)
It seems you realy overcomplicate this function by pushing the whole table in the function and loop over everything and read a value again from the whole table...
just the part to get meanhigh and meanlow requires this:
v <- df[, colCheck]
v <- v[order(v)]
n <- length(v)
meanhigh <- mean(v[2:n])
meanlow <- mean(v[1:n-1])
Or if you use the decreasing ordering this:
v <- df[, colCheck]
v <- v[order(v, decreasing = T)]
n <- length(v)
meanhigh <- mean(v[1:n-1])
meanlow <- mean(v[2:n])
Full function
Hereby the full code using this approach and I agree that is not the specific question you asked, but the way you coded it is highly inefficient and error prone by every time take the whole data.frame and subset it and recalculate lengths every time. Also you just have to order once, as if the lowest is on top, the highest is per definition on the bottom. Then play around with the 1 for first and 2 for second and n for last and n-1 for second last.
df <- data.frame(Row = c(1, 2, 3, 4, 5), Identifier.2 = "36-UWSIF-UT Glut1", linearcorrd15N = c(-11.63433,
-22.13869, -57.21795, -17.06438, -16.23358))
colCheck <- as.numeric(grep("linearcorrd15N", colnames(df)))
QTest <- function(v) {
v <- v[order(v)]
n <- length(v)
meanhigh <- mean(v[2:n])
meanlow <- mean(v[1:n-1])
if(meanlow < meanhigh) {
QexpVal <- abs((v[2]-v[1])/(v[n]-v[1]))
outlier <- v[1]
closest <- v[2]
} else {
QexpVal <- abs((v[n-1]-v[n])/(v[n]-v[1]))
outlier <- v[n]
closest <- v[n-1]
}
return(QexpVal)
}
QTestCorrVals <- QTest(df[, colCheck])
Side note
Using the column index number works slightly different whether your data is a data.frame or a data.table
class(df)
df[, colCheck]
dt <- data.table(df)
class(dt)
dt[, ..colCheck]
dt[, colCheck] # throws an error
I need to count how many times a variable inverts its growth pattern - from increasing values to decreasing values (as well as from decreasing values to increasing values). In the following example, I should be able to find 4 such inversions. How can I create a new dummy variable that shows such inversions?
x <- c(1:20,19:5,6:15,12:9,10:11)
plot(x)
You're effectively asking "when is the second derivative of x not equal to zero?", so you could just do a double diff:
x <- c(1:20,19:5,6:15,12:9,10:11)
plot(seq_along(x), x)
changes <- c(0, diff(diff(x)), 0) != 0
To show it picks the right points, colour them red.
points(seq_along(x)[changes], x[changes], col = "red")
This function will return the indices at which the direction changed:
get_change_indices <- function(x){
# return 0 if x contains one (or none, if NULL) unique elements
if(length(unique(x)) <= 1) return(NULL)
# make x named, so we can recapture its indices later
x <- setNames(x, paste0("a", seq_along(x)))
# calculate diff between successive elements
diff_x <- diff(x)
# remove points that are equal to zero
diff_x <- diff_x[!diff_x==0]
# identify indices of changepoints
diff_x <- c(diff_x[1], diff_x)
change_ind <- NULL
for(i in 2:length(diff_x)){
if(sign(diff_x[i]) != sign(diff_x[i-1])){
change_ind_curr <- as.numeric(gsub("a", "", names(diff_x[i]))) - 1
change_ind <- c(change_ind, change_ind_curr)
}
}
change_ind
}
The length of its output is the number of changes.
Note that it also works when the change in x is non-linear, e.g. if x <- c(1, 4, 9, 1).
I am trying to build a function that takes a numeric vector of homework scores (of length n), and an optional logical argument drop, to compute a single homework value. If drop = TRUE, the lowest HW score must be dropped.
step1 function to get average
get_average <- function(x,na.rm=TRUE) {
if(na.rm==TRUE){
x = remove_missing(x)}
total <- 0
for (n in 1:length(x)) {
total= total + x[n]
}
return(total/length(x))
}
put it all together
score_homework <- function(x,drop=TRUE)
{
if(drop==TRUE)
x = drop_lowest(x)
{get_average(x)}}
However I keep getting the error Error in score_homework() : argument "x" is missing, with no default
I'm not sure this is what you want, but here goes.
First generate some dummy data:
# Set seed
set.seed(1234)
# Generate dummy homework data with <NA> values
homework <- c(rep(NA, 20), rnorm(n = 100, mean = 50, sd = 10))
# Have a quick look
hist(homework)
Then we write the function:
# Make function
homework_func <- function(data, drop = TRUE) {
# Remove NA
data <- data[!is.na(data)]
# Calculate the average depending on whether 'drop' is T or F
if(drop == TRUE) {
data <- data[data > min(data)]
mean(data)
} else {
mean(data)
}
}
# Use function with 'drop = TRUE'
homework_func(data = homework, drop = TRUE)
#> [1] 48.65349
# Use function with 'drop = FALSE'
homework_func(data = homework, drop = FALSE)
#> [1] 48.43238
Here is a function to eliminate the lowest missing score that's less complicated than the version in the original post. I sort the scores in descending order in case the there is a tie for the lowest score. In that case, we should only remove one instance of the lowest score. Also, you're really better off using R's mean() function than writing your own.
scores <- c(78,93,61,NA,61,83,92,95,NA,100)
removeMinScore <- function(x) {
x <- x[order(-x)] # order descending
x <- x[!is.na(x)] # remove NAs
x[1:length(x)-1] # return all but lowest score, removes only 1 tied value
}
That said, if you must write your own version of mean(), here is a simpler approach that takes advantage of existing R functions.
TIP: Since is.na() returns a vector of TRUE and FALSE values, you can sum these to count the number of non-missing values in a vector.
mymean <- function(x) {sum(x, na.rm=TRUE) / sum(!is.na(x))}
The results look like this.
The modified version of score_homework() would be:
score_homework <- function(x,drop=TRUE){
if(drop == TRUE) return mean(removeMinScore(x),na.rm=TRUE)
else mean(x,na.rm=TRUE)
}
The results from testing the function are as follows.
I have a function that uses matplot to plot some data. Data structure is like this:
test = data.frame(x = 1:10, a = 1:10, b = 11:20)
matplot(test[,-1])
matlines(test[,1], test[,-1])
So far so good. However, if there are missing values in the data set, then there are gaps in the resulting plot, and I would like to avoid those by connecting the edges of the gaps.
test$a[3:4] = NA
test$b[7] = NA
matplot(test[,-1])
matlines(test[,1], test[,-1])
In the real situation this is inside a function, the dimension of the matrix is bigger and the number of rows, columns and the position of the non-overlapping missing values may change between different calls, so I'd like to find a solution that could handle this in a flexible way. I also need to use matlines
I was thinking maybe filling in the gaps with intrapolated data, but maybe there is a better solution.
I came across this exact situation today, but I didn't want to interpolate values - I just wanted the lines to "span the gaps", so to speak. I came up with a solution that, in my opinion, is more elegant than interpolating, so I thought I'd post it even though the question is rather old.
The problem causing the gaps is that there are NAs between consecutive values. So my solution is to 'shift' the column values so that there are no NA gaps. For example, a column consisting of c(1,2,NA,NA,5) would become c(1,2,5,NA,NA). I do this with a function called shift_vec_na() in an apply() loop. The x values also need to be adjusted, so we can make the x values into a matrix using the same principle, but using the columns of the y matrix to determine which values to shift.
Here's the code for the functions:
# x -> vector
# bool -> boolean vector; must be same length as x. The values of x where bool
# is TRUE will be 'shifted' to the front of the vector, and the back of the
# vector will be all NA (i.e. the number of NAs in the resulting vector is
# sum(!bool))
# returns the 'shifted' vector (will be the same length as x)
shift_vec_na <- function(x, bool){
n <- sum(bool)
if(n < length(x)){
x[1:n] <- x[bool]
x[(n + 1):length(x)] <- NA
}
return(x)
}
# x -> vector
# y -> matrix, where nrow(y) == length(x)
# returns a list of two elements ('x' and 'y') that contain the 'adjusted'
# values that can be used with 'matplot()'
adj_data_matplot <- function(x, y){
y2 <- apply(y, 2, function(col_i){
return(shift_vec_na(col_i, !is.na(col_i)))
})
x2 <- apply(y, 2, function(col_i){
return(shift_vec_na(x, !is.na(col_i)))
})
return(list(x = x2, y = y2))
}
Then, using the sample data:
test <- data.frame(x = 1:10, a = 1:10, b = 11:20)
test$a[3:4] <- NA
test$b[7] <- NA
lst <- adj_data_matplot(test[,1], test[,-1])
matplot(lst$x, lst$y, type = "b")
You could use the na.interpolation function from the imputeTS package:
test = data.frame(x = 1:10, a = 1:10, b = 11:20)
test$a[3:4] = NA
test$b[7] = NA
matplot(test[,-1])
matlines(test[,1], test[,-1])
library('imputeTS')
test <- na.interpolation(test, option = "linear")
matplot(test[,-1])
matlines(test[,1], test[,-1])
Had also the same issue today. In my context I was not permitted to interpolate. I am providing here a minimal, but sufficiently general working example of what I did. I hope it helps someone:
mymatplot <- function(data, main=NULL, xlab=NULL, ylab=NULL,...){
#graphical set up of the window
plot.new()
plot.window(xlim=c(1,ncol(data)), ylim=range(data, na.rm=TRUE))
mtext(text = xlab,side = 1, line = 3)
mtext(text = ylab,side = 2, line = 3)
mtext(text = main,side = 3, line = 0)
axis(1L)
axis(2L)
#plot the data
for(i in 1:nrow(data)){
nin.na <- !is.na(data[i,])
lines(x=which(nin.na), y=data[i,nin.na], col = i,...)
}
}
The core 'trick' is in x=which(nin.na). It aligns the data points of the line consistently with the indices of the x axis.
The lines
plot.new()
plot.window(xlim=c(1,ncol(data)), ylim=range(data, na.rm=TRUE))
mtext(text = xlab,side = 1, line = 3)
mtext(text = ylab,side = 2, line = 3)
mtext(text = main,side = 3, line = 0)
axis(1L)
axis(2L)`
draw the graphical part of the window.
range(data, na.rm=TRUE) adapts the plot to a proper size being able to include all data points.
mtext(...) is used to label the axes and provides the main title. The axes themselves are drawn by the axis(...) command.
The following for-loop plots the data.
The function head of mymatplot provides the ... argument for an optional passage of typical plot parameters as lty, lwt, cex etc. via . Those will be passed on to the lines.
At last word on the choice of colors - they are up to your flavor.
In this example I'm trying to generate a random time series for 3 individuals at 4 time points (below x contains the 1st timepoints for each individual). I want the values to be randomly increasing rather than decreasing in time. Below is my current solution.
set.seed(0)
x <- rnorm(3)
x
[1] 1.2629543 -0.3262334 1.3297993
y <- c(x,
x*runif(1,.8,1.2),
x*runif(1,.9,1.3),
x*runif(1,1,1.4))
y
[1] 1.2629543 -0.3262334 1.3297993 1.4642135 -0.3782206 1.5417106 1.6138915 -0.4168839 1.6993107 1.5967772
[11] -0.4124631 1.6812906
This has some problems.
For each individual the same coefficient is used for calculating the values for same timepoint resulting in identical trends. How could I get a random coefficient for each multiplication? I could use lapply but then the vector will be "grouped" by individuals not by timepoints.
I don't wish to write the formulas for last timepoints separately and be so precise. Exact coefficients are not important, I just need the values to have a tendency to slightly increase but occasional decreasing should also be allowed. How could I extend the vector more "effectively"?
How to make negative values to also increase?
I managed to solve this thanks to Federico Manigrasso. The solution is below.
TimeSer <- function(num.id, years, init.val) {
df <- data.frame(id = factor(rep(1:num.id, length(years))),
year = rep(years, each = num.id))
yrs <- length(years) - 1
minim <- seq(-.1, by = -.1, len = yrs)
maxim <- seq(.4, by = .4, len = yrs)
val <- list(init.val)
for (i in 1:yrs) {
val[[i + 1]] <- unlist(lapply(init.val, function (x) {
x + (x * runif(1, minim[i], maxim[i]))
}))
}
df$val <- unlist(val)
df
}
df <- TimeSer(num.id = 3, years = 2006:2016, init.val = rnorm(3,1e5, 1e5))
Visual representation of the results:
num.id <- length(unique(df$id))
par(mfrow=c(1,num.id))
lapply(1:num.id, function(x) {
plot(unique(df$year), df$val[df$id == x], type = 'l', col = x)
})
I suggest to put the output in a list, It a lot less messy and you can transform into a vector later (using unlist).
This is how I would rewrite your code
x<-rnorm(3)
time<-3
output<-list(x) #init output list with initial data
par1<-c(0.8,0.9,1)
par2<-c(1.2,1.3,1.4)
for( i in 1:time){
a<-unlist(lapply(x,function(x){x+runif(1,par1[i],par2[i])}))
output[[i+1]]<-a
x<-a
}
let me know if this solves all your problems..