Let's say I have a simple toy vector in R like:
x = seq(1:10);x
[1] 1 2 3 4 5 6 7 8 9 10
I want to use the rollapply function from zoo package but in a different way.Rollapply calculates a function from a vector x with width argument to be a rolling window.I want instead of rolling to be expanding.There is similar question here and here but they don't help me with my problem.
For example what I want to calculate the sum of the first observations of vector x and then expand the window but by 2.
Doing so I did :
rollapplyr(x, seq_along(x) ,sum,by=2,partial = 5,fill=NA)
[1] NA NA NA NA 15 21 28 36 45 55
or replace the NA's
na.locf0(rollapplyr(x, 5 ,sum,by=2,partial = 5,fill=NA))
[1] NA NA NA NA 15 15 25 25 35 35
But what I ideally want as a result is:
[1] NA NA NA NA 15 15 28 28 45 45
Imagine that my dataset is huge (contains 2500 time series observations) and the function is some econometric - statistical model not a simple one like the sum that I use here.
How can I do it? Any help ?
x <- seq(10)
expandapply <- function(x, start, by, FUN){
# set points to apply function up to
checkpoints <- seq(start, length(x), by)
# apply function to all windows
vals <- sapply(checkpoints, function(i) FUN(x[seq(i)]))
# fill in numeric vector at these points (assumes output is numeric)
out <- replace(rep(NA_real_, length(x)), checkpoints, vals)
# forward-fill the gaps
zoo::na.locf(out, na.rm = FALSE)
}
expandapply(x, start = 5, by = 2, FUN = sum)
#> [1] NA NA NA NA 15 15 28 28 45 45
Created on 2022-03-13 by the reprex package (v2.0.1)
Define nonNA as the positions which should not be NA. You can change x and nonNA to whatever you need.
Then assign w a vector of widths to use using zero for those components which are to be NA. Finally apply na.locf0.
(The two extreme cases are that if nonNA is seq_along(x) so that all elements are not to be NA'd out then this is the same as rollapplyr(x, seq_along(x), sum) and if nonNA is c() so that there are no non-NAs then it returns all NAs.)
library(zoo)
x <- 1:10
nonNA <- seq(5, length(x), 2)
w <- ifelse(seq_along(x) %in% nonNA, seq_along(x), 0)
na.locf0(rollapplyr(x, w, function(x) if (length(x)) sum(x) else NA, fill=NA))
## [1] NA NA NA NA 15 15 28 28 45 45
Another way is to use a list for thewidth= argument of rollapply whose components contain the offsets. x and nonNA are from above.
L <- lapply(seq_along(x), function(x) if (x %in% nonNA) -seq(x-1, 0))
na.locf0(rollapplyr(x, L, sum, fill = NA))
## [1] NA NA NA NA 15 15 28 28 45 45
Update
Simplified solution and added second approach.
Related
I am currently working on a dataframe with raw numeric data in cols. Every col contains data for one parameter (for example gene expression data of gene xyz) while each row contains a subject. Some of the data in the cols are normally distributed, while some are far from it. I ran shapiro tests using apply with margin 2 for different transformations and then picked suitable transformations by comparing shapiro.test()$p.value. I sent my pick as char to a vector, giving me a vector of NA, log10, sqrt with the length of ncol(DataFrame). I now wonder if it is possible to apply the vector to the data frame via an apply-function, or if neccessary a for-loop. How do I do this or is there a better way? I guess I could loop if-else statements but there has to be a more efficient ways because my code already is slow.
Thanks all!
Update: I tried the code below but it is giving me "Error in file(filename, "r") : invalid 'description' argument"
TransformedExampleDF <- apply(exampleDF, 2 , function(x) eval(parse(paste(transformationVector , "(" , x , ")" , sep = "" ))))
exampleDF <- as.data.frame(matrix(c(1,2,3,4,1,10,100,1000,0.1,0.2,0.3,0.4), ncol=3, nrow = 4))
transformationVector <- c(NA, "log10", NA)
So you could do something like this. In the example below, I've cooked up four random functions whose names I've then stored in the list func_list (Note: the last function converts data to NA; that is intentional).
Then, I created another function func_to_df() that accepts the data.frame and the list of functions (func_list) as inputs, and applies (i.e., executes using get()) the functions upon the corresponding column of the data.frame. The output is returned (and in this example, is stored in the data.frame my_df1.
tl;dr: just look at what func_to_df() does. It might also be worthwhile looking into the purrr package (although it hasn't been used here).
#---------------------
#Example function 1
myaddtwo <- function(x){
if(is.numeric(x)){
x = x+2
} else{
warning("Input must be numeric!")
}
return(x)
#Constraints such as the one shown above
#can be added elsewhere to prevent
#inappropriate action
}
#Example function 2
mymulttwo <- function(x){
return(x*2)
}
#Example function 3
mysqrt <- function(x){
return(sqrt(x))
}
#Example function 4
myna <- function(x){
return(NA)
}
#---------------------
#Dummy data
my_df <- data.frame(
matrix(sample(1:100, 40, replace = TRUE),
nrow = 10, ncol = 4),
stringsAsFactors = FALSE)
#User somehow ascertains that
#the following order of functions
#is the right one to be applied to the data.frame
my_func_list <- c("myaddtwo", "mymulttwo", "mysqrt", "myna")
#---------------------
#A function which applies
#the functions from func_list
#to the columns of df
func_to_df <- function(df, func_list){
for(i in 1:length(func_list)){
df[, i] <- get(func_list[i])(df[, i])
#Alternative to get()
#df[, i] <- eval(as.name(func_list[i]))(df[, i])
}
return(df)
}
#---------------------
#Execution
my_df1 <- func_to_df(my_df, my_func_list)
#---------------------
#Output
my_df
# X1 X2 X3 X4
# 1 8 85 6 41
# 2 45 7 8 65
# 3 34 80 16 89
# 4 34 62 9 31
# 5 98 47 51 99
# 6 77 28 40 72
# 7 24 7 41 46
# 8 45 80 75 30
# 9 93 25 39 72
# 10 68 64 87 47
my_df1
# X1 X2 X3 X4
# 1 10 170 2.449490 NA
# 2 47 14 2.828427 NA
# 3 36 160 4.000000 NA
# 4 36 124 3.000000 NA
# 5 100 94 7.141428 NA
# 6 79 56 6.324555 NA
# 7 26 14 6.403124 NA
# 8 47 160 8.660254 NA
# 9 95 50 6.244998 NA
# 10 70 128 9.327379 NA
#---------------------
I need to take an existing vector and create a new vector that contains the values;
(x1+2x2−x3, x2+2x3−x4, . . . , xn−2+2xn−1 − xn)
I've tried using xVec[n-2] + 2* xVec[n-1] - xVec[n] but this doesn't work!
Without zoo:
n <- 10
xVec <- seq(n)
idx <- seq(1, n-2)
xVec[idx] + 2* xVec[idx+1] - xVec[idx+2]
[1] 2 4 6 8 10 12 14 16
You need a rolling calculation, something that the zoo package provides:
vec <- 1:10
zoo::rollapply(vec, width = 3, FUN = function(z) z[1]+2*z[2]-z[3])
# [1] 2 4 6 8 10 12 14 16
Validation, using first three and last three:
1 + 2*2 - 3
# [1] 2
8 + 2*9 - 10
# [1] 16
Explanation: each time the function (passed to FUN=) is called, it is given a vector with width= elements in it. The first call is effectively z=1:3, the second call z=2:4, third z=3:5, etc.
You should know that by default it will return length(vec) - width + 1 elements in its return value. You can control this with fill= and align= arguments:
zoo::rollapply(1:10, width = 3, FUN = function(z) z[1]+2*z[2]-z[3], fill = NA)
# [1] NA 2 4 6 8 10 12 14 16 NA
zoo::rollapply(1:10, width = 3, FUN = function(z) z[1]+2*z[2]-z[3], fill = NA, align = "right")
# [1] NA NA 2 4 6 8 10 12 14 16
In a comment, B. Go has suggested to "reshape" the vector and wonders if this can be done in R as well.
In R, two packages provide functions to shift the elements of a vector: data.table and dplyr. (The lag() function from base R deals with times series objects.)
data.table
x <- 1:10
library(data.table)
shift(x, 2L) + 2 * shift(x) - x
[1] NA NA 2 4 6 8 10 12 14 16
dplyr
x <- 1:10
library(dplyr)
lag(x, 2L) + 2 * lag(x) - x
[1] NA NA 2 4 6 8 10 12 14 16
By default, both functions do fill up missing values after shifting with NA. This explains why the first two elements of the result vector are NA.
To get rid of the leading NAs, the tail() function can be used, e.g.,
tail(shift(x, 2L) + 2 * shift(x) - x, -2L)
[1] 2 4 6 8 10 12 14 16
If you are up for a bit of matrix math:
xVec <- 1:10
linear_combo <- c(1, 2, -1)
m <- matrix(0, length(xVec), length(xVec))
for (index in seq_along(linear_combo)) {
m[row(m) == col(m) - index + 1] <- linear_combo[index]
}
m %*% xVec
Note in this case the last two elements are incomplete and should probably be dropped or replaced by NA.
head(m %*% xVec, -(length(linear_combo) - 1))
I have several data frames containing 18 columns with approx. 50000 rows. Each row entry represents a measurement at a specific site (= column), and the data contain NA values.
I need to subtract the consecutive rows per column (e.g. row(i+1)-row(i)) to detect threshold values, but I need to ignore (and retain) the NAs, so that only the entries with numeric values are subtracted from each other.
I found very helpful posts with data.table solutions for a single column Iterate over a column ignoring but retaining NA values in R, and for multiple column operations (e.g. Summarizing multiple columns with dplyr?).
However, I haven't managed to combine the approaches suggested in SO (i.e. apply diff over multiple columns and ignore the NAs)
Here's an example df for illustration and a solution I tried:
library(data.table)
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
that's how it works for a single column
diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but for more columns at once
and that's how I apply a diff function over several columns with lapply
diff_all <- setDT(df)[,lapply(.SD, diff)] # not exactly what I want because NAs are not ignored and the difference between numeric values is not calculated
I'd appreciate any suggestion (base, data.table, dplyr ,... solutions) on how to implement a valid !is.na or similar statement into this second line of code very much.
Defining a helper function makes things a bit cleaner:
lag_diff <- function(x) {
which_nna <- which(!is.na(x))
out <- rep(NA_integer_, length(x))
out[which_nna] <- x[which_nna] - shift(x[which_nna])
out
}
cols <- c("x", "y", "z")
setDT(df)
df[, paste0("lag_diff_", cols) := lapply(.SD, lag_diff), .SDcols = cols]
Result:
# x y z lag_diff_x lag_diff_y lag_diff_z
# 1: 1 NA 6 NA NA NA
# 2: 2 4 2 1 NA -4
# 3: 3 5 7 1 1 5
# 4: NA 6 14 NA 1 7
# 5: NA NA 20 NA NA 6
# 6: 9 15 NA 6 9 NA
# 7: 8 14 NA -1 -1 NA
# 8: 7 13 2 -1 -1 -18
So you are looking for:
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
setDT(df)
# diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
df[, lapply(.SD, lag_d)]
or
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
as.data.frame(lapply(df, lag_d))
Can somebody please help me with a recode from SPSS into R?
SPSS code:
RECODE variable1
(1,2=1)
(3 THRU 8 =2)
(9, 10 =3)
(ELSE = SYSMIS)
INTO variable2
I can create new variables with the different values. However, I'd like it to be in the same variable, as SPSS does.
Many thanks.
x <- y<- 1:20
x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
y[x %in% (1:2)] <- 1
y[x %in% (3:8)] <- 2
y[x %in% (9:10)] <- 3
y[!(x %in% (1:10))] <- NA
y
[1] 1 1 2 2 2 2 2 2 3 3 NA NA NA NA NA NA NA NA NA NA
I wrote a function that has very similiar coding to the spss code recode. See here
variable1 <- -1:11
recodeR(variable1, c(1, 2, 1), c(3:8, 2), c(9, 10, 3), else_do= "missing")
NA NA 1 1 2 2 2 2 2 2 3 3 NA
This function now works also for other examples. This is how the function is defined
recodeR <- function(vec_in, ..., else_do){
l <- list(...)
# extract the "from" values
from_vec <- unlist(lapply(l, function(x) x[1:(length(x)-1)]))
# extract the "to" values
to_vec <- unlist(lapply(l, function(x) rep(x[length(x)], length(x)-1)))
# plyr is required for mapvalues
require(plyr)
# recode the variable
vec_out <- mapvalues(vec_in, from_vec, to_vec)
# if "missing" is written then all outside the defined range will be missings.
# Otherwise values outside the defined range stay the same
if(else_do == "missing"){
vec_out <- ifelse(vec_in < min(from_vec, na.rm=T) | vec_in > max(from_vec, na.rm=T), NA, vec_out)
}
# return resulting vector
return(vec_out)}
Beginner here.
I am trying to write a for loop to adjust a score.
The first part of the for loop creates a conditional output which should be passed to the second part of the for loop which has a lookup function.
In the first data.frame, a, there are 2 columns
score (participant score)
problem (problems with the participant score)
In the second data.frame, b, there are 5 columns
score (participant score)
q (adjusted score when problem = 1)
r (adjusted score when problem = 2)
s (adjusted score when problem = 0)
t (adjusted score when problem > 2)
In the first for loop, I find the index where
a$score = b$score
Then in the second loop, I pass the index to another loop.
Based on the value in a$problem, the loop returns the correct adjusted value in (q, r, s, t).
Here is data.frame a
id score problem
1 11 1
2 12 6
3 13 2
4 14 0
5 NA NA
Here is data.frame b
score q r s t
11 12 13 11 NA
12 14 15 12 NA
13 16 20 13 NA
14 18 22 14 NA
NA NA NA NA NA
I would like the output of the function to be a new column in a, a$adjusted
Here is the function I have been trying,
adjust <- function (y, z){
# y = problem
# z = score
for(j in z){
index <- sapply(j, function(x) b$score %in% x)
for (i in y){
ifelse(i > 2,
z(i) <- b[index, 5],
ifelse(i == 2,
z(i) <- b[index, 3],
ifelse(i == 1,
z(i) <- b[index, 2],
ifelse( i == 0,
z(i) <- b[index, 4],
z(i) <- b[index, 5]))))
print(z(i))
}
}
}
This is still new for me.
Not sure where I'm going wrong.
When I assign:
a$adjusted <- adjust(a$problem, a$score)
Nothing happens
Any and all help very much appreciated here.
To simplify the nested ifelse statements I used the case_when function from the dplyr package. I also used the match function to simplify the inner loop (i.e. sapply)
a<-read.table(header=TRUE, text="id score problem
1 12 1
2 11 6
3 13 2
4 14 0
5 NA NA")
b<-read.table(header=TRUE, text="score q r s t
11 12 13 11 NA
12 14 15 12 NA
13 16 20 13 NA
14 18 22 14 NA
NA NA NA NA NA")
library(dplyr)
#find column name associated with problem score if NA use the score column
colname<-case_when(
a$problem==0 ~ "s",
a$problem==1 ~ "q",
a$problem==2 ~ "r",
a$problem >2 ~ "t",
is.na(a$problem) ~"score"
)
# find the corresponding row in table b to match data frame a
rowscore<-match(a$score, b$score)
#column name and rowscore are the same length
#loop through the column name/row name to find the adjusted score
a$adjusted<-sapply(1:length(rowscore), function(i){b[[colname[i]]][rowscore[i]]} )