R ffdfdply reset cumsum using data.table - r

Sorry for asking basic question. I am using ff package and used read.csv.ffdf to import the data. I have more then 50 million rows in excel and I want to do cumulative sum on one of the column and reset it when it finds 0. I have the below code to generate cumulative series and but don't know how to access the current row.
idx <- ffdforder(i[c("a","c","b")])
ordered_i <- i[idx, ]
ordered_i$key_a_c_d <- ikey(ordered_i[c("a", "c","d")])
cumsum_i <- ffdfdply(ordered_i, split=as.character(ordered_i$key_a_c_d), FUN= function(x) {
x <- as.data.table(x)
if(x[**current row**, d]==0)
{
result <- x[,cumsum_a_c_d :=0]
}
else
{
result <- x[, cumsum_a_c_d := cumsum(d), by = list(key_a_c_d)]
}
as.data.frame(result)
}, trace=T)
I am using the data.tablepackage to get the cumulative sum done. How can I access the current row in data table so that I can compare it with 0 and reset the cumsum. I need the expected output as shown below. It's the cumulative sum of column d.
a b c d Result
1 1 1 1 1
1 4 1 0 0
1 6 1 1 1
1 2 1 1 2
1 5 1 0 0
1 3 1 1 1
Thanks

Related

Irregular Interval based representation of survival data in R

I have the following dataset:
df =
id Time A
1 3 0
1 5 1
1 6 1
2 8 0
2 9 0
2 12 1
I want to do two things: i) have a starting time of -1 across all ids, and ii) split the time into two columns; start and end while preserving the time at which the individual got the observation A (setting end as the reference point). The final result should look something like this:
df =
id start end A
1 -1 0 0
1 0 2 1
1 2 3 1
2 -1 0 0
2 0 1 0
2 1 4 1
This does the trick with this set. I wasn't 100% sure on the question from the description so tried to go off what I could see here. For future reference, please try pasting in dput(df) as the input data :)
df <- data.frame(id=c(rep(1,3),rep(2,3)),
Time=c(3,5,6,8,9,12),
A=c(0,1,1,0,0,1))
library(data.table)
dt <- as.data.table(df)
# diff(Time) finds the interval between points
# cumsum then adds this diff together to take in to account the previous time
# gaps
dt[, end := cumsum(c(0, diff(Time))), by=id]
# start is then just a shifted version of end, with the initial start filled as -1
dt[, start := shift(end, n=1, fill=-1), by=id]
out <- as.data.frame(dt)
out

How to iterate over a column and assign value based on another column without a for loop?

I am trying to efficiently assign a value to a column, based on another column, but without a for loop as this takes too long.
I'm doing something like this: If the reference column value is greater than a certain random number, I assign 1 to the new column. Otherwise, assign 0. Can't figure out the best way to do this without a loop. I tried dplyr and case_when, but that wasn't iterating over each row.
Thanks!
for (i in 1:nrow(data)) {
if (data$value[i] > runif(1, 0, 1.7)) {
temp$newValue[i] <- 1
} else{
temp$newValue[i] <- 0
}
}
c0=data.frame(c(1,4,6,3,7,3),c(2,8,2,4,9,4))
names(c0)=c("A","B")
c0$C=ifelse(c0[,"A"]>runif(1,0,1.7),1,0)
c0
I'm not so sure if I understand you well. Please comment if I have any misunderstanding.
A
<dbl>
B
<dbl>
C
<dbl>
1 2 0
4 8 1
6 2 1
3 4 1
7 9 1
3 4 1
6 rows
Here is how I use A to generate C
Does this solve your problem?
DATA:
set.seed(1)
df <- data.frame(
refcol = rnorm(10)
)
randvalue <- 0
SOLUTION:
df$newcol <- ifelse(df$refcol > randvalue, 1, 0)
RESULT:
df
refcol newcol
1 0.2352207 1
2 -0.3307359 0
3 -0.3116238 0
4 -2.3023457 0
5 -0.1708760 0
6 0.1402782 1
7 -1.4974267 0
8 -1.0101884 0
9 -0.9484756 0
10 -0.4939622 0

removing columns equal to 0 from multiple data frames in a list; lapply not actually removing columns when applying function to a list

I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated
The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1

How to use a vector in lag operation in R

In R, How do we use a vector instead of element in the lag function. i.e for Lag(x,k=2); instead of 2 I want to use a vector because I want to lag each row by a different value. So one row could have a lag of 3, while 1 could be 0 etc.
Example:
a #lags d
1 0 1
2 1 1
4 2 1
3 0 3
1 1 3
i think you may need to write your own function for this task. i wrote one that i think will be what you need, or perhaps point you in the right direction:
x1 <- c(75,98,65,45,78,94,123,54) #a fake data set for us to lag
y1 <- c(2,3,1,4,1,2,3,5) #vector of values to lag by
#the function below takes the data, x1, and lags it by y1
dynlag <- function(x,y) {
a1 <- x[length(x)-y]
return(a1)
}
#test out the function
dynlag(x1,y1)
hope this helps. :)
Here is a solution with index calculus:
D <- read.table(header=TRUE, text=
'a lags d
1 0 1
2 1 1
4 2 1
3 0 3
1 1 3')
i <- seq(length(D$a))
erg <- D$a[i - D$lags]
all.equal(erg, D$d)

R: ddply for an updating function

I have a silly function, which updates the value of S in the length of the vector called ACC by delta1 and delta2.
Sstart=0 #a starting value for S
ACC=c(1,1,0,1,1) #accuracy: 0 or 1
f=c(1,1,1,1,0) #feedback: 0 or 1
ID=rep(1,5) #ID of the participant
delta1=seq(1,5,1)
delta2=seq(1,5,1)
m<-as.matrix(expand.grid(delta1=delta1, delta2=delta2)) #all the possible combination of delta1 and delta2
The function is the following. When the feedback (f) is 1, it updates S by delta1, when the ffedback is 0, then with delta2. Delta1 and delta2 ranges from 1 to 5 and I increment them separately.
silly_function<-function(delta1, delta2,ACC,f,Sstart){
S = Sstart
for (i in 1:length(ACC)){
if (ACC[i]==1 & f[i]==1){
S[i+1]=S[i]+delta1
}
else if (ACC[i]==1 & f[i]==0){
S[i+1]=S[i]+delta2
}
else if (ACC[i]==0){
S[i+1]=S[i]
}
}
return(S)
}
I call the function
N=length(delta1)*length(delta2)
SMat<-matrix(data=NA, nrow=N, ncol=(length(ACC)+1)) #matrix for the data
for (i in 1:N){
SMat[i,] <- silly_function(m[i,1],m[i,2],ACC,f,Sstart)}
My problem:
The function works perfectly for 1 subject, but I cannot find a clever way for applying it for all my subjects separately (I have a data frame where all the data from my participants are in one data frame) and combine the results into one matrix or data frame. I wanted to use ddply from the plyr package, but I couldn’t find an example similar to mine to modify it and get some idea how to implement it in this case.
Thank you very much for your comments/hints in advance!
My input for two participants
ID Feedback ACC
1 1 1
1 1 1
1 1 0
1 1 1
1 0 1
2 1 1
2 1 1
2 0 1
2 1 0
2 1 1
Actual output
V1 V2 V3 V4 V5 V6
0 1 2 2 3 4 #row1
0 2 4 4 6 7
.
.
0 5 10 10 15 20 #row25
For subject 1 the output is:
25 * 6 (rows*columns) matrix: 25 rows because I update S by all possible combinations of delta1 and delta2. the first column is always 0 because Sstart is 0.
Desired output
Basically the same as for subject 1 but with all the subjects data
1-25 rows for subject 1
26-50 rows for subject 2...

Resources