How do I apply a function that can "see" the preceding result when operating by rows?
This comes up a lot, but my current problem requires a running total by student that resets if the total doesn't get to 5.
Example Data:
> df
row Student Absent Consecutive.Absences
1 A 0 0
2 A 1 1
3 A 1 2
4 A 0 0 <- resets to zero if under 5
5 A 0 0
6 A 1 1
7 A 1 2
8 A 1 3
9 B 1 1 <- starts over for new factor (Student)
10 B 1 2
11 B 0 0
12 B 1 1
13 B 1 2
14 B 1 3
15 B 1 4
16 B 0 0
17 B 1 1
18 B 1 2
19 B 1 3
20 B 1 4
21 B 1 5
22 B 0 5 <- gets locked at 5
23 B 0 5
24 B 1 6
25 B 1 7
I've tried doing this with a huge matrix of shifted vectors.
I've tried doing this with the apply family of functions and half of them do nothing, the other half hit 16GB of RAM and crash my computer.
I've tried straight looping and it takes 4+ hours (it's a big data set)
What bothers me is how easy this is in Excel. Usually R runs circles around Excel both in speed and writability, which leads me to believe I'm missing something elementary here.
Forgetting even the more challenging ("lock at 5") feature of this, I can't even get a cumsum that resets. There is no combination of factors I can think of to group for ave like this:
Consecutive.Absences = ave(Absent, ..., cumsum)
Obviously, grouping on Student will just give the Total Cumulative Absences -- it "remembers" the kid's absence over the gaps because of the split and recombine in ave.
So as I said, the core of what I don't know how to do in R is this:
How do I apply a function that can "see" the preceding result when operating by rows?
In Excel it would be easy:
C3 = IF($A3=$A2,$B3+$C2,$B3)*$B3
This excel function is displayed without the 5-absence lock for easy readability.
Once I figure out how to apply a function that looks at previous results of the same function in R, I'll be able to figure out the rest.
Thank you in advance for your help--this will be very useful in a lot of my applications!
Genuinely,
Sam
UPDATE:
Thank you everyone for the ideas on how to identify if a student has 5 consecutive absences!
However, that's easy enough to do in the database at the STUDENTS table. What I need to know is the number of consecutive absences by student in the attendance record itself for things like, "Do we count this particular attendance record when calculating other summary statistics?"
If you're looking to apply a function to every element in a vector while making use the previous element's value, you might want to check out "Reduce", with the accumulate parameter set to True
Here's an example:
##define your function that takes two parameters
##these are the 'previous' and the 'current' elements
runSum <- function(sum, x){
res = 0
if (x == 1){
res = sum + 1
}
else if (x == 0 & sum < 5){
res = 0
}
else{
res = sum
}
res
}
#lets look at the absent values from subject B
x = c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1)
Reduce(x=x, f=runSum, accumulate=T)
# [1] 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
It's fairly easy to identify the students with one or more runs of 5:
tapply(dfrm$Absent, dfrm$Student, function(x) rle(x)$value[rle(x)$length >=5] )
$A
integer(0)
$B
[1] 1
Look for any values of "1" in the result:
tapply(dfrm$Absent, dfrm$Student, function(x) 1 %in% rle(x)$value[rle(x)$length >=5] )
A B
FALSE TRUE
I also struggled through to a Reduce solution (but am second in priority to #kithpradhan):
ave(dfrm$Absent, dfrm$Student,
FUN= function(XX)
Reduce(function(x,y) if( x[1] >= 5){ y+x[1]
} else{ x[1]*y+y } , #Resets to 0 if y=0
XX, accumulate=TRUE)
)
#[1] 0 1 2 0 0 1 2 3 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
For the record, you can also create your own Reduce-derivative which receives f and x, and applies f(x) on its output until x == f(x) or maxiter is reached:
ireduce = function(f, x, maxiter = 50){
i = 1
while(!identical(f(x), x) & i <= maxiter) {x = f(x); i = i+1}; x
}
Related
The task that I want to complete is the following:
I have a dataset with hundreds of variables. I need to recode all of them following the same logic. The logic is the following: if the GIVEN VARIABLE == 0 & a SPECIFIC VARIABLE == 1, the GIVEN VARIABLE must = -1. The SPECIFIC VARIABLE is the same for all of them.
What I have done is the following:
set.seed(123)
data=data.table(a = 0:10, b= 0:10, c = 0:10, d = 1:0)
Here "d" is the SPECIFIC VARIABLE and a:c are the GIVEN VARIABLEs
list_variables <- names(data)
list_variables_v2 <- list_variables[-c(4)]
I extracted the names of the variables from the dataset (minus d) and put them on a list, so they can be fed into the loop
data_v1 = copy(d)
for(i in (list_variables_v2)) {
data_v1[(i) == 0 & d == 1, (i) := -1]
}
Problematically, when I run the loop nothing happens. Those variables that comply with the condition (e.g. a == 0 & d == 1) are not recoded as -1. Various problems could be happening, but I think I have reduced them to one. Potential problems:
a) The code, even outside the loop, does not work. But this is not true. The following code produces the expected result:
data_v1[a == 0 & d == 1, a := -1]
b) The loop is not working, hence, the variable names are not really sorted and recognized. Nonetheless, if I exclude the (i) == 0 condition, the code does work, implying that the loop works for the right side:
for(i in (list_variables_v2)) {
data_v1[d == 1, (i) := -1]
}
I think that the root of the problem is that R, in the row filtering side, is not recognizing (i) == 0 as e.g. a == 0. This is quite weird given that R, when dealing with the right side (columns), does recognize that (i) := -1 as e.g. a := -1. Any idea of what might be causing this and, hopefully, how to solve it?
Again, many many thanks, and please let me know if something is unclear or repeated.
A simple correction would be to wrap with get
for(i in (list_variables_v2)) {
data_v1[get(i) == 0 & d == 1, (i) := -1]
}
-output
> data_v1
a b c d
<int> <int> <int> <int>
1: -1 -1 -1 1
2: 1 1 1 0
3: 2 2 2 1
4: 3 3 3 0
5: 4 4 4 1
6: 5 5 5 0
7: 6 6 6 1
8: 7 7 7 0
9: 8 8 8 1
10: 9 9 9 0
11: 10 10 10 1
> data
a b c d
<int> <int> <int> <int>
1: 0 0 0 1
2: 1 1 1 0
3: 2 2 2 1
4: 3 3 3 0
5: 4 4 4 1
6: 5 5 5 0
7: 6 6 6 1
8: 7 7 7 0
9: 8 8 8 1
10: 9 9 9 0
11: 10 10 10 1
I am working with a rasterbrick "a" with thousands of layers, closer description is not necessary for my problem. I am using following function to create a rasterlayer of the total amount of runs of at least 5 days with values greater than 1 (one layer in brick is one day):
indices<-rep(1:69,each=90)
ff<-function(x,na.rm=TRUE){
y<-x > 1
n<- ave(y,cumsum(y == 0), FUN = cumsum)
sum(n==5)
}
Y<-stackApply(a,indices,fun=ff)
This works great, I tested that. In a similar manner, I wrote new function:
fff<-function(x,na.rm = TRUE){
y <- x > 1
n <- ave(y, cumsum(y == 0), FUN = cumsum)
mean(n[n >= 5])
}
X<-stackApply(a,indices,fun=fff)
Using this function, I wanted to create a rasterlayer of average lengths of those runs greater than 5 days. It seems reasonable and fine, but it does not work correctly. For example, when there is a run of 6 days (satisfying my criterion of value>1), it counts two runs, one of 5 and another one of six, and thus the average is 5,5 instead of 6. I am not sure how to adjust my function fff. If there is a way to do it, it would be great, otherwise I would be greatful if anyone shares another way how to calculate means of those runs. Thanks!
In the future, please include a minimal, reproducible, self-contained example. Do not describe the behavior of your code, but show it. Also, be very clear aobut the question. As-is it is hard to see that your question is not about raster data at all, as you are looking for a function that works on any numeric vector (that you may then apply to raster data).
You are looking for function that finds local maxima larger than 5, in the cumulated sum of neighbors that are > 1; and then average these local maxima.
You have this function
ff<-function(x,na.rm=TRUE){
y<-x > 1
n <- ave(y,cumsum(y == 0), FUN = cumsum)
sum(n==5)
}
Example data
x <- c(-1:10, -1:3, -1:6)
x
# [1] -1 0 1 2 3 4 5 6 7 8 9 10 -1 0 1 2 3 -1 0 1 2 3 4 5 6
ff(x)
# [1] 2
(two local maxima that are larger than 5)
To write the function you want we can start with what we have
y <-x > 1
n <- ave(y,cumsum(y == 0), FUN = cumsum)
n
# [1] 0 0 0 1 2 3 4 5 6 7 8 9 0 0 0 1 2 0 0 0 1 2 3 4 5
In this case, you need to find the numbers 9 and 5. You can start with
n[n<5] <- 0
n
# [1] 0 0 0 0 0 0 0 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 5
And now we can use diff to find the local maxima. These are the values for which the difference with the previous value is negative. Note the zero added to n to consider the last element of the vector.
i <- which(diff(c(n, 0)) < 0)
i
# [1] 12 25
n[i]
# [1] 9 5
Such that we can put the above together in a function like this
f <- function(x) {
y <- x > 1
n <- ave(y,cumsum(y == 0), FUN = cumsum)
n[n<5] <- 0
i <- which(diff(c(n, 0)) < 0)
mean(n[i])
}
f(x)
# [1] 7
If you have NAs you may do
f <- function(x) {
y <- x > 1
y[is.na(y)] <- FALSE
n <- ave(y,cumsum(y == 0), FUN = cumsum)
n[n<5] <- 0
i <- which(diff(c(n, 0)) < 0)
mean(n[i])
}
I want to calculate the choice switching probability by group first(user in below code). Then I will average the group level probability and get a total probability. I have tens of thousands of groups so I need the code to be fast. My code is a for loop , which takes more than 10 minutes to run. I did the same code/logic excel, it takes less than a few seconds.
The switching for choice m to n for a particular user is defined as the share of observations whose choice are n at period t and m at period t-1
My original code is tagging the first and last purchase by for loop first. Then use another for loop to get the switching matrix. I am only able to create the switching matrix by the whole data not by group. Even so, it is still very slow. Adding user would make it even slower.
t<-c(1,2,1,1,2,3,4,5)
user<-c('A','A','B' ,'C','C','C','C','C')
choice<-c(1,1,2,1,2,1,3,3)
dt<-data.frame(t,user,choice)
t user choice
1 A 1
2 A 1
1 B 2
1 C 1
2 C 2
3 C 1
4 C 3
5 C 3
# **step one** create a second choice column for later construction of the switching matrix
#Label first purchase and last purchase is zero
for (i in 1:nrow(dt))
{ ifelse (dt$user[i+1]==dt$user[i],dt$newcol[i+1]<-0,dt$newcol[i+1]<-1) }
# **step two** create stitching matrix
# switching.m is a empty matrix with the size of total chocie:3x3 here
length(unique(dt$user))
total.choice<-3
switching.m<-matrix(0,nrow=total.choice,ncol=total.choice)
for (i in 1:total.choice)
{
for(j in 1:total.choice)
{
if(length(nrow(switching.m[switching.m[,1]==i& switching.m[,2]==j,])!=0))
{switching.m[i,j]=nrow(dt[dt[,1]==i&dt[,2]==j,])}
else {switching.m[i,j]<0}
}
}
The desire output for a particular user/group is like this. The output should have the same matrix size even if the user does not make a particular choice at all
# take user C
#output for switching matrix
second choice
first 1 2 3
1 0 1 1
2 1 0 0
3 0 0 1
#output for switching probability
second choice
first 1 2 3
1 0 0.5 0.5
2 1 0 0
3 0 0 1
We could use table and prop.table after after splitting by 'user'
lst <- lapply(split(dt, dt$user), function(x)
table(factor(x$choice, levels= 1:3), factor(c(x$choice[-1], NA), levels=1:3)))
As mentioned by #nicola, it is more compact to split the 'choice' column by 'user'
lst <- lapply(split(dt$choice, dt$user), function(x)
table(factor(x, levels = 1:3), factor(c(x[-1], NA), levels = 1:3)))
lst$C
# 1 2 3
#1 0 1 1
#2 1 0 0
#3 0 0 1
prb <- lapply(lst, prop.table, 1)
prb$C
# 1 2 3
# 1 0.0 0.5 0.5
# 2 1.0 0.0 0.0
# 3 0.0 0.0 1.0
I have the following df and I would like to extract all rows based on the following start and end signals.
Start signal : When status changes from 1 to 0
End signal : When status changes from 0 to -1.
df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))
time status
1 1 0
2 2 1
3 3 1
4 4 0
5 5 0
6 6 0
7 7 -1
8 8 0
9 9 1
10 10 0
11 11 0
12 12 0
13 13 -1
14 14 0
Desire:
time status
4 4 0
5 5 0
6 6 0
10 10 0
11 11 0
12 12 0
Here's a possible solution using the data.table package. I'm basically first grouping by status == 1 appearances and then checking per group if there was also a status == -1, if so, I'm sub-setting the group from the second incident until the -1 incident minus 1
library(data.table)
setDT(df)[, indx := cumsum(status == 1)]
df[, if(any(status == -1)) .SD[2:(which(status == -1) - 1)], by = indx]
# indx time status
# 1: 2 4 0
# 2: 2 5 0
# 3: 2 6 0
# 4: 3 10 0
# 5: 3 11 0
# 6: 3 12 0
We count start and end markers, then use those values and the cumulative-sum of (start - end) to filter rows. The (cumsum(start)-cumsum(end)>1) is a slight fiddle to avoid the cumulative counts being upset by row 2 which starts but doesn't end; otherwise row 14 would unwantedly get included.
require(dplyr)
df %>% mutate(start=(status==1), end=(status==-1)) %>%
filter(!start & !end & (cumsum(start)-cumsum(end)>1) ) %>%
select(-start, -end)
# time status
# 1 4 0
# 2 5 0
# 3 6 0
# 4 10 0
# 5 11 0
# 6 12 0
A little ugly, but you can always just loop over the values and keep a flag for determining whether the element should be kept or not.
keepers <- rep(FALSE, nrow(df))
flag <- FALSE
for(i in 1:(nrow(df)-1)) {
if(df$status[i] == 1 && df$status[i+1] == 0) {
flag <- TRUE
next # keep signal index false
}
if(df$status[i] == -1 && df$status[i+1] == 0) {
flag <- FALSE
next # keep signal index false
}
keepers[i] <- flag
}
keepers[nrow(df)] <- flag # Set the last element to final flag value
newdf <- df[keepers, ] # subset based on the T/F values determined
Do you have some more data (or can you gen some more data you know the outcome of) to see if this/these generalize?
Two similar approaches:
library(stringr)
df <- data.frame(time = rep(1:14), status = c(0,1,1,0,0,0,-1,0,1,0,0,0,-1,0))
dfr <- rle(df$status)
# first approach
find_seq_str <- function() {
str_locate_all(paste(gsub("-1", "x", dfr$values), collapse=""), "10")[[1]][,2]
}
df[as.vector(sapply(find_seq_str(),
function(n) {
i <- sum(dfr$lengths[1:(n-1)])
tail(i:(i+dfr$lengths[n]), -1)
})),]
# second approach
find_seq_ts <- function() {
which(apply(embed(dfr$values, 2), 1, function(x) all(x == c(0, 1))))
}
df[as.vector(sapply(find_seq_ts(),
function(n) {
i <- sum(dfr$lengths[1:(n)])+1
head(i:(i+dfr$lengths[n+1]), -1)
})),]
Both approaches need a run length encoding of the status vector.
The first does a single character replacement for -1 so we can make an unambiguous, contiguous string to then use str_locate to find the pairs that tell us when the target sequence starts then rebuilds the ranges of zeroes from the rle lengths.
If it needs to be base R I can try to whip up something with regexpr.
The second builds a paired matrix and compares for the same target sequence.
Caveats:
I did no benchmarking
Both create potentially big things if status is big.
I'm not completely positive it generalizes (hence my initial q).
David's is far more readable, maintainable & transferrable code but you get to deal with all the "goodness" that comes with using data.table ;-)
I wrapped the approaches in functions as they could potentially then be parameterized, but you could just as easily just assign the value to a variable or shove it into the sapply (ugh, tho).
I have a problem very similar to that described here:
subset of data.frame columns to maximize "complete" observations
I am trying to schedule a workshop that will meet five times. I have ten days from which to choose meeting dates, each day having three overlapping possible meeting times. Hence, I have 30 columns grouped into ten groups (days) of three columns (meeting times) each. I need to select 5 columns (or meeting date–time combinations) subject to the following criteria: only one meeting time is selected per day (one column per group); the number of respondents (rows) who can attend all 5 meetings is maximized. Ideally, I would also want to know how the optimal column selection changes if I relax the criterion that respondents must attend ALL 5 meetings, requiring only that they attend 4, or 3, etc.
For simple visualization, let's say I want to know which two columns I should choose—no more than one each from V1, V2, and V3—such that I maximize the number of rows that have no zeros (i.e. row sums to 2).
V1A V1B V1C V2A V2B V2C V3A V3B V3C
1 0 1 0 1 1 1 0 1
1 1 0 0 1 1 0 1 1
0 0 1 1 1 0 0 1 1
1 1 1 1 0 0 1 0 0
1 0 0 0 1 1 0 1 0
0 1 1 0 1 1 0 0 0
1 0 1 1 1 0 1 0 1
The actual data are here: https://drive.google.com/file/d/0B03dE9-8088aMklOUVhuV3gtRHc/view Groups are mon1* tue1* [...] mon2* tue2* [...] fri2*.
The code proposed in the link above would solve my problem if it were not the case that I needed to select columns from groups. Ideally, I would also be able to say which columns I should choose to maximize the number of rows under the weaker condition that a row could have one zero (i.e. row sums to 5 or 4 or 3, etc.).
Many thanks!
You could use rowSums to get the index of rows that have greater than or equal to two 1's. (The conditions are not very clear)
lapply(split(names(df),sub('.$', '', names(df))),
function(x) which(rowSums(df[x])>=2))
#$V1
#[1] 1 2 4 6 7
#$V2
#[1] 1 2 3 5 6 7
#$V3
#[1] 1 2 3 7
This just finds the first column index with 1 (or very first if all zero) in each of three groups, returning a three column matrix, one column for each group.
f <- substring(colnames(df), 1L, nchar(colnames(df))-1L)
ans <- lapply(split(as.list(df), f),
function(x) max.col(do.call(cbind, x), ties.method="first"))
do.call(cbind, ans)
With your dataset this delivers the rows that satisfy the requirement to deliver all rows==1:
> lapply( 1:3, function(grp) which( apply( dat[, grep(grp, names(dat))] , 1,
function(z) sum(z, na.rm=TRUE)==3) ) )
[[1]]
[1] 4
[[2]]
integer(0)
[[3]]
integer(0)
If you relax the requirement to allow values less than 3 you get more candidates:
> lapply( 1:3, function(grp) which( apply( dat[, grep(grp, names(dat))] , 1, function(z) sum(z, na.rm=TRUE)>=2) ) )
[[1]]
[1] 1 2 4 6 7
[[2]]
[1] 1 2 3 5 6 7
[[3]]
[1] 1 2 3 7
Now ,,,,,,, what exactly are the rutes for this task?????