Sum rows in a group, starting when a specific value occurs - r

I want to accumulate the values of a column till the end of the group, though starting the addition when a specific value occurs in another column. I am only interested in the first instance of the specific value within a group. So if that value occurs again within the group, the addition column should continue to add the values. I know this sounds like a rather strange problem, so hopefully the example table makes sense.
The following data frame is what I have now:
> df = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0))
> df
group numToAdd occurs
1 1 1 0
2 1 1 0
3 1 3 1
4 1 2 0
5 2 4 0
6 2 2 1
7 2 1 0
8 2 3 0
9 2 2 0
10 3 1 0
11 3 2 1
12 3 1 1
13 4 2 0
14 4 3 0
15 4 2 0
Thus, whenever a 1 occurs within a group, I want a cumulative sum of the values from the column numToAdd, until a new group starts. This would look like the following:
> finalDF = data.frame(group = c(1,1,1,1,2,2,2,2,2,3,3,3,4,4,4),numToAdd = c(1,1,3,2,4,2,1,3,2,1,2,1,2,3,2),occurs = c(0,0,1,0,0,1,0,0,0,0,1,1,0,0,0),added = c(0,0,3,5,0,2,3,6,8,0,2,3,0,0,0))
> finalDF
group numToAdd occurs added
1 1 1 0 0
2 1 1 0 0
3 1 3 1 3
4 1 2 0 5
5 2 4 0 0
6 2 2 1 2
7 2 1 0 3
8 2 3 0 6
9 2 2 0 8
10 3 1 0 0
11 3 2 1 2
12 3 1 1 3
13 4 2 0 0
14 4 3 0 0
15 4 2 0 0
Thus, the added column is 0 until a 1 occurs within the group, then accumulates the values from numToAdd until it moves to a new group, turning the added column back to 0. In group three, a value of 1 is found a second time, yet the cumulated sum continues. Additionally, in group 4, a value of 1 is never found, thus the value within the added column remains 0.
I've played around with dplyr, but can't get it to work. The following solution only outputs the total sum, and not the increasing cumulated number at each row.
library(dplyr)
df =
df %>%
mutate(added=ifelse(occurs == 1,cumsum(numToAdd),0)) %>%
group_by(group)

Try
df %>%
group_by(group) %>%
mutate(added= cumsum(numToAdd*cummax(occurs)))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0
Or using data.table
library(data.table)#v1.9.5+
i1 <-setDT(df)[, .I[(rleid(occurs) + (occurs>0))>1], group]$V1
df[, added:=0][i1, added:=cumsum(numToAdd), by = group]
Or a similar option as in dplyr
setDT(df)[,added := cumsum(numToAdd * cummax(occurs)) , by = group]

You can use split-apply-combine in base R with something like:
df$added <- unlist(lapply(split(df, df$group), function(x) {
y <- rep(0, nrow(x))
pos <- cumsum(x$occurs) > 0
y[pos] <- cumsum(x$numToAdd[pos])
y
}))
df
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0

To add another base R approach:
df$added <- unlist(lapply(split(df, df$group), function(x) {
c(x[,'occurs'][cumsum(x[,'occurs']) == 0L],
cumsum(x[,'numToAdd'][cumsum(x[,'occurs']) != 0L]))
}))
# group numToAdd occurs added
# 1 1 1 0 0
# 2 1 1 0 0
# 3 1 3 1 3
# 4 1 2 0 5
# 5 2 4 0 0
# 6 2 2 1 2
# 7 2 1 0 3
# 8 2 3 0 6
# 9 2 2 0 8
# 10 3 1 0 0
# 11 3 2 1 2
# 12 3 1 1 3
# 13 4 2 0 0
# 14 4 3 0 0
# 15 4 2 0 0

Another base R:
df$added <- unlist(lapply(split(df,df$group),function(x){
cumsum((cumsum(x$occurs) > 0) * x$numToAdd)
}))

Related

Pasting a string of variables into a function is not working

I was looking at this question: Find how many times duplicated rows repeat in R data frame, which provides the following code:
library(plyr)
ddply(df,.(a,b),nrow)
However, I have a dataset with many variables, so I can't type them out like a,b in this case. I've tried using names(data) with the paste function, but it doesn't seem to work. I tried this:
var_names=paste(names(data),collapse=",")
ddply(data,.(paste(a)),nrow)
It instead gives this output:
However, if I manually type them out, I get the proper output:
What do I need to do differently here?
Instead of paste and evaluating, make use of count from dplyr, which can take multiple columns with across and select-helpers - everything()
library(dplyr)
df %>%
count(across(everything()))
A reproducible example with mtcars dataset
data(mtcars)
df <- mtcars %>%
select(vs:carb)
count(df, across(everything()))
vs am gear carb n
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
Also, in ddply, we can just pass a vector of column names i.e. no need to create a single string
library(plyr)
ddply(df, names(df), nrow)
vs am gear carb V1
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
Or if we are creating a single string from names, also paste the whole expression and then evaluate (which is not recommended as there are standard ways of dealing this)
eval(parse(text = paste('ddply(df, .(', toString(names(df)), '), nrow)')))
vs am gear carb V1
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
You can use aggregate by grouping all the columns and counting it's length.
aggregate(1:nrow(df)~., df, length)

Recoding by an order in r

I have a data recoding puzzle. Here is how my sample data looks like:
df <- data.frame(
id = c(1,1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3,3),
scores = c(0,1,1,0,0,-1,-1, 0,0,1,-1,-1,-1, 0,1,0,1,1,0,1),
position = c(1,2,3,4,5,6,7, 1,2,3,4,5,6, 1,2,3,4,5,6,7),
cat = c(1,1,1,1,1,0,0, 1,1,1,0,0,0, 1,1,1,1,1,1,1))
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 -1 6 0
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 -1 4 0
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
There are three ids in the dataset and rows were ordered by a positon variable. For each id, the first row after the scores start by -1 needs to be 0, and the cat variable needs to be 1. For example, for id=1, the first row would be 6th position and in that row, score should be 0 and the cat variable needs to 1. For those ids do not have scores=-1, I keep them as they are.
The desired output should look like below:
id scores position cat
1 1 0 1 1
2 1 1 2 1
3 1 1 3 1
4 1 0 4 1
5 1 0 5 1
6 1 0 6 1
7 1 -1 7 0
8 2 0 1 1
9 2 0 2 1
10 2 1 3 1
11 2 0 4 1
12 2 -1 5 0
13 2 -1 6 0
14 3 0 1 1
15 3 1 2 1
16 3 0 3 1
17 3 1 4 1
18 3 1 5 1
19 3 0 6 1
20 3 1 7 1
Any recommendations??
Thanks
This may be what you are after
df %>%
group_by(id) %>%
mutate(i = which(scores == -1)[1]) %>% # find the first row == -1
mutate(scores = case_when(position == i & scores !=0 ~ 0, T ~ scores), # update the score using position & i
cat = ifelse(scores == -1,0,1)) %>% # then update cat
select (-i) # remove I
After trying a few things and getting ideas from #Ricky and #e.matt, I came up with a solution.
df %>%
filter(scores == -1) %>% # keep cases where var = 1
distinct(id, .keep_all = T) %>% # keep distinct cases based on group
mutate(first = 1) %>% # create first column
right_join(df, by=c("id","scores","position","cat")) %>% # join back original dataset
mutate(first = coalesce(first, 0)) %>% # replace NAs with 0
mutate(scores = case_when(
first == 1 ~ 0,
TRUE~scores)) %>%
mutate(cat = case_when(
first == 1 ~ 1,
TRUE~cat))
This provides my desired output.
id scores position cat first
1 1 0 1 1 0
2 1 1 2 1 0
3 1 1 3 1 0
4 1 0 4 1 0
5 1 0 5 1 0
6 1 0 6 1 1
7 1 -1 7 0 0
8 2 0 1 1 0
9 2 0 2 1 0
10 2 1 3 1 0
11 2 0 4 1 1
12 2 -1 5 0 0
13 2 -1 6 0 0
14 3 0 1 1 0
15 3 1 2 1 0
16 3 0 3 1 0
17 3 1 4 1 0
18 3 1 5 1 0
19 3 0 6 1 0
20 3 1 7 1 0
here is a data.table oneliner
library( data.table )
setDT(df)
df[ df[, .(cumsum( scores == -1 ) == 1), by = .(id)]$V1, `:=`( scores = 0, cat = 1) ]
# id scores position cat
# 1: 1 0 1 1
# 2: 1 1 2 1
# 3: 1 1 3 1
# 4: 1 0 4 1
# 5: 1 0 5 1
# 6: 1 0 6 1
# 7: 1 -1 7 0
# 8: 2 0 1 1
# 9: 2 0 2 1
# 10: 2 1 3 1
# 11: 2 0 4 1
# 12: 2 -1 5 0
# 13: 2 -1 6 0
# 14: 3 0 1 1
# 15: 3 1 2 1
# 16: 3 0 3 1
# 17: 3 1 4 1
# 18: 3 1 5 1
# 19: 3 0 6 1
# 20: 3 1 7 1
You could do something along these lines using the dplyr package:
library(dplyr)
df = mutate(df, cat = ifelse(scores == -1, 1, cat),
scores = ifelse(scores == -1, 0, scores))
Using the mutate() function, I am re-assigning the values for the scores and cat fields according to ifelse() conditional statements. For scores, if the score is -1, the value is replaced by 0, otherwise it keeps the score as is. For cat, it also checks if scores is equal to -1, but would assign a value of 1 when the condition is met, or the already existing value of cat when the condition is not met.
EDIT
After our discussion in the comments, I think something along these lines should be helpful (you may have to modify the logic since I don't exactly follow what the desired output is here):
for(i in 1:nrow(df)){
# Check if score is -1
if(df[i, 'scores'] == -1){
# Update values for the next row
df[i+1, 'scores'] <- 0
df[i+1, 'cat'] <- 1
}
}
Sorry that I don't really follow the desired output, hopefully this is helpful in getting you to your answer!

File entire vector by a certain appearance of another vector

I have the following data:
players<-rep(1:3,each=3)
trial<-rep(1:3)
choice<-c(1,0,0,0,0,0,0,1,0)
gamematrix<-data.frame(cbind(players,trial,choice))
players trial choice
1 1 1 1
2 1 2 0
3 1 3 0
4 2 1 0
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
Now I want to create a new vector:
for each participant who have at least one choice of "1", to get the value "3" and "0" otherwise:
players trial choice win
1 1 1 1 3
2 1 2 0 3
3 1 3 0 3
4 2 1 0 0
5 2 2 0 0
6 2 3 0 0
7 3 1 0 3
8 3 2 1 3
9 3 3 0 3
In the simple example above, player "1", had "1" in the first trial, while player 3 in the second trial, thus for all their choices the value is "3" in the new vector.
Any ideas how to do it? thanks!
A base R option using ave + ifelse
within(
gamematrix,
win <- ave(choice,players,FUN = function(x) ifelse(any(x==1),3,0))
)
giving
players trial choice win
1 1 1 1 3
2 1 2 0 3
3 1 3 0 3
4 2 1 0 0
5 2 2 0 0
6 2 3 0 0
7 3 1 0 3
8 3 2 1 3
9 3 3 0 3
Update
If you criteria is depending on the first two values of choice, you can try
within(
gamematrix,
win <- ave(choice,players,FUN = function(x) ifelse(all(head(x,2)==1),3,0))
)
which gives
players trial choice win
1 1 1 1 0
2 1 2 0 0
3 1 3 0 0
4 2 1 0 0
5 2 2 0 0
6 2 3 0 0
7 3 1 0 0
8 3 2 1 0
9 3 3 0 0
Try this dplyr approach:
library(dplyr)
#Code
gamematrix <- gamematrix %>% group_by(players) %>%
mutate(win=ifelse(length(choice[choice==1])>=1,3,0))
Output:
# A tibble: 9 x 4
# Groups: players [3]
players trial choice win
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 3
2 1 2 0 3
3 1 3 0 3
4 2 1 0 0
5 2 2 0 0
6 2 3 0 0
7 3 1 0 3
8 3 2 1 3
9 3 3 0 3
There is no reason for this data to be a data.frame. Keep it as a numeric matrix. If you do so you can do in one line using only vectorized functions.
cbind(gamematrix, win = (rowSums(gamematrix == 1) > 0) * 3)
for your second case:
I would like it to be only for those players who had "choice=1" in the first N (e.g., first 2 trials)
cbind(gamematrix, win = (rowSums(gamematrix[,c(1,2)] == 1) > 0) * 3)
Vectorized solutions are usually more performant than solutions incorporating a buried loop (e.g. ave).
An option with rowsum from base R
gamematrix$win <- with(gamematrix, 3 * players %in%
names(which(rowsum(choice, players)[,1] > 0)))
gamematrix$win
#[1] 3 3 3 0 0 0 3 3 3

Converting data to longitudinal data

Hi i am having difficulties trying to convert my data into longitudinal data using the Reshape package. Would be grateful if anyone could help me, thank you!
Data is as follows:
m <- matrix(sample(c(0, 0:), 100, replace = TRUE), 10)
ID<-c(1:10)
dim(ID)=c(10,1)
m<- cbind(ID,m)
d <- as.data.frame(m)
names(d)<-c('ID', 'litter1', 'litter2', 'litter3', 'litter4', 'litter5', 'litter6', 'litter7', 'litter8', 'litter9', 'litter10')
print(d)
ID litter1 litter2 litter3 litter4 litter5 litter6 litter7 litter8 litter9 litter10
1 0 0 0 3 1 0 2 0 0 3
2 0 2 1 2 0 0 0 2 0 0
3 1 0 1 2 0 3 3 3 2 0
4 2 1 2 3 0 2 3 3 1 0
5 0 1 2 0 0 0 3 3 1 0
6 2 1 2 0 3 3 0 0 0 0
7 0 1 0 3 0 0 1 2 2 0
8 0 1 3 3 2 1 3 2 3 0
9 0 2 0 2 2 3 2 0 0 3
10 2 2 2 2 1 3 0 3 0 0
I wish to convert the above data into a longitudinal data with columns 'ID', 'litter category' which tells us the category of the litter, i.e. 1-10 and 'litter number' which tells us the number of pieces for each litter category:
ID littercategory litternumber
1 4 3
1 5 1
1 7 2
1 10 3
2 2 2
2 3 1
2 4 2
2 8 2
and so on.
Would really appreciate your help thank you!
You could do that as follows:
library(reshape2)
d = melt(d, id.vars=c("ID"))
colnames(d) = c('ID','littercategory','litternumber')
# remove the text in the littercategory column, keep only the number.
d$littercategory = gsub('litter','',d$littercategory)
d = d[d$litternumber!=0]
Output:
ID littercategory litternumber
1 1 4
2 1 8
3 1 6
4 1 4
7 1 6
8 1 5
10 1 10
1 2 6
2 2 9
As you can see, only the ordering is different as the output you requested, but I'm sure you can fix that yourself. (If not, there are plenty of resources on how to do that).
Hope this helps!
To get desired output you have to melt your data and filter out values larger than 0.
library(data.table)
result <- setDT(melt(d, "ID"))[value != 0][order(ID)]
# To get exact structure modify result
result[, .(ID,
littercategory = sub("litter", "", variable),
litternumber = value)]

Resetting TIME column when AMT > 0

I have a data frame that looks like this:
ID TIME AMT
1 0 50
1 1 0
1 2 0
1 3 0
1 4 0
1 4 50
1 5 0
1 7 0
1 9 0
1 10 0
1 10 50
The TIME column in the above data frame is continuous. I want to add another time column that resets time from zero when AMT>0. So, my output data frame should look like this:
ID TIME AMT TIME2
1 0 50 0
1 1 0 1
1 2 0 2
1 3 0 3
1 4 0 4
1 4 50 0
1 5 0 1
1 7 0 3
1 9 0 5
1 10 0 6
1 10 50 0
This is basically achieved by subtracting the TIME from a "fixed" reference TIME when AMT>0 (For example; the reference time for the second AMT>0 is 4. So, the TIME2 is calculated by subtracting 5-4=1 ;7-4=3; 9-4=5 etc. How can I do this automatically in R.
A data.table solution :
library(data.table)
setDT(DT)[,TIME2 := TIME-TIME[1],cumsum(AMT>0)]
# ID TIME AMT TIME2
# 1: 1 0 50 0
# 2: 1 1 0 1
# 3: 1 2 0 2
# 4: 1 3 0 3
# 5: 1 4 0 4
# 6: 1 4 50 0
# 7: 1 5 0 1
# 8: 1 7 0 3
# 9: 1 9 0 5
# 10: 1 10 0 6
# 11: 1 10 50 0
Was originally posting the same answer as #agstudy, so here's alternatively a possible base R solution
with(df, ave(TIME, cumsum(AMT > 0L), ID, FUN = function(x) x - x[1L]))
## [1] 0 1 2 3 4 0 1 3 5 6 0
Or
library(dplyr)
df %>%
group_by(cumsum(AMT > 0), ID) %>%
mutate(TIME2 = TIME - first(TIME))

Resources