Looking to add a column based on the values of two columns, but over more than one row.
Example Dataset Code:
A = c(1,1,1,2,2,2,3,3,3,4,4)
B = c(1,2,3,1,2,3,1,2,3,1,2)
C = c(0,0,0,1,0,0,1,1,1,0,1)
data <- data.frame(A,B,C)
Dataset:
A B C
1 1 1 0
2 1 2 0
3 1 3 0
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 1
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 1
Ifelse statements:
What I am trying to achieve is "Create column D.If column C == 1 in any row where column A == x, column D = 1. Else column D == 0"
Desired Output:
A B C D
1 1 1 0 0
2 1 2 0 0
3 1 3 0 0
4 2 1 1 1
5 2 2 0 1
6 2 3 0 1
7 3 1 1 1
8 3 2 1 1
9 3 3 1 1
10 4 1 0 1
11 4 2 1 1
What I've done:
I've thought about it today but can't come up with a logical answer, I've tried looking at the data in long and wide formats but nothings jumped out.
Note:
In actual application the number of times x appears in column C is not equal (some contain one repeat in the dataset, others contain 20).
# just check using any() if any group has a single row with C==1
library(dplyr)
data %>% group_by(A) %>% mutate(D = as.numeric(any(C==1)))
library(data.table)
data[, D:=as.numeric(any(C==1)), by = .(A)]
# A B C D
#1 1 1 0 0
#2 1 2 0 0
#3 1 3 0 0
#4 2 1 1 1
#5 2 2 0 1
#6 2 3 0 1
#7 3 1 1 1
#8 3 2 1 1
#9 3 3 1 1
#10 4 1 0 1
#11 4 2 1 1
Easy with data.table
library(data.table)
data <- data.table(data)
x=2
data[,D:=ifelse(!A==x,ifelse(C==1,1,0),0)]
data
We can use ave from base R
data$D <- with(data, as.integer(ave(C==1, A, FUN=any)))
data
# A B C D
#1 1 1 0 0
#2 1 2 0 0
#3 1 3 0 0
#4 2 1 1 1
#5 2 2 0 1
#6 2 3 0 1
#7 3 1 1 1
#8 3 2 1 1
#9 3 3 1 1
#10 4 1 0 1
#11 4 2 1 1
Related
Here is the data:
marker <- c(0,0,0,0,3,3,0,0,5,5,5,0,0,0,
1,1,2,2,2,2,0,0,1,1,1,3,3,3,
1,1,2,2,2,0,0,1,1,1,5,5,5,5)
Those markers show what the participant was doing during an eye tracking study, such that 0 = no trial, 1 = trial onset, 2, 3, 5 = different types of tasks. The data before the first 1 is eye tracker test and can be discarded.
What I need to do (preferably with dplyr):
Delete data before the first 1
Calculate the length of each sequence of repeating numbers (n_samples)
Assign ID numbers to trials and 0's to no trial and trial onset (trial_number)
Desired output:
marker n_samples trial_number
1 2 0
1 2 0
2 4 1
2 4 1
2 4 1
2 4 1
0 2 0
0 2 0
1 3 0
1 3 0
1 3 0
3 3 2
3 3 2
3 3 2
1 2 0
1 2 0
2 3 3
2 3 3
2 3 3
0 2 0
0 2 0
1 3 0
1 3 0
1 3 0
5 4 4
5 4 4
5 4 4
5 4 4
I found this answer, but wasn't able to modify the code to fit my task.
Thank you!
Using dplyr and data.table's rleid function.
library(dplyr)
tibble(marker) %>%
#Drop rows before first 1
filter(row_number() >= match(1, marker)) %>%
#Count samples in each group
add_count(grp = data.table::rleid(marker), name = 'n_samples') %>%
#Create trial number
mutate(trial_number = with(rle(!marker %in% c(1, 0)),
rep(cumsum(values) * values, lengths))) %>%
select(-grp)
This returns -
# marker n_samples trial_number
#1 1 2 0
#2 1 2 0
#3 2 4 1
#4 2 4 1
#5 2 4 1
#6 2 4 1
#7 0 2 0
#8 0 2 0
#9 1 3 0
#10 1 3 0
#11 1 3 0
#12 3 3 2
#13 3 3 2
#14 3 3 2
#15 1 2 0
#16 1 2 0
#17 2 3 3
#18 2 3 3
#19 2 3 3
#20 0 2 0
#21 0 2 0
#22 1 3 0
#23 1 3 0
#24 1 3 0
#25 5 4 4
#26 5 4 4
#27 5 4 4
#28 5 4 4
Base R solution
marker <- c(0,0,0,0,3,3,0,0,5,5,5,0,0,0,
1,1,2,2,2,2,0,0,1,1,1,3,3,3,
1,1,2,2,2,0,0,1,1,1,5,5,5,5)
tmp=marker[which(marker==1)[1]:length(marker)]
abc=rle(tmp)
df=data.frame(
"marker"=tmp,
"n_samples"=rep(abc$lengths,abc$lengths)
)
abc$values[abc$values<=1]=0
abc$values[abc$values>1]=1
abc$values[abc$values==1]=cumsum(abc$values[abc$values==1])
df$trial_number=rep(abc$values,abc$lengths)
which results in
marker n_samples trial_number
1 1 2 0
2 1 2 0
3 2 4 1
4 2 4 1
5 2 4 1
6 2 4 1
7 0 2 0
8 0 2 0
9 1 3 0
10 1 3 0
11 1 3 0
12 3 3 2
13 3 3 2
14 3 3 2
15 1 2 0
...
I have the following data:
players<-rep(1:3,each=3)
trial<-rep(1:3)
choice<-c(1,0,0,0,0,0,0,1,0)
gamematrix<-data.frame(cbind(players,trial,choice))
players trial choice
1 1 1 1
2 1 2 0
3 1 3 0
4 2 1 0
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
Now I want to create a new vector:
for each participant who have at least one choice of "1", to get the value "3" and "0" otherwise:
players trial choice win
1 1 1 1 3
2 1 2 0 3
3 1 3 0 3
4 2 1 0 0
5 2 2 0 0
6 2 3 0 0
7 3 1 0 3
8 3 2 1 3
9 3 3 0 3
In the simple example above, player "1", had "1" in the first trial, while player 3 in the second trial, thus for all their choices the value is "3" in the new vector.
Any ideas how to do it? thanks!
A base R option using ave + ifelse
within(
gamematrix,
win <- ave(choice,players,FUN = function(x) ifelse(any(x==1),3,0))
)
giving
players trial choice win
1 1 1 1 3
2 1 2 0 3
3 1 3 0 3
4 2 1 0 0
5 2 2 0 0
6 2 3 0 0
7 3 1 0 3
8 3 2 1 3
9 3 3 0 3
Update
If you criteria is depending on the first two values of choice, you can try
within(
gamematrix,
win <- ave(choice,players,FUN = function(x) ifelse(all(head(x,2)==1),3,0))
)
which gives
players trial choice win
1 1 1 1 0
2 1 2 0 0
3 1 3 0 0
4 2 1 0 0
5 2 2 0 0
6 2 3 0 0
7 3 1 0 0
8 3 2 1 0
9 3 3 0 0
Try this dplyr approach:
library(dplyr)
#Code
gamematrix <- gamematrix %>% group_by(players) %>%
mutate(win=ifelse(length(choice[choice==1])>=1,3,0))
Output:
# A tibble: 9 x 4
# Groups: players [3]
players trial choice win
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 3
2 1 2 0 3
3 1 3 0 3
4 2 1 0 0
5 2 2 0 0
6 2 3 0 0
7 3 1 0 3
8 3 2 1 3
9 3 3 0 3
There is no reason for this data to be a data.frame. Keep it as a numeric matrix. If you do so you can do in one line using only vectorized functions.
cbind(gamematrix, win = (rowSums(gamematrix == 1) > 0) * 3)
for your second case:
I would like it to be only for those players who had "choice=1" in the first N (e.g., first 2 trials)
cbind(gamematrix, win = (rowSums(gamematrix[,c(1,2)] == 1) > 0) * 3)
Vectorized solutions are usually more performant than solutions incorporating a buried loop (e.g. ave).
An option with rowsum from base R
gamematrix$win <- with(gamematrix, 3 * players %in%
names(which(rowsum(choice, players)[,1] > 0)))
gamematrix$win
#[1] 3 3 3 0 0 0 3 3 3
I have long data looking like this for example:
ID time condition
1 1 0
1 2 0
1 3 0
1 4 1
2 1 0
2 2 1
2 3 1
2 4 0
3 1 1
3 2 1
3 3 0
3 4 0
4 1 0
4 2 1
4 3 NA
4 4 NA
I want to only keep those rows before condition is met once so I want:
ID time condition
1 1 0
1 2 0
1 3 0
1 4 1
2 1 0
2 2 1
3 1 1
4 1 0
4 2 1
I tried to loop but a) it said looping is not good coding style in R and b) it won't work.
Sidenote: just if you are wondering, it does make sense that IDs have condition and then lose it again in my example, but I am only interested in when they first had it.
Thank you.
Here's an easy way with dplyr:
library(dplyr)
df %>% group_by(ID) %>%
filter(row_number() <= which.max(condition) | sum(condition) == 0)
# # A tibble: 7 x 3
# # Groups: ID [3]
# ID time condition
# <int> <int> <int>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 0
# 4 1 4 1
# 5 2 1 0
# 6 2 2 1
# 7 3 1 1
It relies on which.max which returns the index of the first maximum value in vector. The | sum(condition) == 0 takes care to keep censored cases (where condition is always 0).
Using this data:
1 1 0
1 2 0
1 3 0
1 4 1
2 1 0
2 2 1
2 3 1
2 4 0
3 1 1
3 2 1
3 3 0
3 4 0')
Data:
set.seed (112098)
op <- data.frame(id=1:100,cluster=rbinom(100,1,0.5))
id cluster
1 1
2 1
3 1
4 0
5 1
6 1
7 0
8 0
9 1
Intended:
id cluster groups
1 1 1
2 1 1
3 1 1
4 0 0
5 1 2
6 1 2
7 0 0
8 0 0
9 1 3
Essentially, every consecutive 1 series forms a group. How could I add the group column in R?
Here is one option using rleid from data.table
library(data.table)
setDT(op)[, groups := rleid(cluster)*(cluster)
][groups!=0, groups := as.integer(factor(groups))]
head(op, 9)
# id cluster groups
#1: 1 1 1
#2: 2 1 1
#3: 3 1 1
#4: 4 0 0
#5: 5 0 0
#6: 6 1 2
#7: 7 1 2
#8: 8 0 0
#9: 9 1 3
I have a question I hope some of you might help me with. I am doing a thesis on pharmaceuticals and the effect from parallelimports. I am dealing with this in R, having a Panel Dataset
I need a variable, that counts for a given original product - how many parallelimporters are there for this given time period.
Product_ID PI t
1 0 1
1 1 1
1 1 1
1 0 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 1 1
2 0 2
2 1 2
2 0 3
2 1 3
2 1 3
2 1 3
Ideally what i want here is a new column, like number of PI-products (PI=1) for an original (PI=0) at time, t. So the output would be like:
Product_ID PI t nPIcomp
1 0 1 2
1 1 1
1 1 1
1 0 2 4
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1 1
2 1 1
2 0 2 1
2 1 2
2 0 3 3
2 1 3
2 1 3
2 1 3
I hope I have made my issue clear :)
Thanks in advance,
Henrik
Something like this?
x <- read.table(text = "Product_ID PI t
1 0 1
1 1 1
1 1 1
1 0 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 1 1
2 0 2
2 1 2
2 0 3
2 1 3
2 1 3
2 1 3", header = TRUE)
find.count <- rle(x$PI)
count <- find.count$lengths[find.count$values == 1]
x[x$PI == 0, "nPIcomp"] <- count
Product_ID PI t nPIcomp
1 1 0 1 2
2 1 1 1 NA
3 1 1 1 NA
4 1 0 2 4
5 1 1 2 NA
6 1 1 2 NA
7 1 1 2 NA
8 1 1 2 NA
9 2 0 1 1
10 2 1 1 NA
11 2 0 2 1
12 2 1 2 NA
13 2 0 3 3
14 2 1 3 NA
15 2 1 3 NA
16 2 1 3 NA
I would use ave and your two columns Product_ID and t as grouping variables. Then, within each group, apply a function that returns the sum of PI followed by the appropriate number of NAs:
dat <- transform(dat, nPIcomp = ave(PI, Product_ID, t,
FUN = function(z) {
n <- sum(z)
c(n, rep(NA, n))
}))
The same idea can be used with the data.table package if your data is large and speed is a concern.
Roman's answers gives exactly what you want. In case you want to summarise the data this would be handy, using the plyr pacakge (df is what I have called your data.frame)...
ddply( df , .(Product_ID , t ) , summarise , nPIcomp = sum(PI) )
# Product_ID t nPIcomp
#1 1 1 2
#2 1 2 4
#3 2 1 1
#4 2 2 1
#5 2 3 3