How do you duplicate rows n times by group and change one specific column value in R? - r

I am trying to create duplicate rows by group. The number of duplicate rows I want to create varies by group and I want to fix the value of one column Attended = 0.
A minimal working example of the data set DF I am working with is:
ID Demo Attended t
1 3 1 1
1 3 1 3
1 3 0 4
1 3 1 5
2 5 1 2
2 5 1 4
3 7 0 1
For the example above, suppose I want every person (ID) to have 5 rows, with Demo the same across all rows for each individual. Thus, I have to create 1 row for ID = 1, 3 for ID = 2 and 4 for ID = 4 (I would like to calculate these dynamically for each subgroup). For the new rows I generate I want Attended = 0 and t to take on the value of a missing index, so that the final output is:
ID Demo Attended t
1 3 1 1
1 3 1 3
1 3 0 4
1 3 1 5
1 3 0 2
2 5 1 2
2 5 1 4
2 5 0 1
2 5 0 3
2 5 0 5
3 7 0 1
3 7 0 2
3 7 0 3
3 7 0 4
3 7 0 5
I have been able to create duplicate rows by group, but haven't been able to figure out how to create different number of duplicates by participant and correctly fill in the index column t.
Here is what I have working:
DF %>%
group_by(ID) %>%
rbind(., mutate(., t = row_number()))
I have been trying to create the right number of duplicates using slice() and trying to get the t value to be exactly what I want but to no avail.
Any help would be appreciated!

One tidyverse possibility could be:
df %>%
complete(t, nesting(ID), fill = list(Attended = 0)) %>%
arrange(ID)
t ID Demo Attended
<int> <int> <int> <dbl>
1 1 1 3 1
2 2 1 3 0
3 3 1 3 1
4 4 1 3 0
5 5 1 3 1
6 1 2 5 0
7 2 2 5 1
8 3 2 5 0
9 4 2 5 1
10 5 2 5 0
11 1 3 7 0
12 2 3 7 0
13 3 3 7 0
14 4 3 7 0
15 5 3 7 0

Related

R left_join() replacing joined values rather than adding in new columns

I have the following dataframes:
A<-data.frame(AgentNo=c(1,2,3,4,5,6),
N=c(2,5,6,1,9,0),
Rarity=c(1,2,1,1,2,2))
AgentNo N Rarity
1 1 2 1
2 2 5 2
3 3 6 1
4 4 1 1
5 5 9 2
6 6 0 2
B<-data.frame(Rank=c(1,5),
AgentNo.x=c(2,5),
AgentNo.y=c(1,4),
N=c(3,1),
Rarity=c(1,2))
Rank AgentNo.x AgentNo.y N Rarity
1 1 2 1 3 1
2 5 5 4 1 2
I would like to left join B onto A by columns "AgentNo"="AgentNo.y" and "N"="N" but rather than add new columns to A from B I want the same columns from A but where joined values have been updated and taken from B.
For any joined rows I want A.AgentNo to now be B.AgentNo.x, A.N to be B.N and A.Rarity to be B.Rarity. I would like to drop B.Rank and B.Agent.y completely.
The result should be:
Result<-data.frame(AgentNo=c(2,2,3,5,5,6), N=c(2,5,6,1,9,0), Rarity=c(1,2,1,1,2,2))
AgentNo N Rarity
1 2 3 1
2 2 5 2
3 3 6 1
4 5 1 2
5 5 9 2
6 6 0 2
After some data wrangling, you can use rows_update to update the rows of A by the values of B:
library(dplyr)
A <- A %>%
mutate(AgentNo.y = AgentNo)
B <- select(B, AgentNo = AgentNo.x, AgentNo.y, N, Rarity)
rows_update(A, B, by = "AgentNo.y") %>%
select(-AgentNo.y)
output
AgentNo N Rarity
1 2 3 1
2 2 5 2
3 3 6 1
4 5 1 1
5 5 9 2
6 6 0 2

Add a column that divides another column into n chunks, R

There's no easy way to describe my question, that's probably why I was not able to find answer through search.
So I have a data frame with 3 columns, one of the columns is Subject number, the other two columns are Correctness and Block. There are 2 participants, each was exposed to 2 blocks of 3 stimuli in each block.
subj corr block
1 1 1 1
2 1 0 1
3 1 1 1
4 1 1 2
5 1 1 2
6 1 1 2
7 2 0 1
8 2 1 1
9 2 1 1
10 2 0 2
11 2 1 2
12 2 1 2
So what I want to do is to create another column that look at a specific subj number and divide the block columns corresponding to the subj into 3 even chunks (the original df has 2 chunks). In general, I want to know how to divide the stimuli each subj is exposed to in to N chunks and input the chunk number into another column.
subj corr block newblock
1 1 1 1 1
2 1 0 1 1
3 1 1 1 2
4 1 1 2 2
5 1 1 2 3
6 1 1 2 3
7 2 0 1 1
8 2 1 1 1
9 2 1 1 2
10 2 0 2 2
11 2 1 2 3
12 2 1 2 3
Something like this:
library(dplyr)
n_chunks = 3
df %>%
group_by(subj) %>%
mutate(newblock = rep(1:n_chunks, each = ceiling(n() / n_chunks))[1:n()])
How much of this is necessary depends on your use case. If you can guarantee that n_chunks evenly divides the number of observations for each subject you can simplify to:
df %>%
group_by(subj) %>%
mutate(newblock = rep(1:n_chunks, each = n() / n_chunks))

If a value appears in the row, all subsequent rows should take this value (with dplyr)

I'm just starting to learn R and I'm already facing the first bigger problem.
Let's take the following panel dataset as an example:
N=5
T=3
time<-rep(1:T, times=N)
id<- rep(1:N,each=T)
dummy<- c(0,0,1,1,0,0,0,1,0,0,0,1,0,1,0)
df<-as.data.frame(cbind(id, time,dummy))
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 0
I now want the dummy variable for all rows of a cross section to take the value 1 after the 1 for this cross section appears for the first time. So, what I want is:
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 0
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 1
So I guess I need something like:
df_new<-df %>%
group_by(id) %>%
???
I already tried to set all zeros to NA and use the na.locf function, but it didn't really work.
Anybody got an idea?
Thanks!
Use cummax
df %>%
group_by(id) %>%
mutate(dummy = cummax(dummy))
# A tibble: 15 x 3
# Groups: id [5]
# id time dummy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 0
# 8 3 2 1
# 9 3 3 1
#10 4 1 0
#11 4 2 0
#12 4 3 1
#13 5 1 0
#14 5 2 1
#15 5 3 1
Without additional packages you could do
transform(df, dummy = ave(dummy, id, FUN = cummax))

Extracting event rows from a data frame

I have this data frame:
df <-
ID var TIME value method
1 3 0 2 1
1 3 2 2 1
1 3 3 0 1
1 4 0 10 1
1 4 2 10 1
1 4 4 5 1
1 4 6 5 1
2 3 0 2 1
2 3 2 2 1
2 3 3 0 1
2 4 0 10 1
2 4 2 10 1
2 4 4 5 1
2 4 6 5 1
I want to extract rows that has a new eventin value column. For example, for ID=1, var=3 has a value of 2 at TIME=0. This value stays the same at TIME=1, so I would take the first row at TIME=0 only and discard the second row. However, the third row, the value for var=3 has changed into zero, so I have also to extract this row. And so on for the rest of the variables. This has to be applied for every subject ID. For the above df, the result should be as follows:
dfevent <-
ID var TIME value method
1 3 0 2 1
1 3 3 0 1
1 4 0 10 1
1 4 4 5 1
2 3 0 2 1
2 3 3 0 1
2 4 0 10 1
2 4 4 5 1
Could any one help me doing this in R? I have a huge data set and I want to extract the information at which a new event has occurred for the value of every var. I have 4 variables in the data frame numbered (3, 4,5,6, and 7). The above is an example for 2 variables (variable number: 3 and 4).
This does it using dplyr
library(dplyr)
df %>%
group_by(ID, var) %>%
mutate(tf = ifelse(value==lag(value), 1, 0)) %>%
filter(is.na(tf) | tf==0) %>%
select(-tf)
# ID var TIME value method
#1 1 3 0 2 1
#2 1 3 3 0 1
#3 1 4 0 10 1
#4 1 4 4 5 1
#5 2 3 0 2 1
#6 2 3 3 0 1
#7 2 4 0 10 1
#8 2 4 4 5 1
basically, I created an extra variable that returns a '1' when the value is the same as the preceding row within groups of unique ID/var combinations. We then get rid of this variable before returning the output.
Base solution:
df[with(df, abs(ave(value,ID,FUN=function(x) c(1,diff(x)) ))) > 0,]
# ID var TIME value method
#1 1 3 0 2 1
#3 1 3 3 0 1
#4 1 4 0 10 1
#6 1 4 4 5 1
#8 2 3 0 2 1
#10 2 3 3 0 1
#11 2 4 0 10 1
#13 2 4 4 5 1
From the expected results, you may also try rleid from data.table
library(data.table)#data.table_1.9.5
setDT(df)[df[, .I[1L] , list(ID, var, rleid(value))]$V1]
# ID var TIME value method
#1: 1 3 0 2 1
#2: 1 3 3 0 1
#3: 1 4 0 10 1
#4: 1 4 4 5 1
#5: 2 3 0 2 1
#6: 2 3 3 0 1
#7: 2 4 0 10 1
#8: 2 4 4 5 1
Or a similar approach as #thelatemail
setDT(df)[df[, .I[abs(c(1,diff(value)))>0] , ID]$V1]
Or
unique(setDT(df)[, id:=rleid(value)], by=c('ID', 'var', 'id'))

In R, how can I make a running count of runs?

Suppose I have an R dataframe that looks like this, where end.group signifies the end of a unique group of observations:
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
I want to return the following, where group.count is a running count of the number of observations in a group, and group is a unique identifier for each group, in number order. Can anyone help me with a piece of R code to do this?
end.group group.count group
0 1 1
0 2 1
1 3 1
0 1 2
0 2 2
1 3 2
1 1 3
0 1 4
0 2 4
0 3 4
1 4 4
1 1 5
1 1 6
0 1 7
1 2 7
You can create group by using cumsum and rev. You need rev because you have the end points of the groups.
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
# create groups
x$group <- rev(cumsum(rev(x$end.group)))
# re-number groups from smallest to largest
x$group <- abs(x$group-max(x$group)-1)
Now you can use ave to create group.count.
x$group.count <- ave(x$end.group, x$group, FUN=seq_along)
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
ends <- which(as.logical(x$end.group))
ends2 <- c(ends[1],diff(ends))
transform(x, group.count=unlist(sapply(ends2,seq)), group=rep(seq(length(ends)),times=ends2))
end.group group.count group
1 0 1 1
2 0 2 1
3 1 3 1
4 0 1 2
5 0 2 2
6 1 3 2
7 1 1 3
8 0 1 4
9 0 2 4
10 0 3 4
11 1 4 4
12 1 1 5
13 1 1 6
14 0 1 7
15 1 2 7

Resources