Extracting event rows from a data frame - r

I have this data frame:
df <-
ID var TIME value method
1 3 0 2 1
1 3 2 2 1
1 3 3 0 1
1 4 0 10 1
1 4 2 10 1
1 4 4 5 1
1 4 6 5 1
2 3 0 2 1
2 3 2 2 1
2 3 3 0 1
2 4 0 10 1
2 4 2 10 1
2 4 4 5 1
2 4 6 5 1
I want to extract rows that has a new eventin value column. For example, for ID=1, var=3 has a value of 2 at TIME=0. This value stays the same at TIME=1, so I would take the first row at TIME=0 only and discard the second row. However, the third row, the value for var=3 has changed into zero, so I have also to extract this row. And so on for the rest of the variables. This has to be applied for every subject ID. For the above df, the result should be as follows:
dfevent <-
ID var TIME value method
1 3 0 2 1
1 3 3 0 1
1 4 0 10 1
1 4 4 5 1
2 3 0 2 1
2 3 3 0 1
2 4 0 10 1
2 4 4 5 1
Could any one help me doing this in R? I have a huge data set and I want to extract the information at which a new event has occurred for the value of every var. I have 4 variables in the data frame numbered (3, 4,5,6, and 7). The above is an example for 2 variables (variable number: 3 and 4).

This does it using dplyr
library(dplyr)
df %>%
group_by(ID, var) %>%
mutate(tf = ifelse(value==lag(value), 1, 0)) %>%
filter(is.na(tf) | tf==0) %>%
select(-tf)
# ID var TIME value method
#1 1 3 0 2 1
#2 1 3 3 0 1
#3 1 4 0 10 1
#4 1 4 4 5 1
#5 2 3 0 2 1
#6 2 3 3 0 1
#7 2 4 0 10 1
#8 2 4 4 5 1
basically, I created an extra variable that returns a '1' when the value is the same as the preceding row within groups of unique ID/var combinations. We then get rid of this variable before returning the output.

Base solution:
df[with(df, abs(ave(value,ID,FUN=function(x) c(1,diff(x)) ))) > 0,]
# ID var TIME value method
#1 1 3 0 2 1
#3 1 3 3 0 1
#4 1 4 0 10 1
#6 1 4 4 5 1
#8 2 3 0 2 1
#10 2 3 3 0 1
#11 2 4 0 10 1
#13 2 4 4 5 1

From the expected results, you may also try rleid from data.table
library(data.table)#data.table_1.9.5
setDT(df)[df[, .I[1L] , list(ID, var, rleid(value))]$V1]
# ID var TIME value method
#1: 1 3 0 2 1
#2: 1 3 3 0 1
#3: 1 4 0 10 1
#4: 1 4 4 5 1
#5: 2 3 0 2 1
#6: 2 3 3 0 1
#7: 2 4 0 10 1
#8: 2 4 4 5 1
Or a similar approach as #thelatemail
setDT(df)[df[, .I[abs(c(1,diff(value)))>0] , ID]$V1]
Or
unique(setDT(df)[, id:=rleid(value)], by=c('ID', 'var', 'id'))

Related

How to create two columns based on some criteria in R

The data I have is almost similar to the data below.
A=01-03
B=04-06
C=07-09
D=10-11
data<-read.table (text=" ID Class Time1 Time2 Time3
1 1 1 3 3
2 1 4 3 2
3 1 2 2 2
1 2 1 4 1
2 3 2 1 1
3 2 3 2 3
1 3 1 1 2
2 2 4 3 1
3 3 3 2 1
1 1 4 3 2
2 1 2 2 2
3 2 1 4 1
", header=TRUE)
I want to create 2 columns right after the Class column, i.e. the Bin and Zero columns based on A, B, C and D and IDs.
Therefore A goes to IDs 1,2, and 3. B goes to the next IDs, i.e., 1,2 and 3, and C goes to the next IDs, i.e., 1,2,3 and so on. Column Zero gets only numbers zeros. So the outcome would be:
ID Class Bin Zero Time1 Time2 Time3
1 1 01-03 0 1 3 3
2 1 01-03 0 4 3 2
3 1 01-03 0 2 2 2
1 2 04-06 0 1 4 1
2 3 04-06 0 2 1 1
3 2 04-06 0 3 2 3
1 3 07-09 0 1 1 2
2 2 07-09 0 4 3 1
3 3 07-09 0 3 2 1
1 1 10-11 0 4 3 2
2 1 10-11 0 2 2 2
3 2 10-11 0 1 4 1
Please try the below code
library(tidyverse)
#use character vector with quotes
A='01-03'
B='04-06'
C='07-09'
D='10-11'
data<-read.table (text=" ID Class Time1 Time2 Time3
1 1 1 3 3
2 1 4 3 2
3 1 2 2 2
1 2 1 4 1
2 3 2 1 1
3 2 3 2 3
1 3 1 1 2
2 2 4 3 1
3 3 3 2 1
1 1 4 3 2
2 1 2 2 2
3 2 1 4 1
", header=TRUE)
#create a separate dataframe with bin column
data2 <- data.frame(bin=c(rep(A,3),rep(B,3),rep(C,3),rep(D,3)))
data3 <- bind_cols(data, data2) %>% mutate(zero=0)
If you are open to a dplyr based solution, you could use
library(dplyr)
data %>%
group_by(ID) %>%
mutate(Bin = c(A, B, C, D),
Zero = 0,
.after = 2) %>%
ungroup()
This returns
# A tibble: 12 × 7
ID Class Bin Zero Time1 Time2 Time3
<int> <int> <chr> <dbl> <int> <int> <int>
1 1 1 01-03 0 1 3 3
2 2 1 01-03 0 4 3 2
3 3 1 01-03 0 2 2 2
4 1 2 04-06 0 1 4 1
5 2 3 04-06 0 2 1 1
6 3 2 04-06 0 3 2 3
7 1 3 07-09 0 1 1 2
8 2 2 07-09 0 4 3 1
9 3 3 07-09 0 3 2 1
10 1 1 10-11 0 4 3 2
11 2 1 10-11 0 2 2 2
12 3 2 10-11 0 1 4 1

How do you duplicate rows n times by group and change one specific column value in R?

I am trying to create duplicate rows by group. The number of duplicate rows I want to create varies by group and I want to fix the value of one column Attended = 0.
A minimal working example of the data set DF I am working with is:
ID Demo Attended t
1 3 1 1
1 3 1 3
1 3 0 4
1 3 1 5
2 5 1 2
2 5 1 4
3 7 0 1
For the example above, suppose I want every person (ID) to have 5 rows, with Demo the same across all rows for each individual. Thus, I have to create 1 row for ID = 1, 3 for ID = 2 and 4 for ID = 4 (I would like to calculate these dynamically for each subgroup). For the new rows I generate I want Attended = 0 and t to take on the value of a missing index, so that the final output is:
ID Demo Attended t
1 3 1 1
1 3 1 3
1 3 0 4
1 3 1 5
1 3 0 2
2 5 1 2
2 5 1 4
2 5 0 1
2 5 0 3
2 5 0 5
3 7 0 1
3 7 0 2
3 7 0 3
3 7 0 4
3 7 0 5
I have been able to create duplicate rows by group, but haven't been able to figure out how to create different number of duplicates by participant and correctly fill in the index column t.
Here is what I have working:
DF %>%
group_by(ID) %>%
rbind(., mutate(., t = row_number()))
I have been trying to create the right number of duplicates using slice() and trying to get the t value to be exactly what I want but to no avail.
Any help would be appreciated!
One tidyverse possibility could be:
df %>%
complete(t, nesting(ID), fill = list(Attended = 0)) %>%
arrange(ID)
t ID Demo Attended
<int> <int> <int> <dbl>
1 1 1 3 1
2 2 1 3 0
3 3 1 3 1
4 4 1 3 0
5 5 1 3 1
6 1 2 5 0
7 2 2 5 1
8 3 2 5 0
9 4 2 5 1
10 5 2 5 0
11 1 3 7 0
12 2 3 7 0
13 3 3 7 0
14 4 3 7 0
15 5 3 7 0

If a value appears in the row, all subsequent rows should take this value (with dplyr)

I'm just starting to learn R and I'm already facing the first bigger problem.
Let's take the following panel dataset as an example:
N=5
T=3
time<-rep(1:T, times=N)
id<- rep(1:N,each=T)
dummy<- c(0,0,1,1,0,0,0,1,0,0,0,1,0,1,0)
df<-as.data.frame(cbind(id, time,dummy))
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 0
I now want the dummy variable for all rows of a cross section to take the value 1 after the 1 for this cross section appears for the first time. So, what I want is:
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 0
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 1
So I guess I need something like:
df_new<-df %>%
group_by(id) %>%
???
I already tried to set all zeros to NA and use the na.locf function, but it didn't really work.
Anybody got an idea?
Thanks!
Use cummax
df %>%
group_by(id) %>%
mutate(dummy = cummax(dummy))
# A tibble: 15 x 3
# Groups: id [5]
# id time dummy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 0
# 8 3 2 1
# 9 3 3 1
#10 4 1 0
#11 4 2 0
#12 4 3 1
#13 5 1 0
#14 5 2 1
#15 5 3 1
Without additional packages you could do
transform(df, dummy = ave(dummy, id, FUN = cummax))

Indexing by multiple criteria between 2 data frames in R

I have the following data frame:
id day event
1 1 1
1 3 1
2 1 0
2 4 0
2 9 0
2 15 0
3 2 0
3 5 0
4 1 1
4 8 1
4 11 1
What i want is when an event has a value zero then all the event values become one except from the last one(by date). So the output should be the following:
id day event
1 1 1
1 3 1
2 1 1
2 4 1
2 9 1
2 15 0
3 2 1
3 5 0
4 1 1
4 8 1
4 11 1
Any help?
We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', if any of the 'event' is 0 (!event) for that particular 'id', we replicate 1 for the length of that group -1 (.N-1) and concatenate with 0 or else to return the 'event' value, assign (:=) to update the 'event' column.
library(data.table)
setDT(df1)[, event :=if(any(!event)) c(rep(1L, .N-1),0L) else event, by = id]
df1
# id day event
# 1: 1 1 1
# 2: 1 3 1
# 3: 2 1 1
# 4: 2 4 1
# 5: 2 9 1
# 6: 2 15 0
# 7: 3 2 1
# 8: 3 5 0
# 9: 4 1 1
#10: 4 8 1
#11: 4 11 1
Or using dplyr, we group by 'id' and change the 'event' column by taking the lead of the logical vector that is replicated and add with another logical vector (all(event)).
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(event= lead(rep(any(!event), n()), default=0) + all(event))
# id day event
# (int) (int) (dbl)
#1 1 1 1
#2 1 3 1
#3 2 1 1
#4 2 4 1
#5 2 9 1
#6 2 15 0
#7 3 2 1
#8 3 5 0
#9 4 1 1
#10 4 8 1
#11 4 11 1

In R, how can I make a running count of runs?

Suppose I have an R dataframe that looks like this, where end.group signifies the end of a unique group of observations:
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
I want to return the following, where group.count is a running count of the number of observations in a group, and group is a unique identifier for each group, in number order. Can anyone help me with a piece of R code to do this?
end.group group.count group
0 1 1
0 2 1
1 3 1
0 1 2
0 2 2
1 3 2
1 1 3
0 1 4
0 2 4
0 3 4
1 4 4
1 1 5
1 1 6
0 1 7
1 2 7
You can create group by using cumsum and rev. You need rev because you have the end points of the groups.
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
# create groups
x$group <- rev(cumsum(rev(x$end.group)))
# re-number groups from smallest to largest
x$group <- abs(x$group-max(x$group)-1)
Now you can use ave to create group.count.
x$group.count <- ave(x$end.group, x$group, FUN=seq_along)
x <- data.frame(end.group=c(0,0,1,0,0,1,1,0,0,0,1,1,1,0,1))
ends <- which(as.logical(x$end.group))
ends2 <- c(ends[1],diff(ends))
transform(x, group.count=unlist(sapply(ends2,seq)), group=rep(seq(length(ends)),times=ends2))
end.group group.count group
1 0 1 1
2 0 2 1
3 1 3 1
4 0 1 2
5 0 2 2
6 1 3 2
7 1 1 3
8 0 1 4
9 0 2 4
10 0 3 4
11 1 4 4
12 1 1 5
13 1 1 6
14 0 1 7
15 1 2 7

Resources