Indexing by multiple criteria between 2 data frames in R - r

I have the following data frame:
id day event
1 1 1
1 3 1
2 1 0
2 4 0
2 9 0
2 15 0
3 2 0
3 5 0
4 1 1
4 8 1
4 11 1
What i want is when an event has a value zero then all the event values become one except from the last one(by date). So the output should be the following:
id day event
1 1 1
1 3 1
2 1 1
2 4 1
2 9 1
2 15 0
3 2 1
3 5 0
4 1 1
4 8 1
4 11 1
Any help?

We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', if any of the 'event' is 0 (!event) for that particular 'id', we replicate 1 for the length of that group -1 (.N-1) and concatenate with 0 or else to return the 'event' value, assign (:=) to update the 'event' column.
library(data.table)
setDT(df1)[, event :=if(any(!event)) c(rep(1L, .N-1),0L) else event, by = id]
df1
# id day event
# 1: 1 1 1
# 2: 1 3 1
# 3: 2 1 1
# 4: 2 4 1
# 5: 2 9 1
# 6: 2 15 0
# 7: 3 2 1
# 8: 3 5 0
# 9: 4 1 1
#10: 4 8 1
#11: 4 11 1
Or using dplyr, we group by 'id' and change the 'event' column by taking the lead of the logical vector that is replicated and add with another logical vector (all(event)).
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(event= lead(rep(any(!event), n()), default=0) + all(event))
# id day event
# (int) (int) (dbl)
#1 1 1 1
#2 1 3 1
#3 2 1 1
#4 2 4 1
#5 2 9 1
#6 2 15 0
#7 3 2 1
#8 3 5 0
#9 4 1 1
#10 4 8 1
#11 4 11 1

Related

Create variable that flags an ID if it has existed in any previous month

I am unsure of how to create a variable that flags an ID in the current month if the ID has existed in any previous month.
Example data:
ID<-c(1,2,3,2,3,4,1,5)
Month<-c(1,1,1,2,2,2,3,3)
Flag<-c(0,0,0,1,1,0,1,0)
have<-cbind(ID,Month)
> have
ID Month
1 1
2 1
3 1
2 2
3 2
4 2
1 3
5 3
want:
> want
ID Month Flag
1 1 0
2 1 0
3 1 0
2 2 1
3 2 1
4 2 0
1 3 1
5 3 0
a data.table approach
library(data.table)
# set to data.table format
DT <- as.data.table(have)
# initialise Signal column
DT[, Signal := 0]
# flag duplicates with a 1
DT[duplicated(ID), Signal := 1, by = Month][]
ID Month Signal
1: 1 1 0
2: 2 1 0
3: 3 1 0
4: 2 2 1
5: 3 2 1
6: 4 2 0
7: 1 3 1
8: 5 3 0
The idea is presented from akrun in the comments. Here is the dplyr application:
First use as_tibble to bring matrix in tibble format
then use an ifelse statement with duplicated as #akrun already suggests.
library(tibble)
library(dplyr)
have %>%
as_tibble() %>%
mutate(flag = ifelse(duplicated(ID),1,0))
ID Month flag
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 0
4 2 2 1
5 3 2 1
6 4 2 0
7 1 3 1
8 5 3 0

Count number of values which are less than current value

I'd like to count the rows in the column input if the values are smaller than the current row (Please see the results wanted below). The issue to me is that the condition is based on current row value, so it is very different from general case where the condition is a fixed number.
data <- data.frame(input = c(1,1,1,1,2,2,3,5,5,5,5,6))
input
1 1
2 1
3 1
4 1
5 2
6 2
7 3
8 5
9 5
10 5
11 5
12 6
The results I expect to get are like this. For example, for observations 5 and 6 (with value 2), there are 4 observations with value 1 less than their value 2. Hence count is given value 4.
input count
1 1 0
2 1 0
3 1 0
4 1 0
5 2 4
6 2 4
7 3 6
8 5 7
9 5 7
10 5 7
11 5 7
12 6 11
Edit: as I am dealing with grouped data with dplyr, the ultimate results I wish to get is like below, that is, I am wishing the conditions could be dynamic within each group.
data <- data.frame(id = c(1,1,2,2,2,3,3,4,4,4,4,4),
input = c(1,1,1,1,2,2,3,5,5,5,5,6),
count=c(0,0,0,0,2,0,1,0,0,0,0,4))
id input count
1 1 1 0
2 1 1 0
3 2 1 0
4 2 1 0
5 2 2 2
6 3 2 0
7 3 3 1
8 4 5 0
9 4 5 0
10 4 5 0
11 4 5 0
12 4 6 4
Here is an option with tidyverse
library(tidyverse)
data %>%
mutate(count = map_int(input, ~ sum(.x > input)))
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
Update
With the updated data, add the group by 'id' in the above code
data %>%
group_by(id) %>%
mutate(count1 = map_int(input, ~ sum(.x > input)))
# A tibble: 12 x 4
# Groups: id [4]
# id input count count1
# <dbl> <dbl> <dbl> <int>
# 1 1 1 0 0
# 2 1 1 0 0
# 3 2 1 0 0
# 4 2 1 0 0
# 5 2 2 2 2
# 6 3 2 0 0
# 7 3 3 1 1
# 8 4 5 0 0
# 9 4 5 0 0
#10 4 5 0 0
#11 4 5 0 0
#12 4 6 4 4
In base R, we can use sapply and for each input count how many values are greater than itself.
data$count <- sapply(data$input, function(x) sum(x > data$input))
data
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
With dplyr one way would be using rowwise function and following the same logic.
library(dplyr)
data %>%
rowwise() %>%
mutate(count = sum(input > data$input))
1. outer and rowSums
data$count <- with(data, rowSums(outer(input, input, `>`)))
2. table and cumsum
tt <- cumsum(table(data$input))
v <- setNames(c(0, head(tt, -1)), c(head(names(tt), -1), tail(names(tt), 1)))
data$count <- v[match(data$input, names(v))]
3. data.table non-equi join
Perhaps more efficient with a non-equi join in data.table. Count number of rows (.N) for each match (by = .EACHI).
library(data.table)
setDT(data)
data[data, on = .(input < input), .N, by = .EACHI]
If your data is grouped by 'id', as in your update, join on that variable as well:
data[data, on = .(id, input < input), .N, by = .EACHI]
# id input N
# 1: 1 1 0
# 2: 1 1 0
# 3: 2 1 0
# 4: 2 1 0
# 5: 2 2 2
# 6: 3 2 0
# 7: 3 3 1
# 8: 4 5 0
# 9: 4 5 0
# 10: 4 5 0
# 11: 4 5 0
# 12: 4 6 4

If a value appears in the row, all subsequent rows should take this value (with dplyr)

I'm just starting to learn R and I'm already facing the first bigger problem.
Let's take the following panel dataset as an example:
N=5
T=3
time<-rep(1:T, times=N)
id<- rep(1:N,each=T)
dummy<- c(0,0,1,1,0,0,0,1,0,0,0,1,0,1,0)
df<-as.data.frame(cbind(id, time,dummy))
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 0
6 2 3 0
7 3 1 0
8 3 2 1
9 3 3 0
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 0
I now want the dummy variable for all rows of a cross section to take the value 1 after the 1 for this cross section appears for the first time. So, what I want is:
id time dummy
1 1 1 0
2 1 2 0
3 1 3 1
4 2 1 1
5 2 2 1
6 2 3 1
7 3 1 0
8 3 2 1
9 3 3 1
10 4 1 0
11 4 2 0
12 4 3 1
13 5 1 0
14 5 2 1
15 5 3 1
So I guess I need something like:
df_new<-df %>%
group_by(id) %>%
???
I already tried to set all zeros to NA and use the na.locf function, but it didn't really work.
Anybody got an idea?
Thanks!
Use cummax
df %>%
group_by(id) %>%
mutate(dummy = cummax(dummy))
# A tibble: 15 x 3
# Groups: id [5]
# id time dummy
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 1 2 0
# 3 1 3 1
# 4 2 1 1
# 5 2 2 1
# 6 2 3 1
# 7 3 1 0
# 8 3 2 1
# 9 3 3 1
#10 4 1 0
#11 4 2 0
#12 4 3 1
#13 5 1 0
#14 5 2 1
#15 5 3 1
Without additional packages you could do
transform(df, dummy = ave(dummy, id, FUN = cummax))

Restructuing and formatting data frame columns

dfin <-
ID SEQ GRP C1 C2 C3 T1 T2 T3
1 1 1 0 5 8 0 1 2
1 2 1 5 10 15 5 6 7
2 1 2 20 25 30 0 1 2
C1 is the concentration (CONC) at T1 (TIME) and so on. This is what I want as an output:
dfout <-
ID SEQ GRP CONC TIME
1 1 1 0 0
1 1 1 5 1
1 1 1 8 2
1 2 1 5 5
1 2 1 10 6
1 2 1 15 7
2 1 2 20 0
2 1 2 25 1
2 1 2 30 2
The dfin has much more columns for Cx and Tx where x is the number of concentration readings.
You can do this with data.table::melt, with its capability of melting the table into multiple columns based on the columns pattern:
library(data.table)
melt(
setDT(df),
id.vars=c("ID", "SEQ", "GRP"),
# columns starts with C and T should be melted into two separate columns
measure.vars=patterns("^C", "^T"),
value.name=c('CONC', 'TIME')
)[order(ID, SEQ)][, variable := NULL][]
# ID SEQ GRP CONC TIME
#1: 1 1 1 0 0
#2: 1 1 1 5 1
#3: 1 1 1 8 2
#4: 1 2 1 5 5
#5: 1 2 1 10 6
#6: 1 2 1 15 7
#7: 2 1 2 20 0
#8: 2 1 2 25 1
#9: 2 1 2 30 2
Or if the value column names follow the pattern [CT][0-9], you can use reshape from base R by specifying the sep="" which will split the value columns name by the letter/digit separation due to this default setting (from ?reshape):
split = if (sep == "") {
list(regexp = "[A-Za-z][0-9]", include = TRUE)
} else {
list(regexp = sep, include = FALSE, fixed = TRUE)}
reshape(df, varying=-(1:3), idvar=c("ID", "SEQ", "GRP"),
dir="long", sep="", v.names=c("CONC", "TIME"))
# ID SEQ GRP time CONC TIME
#1: 1 1 1 1 0 5
#2: 1 2 1 1 5 10
#3: 2 1 2 1 20 25
#4: 1 1 1 2 8 0
#5: 1 2 1 2 15 5
#6: 2 1 2 2 30 0
#7: 1 1 1 3 1 2
#8: 1 2 1 3 6 7
#9: 2 1 2 3 1 2

Extracting event rows from a data frame

I have this data frame:
df <-
ID var TIME value method
1 3 0 2 1
1 3 2 2 1
1 3 3 0 1
1 4 0 10 1
1 4 2 10 1
1 4 4 5 1
1 4 6 5 1
2 3 0 2 1
2 3 2 2 1
2 3 3 0 1
2 4 0 10 1
2 4 2 10 1
2 4 4 5 1
2 4 6 5 1
I want to extract rows that has a new eventin value column. For example, for ID=1, var=3 has a value of 2 at TIME=0. This value stays the same at TIME=1, so I would take the first row at TIME=0 only and discard the second row. However, the third row, the value for var=3 has changed into zero, so I have also to extract this row. And so on for the rest of the variables. This has to be applied for every subject ID. For the above df, the result should be as follows:
dfevent <-
ID var TIME value method
1 3 0 2 1
1 3 3 0 1
1 4 0 10 1
1 4 4 5 1
2 3 0 2 1
2 3 3 0 1
2 4 0 10 1
2 4 4 5 1
Could any one help me doing this in R? I have a huge data set and I want to extract the information at which a new event has occurred for the value of every var. I have 4 variables in the data frame numbered (3, 4,5,6, and 7). The above is an example for 2 variables (variable number: 3 and 4).
This does it using dplyr
library(dplyr)
df %>%
group_by(ID, var) %>%
mutate(tf = ifelse(value==lag(value), 1, 0)) %>%
filter(is.na(tf) | tf==0) %>%
select(-tf)
# ID var TIME value method
#1 1 3 0 2 1
#2 1 3 3 0 1
#3 1 4 0 10 1
#4 1 4 4 5 1
#5 2 3 0 2 1
#6 2 3 3 0 1
#7 2 4 0 10 1
#8 2 4 4 5 1
basically, I created an extra variable that returns a '1' when the value is the same as the preceding row within groups of unique ID/var combinations. We then get rid of this variable before returning the output.
Base solution:
df[with(df, abs(ave(value,ID,FUN=function(x) c(1,diff(x)) ))) > 0,]
# ID var TIME value method
#1 1 3 0 2 1
#3 1 3 3 0 1
#4 1 4 0 10 1
#6 1 4 4 5 1
#8 2 3 0 2 1
#10 2 3 3 0 1
#11 2 4 0 10 1
#13 2 4 4 5 1
From the expected results, you may also try rleid from data.table
library(data.table)#data.table_1.9.5
setDT(df)[df[, .I[1L] , list(ID, var, rleid(value))]$V1]
# ID var TIME value method
#1: 1 3 0 2 1
#2: 1 3 3 0 1
#3: 1 4 0 10 1
#4: 1 4 4 5 1
#5: 2 3 0 2 1
#6: 2 3 3 0 1
#7: 2 4 0 10 1
#8: 2 4 4 5 1
Or a similar approach as #thelatemail
setDT(df)[df[, .I[abs(c(1,diff(value)))>0] , ID]$V1]
Or
unique(setDT(df)[, id:=rleid(value)], by=c('ID', 'var', 'id'))

Resources