Count by group with condition (in R data.table) - r

Consider a dataset that consists of ID and Val.
# dataset
ID Val Counter
1 2 1
1 4 2
1 NA 2
1 13 3
1 12 4
2 NA 0
2 33 1
2 5 2
2 5 3
A counter per subgroup can be added by dt[, normal_counter := 1:.N, by=ID]. I am looking for a counter that is not incremented when there is an NAvalue (see counter in example above).

This is a cumulative sum of non-NA values by group, so:
dat[, cntr := cumsum(!is.na(Val)), by=ID]
dat
# ID Val Counter cntr
#1: 1 2 1 1
#2: 1 4 2 2
#3: 1 NA 2 2
#4: 1 13 3 3
#5: 1 12 4 4
#6: 2 NA 0 0
#7: 2 33 1 1
#8: 2 5 2 2
#9: 2 5 3 3

Related

Subsetting panel observations

I have a data.table with firm information.
library(data.table)
DT <- fread("
iso Firm GDP year
A 1 1 1
A 2 1 1
A 3 1 1
A 4 1 1
A 5 3 2
A 6 3 2
A 7 3 2
A 8 3 2
B 9 2 1
B 10 2 1
B 11 2 1
B 12 2 1
B 13 4 1
B 14 4 1
B 15 4 1
B 16 4 1",
header = TRUE)
I want to calculate GDPgrowth (per country) from one year to the other and add it to the dataset ((N-O)/O). However, if I do:
DT <- DT[,GDPgrowth :=((GDP- shift(GDP))/shift(GDP)), by=iso]
the outcome will be zero because it subtracts the firm observations from each other.
How can I make sure it calculates for the whole group of firms belonging to the country together?
Desired output:
library(data.table)
DT <- fread("
iso Firm GDP GDPgrowth year
A 1 1 NA 1
A 2 1 NA 1
A 3 1 NA 1
A 4 1 NA 1
A 5 3 2 2
A 6 3 2 2
A 7 3 2 2
A 8 3 2 2
B 9 2 NA 1
B 10 2 NA 1
B 11 2 NA 1
B 12 2 NA 1
B 13 4 1 1
B 14 4 1 1
B 15 4 1 1
B 16 4 1 1",
header = TRUE)
Here is one way continuing from your current approach :
library(data.table)
DT[,GDPgrowth :=((GDP- shift(GDP))/shift(GDP)), by=iso]
DT[GDPgrowth == 0, GDPgrowth := NA]
DT[, GDPgrowth:= zoo::na.locf(GDPgrowth, na.rm = FALSE), .(iso, year)]
DT
# iso Firm GDP year GDPgrowth
# 1: A 1 1 1 NA
# 2: A 2 1 1 NA
# 3: A 3 1 1 NA
# 4: A 4 1 1 NA
# 5: A 5 3 2 2
# 6: A 6 3 2 2
# 7: A 7 3 2 2
# 8: A 8 3 2 2
# 9: B 9 2 1 NA
#10: B 10 2 1 NA
#11: B 11 2 1 NA
#12: B 12 2 1 NA
#13: B 13 4 1 1
#14: B 14 4 1 1
#15: B 15 4 1 1
#16: B 16 4 1 1
Using dplyr and tidyr::fill it can be done as
library(dplyr)
DT %>%
group_by(iso) %>%
mutate(GDPgrowth = (GDP - lag(GDP))/lag(GDP),
GDPgrowth = replace(GDPgrowth, GDPgrowth == 0, NA)) %>%
group_by(iso, year) %>%
tidyr::fill(GDPgrowth)

Restructuing and formatting data frame columns

dfin <-
ID SEQ GRP C1 C2 C3 T1 T2 T3
1 1 1 0 5 8 0 1 2
1 2 1 5 10 15 5 6 7
2 1 2 20 25 30 0 1 2
C1 is the concentration (CONC) at T1 (TIME) and so on. This is what I want as an output:
dfout <-
ID SEQ GRP CONC TIME
1 1 1 0 0
1 1 1 5 1
1 1 1 8 2
1 2 1 5 5
1 2 1 10 6
1 2 1 15 7
2 1 2 20 0
2 1 2 25 1
2 1 2 30 2
The dfin has much more columns for Cx and Tx where x is the number of concentration readings.
You can do this with data.table::melt, with its capability of melting the table into multiple columns based on the columns pattern:
library(data.table)
melt(
setDT(df),
id.vars=c("ID", "SEQ", "GRP"),
# columns starts with C and T should be melted into two separate columns
measure.vars=patterns("^C", "^T"),
value.name=c('CONC', 'TIME')
)[order(ID, SEQ)][, variable := NULL][]
# ID SEQ GRP CONC TIME
#1: 1 1 1 0 0
#2: 1 1 1 5 1
#3: 1 1 1 8 2
#4: 1 2 1 5 5
#5: 1 2 1 10 6
#6: 1 2 1 15 7
#7: 2 1 2 20 0
#8: 2 1 2 25 1
#9: 2 1 2 30 2
Or if the value column names follow the pattern [CT][0-9], you can use reshape from base R by specifying the sep="" which will split the value columns name by the letter/digit separation due to this default setting (from ?reshape):
split = if (sep == "") {
list(regexp = "[A-Za-z][0-9]", include = TRUE)
} else {
list(regexp = sep, include = FALSE, fixed = TRUE)}
reshape(df, varying=-(1:3), idvar=c("ID", "SEQ", "GRP"),
dir="long", sep="", v.names=c("CONC", "TIME"))
# ID SEQ GRP time CONC TIME
#1: 1 1 1 1 0 5
#2: 1 2 1 1 5 10
#3: 2 1 2 1 20 25
#4: 1 1 1 2 8 0
#5: 1 2 1 2 15 5
#6: 2 1 2 2 30 0
#7: 1 1 1 3 1 2
#8: 1 2 1 3 6 7
#9: 2 1 2 3 1 2

Count how many times values has changed in column using R

Hi i want to count how many times value has changed in a column by the group and how many unique values was in a group, and i sort of getting what i want, but it has a NA observation which i do not want to be counted.
df <- data.frame(x=c("a",'a', "a", "b",'b', "b", "c",'c', "d")
,y=c(1,2,NA,3,3,3,2,1,5))
library(data.table) #data.table_1.9.5
setDT(df)[, wanted := rleid(y), by=x][]
setDT(df)[, count := uniqueN(y),by=x][]
x y wanted count
1: a 1 1 3
2: a 2 2 3
3: a NA 3 3
4: b 3 1 1
5: b 3 1 1
6: b 3 1 1
7: c 2 1 2
8: c 1 2 2
9: d 5 1 1`
Desired results:
x y wanted count
1: a 1 1 2
2: a 2 2 2
3: a NA 2 2
4: b 3 1 1
5: b 3 1 1
6: b 3 1 1
7: c 2 1 2
8: c 1 2 2
9: d 5 1 1
I tried rleid(!is.na(y)) but seems not to work as i expected. Thank you.
We can replace the NA elements with previous non-NA element (na.locf), take the rleid on that to get the 'wanted' and also get the length of unique elements that are not NA to get the 'count'
library(zoo)
setDT(df)[, c('wanted', 'count') := list(rleid(na.locf(y)), uniqueN(y, na.rm = TRUE)), x]
df
# x y wanted count
#1: a 1 1 2
#2: a 2 2 2
#3: a NA 2 2
#4: b 3 1 1
#5: b 3 1 1
#6: b 3 1 1
#7: c 2 1 2
#8: c 1 2 2
#9: d 5 1 1

How to create a count variable by group for specific values in the variable of interest?

At the moment I have to deal with paradata (long-format) generated by a software during the data collection phase of a cohort study.
How can I create a variable containing the number of occurence of a certain value by a group-variable (like by id: gen _n if VAR1==2 in Stata)?
Basically the data looks like this:
ID: VAR1:
1 2
1 1
1 2
2 2
2 3
2 2
3 2
3 2
3 2
I can create a variable count.1 using
`data$count.1 <- ave(data$VAR1, data$ID, FUN = seq_along)`
ID: VAR1: count.1:
1 2 1
1 1 2
1 2 3
2 2 1
2 3 2
2 2 3
3 2 1
3 2 2
3 2 3
How can I create a variable count.2 counting by ID the number of the occurence of the event 2 in VAR1?
ID: VAR1: count.1: count.2:
1 2 1 1
1 1 2 NA
1 2 3 2
2 2 1 1
2 3 2 NA
2 2 3 2
3 1 1 NA
3 2 2 1
3 2 3 2
The Data:
ID=c(1,1,1,2,2,2,3,3,3)
VAR1=c(2,1,2,2,3,2,1,2,2)
data <- as.data.frame(cbind(ID, VAR1))
Thanks in advance!!!
Try
data$count.2 <- with(data, ave(VAR1==2, ID,
FUN=function(x) ifelse(x, cumsum(x), NA)) )
data$count.2
#[1] 1 NA 2 1 NA 2 NA 1 2
Or using data.table
library(data.table)
setDT(data)[VAR1==2, count.2:=1:.N, by=ID][]
# ID VAR1 count.2
#1: 1 2 1
#2: 1 1 NA
#3: 1 2 2
#4: 2 2 1
#5: 2 3 NA
#6: 2 2 2
#7: 3 1 NA
#8: 3 2 1
#9: 3 2 2
Or using dplyr
library(dplyr)
data %>%
group_by(ID) %>%
mutate(count.2= ifelse(VAR1==2, cumsum(VAR1==2), NA))

Create a block column based on id and the value of another column in R

Given the following first two columns(id and time_diff), i want to generate the 'block' column
test
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5
The data is already sorted by id and time. The time_diff was computed based on the difference of the previous time and the time value for the row, given the same id. I want to create a block id which is an auto-increment value and increases when a new ID or a time_diff of >10 with the same id is encountered.
How can I achieve this in R?
Importing your data as a data frame with something like:
df = read.table(text='
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5')
You can do a one-liner like this to get occurrences satisfying your two conditions:
> new_col = as.vector(cumsum(
na.exclude(
c(F,diff(as.numeric(as.factor(df$id)))) | # change of id OR
df$time_diff > 10 # time_diff greater than 10
)
))
> new_col
[1] 0 0 0 0 0 1 2 2 2 2 3 3 4 4 4
And finally append this new column to your dataframe with cbind:
> cbind(df, block = c(0,new_col))
id time_diff block block
1 a NA 1 0
2 a 1 1 0
3 a 1 1 0
4 a 1 1 0
5 a 3 1 0
6 a 3 1 0
7 b NA 2 1
8 b 11 3 2
9 b 1 3 2
10 b 1 3 2
11 b 1 3 2
12 b 12 4 3
13 b 1 4 3
14 c NA 5 4
15 c 4 5 4
16 c 7 5 4
You will notice an offset between your wanted block variable and mine: correcting it is easy and can be done at several different step, I will leave it to you :)
Another variation of #Jealie's method would be:
with(test, cumsum(c(TRUE,id[-1]!=id[-nrow(test)])|time_diff>10))
#[1] 1 1 1 1 1 1 2 3 3 3 3 4 4 5 5 5
After learning from Jealie and akrun, I came up with this idea.
mydf %>%
mutate(group = cumsum(time_diff > 10 |!duplicated(id)))
# id time_diff block group
#1 a NA 1 1
#2 a 1 1 1
#3 a 1 1 1
#4 a 1 1 1
#5 a 3 1 1
#6 a 3 1 1
#7 b NA 2 2
#8 b 11 3 3
#9 b 1 3 3
#10 b 1 3 3
#11 b 1 3 3
#12 b 12 4 4
#13 b 1 4 4
#14 c NA 5 5
#15 c 4 5 5
#16 c 7 5 5
Here is an approach using dplyr:
require(dplyr)
set.seed(999)
test <- data.frame(
id = rep(letters[1:4], each = 3),
time_diff = sample(4:15)
)
test %>%
mutate(
b = as.integer(id) - lag(as.integer(id)),
more10 = time_diff > 10,
increment = pmax(b, more10, na.rm = TRUE),
increment = ifelse(row_number() == 1, 1, increment),
block = cumsum(increment)
) %>%
select(id, time_diff, block)
Try:
> df
id time_diff
1 a NA
2 a 1
3 a 1
4 a 1
5 a 3
6 a 3
7 b NA
8 b 11
9 b 1
10 b 1
11 b 1
12 b 12
13 b 1
14 c NA
15 c 4
16 c 7
block= c(1)
for(i in 2:nrow(df))
block[i] = ifelse(df$time_diff[i]>10 || df$id[i]!=df$id[i-1],
block[i-1]+1,
block[i-1])
df$block = block
df
id time_diff block
1 a NA 1
2 a 1 1
3 a 1 1
4 a 1 1
5 a 3 1
6 a 3 1
7 b NA 2
8 b 11 3
9 b 1 3
10 b 1 3
11 b 1 3
12 b 12 4
13 b 1 4
14 c NA 5
15 c 4 5
16 c 7 5

Resources