Relative reference to rows in large data set - r

I have a very large data set (millions of rows) where I need to turn into NA certain rows when a var1 equals "Z". However, I also need to turn into NA the preceding row to a row with var1="Z".
E.g.:
id var1
1 A
1 B
1 Z
1 S
1 A
1 B
2 A
2 B
3 A
3 B
3 A
3 B
4 A
4 B
4 A
4 B
In this case, the second row and the third row for id==1 should be NA.
I have tried a loop but it doesn't work as the data set is very large.
for (i in 1:length(df$var1)){
if(df$var1[i] =="Z"){
df[i,] <- NA
df[(i-1),] <-- NA
}
}
I have also tried to use data.table package unsuccessfully. Do you have any idea of how I could do it or what is the right term to look for info on what I am trying to do?

Maybe do it like this using data.table:
df <- as.data.table(read.table(header=T, file='clipboard'))
df$var1 <- as.character(df$var1)
#find where var1 == Z
index <- df[, which(var1 == 'Z')]
#add the previous lines too
index <- c(index, index-1)
#convert to NA
df[index, var1 := NA ]
Or in one call:
df[c(which(var1 == 'Z'), which(var1 == 'Z') - 1), var1 := NA ]
Output:
> df
id var1
1: 1 A
2: 1 NA
3: 1 NA
4: 1 S
5: 1 A
6: 1 B
7: 2 A
8: 2 B
9: 3 A
10: 3 B
11: 3 A
12: 3 B
13: 4 A
14: 4 B
15: 4 A
16: 4 B

If you want to take in count the preceding indices only if they are from the same id, I would suggest to use the .I and by combination which will make sure that you are not taking indecies from previous id
setDT(df)[, var1 := as.character(var1)]
indx <- df[, {indx <- which(var1 == "Z") ; .I[c(indx - 1L, indx)]}, by = id]$V1
df[indx, var1 := NA_character_]
df
# id var1
# 1: 1 A
# 2: 1 NA
# 3: 1 NA
# 4: 1 S
# 5: 1 A
# 6: 1 B
# 7: 2 A
# 8: 2 B
# 9: 3 A
# 10: 3 B
# 11: 3 A
# 12: 3 B
# 13: 4 A
# 14: 4 B
# 15: 4 A
# 16: 4 B

You can have a base R approach:
x = var1=='Z'
df[x | c(x[-1],F), 'var1'] <- NA
# id var1
#1 1 A
#2 1 <NA>
#3 1 <NA>
#4 1 S
#5 1 A
#6 1 B
#7 2 A
#8 2 B
#9 3 A
#10 3 B
#11 3 A
#12 3 B
#13 4 A
#14 4 B
#15 4 A
#16 4 B

Related

How to calculate yearly retention rate by group in R?

I have a large data set with individuals located in counties over a period of multiple years. Each year, some individuals move to a different county or leave the data set and new individuals join.
I would like to count the number of individuals that stayed in the same county from year to year and from year 1. Here is the question I found that comes closest to this task (without the additional grouping by counties): Month-over-month Customer Retention Rate in R
Here is a simplified version of the data set:
dt <- setDT(data.frame(ID = rep(c('a', 'b', 'c', 'd', 'a', 'c', 'd', 'e', 'c', 'e', 'f'),2),
CTY = rep(c(1, 2), each = 11),
YEAR = rep(c(1,1,1,1,2,2,2,2,3,3,3),2)))
My solution, so far, relies on a loop
x =matrix(NA, 2,3)
y =matrix(NA, 2,3)
for (i in 1:2) {
for (j in 1:3) {
x[i,j] = ifelse(j == 1, NA, sum(dt[CTY == i & YEAR == j, ID] %in% dt[CTY == i & YEAR == j-1, ID] == T))
y[i,j] = ifelse(j == 1, NA, sum(dt[CTY == i & YEAR == 1, ID] %in% dt[CTY == i & YEAR == j, ID] == T))
}
}
Which gives after joining
colnames(x) <- unique(dt$YEAR)
rownames(x) <- unique(dt$CTY)
x <- reshape2::melt(x)
names(x) <- c("CTY", "YEAR", "stayed")
x <- x[order(x$CTY),]
colnames(y) <- unique(dt$YEAR)
rownames(y) <- unique(dt$CTY)
y <- reshape2::melt(y)
names(y) <- c("CTY", "YEAR", "stayed2")
y <- y[order(y$CTY),]
dt <-dt[x, on = c("CTY", "YEAR")]
dt <-dt[y, on = c("CTY", "YEAR")]
dt
# ID CTY YEAR stayed stayed2
# 1: a 1 1 NA NA
# 2: b 1 1 NA NA
# 3: c 1 1 NA NA
# 4: d 1 1 NA NA
# 5: a 1 2 3 3
# 6: c 1 2 3 3
# 7: d 1 2 3 3
# 8: e 1 2 3 3
# 9: c 1 3 2 1
# 10: e 1 3 2 1
# 11: f 1 3 2 1
# 12: a 2 1 NA NA
# 13: b 2 1 NA NA
# 14: c 2 1 NA NA
# 15: d 2 1 NA NA
# 16: a 2 2 3 3
# 17: c 2 2 3 3
# 18: d 2 2 3 3
# 19: e 2 2 3 3
# 20: c 2 3 2 1
# 21: e 2 3 2 1
# 22: f 2 3 2 1
This is the right final table but it requires manipulation of the loop output that seems unnecessary; in sum, this works but it is clunky and slow.
I have experimented with data.table and dplyr solutions but can't seem to make it work.
Try sapply function like this:
fx <- function(x) ifelse(x$YEAR == 1, NA, sum(dt[CTY == x$CTY & YEAR == x$YEAR, ID] %in% dt[CTY == x$CTY & YEAR == x$YEAR-1, ID] == T))
fy <- function(y) ifelse(y$YEAR == 1, NA, sum(dt[CTY == y$CTY & YEAR == 1, ID] %in% dt[CTY == y$CTY & YEAR == y$YEAR, ID] == T))
x <- merge(data.frame(CTY=1:2),data.frame(YEAR=1:3))
s <- data.frame(x,stayed=sapply(split(x,1:nrow(x)),fx))
s <- data.frame(s,stayed2=sapply(split(x,1:nrow(x)),fy))
merge(dt,s)
# CTY YEAR ID stayed stayed2
# 1: 1 1 a NA NA
# 2: 1 1 b NA NA
# 3: 1 1 c NA NA
# 4: 1 1 d NA NA
# 5: 1 2 a 3 3
# 6: 1 2 c 3 3
# 7: 1 2 d 3 3
# 8: 1 2 e 3 3
# 9: 1 3 c 2 1
# 10: 1 3 e 2 1
# 11: 1 3 f 2 1
# 12: 2 1 a NA NA
# 13: 2 1 b NA NA
# 14: 2 1 c NA NA
# 15: 2 1 d NA NA
# 16: 2 2 a 3 3
# 17: 2 2 c 3 3
# 18: 2 2 d 3 3
# 19: 2 2 e 3 3
# 20: 2 3 c 2 1
# 21: 2 3 e 2 1
# 22: 2 3 f 2 1

Get the group index in a data.table

I have the following data.table:
library(data.table)
DT <- data.table(a = c(1,2,3,4,5,6,7,8,9,10), b = c('A','A','A','B','B', 'C', 'C', 'C', 'D', 'D'), c = c(1,1,1,1,1,2,2,2,2,2))
> DT
a b c
1: 1 A 1
2: 2 A 1
3: 3 A 1
4: 4 B 1
5: 5 B 1
6: 6 C 2
7: 7 C 2
8: 8 C 2
9: 9 D 2
10: 10 D 2
I want to add a column that shows the index grouped by c (starts from 1 from each group in column c), but that only changes when the value of b is changed. The result wanted is shown below:
Here are two ways to do this :
Using rleid :
library(data.table)
DT[, col := rleid(b), c]
With match + unique :
DT[, col := match(b, unique(b)), c]
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
3 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
We can use factor with levels specified and coerce it to integer
library(data.table)
DT[, col := as.integer(factor(b, levels = unique(b))), c]
-output
DT
# a b c col
# 1: 1 A 1 1
# 2: 2 A 1 1
# 3: 3 A 1 1
# 4: 4 B 1 2
# 5: 5 B 1 2
# 6: 6 C 2 1
# 7: 7 C 2 1
# 8: 8 C 2 1
# 9: 9 D 2 2
#10: 10 D 2 2
Or using base R with rle
with(DT, as.integer(ave(b, c, FUN = function(x)
with(rle(x), rep(seq_along(values), lengths)))))

R find intervals in data.table

i want to add a new column with intervals or breakpoints by group. As an an example:
This is my data.table:
x <- data.table(a = c(1:8,1:8), b = c(rep("A",8),rep("B",8)))
I have already the breakpoint or rowindices:
pos <- data.table(b = c("A","A","B","B"), bp = c(3,5,2,4))
Here i can find the interval for group "A" with:
findInterval(1:nrow(x[b=="A"]), pos[b=="A"]$bp)
How can i do this for each group. In this case "A" and "B"?
An option is to split the datasets by 'b' column, use Map to loop over the corresponding lists, and apply findInterval
Map(function(u, v) findInterval(seq_len(nrow(u)), v$bp),
split(x, x$b), split(pos, pos$b))
#$A
#[1] 0 0 1 1 2 2 2 2
#$B
#[1] 0 1 1 2 2 2 2 2
or another option is to group by 'b' from 'x', then use findInterval by subsetting the 'bp' from 'pos' by filtering with a logical condition created based on .BY
x[, findInterval(seq_len(.N), pos$bp[pos$b==.BY]), b]
# b V1
# 1: A 0
# 2: A 0
# 3: A 1
# 4: A 1
# 5: A 2
# 6: A 2
# 7: A 2
# 8: A 2
# 9: B 0
#10: B 1
#11: B 1
#12: B 2
#13: B 2
#14: B 2
#15: B 2
#16: B 2
Another option using rolling join in data.table:
pos[, ri := rowid(b)]
x[, intvl := fcoalesce(pos[x, on=.(b, bp=a), roll=Inf, ri], 0L)]
output:
a b intvl
1: 1 A 0
2: 2 A 0
3: 3 A 1
4: 4 A 1
5: 5 A 2
6: 6 A 2
7: 7 A 2
8: 8 A 2
9: 1 B 0
10: 2 B 1
11: 3 B 1
12: 4 B 2
13: 5 B 2
14: 6 B 2
15: 7 B 2
16: 8 B 2
We can nest the pos data into list by b and join with x and use findInterval to get corresponding groups.
library(dplyr)
pos %>%
tidyr::nest(data = bp) %>%
right_join(x, by = 'b') %>%
group_by(b) %>%
mutate(interval = findInterval(a, data[[1]][[1]])) %>%
select(-data)
# b a interval
# <chr> <int> <int>
# 1 A 1 0
# 2 A 2 0
# 3 A 3 1
# 4 A 4 1
# 5 A 5 2
# 6 A 6 2
# 7 A 7 2
# 8 A 8 2
# 9 B 1 0
#10 B 2 1
#11 B 3 1
#12 B 4 2
#13 B 5 2
#14 B 6 2
#15 B 7 2
#16 B 8 2

Index the first and the last rows with NA in a dataframe

I have a large dataset, which contains many NAs. I want to find the rows where the first NA and the last NA appear. For example, for column A, I want the output to be the second row (the last NA before a number) and the fifth row (the first NA after a number). My code, which was shown below, does not work very well.
nonnaindex <- which(!is.na(df))
firstnonna <- apply(nonnaindex, 2, min)
Data:
ID A B C
1 NA NA 3
2 NA 2 2
3 3 3 1
4 4 5 NA
5 NA 6 NA
I believe this function might be what you are looking for:
first_and_last_non_na <- function(DT, col) {
library(data.table)
data.table(DT)[, grp := rleid(is.na(get(col)))][
, rbind(last(.SD[is.na(get(col)) & grp == min(grp)]),
first(.SD[is.na(get(col)) & grp == max(grp)]))][
!is.na(ID)][, grp := NULL][]
}
which returns
first_and_last_na_row(DT, "A")
ID A B C
1: 2 NA 2 2
2: 5 NA 6 NA
first_and_last_na_row(DT, "B")
ID A B C
1: 1 NA NA 3
first_and_last_na_row(DT, "C")
ID A B C
1: 4 4 5 NA
first_and_last_na_row(DT, "D")
Empty data.table (0 rows) of 4 cols: ID,A,B,C
in case of
DT
ID A B C
1: 1 NA NA 3
2: 2 NA 2 2
3: 3 3 3 1
4: 4 4 5 NA
5: 5 NA 6 NA
or
first_and_last_na_row(DT2, "D")
ID A B C D
1: 1 NA NA 3 NA
in case of Akrun's (simplified) example
DT2
ID A B C D
1: 1 NA NA 3 NA
2: 2 NA 2 2 2
3: 3 3 3 1 NA
4: 4 4 5 NA NA
5: 5 NA 6 NA 4
Edit: Faster version using melt()
The OP has commented that his production data set consists of 4000 columns and 192 rows and that he needs the indices to clean another data set. He tried a for loop across all columns which is very slow.
Therefore, I suggest to reshape the data set from wide to long format and to use data.table's efficient grouping mechanism:
# reshape from wide to long format
long <- setDT(DT2)[, melt(.SD, id = "ID")][
# add grouping variable to distinguish streaks continuous of NA/non-NA values
# for each variable
, grp := rleid(variable, is.na(value))][
# set sort order just for convenience, not essential
, setorder(.SD, variable, ID)]
long
ID variable value grp
1: 1 A NA 1
2: 2 A NA 1
3: 3 A 3 2
4: 4 A 4 2
5: 5 A NA 3
6: 1 B NA 4
7: 2 B 2 5
8: 3 B 3 5
9: 4 B 5 5
10: 5 B 6 5
11: 1 C 3 6
12: 2 C 2 6
13: 3 C 1 6
14: 4 C NA 7
15: 5 C NA 7
16: 1 D NA 8
17: 2 D 2 9
18: 3 D NA 10
19: 4 D NA 10
20: 5 D 4 11
Now, we get the indices of the starting or ending, resp., NA sequence for each variable (if any) by
# starting NA sequence
long[, .(ID = which(is.na(value) & grp == min(grp))), by = variable]
variable ID
1: A 1
2: A 2
3: B 1
4: D 1
# ending NA sequence
long[, .(ID = which(is.na(value) & grp == max(grp))), by = variable]
variable ID
1: A 5
2: C 4
3: C 5
Note that this returns all indices of the starting or ending NA sequences which might be more convenient for subsequent cleaning of another data set. If only the last and first indices are required this can be achieved by
long[long[, is.na(value) & grp == min(grp), by =variable]$V1, .(ID = max(ID)), by = variable]
variable ID
1: A 2
2: B 1
3: D 1
long[long[, is.na(value) & grp == max(grp), by =variable]$V1, .(ID = min(ID)), by = variable]
variable ID
1: A 5
2: C 4
I have tested this approach using a dummy data set of 192 rows times 4000 columns. The whole operation needed less than one second.

r - data.table join and then add all columns from one table to another

My question is essentially the same as this question: data.table join then add columns to existing data.frame without re-copy.
Basically I have a template with keys and I want to assign columns from other data.tables to the template by the same keys.
> template
id1 id2
1: a 1
2: a 2
3: a 3
4: a 4
5: a 5
6: b 1
7: b 2
8: b 3
9: b 4
10: b 5
> x
id1 id2 value
1: a 2 0.01649728
2: a 3 -0.27918482
3: b 3 0.86933718
> y
id1 id2 value
1: a 4 -1.163439
2: b 4 2.267872
3: b 5 1.083258
> template[x, value := i.value]
> template[y, value := i.value]
> template
id1 id2 value
1: a 1 NA
2: a 2 0.01649728
3: a 3 -0.27918482
4: a 4 -1.16343917
5: a 5 NA
6: b 1 NA
7: b 2 NA
8: b 3 0.86933718
9: b 4 2.26787248
10: b 5 1.08325793
>
But if x and y have say 100 columns, then it is not possible to write out the value := i.value syntax for all columns. Is there a way to do the same thing but for all the columns in x and y?
EDIT:
If I do y[x[template]], then it creates separate value columns, which is not intended:
> y[x[template]]
id1 id2 value value.1
1: a 1 NA NA
2: a 2 NA 0.01649728
3: a 3 NA -0.27918482
4: a 4 -1.163439 NA
5: a 5 NA NA
6: b 1 NA NA
7: b 2 NA NA
8: b 3 NA 0.86933718
9: b 4 2.267872 NA
10: b 5 1.083258 NA
>
Just create a function that takes names as arguments and constructs the expression for you. And then eval it each time by passing the names of each data.table you require. Here's an illustration:
get_expr <- function(x) {
# 'x' is the names vector
expr = paste0("i.", x)
expr = lapply(expr, as.name)
setattr(expr, 'names', x)
as.call(c(quote(`:=`), expr))
}
> get_expr('value') ## generates the required expression
# `:=`(value = i.value)
template[x, eval(get_expr("value"))]
template[y, eval(get_expr("value"))]
# id1 id2 value
# 1: a 1 NA
# 2: a 2 0.01649728
# 3: a 3 -0.27918482
# 4: a 4 -1.16343900
# 5: a 5 NA
# 6: b 1 NA
# 7: b 2 NA
# 8: b 3 0.86933718
# 9: b 4 2.26787200
# 10: b 5 1.08325800

Resources