Find previous occurrence of a value and get value in opposite column - r

I have data like this:
stim1 stim2
1: 2 3
2: 1 3
3: 2 1
4: 1 2
5: 3 1
structure(list(stim1 = c(2L, 1L, 2L, 1L, 3L),
stim2 = c(3L, 3L, 1L, 2L, 1L)),
row.names = c(NA, -10L), class = c("data.table", "data.frame"))
My objective is to add two columns: one for 'stim1' and one for 'stim2'. For each row of both columns, I want to find the previous occurrence of its value, in either column, and then grab the value in the opposite column.
For example, on row 3 'stim1' is 2. The previous occurrence of 2 is in 'stim1' on row 1. The value in the other column of that row is 3. So Prev1[3] is 3.
Another example: On row 4 'stim1' is 1. The previous occurrence of 1 is in 'stim2' on row 3. The value in the other column on that row is 2. So Prev1[4] is 2.
Desired output:
stim1 stim2 Prev1 Prev2
1: 2 3 <NaN> <NaN>
2: 1 3 <NaN> 2
3: 2 1 3 3
4: 1 2 2 1
5: 3 1 1 2

The tricky thing is that the OP wants to find the previous occurrence of a value in either column.
Therefore, the idea is to reshape the data into long format and to find the matching rows by aggregating in a non-equi self join.
library(data.table)
long <- melt(DT, measure.vars = patterns("^stim"), value.name = "stim")[
, rn := rowid(variable)][
, opposite := rev(stim), keyby = rn][]
long[, prev := long[long, on = c("stim", "rn < rn"),
.(max(x.rn), x.opposite[which.max(x.rn)]), by = .EACHI]$V2][]
dcast(long, rn ~ rowid(rn), value.var = c("stim", "prev"))
rn stim_1 stim_2 prev_1 prev_2
1: 1 2 3 NA NA
2: 2 1 3 NA 2
3: 3 2 1 3 3
4: 4 1 2 2 1
5: 5 3 1 1 2
Explanation
Reshape DT to long format.
Create an additional column rn which identifies the row numbers in the original dataset DT using rowid(variable).
Create an additional column opposite which contains the values of the opposite column. In long format this means to reverse the order of values within each rn group.
Now, join long with itself. The non-equi join condition is looking for all occurrences of the current stim value in rows before the current row. As there might be more than one match, aggregating by max(rn) within the .EACHI groups picks the row number of the previous occurrence of the value as well as the corresponding opposite value. So,
long[long, on = c("stim", "rn < rn"), .(max(x.rn), x.opposite[which.max(x.rn)]), by = .EACHI]
returns
stim rn V1 V2
1: 2 1 NA NA
2: 3 1 NA NA
3: 1 2 NA NA
4: 3 2 1 2
5: 2 3 1 3
6: 1 3 2 3
7: 1 4 3 2
8: 2 4 3 1
9: 3 5 2 1
10: 1 5 4 2
Create an additional column prev in long which contains the previous opposite value V2.
Finally, reshape long back to wide format, using both measure columns stim and prev.
Edit: Alternative solution
In case DT contains more columns that just stim1 and stim2, DT can be updated by reference, alternatively:
long <- melt(DT, measure.vars = patterns("^stim"), value.name = "stim")[
, rn := rowid(variable)][
, opposite := rev(stim), keyby = rn][]
DT[, c("prev1", "prev2") := dcast(
long[long, on = c("stim", "rn < rn"),
.(max(x.rn), x.opposite[which.max(x.rn)]), by = .EACHI],
rn ~ rowid(rn), value.var = "V2")[, rn := NULL]][]
stim1 stim2 prev1 prev2
1: 2 3 NA NA
2: 1 3 NA 2
3: 2 1 3 3
4: 1 2 2 1
5: 3 1 1 2
Data
library(data.table)
DT <- data.table(stim1 = c(2L, 1L, 2L, 1L, 3L),
stim2 = c(3L, 3L, 1L, 2L, 1L))

A quick helper function to iterate through the data:
func <- function(mtx) {
na <- mtx[1][NA]
c(NA, sapply(seq_len(nrow(mtx))[-1], function(ind) {
v <- mtx[ind,1] ; s <- seq_len(ind-1)
m <- cbind(v == mtx[s,1], v == mtx[s,2])
if (any(m)) {
m <- which(m, arr.ind = TRUE)
row <- which.max(m[,1])
mtx[m[row,1], m[row,2] %% 2 + 1]
} else na
}))
}
Demonstration:
dat[, Prev1 := func(cbind(stim1, stim2)) ][, Prev2 := func(cbind(stim2, stim1)) ]
# stim1 stim2 Prev1 Prev2
# <int> <int> <int> <int>
# 1: 2 3 NA NA
# 2: 1 3 NA 2
# 3: 2 1 3 3
# 4: 1 2 2 1
# 5: 3 1 1 2
Alternative, using zoo::rollapply:
func2 <- function(mtx) {
na <- mtx[1][NA]
if (!is.matrix(mtx)) return(na) # we're on the first row
v <- mtx[nrow(mtx),1] ; s <- seq_len(nrow(mtx)-1)
m <- cbind(v == mtx[s,1], v == mtx[s,2])
if (any(m)) {
m <- which(m, arr.ind = TRUE)
row <- which.max(m[,1])
mtx[m[row,1], m[row,2] %% 2 + 1]
} else na
}
dat[, Prev1 := zoo::rollapplyr(.SD, .N, FUN = func2, by.column = FALSE, partial = TRUE),
.SDcols = c("stim1", "stim2")
][, Prev2 := zoo::rollapplyr(.SD, .N, FUN = func2, by.column = FALSE, partial = TRUE),
.SDcols = c("stim2", "stim1") ]
It's not shorter, and in fact is slower (with a 5-row dataset), but if you prefer to think of this in a rolling fashion, this produces the same results. (It's possible the newer slider package might be clearer, faster, or neither compared with this.)
Note:
I assign na as a class-specific NA (there are at least six types of NA). I do this defensively: if there is at least one match, then the remainder of the NA values will be coerced into the correct class; however, if there are no matches, then the class returned by func will be logical which may not be the same as the original data, and data.table will complain.

Just in case there was interest in a tidyverse approach (probably not as fast!). You could add row numbers to your data.frame, and put it into long form. Then, grouping by each value, get the previous row number and stim to reference in Prev. With a left_join you can obtain the appropriate value for Prev.
library(tidyverse)
df <- mutate(as.data.frame(df), rn = row_number())
df_long <- pivot_longer(df,
cols = -rn,
names_to = "stim",
names_pattern = "stim(\\d+)",
names_transform = list(stim = as.numeric))
df_long %>%
group_by(value) %>%
mutate(match_rn = lag(rn), match_stim = 3 - lag(stim)) %>%
left_join(df_long, by = c("match_rn" = "rn", "match_stim" = "stim")) %>%
pivot_wider(id_cols = rn,
names_from = stim,
values_from = value.y,
names_prefix = "Prev") %>%
right_join(df) %>%
arrange(rn)
Output
rn Prev1 Prev2 stim1 stim2
<int> <int> <int> <int> <int>
1 1 NA NA 2 3
2 2 NA 2 1 3
3 3 3 3 2 1
4 4 2 1 1 2
5 5 1 2 3 1

Here is another option:
setDT(DT)[, rn := .I]
dt1 <- DT[DT, on=.(stim1, rn<rn), mult="last", .(x.rn, v=x.stim2)]
dt2 <- DT[DT, on=.(stim2=stim1, rn<rn), mult="last", .(x.rn, v=x.stim1)]
DT[, Prev1 := fcoalesce(fifelse(dt1$x.rn > dt2$x.rn, dt1$v, dt2$v), dt1$v, dt2$v)]
#have not flipped everything and seems to work for this minimal example, pls let me know if there are cases where Prev2 is wrong
dt1 <- DT[DT, on=.(stim2, rn<rn), mult="last", .(x.rn, v=x.stim1)]
dt2 <- DT[DT, on=.(stim1=stim2, rn<rn), mult="last", .(x.rn, v=x.stim2)]
DT[, Prev2 := fcoalesce(fifelse(dt1$x.rn > dt2$x.rn, dt1$v, dt2$v), dt1$v, dt2$v)]
output:
stim1 stim2 rn Prev1 Prev2
1: 2 3 1 NA NA
2: 1 3 2 NA 2
3: 2 1 3 3 3
4: 1 2 4 2 1
5: 3 1 5 1 2
data:
library(data.table)
DT <- structure(list(stim1 = c(2L, 1L, 2L, 1L, 3L),
stim2 = c(3L, 3L, 1L, 2L, 1L)),
row.names = c(NA, -10L), class = c("data.table", "data.frame"))

Related

Grouped recurrence by periods over a data.table

I have a dataset with names, dates, and several categorical columns. Let's say
data <- data.table(name = c('Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Cal', 'Anne', 'Ben', 'Ben', 'Ben', 'Cal'),
period = c(1,1,1,1,1,1,2,2,2,3,3),
category = c("A","A","A","B","B","B","A","B","A","B","A"))
Which looks like this:
name period category
Anne 1 A
Ben 1 A
Cal 1 A
Anne 1 B
Ben 1 B
Cal 1 B
Anne 2 A
Ben 2 B
Ben 2 A
Ben 3 A
Cal 3 B
I want to compute, for each period, how many names were present in the past period, for every group of my categorical variables. The output should be as follows:
period category recurrence_count
2 A 2 # due to Anne and Ben being on A, period 1
2 B 1 # due to Ben being on B, period 1
3 A 1 # due to Ben being on A, period 2
3 B 0 # no match from B, period 2
I am aware of the .I and .GRP operators in data.table, but I have no idea how to write the notion of 'next group' in the j entry of my statement. I imagine something like this might be a reasonable path, but I can't figure out the correct syntax:
data[, .(recurrence_count = length(intersect(name, name[last(.GRP)]))), by = .(category, period)]
You can first summarize your data by category and period.
previous_period_names <- data[, .(names = list(name)), .(category, period)]
previous_period_names[, next_period := period + 1]
Join your summary with your original data.
data[previous_period_names, names := i.names, on = c('period==next_period')]
Now count how many names you see the name in the summarized names
data[, .(recurrence_count = sum(name %in% unlist(names))), by = .(period, category)]
Another data.table alternative. For rows that can have a previous period (period != 1), create such a variable (prev_period := period - 1).
Join original data with a subset that has values for 'prev_period' (data[data[!is.na(prev_period)]). Join on 'category', 'period = prev_period' and 'name'.
In the resulting data set, for each 'period' and 'category' (by = .(period = i.period, category)), count the number of names from original data (x.name) that had a match with previous period (length(na.omit(x.name))).
data[period != 1, prev_period := period - 1]
data[data[!is.na(prev_period)], on = c("category", period = "prev_period", "name"),
.(category, i.period, x.name)][
, .(n = length(na.omit(x.name))), by = .(period = i.period, category)]
# period category n
# 1: 2 A 2
# 2: 2 B 1
# 3: 3 B 1
# 4: 3 A 0
One option in base R is to split the 'data' by 'category', then loop over the list (lapply), use Reduce with intersect on the splitted 'name' by 'period' with accumulate as TRUE, get the lengths of the list, create a data.frame with the unique elements of 'period' and use Map to create the 'category' from the names of the list output, rbind the list of data.frame into a single dataset
library(data.table)
lst1 <- lapply(split(data, data$category), function(x)
data.frame(period = unique(x$period)[-1],
recurrence_count = lengths(Reduce(intersect,
split(x$name, x$period), accumulate = TRUE)[-1])))
rbindlist(Map(cbind, category = names(lst1), lst1))[
order(period), .(period, category, recurrence_count)]
# period category recurrence_count
#1: 2 A 2
#2: 2 B 1
#3: 3 A 1
#4: 3 B 0
Or using the same logic within data.table, grouped by 'category, do the split of 'name' by 'period' and apply the Reduce with intersect
setDT(data)[, .(period = unique(period),
recurrence_count = lengths(Reduce(intersect,
split(name, period), accumulate = TRUE))), .(category)][duplicated(category)]
# category period recurrence_count
#1: A 2 2
#2: A 3 1
#3: B 2 1
#4: B 3 0
Or similar option in tidyverse
library(dplyr)
library(purrr)
data %>%
group_by(category) %>%
summarise(reccurence_count = lengths(accumulate(split(name, period),
intersect)), period = unique(period), .groups = 'drop' ) %>%
filter(duplicated(category))
# A tibble: 4 x 3
# category reccurence_count period
# <chr> <int> <int>
#1 A 2 2
#2 A 1 3
#3 B 1 2
#4 B 0 3
data
data <- structure(list(name = c("Anne", "Ben", "Cal", "Anne", "Ben",
"Cal", "Anne", "Ben", "Ben", "Ben", "Cal"), period = c(1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), category = c("A", "A", "A",
"B", "B", "B", "A", "B", "A", "A", "B")), class = "data.frame",
row.names = c(NA,
-11L))
A data.table option
setDT(df)[
,
{
u <- split(name, period)
data.table(
period = unique(period)[-1],
recurrence_count = lengths(
Map(
intersect,
head(u, -1),
tail(u, -1)
)
)
)
},
category
]
gives
category period recurrence_count
1: A 2 2
2: A 3 1
3: B 2 1
4: B 3 0

apply two functions by row condition data.table

I have the following data.table
df <- data.table(
id = c(rep(1,6),rep(2,6),rep(3,6)),
grp = c(rep("x",6),rep("y",6),rep("z",6)),
val1 = 1:18,
val2 = 13:30
)
I want two apply two different functions by row condition
for example:
cols <- paste0("val",1:2)
df[id == 1,lapply(.SD, function (x) tail(x,2)),.SDcols = cols,by = list(id,grp)]
df[id != 1,lapply(.SD, function (x) tail(x,3)),.SDcols = cols,by = list(id,grp)]
I'm quite new to working with data.table so there is maybe a more efficient way than carrying out separate calculations then joining the two tables above
If the conditions are disjunct, i.e., id == 1 and id != 1, and if id is also one of the grouping variables (in the by = clause), two different functions can be applied by
df[, lapply(.SD, function (x) if (first(id) == 1) tail(x, 2) else tail(x, 3)),
.SDcols = cols, by = .(id, grp)]
id grp val1 val2
1: 1 x 5 17
2: 1 x 6 18
3: 2 y 10 22
4: 2 y 11 23
5: 2 y 12 24
6: 3 z 16 28
7: 3 z 17 29
8: 3 z 18 30
So, subsetting is not by row but by grouping variable and has been moved into the anonymous function definition within lapply(). This avoids to rbind() the subsets afterwards.
For the sake of completeness, in the particular case of the tail() function being called with different parameters we can write more concisely
df[, lapply(.SD, tail, n = fifelse(first(id) == 1, 2, 3)),
.SDcols = cols, by = .(id, grp)]
Here is another option:
df[.N:1L, ri := rowid(id, grp)]
rbindlist(list(
df[id == 1L & ri <= 2L], #for the first, df[id == 1L, tail(.SD, 2L), .(id, grp), .SDcols = cols]
df[id != 1L & ri <= 3L] #and for df[id != 2, tail(.SD, 3L), .(id,grp), .SDcols = cols]
))
output:
id grp val1 val2 ri
1: 1 x 5 17 2
2: 1 x 6 18 1
3: 2 y 10 22 3
4: 2 y 11 23 2
5: 2 y 12 24 1
6: 3 z 16 28 3
7: 3 z 17 29 2
8: 3 z 18 30 1
Would be interested to know the size of your dataset and the speedup.

Selecting the row subgroup with the max value of a variable [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
so I have a data frame:
ID: YearMon: Var: Count:
1 012007 H 1
1 012007 D 2
1 022007 NA
1 032007 H 1
2 012007 H 1
2 022007 Na
2 022007 D 1
2 032007 NA
How would i go about getting just the max value for each unique ID for a certain YearMon? Ideally it would return:
1 012007 D 2
1 022007 NA
1 032007 H 1
2 012007 H 1
2 022007 D 1
2 032007 NA
Using plyr this should be easily achieved. This will filter by ID and YearMon and return the max value along with the ID and YearMon in a data frame.
library(plyr)
ddply( dat1 , .(ID,YearMon) ,function(x) {
Count = max( x$Count )
data.frame( Count=Count , Var=x[x$Count == Count,"Var"] )
})
In order to return all Columns:
df[ is.na( df$Count ) , "Count" ] <- -9999
df2 <- ddply(df, .(ID,YearMon) , function(x){
Count = max( x$Count )
index = which( x$Count == max( x$Count ))
y <- x[ index ,]
data.frame( y )
})
df2[ df2$Count == -9999, "Count" ] <- NA
This will return your indexing values back to NA as well.
Using data.table, if you have a data table called dt, you can first calculate the max of Count by group, and then just keep the rows where Count is equal to the max for that group:
newdt <- dt[, max.count := max(Count), by=.(ID, YearMon)][Count==max.count,.(ID, YearMon, Var, Count)]
library(dplyr)
dt %>%
group_by(ID, YearMon) %>%
slice(Count %>% which.max)
Lets not forget about aggregate!
#####Clean up data. You need to change your grouping variables to factors and data needs to be numeric####
dat1$Var.[dat1$Var.==1]=""
dat1$Count.<-as.numeric(dat1$Count.)
dat1$ID.<-as.factor(dat1$ID.)
dat1$YearMon.<-as.factor(dat1$YearMon.)
dat1<-dat1[,-3] ###Lets get rid of the Var column as you're not using it.
aggregate(. ~ ID.+YearMon.,data = dat1,FUN=max ) #### Use aggregate. Simple and short code
ID. YearMon. Count.
1 1 12007 2
2 2 12007 1
3 2 22007 1
4 1 32007 1
Another data.table option using if/else. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', 'YearMon', if all of the 'Count' values are 'NA' we return the Subset of Data.table (.SD) or else we get the index of maximum value of 'Count' and subset the data.table (.SD[which.max(Count)]).
library(data.table)
setDT(df1)[, if(all(is.na(Count))) .SD else .SD[which.max(Count)],.(ID, YearMon)]
# ID YearMon Var Count
#1: 1 12007 D 2
#2: 1 22007 NA
#3: 1 32007 H 1
#4: 2 12007 H 1
#5: 2 22007 D 1
#6: 2 32007 NA
Or another option would be to concatenate the index from which.max and the rows which have all 'NA' for 'Count' grouped by the variables, get the row index (.I) and use that to subset the 'data.table'.
setDT(df1)[df1[, .I[c(which.max(Count), all(is.na(Count)))], .(ID, YearMon)]$V1]
# ID YearMon Var Count
#1: 1 12007 D 2
#2: 1 22007 NA
#3: 1 32007 H 1
#4: 2 12007 H 1
#5: 2 22007 D 1
#6: 2 32007 NA
Or we replace the NA by a very small number, use which.max and subset
setDT(df1)[, .SD[which.max(replace(Count, is.na(Count),-Inf ))], .(ID, YearMon)]
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L),
YearMon = c(12007L,
12007L, 22007L, 32007L, 12007L, 22007L, 22007L, 32007L), Var = c("H",
"D", "", "H", "H", "", "D", ""), Count = c(1L, 2L, NA, 1L, 1L,
NA, 1L, NA)), .Names = c("ID", "YearMon", "Var", "Count"),
class = "data.frame", row.names = c(NA, -8L))

Rearranging longitudinal data

I have a dataset that is roughly structured like this:
case Year 2001 2002 2003 2004
1 2003 0 0 0 3
2 2002 0 5 3 2
3 2001 3 3 2 2
I am trying to restructure it so that every column represents the first, second (etc.) year counting from the "Year" variable, i.e.:
case Year yr1 yr2 yr3 yr4
1 2003 0 3 0 0
2 2002 5 3 2 0
3 2001 3 3 2 2
This code downloads the dataset and tries the solution suggested by #akrun, but it fails.
library("devtools")
df1 <- source_gist("b4c44aa67bfbcd6b72b9")
df1[-(1:2)] <- do.call(rbind,lapply(seq_len(nrow(df1)), function(i) {x <- df1[i, ]; x1 <- unlist(x[-(1:2)]); indx <- which(!is.na(x1))[1]; i <- as.numeric(names(indx))-x[,2]+1; x2 <- x1[!is.na(x1)]; x3 <- rep(NA, length(x1)); x3[i:(i+length(x2)-1)]<- x2; x3}))
This generates:
Error in i:(i + length(x2) - 1) : NA/NaN argument
In addition: Warning message:
In FUN(1:234[[1L]], ...) : NAs introduced by coercion
How can I transform the data so that every column represents the first, second (etc.) year counting from the value in the "Year" variable for each row?
Here's a possibilty:
library(dplyr)
library(reshape2)
df %>%
melt(id.vars = c("case", "Year")) %>%
mutate(variable = as.numeric(as.character(variable)),
yr = variable - Year + 1) %>%
filter(variable >= Year) %>%
dcast(case + Year ~ yr, fill = 0)
# case Year 1 2 3 4
# 1 1 2003 0 3 0 0
# 2 2 2002 5 3 2 0
# 3 3 2001 3 3 2 2
Data:
df <- structure(list(case = 1:3, Year = c(2003L, 2002L, 2001L), `2001` = c(0L,
0L, 3L), `2002` = c(0L, 5L, 3L), `2003` = c(0L, 3L, 2L), `2004` = c(3L,
2L, 2L)), .Names = c("case", "Year", "2001", "2002", "2003",
"2004"), class = "data.frame", row.names = c(NA, -3L))
This should create the manipulation you are looking for.
library("devtools")
df1 <- source_gist("b4c44aa67bfbcd6b72b9")
temp <- df1[[1]]
library(dplyr); library(tidyr); library(stringi)
temp <- temp %>%
gather(new.Years, X, -Year) %>% # convert rows to one column
mutate(Year.temp=paste0(rownames(temp), "-", Year)) %>% # concatenate the Year with row number to make them unique
mutate(new.Years = as.numeric(gsub("X", "", new.Years)), diff = new.Years-Year+1) %>% # calculate the difference to get the yr0 yr1 and so on
mutate(diff=paste0("yr", stri_sub(paste0("0", (ifelse(diff>0, diff, 0))), -2, -1))) %>% # convert the differences in Yr01 ...
select(-new.Years) %>% filter(diff != "yr00") %>% # drop new.Years column
spread(diff, X) %>% # convert column to rows
select(-Year.temp) # Drop Year.temp column
temp[is.na(temp)] <- 0 # replace NA with 0
temp %>% View
Notice that this will work for up to 99 years.
Here's a data.table solution:
require(data.table)
require(reshape2)
dt.m = melt(dt, id = 1:2, variable.factor = FALSE)
dt.m[, variable := as.integer(variable)-Year+1L]
dcast.data.table(dt.m, case + Year ~ variable, fill=0L,
value.var = "value", subset = (variable > 0L))
# case Year 1 2 3 4
# 1: 1 2003 0 3 0 0
# 2: 2 2002 5 3 2 0
# 3: 3 2001 3 3 2 2
library("devtools")
df1 <- source_gist("b4c44aa67bfbcd6b72b9")$value
I have an X in the colnames and remove it:
colnames(df1) <- gsub("X", "", colnames(df1))
I have got a solution without any additional packages:
startYear <- as.numeric(colnames(df1)[2])
shifts <- df1$Year - startYear
n <- ncol(df1)
df2 <- df1
colnames(df2)[-1] <- 1:(n-1)
df2[,2:n] <- NA
for(row in 1:nrow(df1)){
if(shifts[row]>=0){
df2[row,2:(n-shifts[row])] <- df1[row, (shifts[row]+2):n]
#df2[row,2:(n-shifts[row])] <- colnames(df1)[(shifts[row]+2):n]
}else{
df2[row, (-shifts[row]+2):n] <- df1[row, 2:(n+shifts[row])]
#df2[row, (-shifts[row]+2):n] <- colnames(df1)[2:(n+shifts[row])]
}
}
You can prefill df2 with 0 instead of NA of corse. Decomment second rows and comment first rows in the ifelse condition to validate the permutation.
Hope it does what you wanted.

drop levels of factor for which there is one missing value for one column r

I would like to drop any occurrence of a factor level for which one row contains a missing value
Example:
ID var1 var2
1 1 2
1 NA 3
2 1 2
2 2 4
So, in this hypothetical, what would be left would be:
ID var1 var2
2 1 2
2 2 4
Hers's possible data.table solution (sorry #rawr)
library(data.table)
setDT(df)[, if (all(!is.na(.SD))) .SD, ID]
# ID var1 var2
# 1: 2 1 2
# 2: 2 2 4
If you only want to check var1 then
df[, if (all(!is.na(var1))) .SD, ID]
# ID var1 var2
# 1: 2 1 2
# 2: 2 2 4
Assuming that NAs would occur in both var columns,
df[with(df, !ave(!!rowSums(is.na(df[,-1])), ID, FUN=any)),]
# ID var1 var2
#3 2 1 2
#4 2 2 4
Or if it is only specific to var1
df[with(df, !ave(is.na(var1), ID, FUN=any)),]
# ID var1 var2
#3 2 1 2
#4 2 2 4
Or using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(!is.na(var1)))
# ID var1 var2
#1 2 1 2
#2 2 2 4
data
df <- structure(list(ID = c(1L, 1L, 2L, 2L), var1 = c(1L, NA, 1L, 2L
), var2 = c(2L, 3L, 2L, 4L)), .Names = c("ID", "var1", "var2"
), class = "data.frame", row.names = c(NA, -4L))
Here's one more option in base R. It will check all columns for NAs.
df[!df$ID %in% df$ID[rowSums(is.na(df)) > 0],]
# ID var1 var2
#3 2 1 2
#4 2 2 4
If you only want to check in column "var1" you can do:
df[!with(df, ID %in% ID[is.na(var1)]),]
# ID var1 var2
#3 2 1 2
#4 2 2 4
In the current development version of data.table, there's a new implementation of na.omit for data.tables, which takes a cols =and invert = arguments.
The cols = allows to specify the columns on which to look for NAs. And invert = TRUE returns the NA rows instead, instead of omitting them.
You can install the devel version by following these instructions. Or you can wait for 1.9.6 on CRAN at some point. Using that, we can do:
require(data.table) ## 1.9.5+
setkey(setDT(df), ID)
df[!na.omit(df, invert = TRUE)]
# ID var1 var2
# 1: 2 1 2
# 2: 2 2 4
How this works:
setDT converts data.frame to data.table by reference.
setkey sorts the data.table by the columns provided and marks those columns as key columns so that we can perform a join.
na.omit(df, invert = TRUE) gives just those rows that have NA anywhere.
X[!Y] does an anit-join by joining on the key column ID, and returns all the rows that don't match ID = 1 (from Y). Check this post to read in detail about data.table's joins.
HTH

Resources