Removing rows in a R data.table with NAs in specific columns - r

I have a data.table with a large number of features. I would like to remove the rows where the values are NAs only for certain features.
Currently I am using the following to handle this:
data.joined.sample <- data.joined.sample %>%
filter(!is.na(lat)) %>%
filter(!is.na(long)) %>%
filter(!is.na(temp)) %>%
filter(!is.na(year)) %>%
filter(!is.na(month)) %>%
filter(!is.na(day)) %>%
filter(!is.na(hour)) %>%
.......
Is there a more concise way to achieve this?
str(data.joined.sample)
Classes ‘data.table’ and 'data.frame': 336776 obs. of 50 variables:

We can select those columns, get a logical vector of NA's based on it using complete.cases and use that to remove the NA elements
data.joined.sample[complete.cases(data.joined.sample[colsofinterest]),]
where
colsofinterest <- c("lat", "long", "temp", "year", "month", "day", "hour")
Update
Based on the OP's comments, if it is a data.table, then subset the colsofinterest and use complete.cases
data.joined.sample[complete.cases(data.joined.sample[, colsofinterest, with = FALSE])]

data.table-objects, if that is in fact what your working with, have a somewhat different syntax for the "[" function. Look through this console session:
> DT = data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
> DT[x=="a"&y==1]
x y v
1: a 1 4
> is.na(DT[x=="a"&y==1]$v) <- TRUE # make one item NA
> DT[x=="a"&y==1]
x y v
1: a 1 NA
> DT
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 NA
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT[complete.cases(DT)] # note no comma
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 3 5
5: a 6 6
6: c 1 7
7: c 3 8
8: c 6 9
> DT # But that didn't remove the NA, it only gave a value
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 1 NA
5: a 3 5
6: a 6 6
7: c 1 7
8: c 3 8
9: c 6 9
> DT <- DT[complete.cases(DT)] # do this assignment to make permanent
> DT
x y v
1: b 1 1
2: b 3 2
3: b 6 3
4: a 3 5
5: a 6 6
6: c 1 7
7: c 3 8
8: c 6 9
Probably not the true "data.table way".

Related

R data.table group by continuous values

I need some help with grouping data by continuous values.
If I have this data.table
dt <- data.table::data.table( a = c(1,1,1,2,2,2,2,1,1,2), b = seq(1:10), c = seq(1:10)+1 )
a b c
1: 1 1 2
2: 1 2 3
3: 1 3 4
4: 2 4 5
5: 2 5 6
6: 2 6 7
7: 2 7 8
8: 1 8 9
9: 1 9 10
10: 2 10 11
I need a group for every following equal values in column a. Of this group i need the first (also min possible) value of column b and the last (also max possible) value of column c.
Like this:
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
Thank you very much for your help. I do not get it solved alone.
Probably we can try
> dt[, .(a = a[1], b = b[1], c = c[.N]), rleid(a)][, -1]
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
An option with dplyr
library(dplyr)
dt %>%
group_by(grp = cumsum(c(TRUE, diff(a) != 0))) %>%
summarise(across(a:b, first), c = last(c)) %>%
select(-grp)
-output
# A tibble: 4 × 3
a b c
<dbl> <int> <dbl>
1 1 1 4
2 2 4 8
3 1 8 10
4 2 10 11

R find intervals in data.table

i want to add a new column with intervals or breakpoints by group. As an an example:
This is my data.table:
x <- data.table(a = c(1:8,1:8), b = c(rep("A",8),rep("B",8)))
I have already the breakpoint or rowindices:
pos <- data.table(b = c("A","A","B","B"), bp = c(3,5,2,4))
Here i can find the interval for group "A" with:
findInterval(1:nrow(x[b=="A"]), pos[b=="A"]$bp)
How can i do this for each group. In this case "A" and "B"?
An option is to split the datasets by 'b' column, use Map to loop over the corresponding lists, and apply findInterval
Map(function(u, v) findInterval(seq_len(nrow(u)), v$bp),
split(x, x$b), split(pos, pos$b))
#$A
#[1] 0 0 1 1 2 2 2 2
#$B
#[1] 0 1 1 2 2 2 2 2
or another option is to group by 'b' from 'x', then use findInterval by subsetting the 'bp' from 'pos' by filtering with a logical condition created based on .BY
x[, findInterval(seq_len(.N), pos$bp[pos$b==.BY]), b]
# b V1
# 1: A 0
# 2: A 0
# 3: A 1
# 4: A 1
# 5: A 2
# 6: A 2
# 7: A 2
# 8: A 2
# 9: B 0
#10: B 1
#11: B 1
#12: B 2
#13: B 2
#14: B 2
#15: B 2
#16: B 2
Another option using rolling join in data.table:
pos[, ri := rowid(b)]
x[, intvl := fcoalesce(pos[x, on=.(b, bp=a), roll=Inf, ri], 0L)]
output:
a b intvl
1: 1 A 0
2: 2 A 0
3: 3 A 1
4: 4 A 1
5: 5 A 2
6: 6 A 2
7: 7 A 2
8: 8 A 2
9: 1 B 0
10: 2 B 1
11: 3 B 1
12: 4 B 2
13: 5 B 2
14: 6 B 2
15: 7 B 2
16: 8 B 2
We can nest the pos data into list by b and join with x and use findInterval to get corresponding groups.
library(dplyr)
pos %>%
tidyr::nest(data = bp) %>%
right_join(x, by = 'b') %>%
group_by(b) %>%
mutate(interval = findInterval(a, data[[1]][[1]])) %>%
select(-data)
# b a interval
# <chr> <int> <int>
# 1 A 1 0
# 2 A 2 0
# 3 A 3 1
# 4 A 4 1
# 5 A 5 2
# 6 A 6 2
# 7 A 7 2
# 8 A 8 2
# 9 B 1 0
#10 B 2 1
#11 B 3 1
#12 B 4 2
#13 B 5 2
#14 B 6 2
#15 B 7 2
#16 B 8 2

Mixing by and .SDcols in data.table

I am trying to mix by and .SDcols in data.table cran 1.9.6 (and also tested on dev from github, so it is likely a misundertanding on my part)
f = function(x){
print(x);
res=data.table(X=x,Y=x*x);
return(res)
}
DT = data.table(x=1:4, y=rep(c('a','b'),2))
DT[,c('A','B'):=lapply(.SD,FUN=f),.SDcols='x',by=y]
I get:
[1] 1 3
Error in `[.data.table`(DT, , `:=`(c("A", "B"), lapply(.SD, FUN = f)), :
All items in j=list(...) should be atomic vectors or lists. If you are trying something like j=list(.SD,newcol=mean(colA)) then use := by group instead (much quicker), or cbind or merge afterwards.
I would expect
x y A B
1: 1 a 1 1
2: 2 b 2 4
3: 3 a 3 9
4: 4 b 4 16
I would have expected the by operation to take place and SDcols to be replace by 'x' Could someone explain why I am wrong here ?
All the following works, as #Frank pinpointed, the problem was in the level nesting of the list by lapply
DT[,f(.SD[[1]]),.SDcols='x',by=y]
y X Y
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,lapply(.SD, f)[[1]],.SDcols='x',by=y]
y X Y
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,rbindlist(lapply(.SD, f)),.SDcols='x',by=y]
y X Y
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,sapply(.SD, f),.SDcols='x',by=y]
y V1 V2
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16
DT[,mapply(FUN=f, mget('x')),by=y]
y V1 V2
1: a 1 1
2: a 3 9
3: b 2 4
4: b 4 16

Assigning an unique identification variable across repeated values

I will create a simple example of some dummy data:
case <- c('a','a','a','b','b','c','c','c','c','d','d','e','e')
object <- c(1,1,2,1,1,1,1,2,3,1,1,1,2)
df1 <- data.frame(case, object)
Now for each unique case and object value, I want to create a corresponding unique numerical value (an identifier)
df1$UNIQ_ID <- ........
The end result should take the following values c(1,1,2,3,3,4,4,5,6,7,7,8,9) as when
unique(df1$object[df1$case=='a'])
unique(df1$object[df1$case=='b'])
I have though of using dpylr and group_by(case)
We can use the .GRP from data.table after grouping by 'case' and 'object' on a data.table object (setDT(df1)).
library(data.table)
setDT(df1)[,UNIQ_ID:= .GRP ,.(case, object)]
df1
# case object UNIQ_ID
# 1: a 1 1
# 2: a 1 1
# 3: a 2 2
# 4: b 1 3
# 5: b 1 3
# 6: c 1 4
# 7: c 1 4
# 8: c 2 5
# 9: c 3 6
#10: d 1 7
#11: d 1 7
#12: e 1 8
#13: e 2 9
A base R option would be
grp <- interaction(df1)
as.numeric(factor(grp, levels= unique(grp)))
#[1] 1 1 2 3 3 4 4 5 6 7 7 8 9

How to drop factors that have fewer than n members

Is there a way to drop factors that have fewer than N rows, like N = 5, from a data table?
Data:
DT = data.table(x=rep(c("a","b","c"),each=6), y=c(1,3,6), v=1:9,
id=c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,4))
Goal: remove rows when the number of id is less than 5. The variable "id" is the grouping variable, and the groups to delete when the number of rows in a group is less than 5. In DT, need to determine which groups have less than 5 members, (groups "1" and "4") and then remove those rows.
1: a 3 5 2
2: b 6 6 2
3: b 1 7 2
4: b 3 8 2
5: b 6 9 2
6: b 1 1 3
7: c 3 2 3
8: c 6 3 3
9: c 1 4 3
10: c 3 5 3
11: c 6 6 3
Here's an approach....
Get the length of the factors, and the factors to keep
nFactors<-tapply(DT$id,DT$id,length)
keepFactors <- nFactors >= 5
Then identify the ids to keep, and keep those rows. This generates the desired results, but is there a better way?
idsToKeep <- as.numeric(names(keepFactors[which(keepFactors)]))
DT[DT$id %in% idsToKeep,]
Since you begin with a data.table, this first part uses data.table syntax.
EDIT: Thanks to Arun (comment) for helping me improve this data table answer
DT[DT[, .(I=.I[.N>=5L]), by=id]$I]
# x y v id
# 1: a 3 5 2
# 2: a 6 6 2
# 3: b 1 7 2
# 4: b 3 8 2
# 5: b 6 9 2
# 6: b 1 1 3
# 7: b 3 2 3
# 8: b 6 3 3
# 9: c 1 4 3
# 10: c 3 5 3
# 11: c 6 6 3
In base R you could use
df <- data.frame(DT)
tab <- table(df$id)
df[df$id %in% names(tab[tab >= 5]), ]
# x y v id
# 5 a 3 5 2
# 6 a 6 6 2
# 7 b 1 7 2
# 8 b 3 8 2
# 9 b 6 9 2
# 10 b 1 1 3
# 11 b 3 2 3
# 12 b 6 3 3
# 13 c 1 4 3
# 14 c 3 5 3
# 15 c 6 6 3
If using a data.table is not necessary, you can use dplyr:
library(dplyr)
data.frame(DT) %>%
group_by(id) %>%
filter(n() >= 5)

Resources