Select data.table columns based on condition, within by - r

I want to extract data.table columns if their contents fulfill a criteria. And I need a method that will work with by (or in some other way within combinations of columns). I am not very experienced with data.table and have tried my best with .SDcol and what else I could think of.
Example: I often have datasets with observations at multiple time points for multiple subjects. They also contain covariates which do not vary within subjects.
dt1 <- data.table(
id=c(1,1,2,2,3,3),
time=c(1,2,1,2,1,2),
meas=c(452,23,555,33,322,32),
age=c(30,30,54,54,20,20),
bw=c(75,75,81,81,69,70)
)
How do I (efficiently) select the columns that do not vary within id (in this case, id and age)? I'd like a function call that would return
id age
1: 1 30
2: 2 54
3: 3 20
And how do I select the columns that do vary within ID (so drop age)? The function call should return:
id time meas bw
1: 1 1 452 75
2: 1 2 23 75
3: 2 1 555 81
4: 2 2 33 81
5: 3 1 322 69
6: 3 2 32 70
Of course, I am interested if you know of a function that addresses the specific example above, but I am even more curious on how to do this generally. Columns that contain more than two values > 1000 within any combinations of id and time in by=.(id,time), or whatever...
Thanks!

How do I (efficiently) select the columns that do not vary within id (in this case, id and age)?
Maybe something like:
f <- function(DT, byChar) {
cols <- Reduce(intersect, DT[, .(.(names(.SD)[sapply(.SD, uniqueN)==1])), byChar]$V1)
unique(DT[, c(byChar, cols), with=FALSE])
}
f(dt1, "id")
output:
id age
1: 1 30
2: 2 54
3: 3 20
And how do I select the columns that do vary within ID (so drop age)?
Similarly,
f2 <- function(DT, byChar, k) {
cols <- Reduce(intersect, DT[, .(.(names(.SD)[sapply(.SD, uniqueN)>k])), byChar]$V1)
unique(DT[, c(byChar, cols), with=FALSE])
}
f2(dt1, "id", 1)
output:
id time meas
1: 1 1 452
2: 1 2 23
3: 2 1 555
4: 2 2 33
5: 3 1 322
6: 3 2 32
data:
library(data.table)
dt1 <- data.table(
id=c(1,1,2,2,3,3),
time=c(1,2,1,2,1,2),
meas=c(452,23,555,33,322,32),
age=c(30,30,54,54,20,20),
bw=c(75,75,81,81,69,70)
)

This might also be an option:
Count unique values per column, by ID (using data.table::uniqueN )
Check in which columns the sum of unique values (by group) equals the number of unique IDs (using colSums)
Only keep (or drop) the wanted columns
library(data.table)
ids <- uniqueN(dt1$id)
#no variation
dt1[, c( TRUE, colSums( dt1[, lapply( .SD, uniqueN ), by = id ][,-1]) == ids ), with = FALSE]
id age
1: 1 30
2: 1 30
3: 2 54
4: 2 54
5: 3 20
6: 3 20
#variation
dt1[, c( TRUE, !colSums( dt1[, lapply( .SD, uniqueN ), by = id ][,-1]) == ids ), with = FALSE]
id time meas bw
1: 1 1 452 75
2: 1 2 23 75
3: 2 1 555 81
4: 2 2 33 81
5: 3 1 322 69
6: 3 2 32 70

Based on chinsoon12's suggestion, I managed to put something together. I need four steps, and I'm not sure how efficient it is, but at least it does the job. To recap, this is the dataset:
dt1
id time meas age bw
1: 1 1 452 30 75
2: 1 2 23 30 75
3: 2 1 555 54 81
4: 2 2 33 54 81
5: 3 1 322 20 69
6: 3 2 32 20 70
I put this together to get the columns that are constant within "id" (only age):
cols.id <- "id"
dt2 <- dt1[, .SD[, lapply(.SD, function(x)uniqueN(x)==1)], by=cols.id]
ifkeep <- dt2[,sapply(.SD,all),.SDcols=!(cols.id)]
keep <- c(cols.id,setdiff(colnames(dt2),cols.id)[ifkeep])
unique(dt1[,keep,with=F])
id age
1: 1 30
2: 2 54
3: 3 20
And to get the columns that vary within any value of "id" (age is dropped):
cols.id <- "id"
## differenct from above: ==1 -> >1
dt2 <- dt1[, .SD[, lapply(.SD, function(x)uniqueN(x)>1)], by=cols.id]
## difference from above: all -> any
ifkeep <- dt2[,sapply(.SD,any),.SDcols=!(cols.id)]
keep <- c(cols.id,setdiff(colnames(dt2),cols.id)[ifkeep])
unique(dt1[,keep,with=F])
id time meas bw
1: 1 1 452 75
2: 1 2 23 75
3: 2 1 555 81
4: 2 2 33 81
5: 3 1 322 69
6: 3 2 32 70

Related

How to replace NAs with values from another column in data.table (Example given)?

DT is data.table and I want to replace NAs with values from visits column and Expected_DT is desired DT.
DT<-data.table(name=c("x","x","x","x"),hour=1:4,count=c(NA,45,56,78),visits=c(14,45,56,78))
name hour count visits
1: x 1 NA 14
2: x 2 45 45
3: x 3 56 56
4: x 4 78 78
This is what I want
Expected_DT<-data.table(name=c("x","x","x","x"),hour=1:4,count=c(14,45,56,78),visits=c(14,45,56,78))
name hour count visits
1: x 1 14 14
2: x 2 45 45
3: x 3 56 56
4: x 4 78 78
A few options:
1) using fcoalesce
DT[, count := fcoalesce(visits, count)]
2) using is.na:
DT[is.na(count), count := visits]
3) using fifelse:
DT[, count := fifelse(is.na(count), visits, count)]
4) using set and using sindri_baldur's comment on [[ for faster indexing:
ix <- DT[is.na(count), which=TRUE]
set(DT, ix, "count", DT[["visits"]][ix])
Solution using data.table:
DT[is.na(count), count:=visits]
DT
Returns:
name hour count visits
1: x 1 14 14
2: x 2 45 45
3: x 3 56 56
4: x 4 78 78
Some base R solutions
using ifelse
DT <- within(DT, count <- ifelse(is.na(count),visits,count))
using rowSums
DT <- within(DT, count <- rowSums(cbind(is.na(count)*visits,count),na.rm = TRUE))
And here is a dplyr version to be complete for other users:
library(dplyr)
DT %>%
mutate(count = if_else(is.na(count), visits, count))
name hour count visits
1 x 1 14 14
2 x 2 45 45
3 x 3 56 56
4 x 4 78 78

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

how to select the minimum pvalue and the next closer to the minimum in data.table

Here is an example about what I want:
set.seed(123)
data<-data.frame(X=rep(letters[1:3], each=4),Y=sample(1:12,12),Z=sample(1:100, 12))
setDT(data)
What I would like to do is to select the unique row of X with minimum Y and the next closer value to the minimum
Desired output
>data
a 4 68
a 5 11
b 1 4
b 10 89
c 2 64
c 3 82
The min value is already answer in this post How to select rows by group with the minimum value and containing NAs in R
data[, .SD[which.min(Y)], by=X]
But how to do it with the minimum and the next closer?
For the ungrouped case, for a data.table you can do:
data[rank(Y) %in% 1:2, ]
For the grouped case, you can do:
data[ , .SD[rank(Y) %in% 1:2] , by=X]
X Y Z
1: a 4 68
2: a 5 11
3: b 1 4
4: b 10 89
5: c 3 82
6: c 2 64

How to use apply function once for each unique factor value

I'm trying on some commands on the R-studio built-in databse, ChickWeight. The data looks as follows.
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
7 106 12 1 1
8 125 14 1 1
9 149 16 1 1
10 171 18 1 1
11 199 20 1 1
12 205 21 1 1
13 40 0 2 1
14 49 2 2 1
15 58 4 2 1
Now what I would like to do is to simply output the difference between the chicken-weight for the "Chick" column for time 0 and 21 (last time value). I.e the weight the chick has put on.
I've been trying tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1]). But this of course applies the value to all rows.
How do I make it so that it applies only once for each unique Chick-value?
If we need a single value per each 'factor' column (assuming that 'Chick', and 'Diet' are the factor columns)
library(data.table)
setDT(df1)[, list(Diff= abs(weight[Time==21]-weight[Time==0])) ,.(Chick, Diet)]
and If we need to create a column
setDT(df1)[, Diff:= abs(weight[Time==21]-weight[Time==0]) ,.(Chick, Diet)]
I noticed that in the example Time = 21 is not found in the Chick No:2, may be in that case, we need one of the number
setDT(df1)[, {tmp <- Time %in% c(0,21)
list(Diff= if(sum(tmp)>1) abs(diff(weight[tmp])) else weight[tmp]) } ,
by = .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 40
If we are taking the difference of 'weight' based on the max and min 'Time' for each group
setDT(df1)[, list(Diff=weight[which.max(Time)]-
weight[which.min(Time)]), .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 18
Also, if the 'Time' is ordered
setDT(df1)[, list(Diff= abs(diff(weight[c(1L,.N)]))), by =.(Chick, Diet)]
Using by from base R
by(df1[1:2], df1[3:4], FUN= function(x) with(x,
abs(weight[which.max(Time)]-weight[which.min(Time)])))
#Chick: 1
#Diet: 1
#[1] 163
#------------------------------------------------------------
#Chick: 2
#Diet: 1
#[1] 18
Here's a solution using dplyr:
ChickWeight %>%
group_by(Chick = as.numeric(as.character(Chick))) %>%
summarise(weight_gain = last(weight) - first(weight), final_time = last(Time))
(First and last as suggested by #ulfelder.)
Note that ChickWeight$Chick is an ordered factor so without coercing it into numeric the final order looks odd.
Using base R:
ChickWeight$Chick <- as.numeric(as.character(ChickWeight$Chick))
tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1])

cut with varied intervals?

I have a dataset with two variables, one is grouping variable, and the other is value. The data is sorted by value within each group. I want to cut the value variable into a factor within each group and less than the interval of diff(10). That is, if diff(val)>=10, than a new level is created. Below is a demo data, where newgrp is the new variable I want. Maybe filter() is desired here, but I have been in a daze with it for quite a while. Any thoughts?
grp val newgrp
a 101 1
a 101 1
a 102 1
a 110 1
a 111 2 <-- a new level is created since 111 - 101 > 9
a 112 2
a 148 3 <-- a new level is created sine 152 - 148 > 9,
a 157 3
a 158 4 <-- a new level is created since 158 - 148>9
b 8 1 <-- levels start over for group b
b 9 1
b 12 1
b 17 1
b 18 2
Edit
I don't think there's any way to avoid defining a function first that will loop through each vector, since two numbers (the "base" and "new group") need to be reset every time a large enough difference is encountered.
NewGroup = function(x)
{
base = x[1]
new = 1
newgrp = c()
for(i in seq_along(x))
{
if (x[i] - base > 9)
{
base = x[i]
new = new + 1
}
newgrp[i] <- new
}
return(newgrp)
}
dt[,newgrp:=NewGroup(val),by=grp]
grp val newgrp
1: a 101 1
2: a 101 1
3: a 102 1
4: a 110 1
5: a 111 2
6: a 112 2
7: a 148 3
8: a 157 3
9: a 158 4
10: b 8 1
11: b 9 1
12: b 12 1
13: b 17 1
14: b 18 2
You can use this:
do.call(rbind, by(yourdf, yourdf$grp, function(df) within(df, newgrp <- cumsum(c(1,diff(val))>9))))
Replace yourdf with your dataframe.

Resources