cut with varied intervals? - r

I have a dataset with two variables, one is grouping variable, and the other is value. The data is sorted by value within each group. I want to cut the value variable into a factor within each group and less than the interval of diff(10). That is, if diff(val)>=10, than a new level is created. Below is a demo data, where newgrp is the new variable I want. Maybe filter() is desired here, but I have been in a daze with it for quite a while. Any thoughts?
grp val newgrp
a 101 1
a 101 1
a 102 1
a 110 1
a 111 2 <-- a new level is created since 111 - 101 > 9
a 112 2
a 148 3 <-- a new level is created sine 152 - 148 > 9,
a 157 3
a 158 4 <-- a new level is created since 158 - 148>9
b 8 1 <-- levels start over for group b
b 9 1
b 12 1
b 17 1
b 18 2

Edit
I don't think there's any way to avoid defining a function first that will loop through each vector, since two numbers (the "base" and "new group") need to be reset every time a large enough difference is encountered.
NewGroup = function(x)
{
base = x[1]
new = 1
newgrp = c()
for(i in seq_along(x))
{
if (x[i] - base > 9)
{
base = x[i]
new = new + 1
}
newgrp[i] <- new
}
return(newgrp)
}
dt[,newgrp:=NewGroup(val),by=grp]
grp val newgrp
1: a 101 1
2: a 101 1
3: a 102 1
4: a 110 1
5: a 111 2
6: a 112 2
7: a 148 3
8: a 157 3
9: a 158 4
10: b 8 1
11: b 9 1
12: b 12 1
13: b 17 1
14: b 18 2

You can use this:
do.call(rbind, by(yourdf, yourdf$grp, function(df) within(df, newgrp <- cumsum(c(1,diff(val))>9))))
Replace yourdf with your dataframe.

Related

How to apply a custom recursive function with data.table and loop over each index group-wise?

Since I can't find an answer in below questions:
Apply a recursive function over groups and rows without explicit for loop
How do I mimic the drag functionality for new rows such as in Excel but for R?
I'll try in asking a new question related to above. I.e, I want to apply a custom function recursively based on the output of previous values to the current row by group.
Example with a dataframe and a for loop:
for(i in 2:nrow(df1)) df1$z[i] <- df1$z[i-1] + df1$x[i-1] - df1$y[i-1]
Example with a dataframe and a for loop with custom function:
for(i in 2:nrow(df1)) df1$z[i] <- ifelse(df1$z[i-1] == df1$z[i],
df1$z[i-1] + df1$x[i-1] - df1$y[i-1],
df1$z[i-1] - df1$x[i-1] - df1$y[i-1])
However, with >1 mill rows, data.frames and for-loops are not optimal.
Is there any way to do above with data.table or dtplyr and optimized but also group-wise?
EDIT:
See visualization of question.
It should first initiate from 2nd row like in for(i in 2:nrow(df) and it should use the custom function if and only if group[i]==group[i-1]
Does this use of Reduce do the trick?
tmp = data.table(
grp = c(rep(0,6), rep(1,6)),
x=c(10,20,30,40,50,60,1,2,3,4,5,6),
y=c(1,2,3,4,5,6, 10,20,30,40,50,60)
)
tmp[, z:=Reduce(f=function(z,i) z + x[i-1] - y[i-1],
x=(1:.N)[-1],
init=0,
accumulate = T)
,by=grp
]
Output:
grp x y z
1: 0 10 1 0
2: 0 20 2 9
3: 0 30 3 27
4: 0 40 4 54
5: 0 50 5 90
6: 0 60 6 135
7: 1 1 10 0
8: 1 2 20 -9
9: 1 3 30 -27
10: 1 4 40 -54
11: 1 5 50 -90
12: 1 6 60 -135
Take for example, row 4. The value in the z column is 54, which is equal to the prior row's z-value + prior row's x-value, minus prior row's y-value.
The function f within Reduce can take any complicated form, including ifelse statements. Here is an example, where I've made a function called func, which is a wrapper around Reduce. Notice that within the Reduce statement, f is a function taking prev (thanks to suggestion by #r2evans), and this function first calculates previous row's s value minus previous row's t value (this is akin to your x[-1]-y[-1]. Then there is an ifelse statement. If the difference between the prior rows s and t value (i.e. k) is >20, then the new value in this row will be the previous z value minus the product of 20-4k (i.e. prev-(20-4k)), otherwise it will the previous z value + k (i.e. which is equal to your original formulation: z[i-1]+x[i-1]-y[i-1])
func <- function(s,t) {
Reduce(
f=function(prev,i) {
k=s[i-1] - t[i-1]
ifelse(k>10, prev -(20-4*k), prev+k)
},
x=2:length(s),
init=0,
accumulate = TRUE
)
}
You can then assign the value of the func(x,y) to z, like this:
tmp[, z:=func(x,y), by=.(grp)][]
Output:
grp x y z
1: 0 10 1 0
2: 0 20 2 9
3: 0 30 3 61
4: 0 40 4 149
5: 0 50 5 273
6: 0 60 6 433
7: 1 1 10 0
8: 1 2 20 -9
9: 1 3 30 -27
10: 1 4 40 -54
11: 1 5 50 -90
12: 1 6 60 -135

Select data.table columns based on condition, within by

I want to extract data.table columns if their contents fulfill a criteria. And I need a method that will work with by (or in some other way within combinations of columns). I am not very experienced with data.table and have tried my best with .SDcol and what else I could think of.
Example: I often have datasets with observations at multiple time points for multiple subjects. They also contain covariates which do not vary within subjects.
dt1 <- data.table(
id=c(1,1,2,2,3,3),
time=c(1,2,1,2,1,2),
meas=c(452,23,555,33,322,32),
age=c(30,30,54,54,20,20),
bw=c(75,75,81,81,69,70)
)
How do I (efficiently) select the columns that do not vary within id (in this case, id and age)? I'd like a function call that would return
id age
1: 1 30
2: 2 54
3: 3 20
And how do I select the columns that do vary within ID (so drop age)? The function call should return:
id time meas bw
1: 1 1 452 75
2: 1 2 23 75
3: 2 1 555 81
4: 2 2 33 81
5: 3 1 322 69
6: 3 2 32 70
Of course, I am interested if you know of a function that addresses the specific example above, but I am even more curious on how to do this generally. Columns that contain more than two values > 1000 within any combinations of id and time in by=.(id,time), or whatever...
Thanks!
How do I (efficiently) select the columns that do not vary within id (in this case, id and age)?
Maybe something like:
f <- function(DT, byChar) {
cols <- Reduce(intersect, DT[, .(.(names(.SD)[sapply(.SD, uniqueN)==1])), byChar]$V1)
unique(DT[, c(byChar, cols), with=FALSE])
}
f(dt1, "id")
output:
id age
1: 1 30
2: 2 54
3: 3 20
And how do I select the columns that do vary within ID (so drop age)?
Similarly,
f2 <- function(DT, byChar, k) {
cols <- Reduce(intersect, DT[, .(.(names(.SD)[sapply(.SD, uniqueN)>k])), byChar]$V1)
unique(DT[, c(byChar, cols), with=FALSE])
}
f2(dt1, "id", 1)
output:
id time meas
1: 1 1 452
2: 1 2 23
3: 2 1 555
4: 2 2 33
5: 3 1 322
6: 3 2 32
data:
library(data.table)
dt1 <- data.table(
id=c(1,1,2,2,3,3),
time=c(1,2,1,2,1,2),
meas=c(452,23,555,33,322,32),
age=c(30,30,54,54,20,20),
bw=c(75,75,81,81,69,70)
)
This might also be an option:
Count unique values per column, by ID (using data.table::uniqueN )
Check in which columns the sum of unique values (by group) equals the number of unique IDs (using colSums)
Only keep (or drop) the wanted columns
library(data.table)
ids <- uniqueN(dt1$id)
#no variation
dt1[, c( TRUE, colSums( dt1[, lapply( .SD, uniqueN ), by = id ][,-1]) == ids ), with = FALSE]
id age
1: 1 30
2: 1 30
3: 2 54
4: 2 54
5: 3 20
6: 3 20
#variation
dt1[, c( TRUE, !colSums( dt1[, lapply( .SD, uniqueN ), by = id ][,-1]) == ids ), with = FALSE]
id time meas bw
1: 1 1 452 75
2: 1 2 23 75
3: 2 1 555 81
4: 2 2 33 81
5: 3 1 322 69
6: 3 2 32 70
Based on chinsoon12's suggestion, I managed to put something together. I need four steps, and I'm not sure how efficient it is, but at least it does the job. To recap, this is the dataset:
dt1
id time meas age bw
1: 1 1 452 30 75
2: 1 2 23 30 75
3: 2 1 555 54 81
4: 2 2 33 54 81
5: 3 1 322 20 69
6: 3 2 32 20 70
I put this together to get the columns that are constant within "id" (only age):
cols.id <- "id"
dt2 <- dt1[, .SD[, lapply(.SD, function(x)uniqueN(x)==1)], by=cols.id]
ifkeep <- dt2[,sapply(.SD,all),.SDcols=!(cols.id)]
keep <- c(cols.id,setdiff(colnames(dt2),cols.id)[ifkeep])
unique(dt1[,keep,with=F])
id age
1: 1 30
2: 2 54
3: 3 20
And to get the columns that vary within any value of "id" (age is dropped):
cols.id <- "id"
## differenct from above: ==1 -> >1
dt2 <- dt1[, .SD[, lapply(.SD, function(x)uniqueN(x)>1)], by=cols.id]
## difference from above: all -> any
ifkeep <- dt2[,sapply(.SD,any),.SDcols=!(cols.id)]
keep <- c(cols.id,setdiff(colnames(dt2),cols.id)[ifkeep])
unique(dt1[,keep,with=F])
id time meas bw
1: 1 1 452 75
2: 1 2 23 75
3: 2 1 555 81
4: 2 2 33 81
5: 3 1 322 69
6: 3 2 32 70

Subsetting data.frame to return first 200 rows for specific condition in r

I have a data.frame with 3.3 million rows and 9 columns. Below is an example with the 3 relevant columns.
StimulusName Subject Pupil Means
1 1 101 3.270000
2 1 101 3.145000
3 1 101 3.265000
4 2 101 3.015000
5 2 101 3.100000
6 2 101 3.051250
7 1 102 3.035000
8 1 102 3.075000
9 1 102 3.050000
10 2 102 3.056667
11 2 102 3.059167
12 2 102 3.060000
13 1 103 3.085000
14 1 103 3.125000
15 1 103 3.115000
I want to subset data based on stimulus name and subject and then take either the first few or the last few rows for that subset. So, for example returning row 10 and 11 by getting the first 2 rows where df$StimulusName == 2 & df$Subject == 102.
The actual data frame contains thousands of observations per Stimulus and Subject. I want to use it to plot the first and last 200 observations of the stimulus separately.
Have not tested this out, but should work.
First 200
df_filtered <- subset(df, StimulusName == 2 & Subject == 102)
df_filtered <- df_filtered[1:200,]
Then plot df_filtered.
Last 200
df_filtered <- subset(df, StimulusName == 2 & Subject == 102)
df_filtered <- df_filtered[(nrow(df_filtered)-199):nrow(df_filtered),]
Then plot df_filtered.
Perhaps you want something like this:
subCond <- function(x, r, c) {
m <- x[x[, 1] == r & x[, 2] == c,]
return(m)
}
Yields e.g.:
> subCond(df, 1, 102)
StimulusName Subject PupilMeans
7 1 102 3.035
8 1 102 3.075
9 1 102 3.050
or
> subCond(df, 2, 101)
StimulusName Subject PupilMeans
4 2 101 3.01500
5 2 101 3.10000
6 2 101 3.05125

How to use apply function once for each unique factor value

I'm trying on some commands on the R-studio built-in databse, ChickWeight. The data looks as follows.
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
7 106 12 1 1
8 125 14 1 1
9 149 16 1 1
10 171 18 1 1
11 199 20 1 1
12 205 21 1 1
13 40 0 2 1
14 49 2 2 1
15 58 4 2 1
Now what I would like to do is to simply output the difference between the chicken-weight for the "Chick" column for time 0 and 21 (last time value). I.e the weight the chick has put on.
I've been trying tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1]). But this of course applies the value to all rows.
How do I make it so that it applies only once for each unique Chick-value?
If we need a single value per each 'factor' column (assuming that 'Chick', and 'Diet' are the factor columns)
library(data.table)
setDT(df1)[, list(Diff= abs(weight[Time==21]-weight[Time==0])) ,.(Chick, Diet)]
and If we need to create a column
setDT(df1)[, Diff:= abs(weight[Time==21]-weight[Time==0]) ,.(Chick, Diet)]
I noticed that in the example Time = 21 is not found in the Chick No:2, may be in that case, we need one of the number
setDT(df1)[, {tmp <- Time %in% c(0,21)
list(Diff= if(sum(tmp)>1) abs(diff(weight[tmp])) else weight[tmp]) } ,
by = .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 40
If we are taking the difference of 'weight' based on the max and min 'Time' for each group
setDT(df1)[, list(Diff=weight[which.max(Time)]-
weight[which.min(Time)]), .(Chick, Diet)]
# Chick Diet Diff
#1: 1 1 163
#2: 2 1 18
Also, if the 'Time' is ordered
setDT(df1)[, list(Diff= abs(diff(weight[c(1L,.N)]))), by =.(Chick, Diet)]
Using by from base R
by(df1[1:2], df1[3:4], FUN= function(x) with(x,
abs(weight[which.max(Time)]-weight[which.min(Time)])))
#Chick: 1
#Diet: 1
#[1] 163
#------------------------------------------------------------
#Chick: 2
#Diet: 1
#[1] 18
Here's a solution using dplyr:
ChickWeight %>%
group_by(Chick = as.numeric(as.character(Chick))) %>%
summarise(weight_gain = last(weight) - first(weight), final_time = last(Time))
(First and last as suggested by #ulfelder.)
Note that ChickWeight$Chick is an ordered factor so without coercing it into numeric the final order looks odd.
Using base R:
ChickWeight$Chick <- as.numeric(as.character(ChickWeight$Chick))
tapply(ChickWeight$weight, ChickWeight$Chick, function(x) x[length(x)] - x[1])

R data.table with rollapply

Is there an existing idiom for computing rolling statistics using data.table grouping?
For example, given the following code:
DT = data.table(x=rep(c("a","b","c"),each=2), y=c(1,3), v=1:6)
setkey(DT, y)
stat.ror <- DT[,rollapply(v, width=1, by=1, mean, na.rm=TRUE), by=y];
If there isn't one yet, what would be the best way to do it?
In fact I am trying to solve this very problem right now. Here is a partial solution which will work for grouping by a single column:
Edit: got it with RcppRoll, I think:
windowed.average <- function(input.table,
window.width = 2,
id.cols = names(input.table)[3],
index.col = names(input.table)[1],
val.col = names(input.table)[2]) {
require(RcppRoll)
avg.with.group <-
input.table[,roll_mean(get(val.col), n = window.width),by=c(id.cols)]
avg.index <-
input.table[,roll_mean(get(index.col), n = window.width),by=c(id.cols)]$V1
output.table <- data.table(
Group = avg.with.group,
Index = avg.index)
# rename columns to (sensibly) match inputs
setnames(output.table, old=colnames(output.table),
new = c(id.cols,val.col,index.col))
return(output.table)
}
A (badly written) unit test that will pass the above:
require(testthat)
require(zoo)
test.datatable <- data.table(Time = rep(seq_len(10), times=2),
Voltage = runif(20),
Channel= rep(seq_len(2),each=10))
test.width <- 8
# first test: single id column
test.avgtable <- data.table(
test.datatable[,rollapply(Voltage, width = test.width, mean, na.rm=TRUE),
by=c("Channel")],
Time = test.datatable[,rollapply(Time, width = test.width, mean, na.rm=TRUE),
by=c("Channel")]$V1)
setnames(test.avgtable,old=names(test.avgtable),
new=c("Channel","Voltage","Time"))
expect_that(test.avgtable,
is_identical_to(windowed.average(test.datatable,test.width)))
How it looks:
> test.datatable
Time Voltage Channel Class
1: 1 0.310935570 1 1
2: 2 0.565257533 1 2
3: 3 0.577278573 1 1
4: 4 0.152315111 1 2
5: 5 0.836052122 1 1
6: 6 0.655417230 1 2
7: 7 0.034859642 1 1
8: 8 0.572040136 1 2
9: 9 0.268105436 1 1
10: 10 0.126484340 1 2
11: 1 0.139711248 2 1
12: 2 0.336316520 2 2
13: 3 0.413086486 2 1
14: 4 0.304146029 2 2
15: 5 0.399344631 2 1
16: 6 0.581641210 2 2
17: 7 0.183586025 2 1
18: 8 0.009775488 2 2
19: 9 0.449576242 2 1
20: 10 0.938517952 2 2
> test.avgtable
Channel Voltage Time
1: 1 0.4630195 4.5
2: 1 0.4576657 5.5
3: 1 0.4028191 6.5
4: 2 0.2959510 4.5
5: 2 0.3346841 5.5
6: 2 0.4099593 6.5
Unfortunately, I haven't managed to make it work with multiple groupings (as this second section shows):
Looks okay for multiple column groups:
# second test: multiple id columns
# Depends on the first test passing to be meaningful.
test.width <- 4
test.datatable[,Class:= rep(seq_len(2),times=ceiling(nrow(test.datatable)/2))]
# windowed.average(test.datatable,test.width,id.cols=c("Channel","Class"))
test.avgtable <- rbind(windowed.average(test.datatable[Class==1,],test.width),
windowed.average(test.datatable[Class==2,],test.width))
# somewhat artificially attaching expected class labels
test.avgtable[,Class:= rep(seq_len(2),times=nrow(test.avgtable)/4,each=2)]
setkey(test.avgtable,Channel)
setcolorder(test.avgtable,c("Channel","Class","Voltage","Time"))
expect_that(test.avgtable,
is_equivalent_to(windowed.average(test.datatable,test.width,
id.cols=c("Channel","Class"))))

Resources