Keep Empty Factor Levels During Aggregation [duplicate] - r

This question already has answers here:
Empty factors in "by" data.table
(2 answers)
Closed 8 years ago.
I'm using data.table to aggregate values, but I'm finding that when the "by" variable has a level not present in the aggregation, it is omitted, even if it is specified in the factor levels.
The code below generates a data.table with 6 rows, the last two of which only have one of the two possible levels for f2 nested within f1. During aggregation, the {3,1} combination is dropped.
set.seed(1987)
dt <- data.table(f1 = factor(rep(1:3, each = 2)),
f2 = factor(sample(1:2, 6, replace = TRUE)),
val = runif(6))
str(dt)
Classes ‘data.table’ and 'data.frame': 6 obs. of 3 variables:
$ f1 : Factor w/ 3 levels "1","2","3": 1 1 2 2 3 3
$ f2 : Factor w/ 2 levels "1","2": 1 2 2 1 2 2
$ val: num 0.383 0.233 0.597 0.346 0.606 ...
- attr(*, ".internal.selfref")=<externalptr>
dt
f1 f2 val
1: 1 1 0.3829077
2: 1 2 0.2327311
3: 2 2 0.5965087
4: 2 1 0.3456710
5: 3 2 0.6058819
6: 3 2 0.7437177
dt[, sum(val), by = list(f1, f2)] # output is missing a row
f1 f2 V1
1: 1 1 0.3829077
2: 1 2 0.2327311
3: 2 2 0.5965087
4: 2 1 0.3456710
5: 3 2 1.3495996
# this is the output I'm looking for:
f1 f2 V1
1: 1 1 0.3829077
2: 1 2 0.2327311
3: 2 2 0.5965087
4: 2 1 0.3456710
5: 3 1 0.0000000 # <- the missing row from above
6: 3 2 1.3495996
Is there a way to achieve this behavior?

Why do you expect that data.table will compute sums for all combinations of f1 and f2?
If this what you want you should add missings rows to the original data before grouping sum. For example:
setkey(dt, f1, f2)
# omit "by = .EACHI" in data.table <= 1.9.2
dt[CJ(levels(f1), levels(f2)), sum(val, na.rm=T),
allow.cartesian = T, by = .EACHI]
## f1 f2 V1
## 1: 1 1 0.3829077
## 2: 1 2 0.2327311
## 3: 2 1 0.3456710
## 4: 2 2 0.5965087
## 5: 3 1 0.0000000 ## <- your "missing row" :)
## 6: 3 2 1.3495996

Related

Select Random Consecutive Rows Per Group

I have data which is grouped by 'student_id':
my_data = data.frame(student_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
exam_no = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
result = rnorm(15,60,10))
my_data
student_id exam_no result
1 1 1 56.60374
2 1 2 55.76655
3 1 3 53.81728
4 1 4 74.82202
5 1 5 34.91834
6 2 1 58.32422
7 2 2 60.38213
8 2 3 49.40390
9 2 4 63.85426
10 2 5 40.32912
11 3 1 69.54969
12 3 2 43.36639
13 3 3 37.97265
14 3 4 52.36436
15 3 5 61.62080
My Question:
For each student, I want to select a set of consecutive rows, with random start and end rows.
For example, keep exams 2-4 for student 1, keep exams 2-5 for student 2, etc.
I thought of the following way to do this:
Create a data frame that contains the max number of exams each student takes (in my problem, each student takes the same number of exams, but in the future this could be different)
library(dplyr)
counts = my_data %>% group_by(student_id) %>% summarise(counts = n())
# create variables that indicate where to start ("min") and where to end ("max") for each student
counts$min = sample(1:counts$counts, 1)
counts$max = sample(counts$min:counts$counts,1)
From here, I was then going to write a loop that would select rows between "min" and "max" index for each student (e.g. my_data[min:max]), but the results from the previous code are giving me warnings and illogical results:
Warning message:
In 1:counts$counts :
numerical expression has 3 elements: only the first used
Warning messages:
1: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
2: In counts$min:counts$counts :
numerical expression has 3 elements: only the first used
# A tibble: 3 x 4
student_id counts min max
<dbl> <int> <int> <int>
1 1 5 4 5
2 2 5 4 5
3 3 5 4 5
I am not sure how to continue this - can someone please show me how to continue?
Thanks!
A base R option using cumsum to label the in-between consecutive rows
subset(
my_data,
ave(
exam_no,
student_id,
FUN = function(x) cumsum(seq_along(x) %in% sample.int(length(x), 2))
) == 1
)
which gives, for example
student_id exam_no result
2 1 2 61.83643
3 1 3 51.64371
4 1 4 75.95281
6 2 1 51.79532
7 2 2 64.87429
8 2 3 67.38325
11 3 1 75.11781
12 3 2 63.89843
13 3 3 53.78759
A more compact version by data.table with a similar idea as above is
library(data.table)
setDT(my_data)[, .SD[cumsum((1:.N) %in% sample.int(.N, 2)) == 1], student_id]
Using data.table, within each group, sample two values from .I (without replacement), and create a sequence of indices.
library(data.table)
setDT(my_data)
set.seed(3)
my_data[my_data[ , {ix = sample(.I, 2); ix[1]:ix[2]}, by = student_id]$V1]
# student_id exam_no result
# <num> <num> <num>
# 1: 1 5 74.05672
# 2: 1 4 49.37525
# 3: 1 3 67.41662
# 4: 1 2 67.64935
# 5: 2 4 55.15337
# 6: 2 3 58.95694
# 7: 3 4 50.79859
# 8: 3 3 53.66886
# 9: 3 2 47.01089

recode/replace multiple values in a shared data column to a single value across data frames

I hope I haven't missed it, but I haven't been able to find a working solution to this problem.
I have a set of data frames with a shared column. These columns contain multiple and varying transcription errors, some of which are shared, others not, for multiple values.
I would like replace/recode the transcription errors (bad_values) with the correct values (good_values) across all data frames.
I have tried nesting the map*() family of functions across lists of data frames, bad_values, and good_values to do this, among other things. Here is an example:
df1 = data.frame(grp = c("a1","a.","a.",rep("b",7)), measure = rnorm(10))
df2 = data.frame(grp = c(rep("as", 3), "b2",rep("a",22)), measure = rnorm(26))
df3 = data.frame(grp = c(rep("b-",3),rep("bq",2),"a", rep("a.", 3)), measure = 1:9)
df_list = list(df1, df2, df3)
bad_values = list(c("a1","a.","as"), c("b2","b-","bq"))
good_values = list("a", "b")
dfs = map(df_list, function(x) {
x %>% mutate(grp = plyr::mapvalues(grp, bad_values, rep(good_values,length(bad_values))))
})
Which I didn't necessarily expect to work beyond a single good-bad value pair. However, I thought nesting another call to map*() within this might work:
dfs = map(df_list, function(x) {
x %>% mutate(grp = map2(bad_values, good_values, function(x,y) {
recode(grp, bad_values = good_values)})
})
I have tried a number of other approaches, none of which have worked.
Ultimately, I would like to go from a set of data frames with errors, as here:
[[1]]
grp measure
1 a1 0.5582253
2 a. 0.3400904
3 a. -0.2200824
4 b -0.7287385
5 b -0.2128275
6 b 1.9030766
[[2]]
grp measure
1 as 1.6148772
2 as 0.1090853
3 as -1.3714180
4 b2 -0.1606979
5 a 1.1726395
6 a -0.3201150
[[3]]
grp measure
1 b- 1
2 b- 2
3 b- 3
4 bq 4
5 bq 5
6 a 6
To a list of 'fixed' data frames, as such:
[[1]]
grp measure
1 a -0.7671052
2 a 0.1781247
3 a -0.7565773
4 b -0.3606900
5 b 1.9264804
6 b 0.9506608
[[2]]
grp measure
1 a 1.45036125
2 a -2.16715639
3 a 0.80105611
4 b 0.24216723
5 a 1.33089426
6 a -0.08388404
[[3]]
grp measure
1 b 1
2 b 2
3 b 3
4 b 4
5 b 5
6 a 6
Any help would be very much appreciated
Here is an option using tidyverse with recode_factor. When there are multiple elements to be changed, create a list of key/val elements and use recode_factor to match and change the values to new levels
library(tidyverse)
keyval <- setNames(rep(good_values, lengths(bad_values)), unlist(bad_values))
out <- map(df_list, ~ .x %>%
mutate(grp = recode_factor(grp, !!! keyval)))
-output
out
#[[1]]
# grp measure
#1 a -1.63295876
#2 a 0.03859976
#3 a -0.46541610
#4 b -0.72356671
#5 b -1.11552841
#6 b 0.99352861
#....
#[[2]]
# grp measure
#1 a 1.26536789
#2 a -0.48189740
#3 a 0.23041056
#4 b -1.01324689
#5 a -1.41586086
#6 a 0.59026463
#....
#[[3]]
# grp measure
#1 b 1
#2 b 2
#3 b 3
#4 b 4
#5 b 5
#6 a 6
#....
NOTE: This doesn't change the class of the initial dataset column
str(out)
#List of 3
# $ :'data.frame': 10 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 1 1 1 2 2 2 2 2 2 2
# ..$ measure: num [1:10] -1.633 0.0386 -0.4654 -0.7236 -1.1155 ...
# $ :'data.frame': 26 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 1 1 1 2 1 1 1 1 1 1 ...
# ..$ measure: num [1:26] 1.265 -0.482 0.23 -1.013 -1.416 ...
# $ :'data.frame': 9 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 2 2 2 2 2 1 1 1 1
# ..$ measure: int [1:9] 1 2 3 4 5 6 7 8 9
Once we have a keyval pair list, this can be also used in base R functions
out1 <- lapply(df_list, transform, grp = unlist(keyval[grp]))
Any reason mapping a case_when statement wouldn't work?
library(tidyverse)
df_list %>%
map(~ mutate_if(.x, is.factor, as.character)) %>% # convert factor to character
map(~ mutate(.x, grp = case_when(grp %in% bad_values[[1]] ~ good_values[[1]],
grp %in% bad_values[[2]] ~ good_values[[2]],
TRUE ~ grp)))
I could see it working for your reprex but possibly not the greater problem.
A base R option if you have lot of good_values and bad_values and it is not possible to check each one individually.
lapply(df_list, function(x) {
vec = x[['grp']]
mapply(function(p, q) vec[vec %in% p] <<- q ,bad_values, good_values)
transform(x, grp = vec)
})
#[[1]]
# grp measure
#1 a -0.648146527
#2 a -0.004722549
#3 a -0.943451194
#4 b -0.709509396
#5 b -0.719434286
#....
#[[2]]
# grp measure
#1 a 1.03131291
#2 a -0.85558910
#3 a -0.05933911
#4 b 0.67812934
#5 a 3.23854093
#6 a 1.31688645
#7 a 1.87464048
#8 a 0.90100179
#....
#[[3]]
# grp measure
#1 b 1
#2 b 2
#3 b 3
#4 b 4
#5 b 5
#....
Here, for every list element we extract it's grp column and replace bad_values with corresponding good_values if they are found and return the corrected dataframe.

Reduce number of levels for each factor dplyr approach

I am trying to reduce the number of levels in each factor variable in my data. I want to reduce the number of levels doing 2 operations:
If the number of levels is larger than a cut-off then replace the less frequent levels to a new level until the number of levels has reached the cut-off
Replace levels in a factor with not enough observations to a new level
I wrote a function which works fine, but I don't like the code. It does not matter if the level REMAIN has not enough observations. I prefer a dplyr approach.
ReplaceFactor <- function(data, max_levels, min_values_factor){
# First make sure that not to many levels are in a factor
for(i in colnames(data)){
if(class(data[[i]]) == "factor"){
if(length(levels(data[[i]])) > max_levels){
levels_keep <- names(sort(table(data[[i]]), decreasing = T))[1 : (max_levels - 1)]
data[!get(i) %in% levels_keep, (i) := "REMAIN"]
data[[i]] <- as.factor(as.character(data[[i]]))
}
}
}
# Now make sure that in each level has enough observations
for(i in colnames(data)){
if(class(data[[i]]) == "factor"){
if(min(table(data[[i]])) < min_values_factor){
levels_replace <- table(data[[i]])[table(data[[i]]) < min_values_factor]
data[get(i) %in% names(levels_replace), (i) := "REMAIN"]
data[[i]] <- as.factor(as.character(data[[i]]))
}
}
}
return(data)
}
df <- data.frame(A = c("A","A","B","B","C","C","C","C","C"),
B = 1:9,
C = c("A","A","B","B","C","C","C","D","D"),
D = c("A","B","E", "E", "E","E","E", "E", "E"))
str(df)
'data.frame': 9 obs. of 4 variables:
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3
$ B: int 1 2 3 4 5 6 7 8 9
$ C: Factor w/ 4 levels "A","B","C","D": 1 1 2 2 3 3 3 4 4
$ D: Factor w/ 3 levels "A","B","E": 1 2 3 3 3 3 3 3 3
dt2 <- ReplaceFactor(data = data.table(df),
max_levels = 3,
min_values_factor = 2)
str(dt2)
Classes ‘data.table’ and 'data.frame': 9 obs. of 4 variables:
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3
$ B: int 1 2 3 4 5 6 7 8 9
$ C: Factor w/ 3 levels "A","C","REMAIN": 1 1 3 3 2 2 2 3 3
$ D: Factor w/ 2 levels "E","REMAIN": 2 2 1 1 1 1 1 1 1
- attr(*, ".internal.selfref")=<externalptr>
dt2
A B C D
1: A 1 A REMAIN
2: A 2 A REMAIN
3: B 3 REMAIN E
4: B 4 REMAIN E
5: C 5 C E
6: C 6 C E
7: C 7 C E
8: C 8 REMAIN E
9: C 9 REMAIN E
Using forcats:
library(dplyr)
library(forcats)
max_levels <- 3
min_values_factor <- 2
df %>%
mutate_if(is.factor, fct_lump, n = max_levels,
other_level = "REMAIN", ties.method = "first") %>%
mutate_if(is.factor, fct_lump, prop = (min_values_factor - 1) / nrow(.),
other_level = "REMAIN")
# A B C D
# 1 A 1 A REMAIN
# 2 A 2 A REMAIN
# 3 B 3 B E
# 4 B 4 B E
# 5 C 5 C E
# 6 C 6 C E
# 7 C 7 C E
# 8 C 8 REMAIN E
# 9 C 9 REMAIN E
(Oh, and I wasn't able to replicate the exact behavior of your function, but you might get what you want by tweaking ties.method and substracting 1 to max_levels).

Reshaping data.table with cumulative sum

I want to reshape a data.table, and include the historic (cumulative summed) information for each variable. The No variable indicates the chronological order of measurements for object ID. At each measurement additional information is found. I want to aggregate the known information at each timestamp No for object ID.
Let me demonstrate with an example:
For the following data.table:
df <- data.table(ID=c(1,1,1,2,2,2,2),
No=c(1,2,3,1,2,3,4),
Variable=c('a','b', 'a', 'c', 'a', 'a', 'b'),
Value=c(2,1,3,3,2,1,5))
df
ID No Variable Value
1: 1 1 a 2
2: 1 2 b 1
3: 1 3 a 3
4: 2 1 c 3
5: 2 2 a 2
6: 2 3 a 1
7: 2 4 b 5
I want to reshape it to this:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
So the summed values of Value, per Variable by (ID, No), cumulative over No.
I can get the result without the cumulative part by doing
dcast(df, ID+No~Variable, value.var="Value")
which results in the non-cumulative variant:
ID No a b c
1: 1 1 2 NA NA
2: 1 2 NA 1 NA
3: 1 3 3 NA NA
4: 2 1 NA NA 3
5: 2 2 2 NA NA
6: 2 3 1 NA NA
7: 2 4 NA 5 NA
Any ideas how to make this cumulative? The original data.table has over 250,000 rows, so efficiency matters.
EDIT: I just used a,b,c as an example, the original file has about 40 different levels. Furthermore, the NAs are important; there are also Value-values of 0, which means something else than NA
POSSIBLE SOLUTION
Okay, so I've found a working solution. It is far from efficient, since it enlarges the original table.
The idea is to duplicate each row TotalNo - No times, where TotalNo is the maximum No per ID. Then the original dcast function can be used to extract the dataframe. So in code:
df[,TotalNo := .N, by=ID]
df2 <- df[rep(seq(nrow(df)), (df$TotalNo - df$No + 1))] #create duplicates
df3 <- df2[order(ID, No)]#, No:= seq_len(.N), by=.(ID, No)]
df3[,No:= seq(from=No[1], to=TotalNo[1], by=1), by=.(ID, No)]
df4<- dcast(df3,
formula = ID + No ~ Variable,
value.var = "Value", fill=NA, fun.aggregate = sum)
It is not really nice, because the creation of duplicates uses more memory. I think it can be further optimized, but so far it works for my purposes. In the sample code it goes from 7 rows to 16 rows, in the original file from 241,670 rows to a whopping 978,331. That's over a factor 4 larger.
SOLUTION
Eddi has improved my solution in computing time in the full dataset (2.08 seconds of Eddi versus 4.36 seconds of mine). Those are numbers I can work with! Thanks everybody!
Your solution is good, but you're adding too many rows, that are unnecessary if you compute the cumsum beforehand:
# add useful columns
df[, TotalNo := .N, by = ID][, CumValue := cumsum(Value), by = .(ID, Variable)]
# do a rolling join to extend the missing values, and then dcast
dcast(df[df[, .(No = seq(No[1], TotalNo[1])), by = .(ID, Variable)],
on = c('ID', 'Variable', 'No'), roll = TRUE],
ID + No ~ Variable, value.var = 'CumValue')
# ID No a b c
#1: 1 1 2 NA NA
#2: 1 2 2 1 NA
#3: 1 3 5 1 NA
#4: 2 1 NA NA 3
#5: 2 2 2 NA 3
#6: 2 3 3 NA 3
#7: 2 4 3 5 3
Here's a standard way:
library(zoo)
df[, cv := cumsum(Value), by = .(ID, Variable)]
DT = dcast(df, ID + No ~ Variable, value.var="cv")
lvls = sort(unique(df$Variable))
DT[, (lvls) := lapply(.SD, na.locf, na.rm = FALSE), by=ID, .SDcols=lvls]
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
One alternative way to do it is using a custom built cumulative sum function. This is exactly the method in #David Arenburg's comment, but substitutes in a custom cumulative summary function.
EDIT: Using #eddi's much more efficient custom cumulative sum function.
cumsum.na <- function(z){
Reduce(function(x, y) if (is.na(x) && is.na(y)) NA else sum(x, y, na.rm = T), z, accumulate = T)
}
cols <- sort(unique(df$Variable))
res <- dcast(df, ID + No ~ Variable, value.var = "Value")[, (cols) := lapply(.SD, cumsum.na), .SDcols = cols, by = ID]
res
ID No a b c
1: 1 1 2 NA NA
2: 1 2 2 1 NA
3: 1 3 5 1 NA
4: 2 1 NA NA 3
5: 2 2 2 NA 3
6: 2 3 3 NA 3
7: 2 4 3 5 3
This definitely isn't the most efficient, but it gets the job done and gives you an admittedly very slow very slow cumulative summary function that handles NAs the way you want to.

Keep only 'by' variables when collapsing data.table

I have a very large data.table:
DT <- data.table(a=c(1,1,1,1,2,2,2,2,3,3,3,3),b=c(1,1,2,2),c=1:12)
And I need to collapse it by several variables, e.g. list(a,b). Easy:
DT[,sum(c),by=list(a,b)]
a b V1
1: 1 1 3
2: 1 2 7
3: 2 1 11
4: 2 2 15
5: 3 1 19
6: 3 2 23
However, I don't want to take any operation on c, I just want to drop it:
DT[,,by=list(a,b)] # includes a,b,c, thus does not collapse
DT[,list(),by=list(a,b)] # zero rows
DT[,a,by=list(a,b)] # what I want but adds extraneous column a after 'by' columns
How can I specify X below to get the indicated result?
DT[,X,by=list(a,b)]
a b
1: 1 1
2: 1 2
3: 2 1
4: 2 2
5: 3 1
6: 3 2
unique.data.table has a by argument, you could then subset result to get the columns you want.
eg
unique(DT, by = c('a', 'b'))[, c('a','b')]

Resources