I am trying to reduce the number of levels in each factor variable in my data. I want to reduce the number of levels doing 2 operations:
If the number of levels is larger than a cut-off then replace the less frequent levels to a new level until the number of levels has reached the cut-off
Replace levels in a factor with not enough observations to a new level
I wrote a function which works fine, but I don't like the code. It does not matter if the level REMAIN has not enough observations. I prefer a dplyr approach.
ReplaceFactor <- function(data, max_levels, min_values_factor){
# First make sure that not to many levels are in a factor
for(i in colnames(data)){
if(class(data[[i]]) == "factor"){
if(length(levels(data[[i]])) > max_levels){
levels_keep <- names(sort(table(data[[i]]), decreasing = T))[1 : (max_levels - 1)]
data[!get(i) %in% levels_keep, (i) := "REMAIN"]
data[[i]] <- as.factor(as.character(data[[i]]))
}
}
}
# Now make sure that in each level has enough observations
for(i in colnames(data)){
if(class(data[[i]]) == "factor"){
if(min(table(data[[i]])) < min_values_factor){
levels_replace <- table(data[[i]])[table(data[[i]]) < min_values_factor]
data[get(i) %in% names(levels_replace), (i) := "REMAIN"]
data[[i]] <- as.factor(as.character(data[[i]]))
}
}
}
return(data)
}
df <- data.frame(A = c("A","A","B","B","C","C","C","C","C"),
B = 1:9,
C = c("A","A","B","B","C","C","C","D","D"),
D = c("A","B","E", "E", "E","E","E", "E", "E"))
str(df)
'data.frame': 9 obs. of 4 variables:
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3
$ B: int 1 2 3 4 5 6 7 8 9
$ C: Factor w/ 4 levels "A","B","C","D": 1 1 2 2 3 3 3 4 4
$ D: Factor w/ 3 levels "A","B","E": 1 2 3 3 3 3 3 3 3
dt2 <- ReplaceFactor(data = data.table(df),
max_levels = 3,
min_values_factor = 2)
str(dt2)
Classes ‘data.table’ and 'data.frame': 9 obs. of 4 variables:
$ A: Factor w/ 3 levels "A","B","C": 1 1 2 2 3 3 3 3 3
$ B: int 1 2 3 4 5 6 7 8 9
$ C: Factor w/ 3 levels "A","C","REMAIN": 1 1 3 3 2 2 2 3 3
$ D: Factor w/ 2 levels "E","REMAIN": 2 2 1 1 1 1 1 1 1
- attr(*, ".internal.selfref")=<externalptr>
dt2
A B C D
1: A 1 A REMAIN
2: A 2 A REMAIN
3: B 3 REMAIN E
4: B 4 REMAIN E
5: C 5 C E
6: C 6 C E
7: C 7 C E
8: C 8 REMAIN E
9: C 9 REMAIN E
Using forcats:
library(dplyr)
library(forcats)
max_levels <- 3
min_values_factor <- 2
df %>%
mutate_if(is.factor, fct_lump, n = max_levels,
other_level = "REMAIN", ties.method = "first") %>%
mutate_if(is.factor, fct_lump, prop = (min_values_factor - 1) / nrow(.),
other_level = "REMAIN")
# A B C D
# 1 A 1 A REMAIN
# 2 A 2 A REMAIN
# 3 B 3 B E
# 4 B 4 B E
# 5 C 5 C E
# 6 C 6 C E
# 7 C 7 C E
# 8 C 8 REMAIN E
# 9 C 9 REMAIN E
(Oh, and I wasn't able to replicate the exact behavior of your function, but you might get what you want by tweaking ties.method and substracting 1 to max_levels).
Related
I hope I haven't missed it, but I haven't been able to find a working solution to this problem.
I have a set of data frames with a shared column. These columns contain multiple and varying transcription errors, some of which are shared, others not, for multiple values.
I would like replace/recode the transcription errors (bad_values) with the correct values (good_values) across all data frames.
I have tried nesting the map*() family of functions across lists of data frames, bad_values, and good_values to do this, among other things. Here is an example:
df1 = data.frame(grp = c("a1","a.","a.",rep("b",7)), measure = rnorm(10))
df2 = data.frame(grp = c(rep("as", 3), "b2",rep("a",22)), measure = rnorm(26))
df3 = data.frame(grp = c(rep("b-",3),rep("bq",2),"a", rep("a.", 3)), measure = 1:9)
df_list = list(df1, df2, df3)
bad_values = list(c("a1","a.","as"), c("b2","b-","bq"))
good_values = list("a", "b")
dfs = map(df_list, function(x) {
x %>% mutate(grp = plyr::mapvalues(grp, bad_values, rep(good_values,length(bad_values))))
})
Which I didn't necessarily expect to work beyond a single good-bad value pair. However, I thought nesting another call to map*() within this might work:
dfs = map(df_list, function(x) {
x %>% mutate(grp = map2(bad_values, good_values, function(x,y) {
recode(grp, bad_values = good_values)})
})
I have tried a number of other approaches, none of which have worked.
Ultimately, I would like to go from a set of data frames with errors, as here:
[[1]]
grp measure
1 a1 0.5582253
2 a. 0.3400904
3 a. -0.2200824
4 b -0.7287385
5 b -0.2128275
6 b 1.9030766
[[2]]
grp measure
1 as 1.6148772
2 as 0.1090853
3 as -1.3714180
4 b2 -0.1606979
5 a 1.1726395
6 a -0.3201150
[[3]]
grp measure
1 b- 1
2 b- 2
3 b- 3
4 bq 4
5 bq 5
6 a 6
To a list of 'fixed' data frames, as such:
[[1]]
grp measure
1 a -0.7671052
2 a 0.1781247
3 a -0.7565773
4 b -0.3606900
5 b 1.9264804
6 b 0.9506608
[[2]]
grp measure
1 a 1.45036125
2 a -2.16715639
3 a 0.80105611
4 b 0.24216723
5 a 1.33089426
6 a -0.08388404
[[3]]
grp measure
1 b 1
2 b 2
3 b 3
4 b 4
5 b 5
6 a 6
Any help would be very much appreciated
Here is an option using tidyverse with recode_factor. When there are multiple elements to be changed, create a list of key/val elements and use recode_factor to match and change the values to new levels
library(tidyverse)
keyval <- setNames(rep(good_values, lengths(bad_values)), unlist(bad_values))
out <- map(df_list, ~ .x %>%
mutate(grp = recode_factor(grp, !!! keyval)))
-output
out
#[[1]]
# grp measure
#1 a -1.63295876
#2 a 0.03859976
#3 a -0.46541610
#4 b -0.72356671
#5 b -1.11552841
#6 b 0.99352861
#....
#[[2]]
# grp measure
#1 a 1.26536789
#2 a -0.48189740
#3 a 0.23041056
#4 b -1.01324689
#5 a -1.41586086
#6 a 0.59026463
#....
#[[3]]
# grp measure
#1 b 1
#2 b 2
#3 b 3
#4 b 4
#5 b 5
#6 a 6
#....
NOTE: This doesn't change the class of the initial dataset column
str(out)
#List of 3
# $ :'data.frame': 10 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 1 1 1 2 2 2 2 2 2 2
# ..$ measure: num [1:10] -1.633 0.0386 -0.4654 -0.7236 -1.1155 ...
# $ :'data.frame': 26 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 1 1 1 2 1 1 1 1 1 1 ...
# ..$ measure: num [1:26] 1.265 -0.482 0.23 -1.013 -1.416 ...
# $ :'data.frame': 9 obs. of 2 variables:
# ..$ grp : Factor w/ 2 levels "a","b": 2 2 2 2 2 1 1 1 1
# ..$ measure: int [1:9] 1 2 3 4 5 6 7 8 9
Once we have a keyval pair list, this can be also used in base R functions
out1 <- lapply(df_list, transform, grp = unlist(keyval[grp]))
Any reason mapping a case_when statement wouldn't work?
library(tidyverse)
df_list %>%
map(~ mutate_if(.x, is.factor, as.character)) %>% # convert factor to character
map(~ mutate(.x, grp = case_when(grp %in% bad_values[[1]] ~ good_values[[1]],
grp %in% bad_values[[2]] ~ good_values[[2]],
TRUE ~ grp)))
I could see it working for your reprex but possibly not the greater problem.
A base R option if you have lot of good_values and bad_values and it is not possible to check each one individually.
lapply(df_list, function(x) {
vec = x[['grp']]
mapply(function(p, q) vec[vec %in% p] <<- q ,bad_values, good_values)
transform(x, grp = vec)
})
#[[1]]
# grp measure
#1 a -0.648146527
#2 a -0.004722549
#3 a -0.943451194
#4 b -0.709509396
#5 b -0.719434286
#....
#[[2]]
# grp measure
#1 a 1.03131291
#2 a -0.85558910
#3 a -0.05933911
#4 b 0.67812934
#5 a 3.23854093
#6 a 1.31688645
#7 a 1.87464048
#8 a 0.90100179
#....
#[[3]]
# grp measure
#1 b 1
#2 b 2
#3 b 3
#4 b 4
#5 b 5
#....
Here, for every list element we extract it's grp column and replace bad_values with corresponding good_values if they are found and return the corrected dataframe.
Imagine a data set like this:
# creating data for test
set.seed(1839)
id <- as.character(1:10)
frequency <- sample(c("n", "r", "s", "o", "a"), 10, TRUE)
frequency_value <- sapply(
frequency, switch, "n" = -2, "r" = -1, "s" = 0, "o" = 1, "a" = 2
)
(test <- data.frame(id, frequency, frequency_value))
Which looks like:
id frequency frequency_value
1 1 a 2
2 2 o 1
3 3 r -1
4 4 o 1
5 5 o 1
6 6 s 0
7 7 n -2
8 8 n -2
9 9 r -1
10 10 n -2
The variable frequency has the response I'm interested in. It goes from never to rarely to sometimes to often to always. The labels are just the first letter of each of those words. The order is presented in frequency_value.
What I would like to do is make frequency a factor with levels in the order n, r, s, o, a. However, I want to make this dependent on the values in frequency_value. They should follow the order that is preserved in frequency_value and not be simply hard-coded (like one would do with factor(frequency, levels = c("n", "r", "s", "o", "a"))).
I have thought about using this, a tidyverse solution:
levels <- test[, c("frequency", "frequency_value")] %>%
unique() %>%
arrange(as.numeric(frequency_value)) %>%
pull(frequency) %>%
as.character()
test$frequency <- factor(test$frequency, levels)
But that seems to be computationally inefficient when I do this on big data sets with more than one variable that I want to make factor. Is there a more efficient solution?
Use order for unique combinations (what you were using) within with:
test$frequency <- factor(test$frequency,
with(unique(test[, -1]), frequency[order(frequency_value)]))
[1] a o r o o s n n r n
Levels:
n r s o a
Once option could be by just using dplyr as:
library(dplyr)
test <- test %>% arrange(frequency_value) %>%
mutate(frequency = factor(frequency, levels = unique(frequency)))
test
# id frequency frequency_value
# 1 7 n -2
# 2 8 n -2
# 3 10 n -2
# 4 3 r -1
# 5 9 r -1
# 6 6 s 0
# 7 2 o 1
# 8 4 o 1
# 9 5 o 1
# 10 1 a 2
str(test)
#'data.frame': 10 obs. of 3 variables:
# $ id : Factor w/ 10 levels "1","10","2","3",..: 8 9 2 4 10 7 3 5 6 1
# $ frequency : Factor w/ 5 levels "n","r","s","o",..: 1 1 1 2 2 3 4 4 4 5
# $ frequency_value: num -2 -2 -2 -1 -1 0 1 1 1 2
I have a very basic question, but I'm new to R so would appreciate any help.
I have a column (among other columns) in one dataset which the rows read as numeric codes(for example).
In another dataset, I have two columns, one is the numeric codes (same as above) and the column next to it are names.
Is there a way in R that I can rename the numeric codes in the first datasets to the names using the second dataset as a reference essentially?
Many thanks for your help
Some sample data:
set.seed(42) # because I use `sample`, o/w unnecessary
df1 <- data.frame(n = sample(5, size = 10, replace = TRUE))
str(df1)
# 'data.frame': 10 obs. of 1 variable:
# $ n: int 5 5 2 5 4 3 4 1 4 4
df2 <- data.frame(n = 1:5, txt = LETTERS[5:9], stringsAsFactors = FALSE)
str(df2)
# 'data.frame': 5 obs. of 2 variables:
# $ n : int 1 2 3 4 5
# $ txt: chr "E" "F" "G" "H" ...
Base R use of merge:
merge(df1, df2, by = "n")
# n txt
# 1 1 E
# 2 2 F
# 3 3 G
# 4 4 H
# 5 4 H
# 6 4 H
# 7 4 H
# 8 5 I
# 9 5 I
# 10 5 I
Notice that the order of df1 is not preserved. We can use merge(..., sort = FALSE), but the order is "unspecified" (?merge).
Using dplyr::left_join:
library(dplyr)
df1 %>%
left_join(df2, by = "n")
# n txt
# 1 5 I
# 2 5 I
# 3 2 F
# 4 5 I
# 5 4 H
# 6 3 G
# 7 4 H
# 8 1 E
# 9 4 H
# 10 4 H
(Order is preserved.)
I have a larga database, and I wanted to print out the number of elements in each factor, with this code:
summary(factor(Bond$Cf$ISIN, levels = Bond$ISIN), maxsum = 1000)
I got this output below. My question is: how does this order the values? At first glance I thought it orders it according to the number of values in each factor but then I saw NL0009054907. It has 7 elements, but somehow It appears among the factors with 4 elements.
DE000A1X3LN0 XS1002977103 XS1014670233 DE000CZ40JH0 XS1016720853 XS0973623514 DE000HSH4QN3 XS0997333223 DE000A13SWD8 DE000HSH4XP4
3 3 3 3 3 3 3 3 3 3
XS1033018158 XS1369254310 XS1041793123 XS1196748757 XS1043150462DE000CZ40K31 XS0187339600 XS0413584672 XS1046237431 DE000CB83CE3
3 3 3 3 3 3 3 3 3 4
XS1050665386 XS1385935769 FR0010743070 FR0011233337 XS0418053152 XS0914402887 NL0009054907 XS0984200617 XS0993272862 XS0423989267
4 4 4 4 4 4 7 4 4 4
XS0296306078 DE000CZ40KW7 XS1070100257 XS0996755350 ES03136793B0 XS0432092137 XS0429192767 DE000HV2AKV6 XS1077629225 XS1078760813
4 4 4 4 4 4 4 4 4 4
DE000HSH28Z5 DE000HSH2893 XS1080952960 DE000DB7UQ89 XS1084838496 XS1236611684 DE000HSH41B9 DE000CZ291M8 DE000HSH3AM1 DE000HSH28J9
4 4 4 4 4 4 4 4 4 4
Compare:
x <- factor(rep(letters[1:3], 4:6))
summary(x)
# a b c
# 4 5 6
x <- factor(x, levels = letters[3:1])
summary(x)
# c b a
# 6 5 4
If x is a factor in a data.frame summary.factor sorts by frequency
x <- factor(rep(letters[1:10], 1:10), levels = letters[1:10])
summary(data.frame(x = x))
# x
# j :10
# i : 9
# h : 8
# g : 7
# f : 6
# e : 5
# (Other):10
This question already has answers here:
Empty factors in "by" data.table
(2 answers)
Closed 8 years ago.
I'm using data.table to aggregate values, but I'm finding that when the "by" variable has a level not present in the aggregation, it is omitted, even if it is specified in the factor levels.
The code below generates a data.table with 6 rows, the last two of which only have one of the two possible levels for f2 nested within f1. During aggregation, the {3,1} combination is dropped.
set.seed(1987)
dt <- data.table(f1 = factor(rep(1:3, each = 2)),
f2 = factor(sample(1:2, 6, replace = TRUE)),
val = runif(6))
str(dt)
Classes ‘data.table’ and 'data.frame': 6 obs. of 3 variables:
$ f1 : Factor w/ 3 levels "1","2","3": 1 1 2 2 3 3
$ f2 : Factor w/ 2 levels "1","2": 1 2 2 1 2 2
$ val: num 0.383 0.233 0.597 0.346 0.606 ...
- attr(*, ".internal.selfref")=<externalptr>
dt
f1 f2 val
1: 1 1 0.3829077
2: 1 2 0.2327311
3: 2 2 0.5965087
4: 2 1 0.3456710
5: 3 2 0.6058819
6: 3 2 0.7437177
dt[, sum(val), by = list(f1, f2)] # output is missing a row
f1 f2 V1
1: 1 1 0.3829077
2: 1 2 0.2327311
3: 2 2 0.5965087
4: 2 1 0.3456710
5: 3 2 1.3495996
# this is the output I'm looking for:
f1 f2 V1
1: 1 1 0.3829077
2: 1 2 0.2327311
3: 2 2 0.5965087
4: 2 1 0.3456710
5: 3 1 0.0000000 # <- the missing row from above
6: 3 2 1.3495996
Is there a way to achieve this behavior?
Why do you expect that data.table will compute sums for all combinations of f1 and f2?
If this what you want you should add missings rows to the original data before grouping sum. For example:
setkey(dt, f1, f2)
# omit "by = .EACHI" in data.table <= 1.9.2
dt[CJ(levels(f1), levels(f2)), sum(val, na.rm=T),
allow.cartesian = T, by = .EACHI]
## f1 f2 V1
## 1: 1 1 0.3829077
## 2: 1 2 0.2327311
## 3: 2 1 0.3456710
## 4: 2 2 0.5965087
## 5: 3 1 0.0000000 ## <- your "missing row" :)
## 6: 3 2 1.3495996