I have a simple data table apple that has numerous instances of numbers shortened as 40.08B, 40.08M, 400.08K, etc. I need to remove these letters and replace them with the appropriate number of zeros (i.e. 400.08K becomes 400080), so I wrote the following code:
apple2 <- dplyr::case_when(
stringr::str_detect(apple[,-1], 'B') ~ readr::parse_number(as.character(apple[,-1]), na = c("", "NA")) * 1e9,
stringr::str_detect(apple[,-1], 'M') ~ readr::parse_number(as.character(apple[,-1]), na = c("", "NA")) * 1e6,
stringr::str_detect(apple[,-1], 'K') ~ readr::parse_number(as.character(apple[,-1]), na = c("", "NA")) * 1e3,
TRUE ~ parse_number(as.character(apple[,-1]), na = c("", "NA"), trim_ws = TRUE)
)
The code works as expected in finding and converting the strings into appropriate numbers, but it only runs on the first row of the data table. In addition, it removes the headers. The error message is the following:
argument is not an atomic vector; coercingargument is not an atomic vector; coercingargument is not an atomic vector; coercing[1]
I've tried figuring this out for hours but to no avail - what am I doing wrong here? Thank you!
You are using case_when in a somewhat unorthodox way:
## some data:
d <- cbind.data.frame(
id = LETTERS,
matrix(
paste0(
ceiling(runif(26*5, max=999)),
sample( c("","B","K","M"), size=26*5, replace=T )
), nrow=26
)
)
library(stringr)
library(readr)
d %>% mutate( across( -1,
~ case_when(
str_detect(., 'B') ~ parse_number(as.character(.), na = c("", "NA")) * 1e9,
str_detect(., 'M') ~ parse_number(as.character(.), na = c("", "NA")) * 1e6,
str_detect(., 'K') ~ parse_number(as.character(.), na = c("", "NA")) * 1e3,
TRUE ~ parse_number(as.character(.), na = c("", "NA"), trim_ws = TRUE)
)
))
Input data:
id 1 2 3 4 5
1 A 834 27B 250 881B 988
2 B 313M 506B 309 413 141K
3 C 197 77 824 161B 43K
4 D 845K 172K 745B 922M 145M
5 E 168M 959M 990B 250K 893
6 F 430 687K 368M 10M 824M
7 G 940B 403B 655M 818 777K
8 H 281 833K 86B 849B 16K
9 I 485B 508B 349M 643M 926M
10 J 235B 10B 206M 505K 347M
11 K 897B 727M 405K 987B 674M
12 L 588B 40M 860M 58 934B
13 M 727K 375 188M 728K 201B
14 N 280K 442M 43K 400 445
15 O 988B 388M 530B 702M 240B
16 P 177M 782 410K 254K 758K
17 Q 706K 262 520B 104K 34
18 R 390B 99K 677K 965 635M
19 S 819 115M 920M 580M 295K
20 T 573M 901K 360 7K 88B
21 U 333B 593M 504B 992 241B
22 V 674 192M 841B 644B 659
23 W 524M 581M 692M 41 133
24 X 626K 686M 712K 756M 136B
25 Y 295 468 932M 486B 35K
26 Z 526K 798K 229K 958B 700B
Output:
id 1 2 3 4 5
1 A 8.34e+02 2.70e+10 2.50e+02 8.81e+11 9.88e+02
2 B 3.13e+08 5.06e+11 3.09e+02 4.13e+02 1.41e+05
3 C 1.97e+02 7.70e+01 8.24e+02 1.61e+11 4.30e+04
4 D 8.45e+05 1.72e+05 7.45e+11 9.22e+08 1.45e+08
5 E 1.68e+08 9.59e+08 9.90e+11 2.50e+05 8.93e+02
6 F 4.30e+02 6.87e+05 3.68e+08 1.00e+07 8.24e+08
7 G 9.40e+11 4.03e+11 6.55e+08 8.18e+02 7.77e+05
8 H 2.81e+02 8.33e+05 8.60e+10 8.49e+11 1.60e+04
9 I 4.85e+11 5.08e+11 3.49e+08 6.43e+08 9.26e+08
10 J 2.35e+11 1.00e+10 2.06e+08 5.05e+05 3.47e+08
11 K 8.97e+11 7.27e+08 4.05e+05 9.87e+11 6.74e+08
12 L 5.88e+11 4.00e+07 8.60e+08 5.80e+01 9.34e+11
13 M 7.27e+05 3.75e+02 1.88e+08 7.28e+05 2.01e+11
14 N 2.80e+05 4.42e+08 4.30e+04 4.00e+02 4.45e+02
15 O 9.88e+11 3.88e+08 5.30e+11 7.02e+08 2.40e+11
16 P 1.77e+08 7.82e+02 4.10e+05 2.54e+05 7.58e+05
17 Q 7.06e+05 2.62e+02 5.20e+11 1.04e+05 3.40e+01
18 R 3.90e+11 9.90e+04 6.77e+05 9.65e+02 6.35e+08
19 S 8.19e+02 1.15e+08 9.20e+08 5.80e+08 2.95e+05
20 T 5.73e+08 9.01e+05 3.60e+02 7.00e+03 8.80e+10
21 U 3.33e+11 5.93e+08 5.04e+11 9.92e+02 2.41e+11
22 V 6.74e+02 1.92e+08 8.41e+11 6.44e+11 6.59e+02
23 W 5.24e+08 5.81e+08 6.92e+08 4.10e+01 1.33e+02
24 X 6.26e+05 6.86e+08 7.12e+05 7.56e+08 1.36e+11
25 Y 2.95e+02 4.68e+02 9.32e+08 4.86e+11 3.50e+04
26 Z 5.26e+05 7.98e+05 2.29e+05 9.58e+11 7.00e+11
See also other ways to convert the human readable byte number to a number, eg this or perhaps this
We could make use of str_replace_all instead of multiple str_detect. Match and replace the 'B', 'M', 'K' substring in the column with a named vector in str_replace_all, then separate the column, and do the multiplication based on the separated columns
library(stringr)
library(dplyr)
library(tidyr)
apple %>%
mutate(col1 = str_replace_all(col1, setNames(c(' 1e9', ' 1e6', ' 1e3'),
c('B', 'M', 'K')))) %>%
separate(col1, into = c('col1', 'col2'), convert = TRUE) %>%
transmute(col1 = col1 * col2)
-output
# col1
#1 4.0e+10
#2 2.0e+08
#3 2.0e+06
#4 4.0e+05
#5 3.6e+10
data
apple <- structure(list(col1 = c("40B", "200M", "2M", "400K", "36B")),
class = "data.frame", row.names = c(NA,
-5L))
Related
I am trying to produce multiple frequency tables that are stratified by multiple independent variables. I can get this to work for one variable and one stratification variable, but my for-loop is broken.
library(tidyverse)
# Create example dataframe of survey data
df <- data.frame(
var1 = sample(1:7, 1000, replace = TRUE),
var2 = sample(1:7, 1000, replace = TRUE),
var3 = sample(1:7, 1000, replace = TRUE),
var4 = sample(1:7, 1000, replace = TRUE),
var5 = sample(1:7, 1000, replace = TRUE),
var6 = sample(1:7, 1000, replace = TRUE),
strat1 = sample(c("A", "B", "C"), 1000, replace = TRUE),
strat2 = sample(c("X", "Y"), 1000, replace = TRUE),
strat3 = sample(c("True", "False"), 1000, replace = TRUE)
)
Example that works for one variable and one stratification variable. I want to convert this code into a for loop:
temp_df <- df %>% count(var1)
temp_df$percent <- temp_df$n / sum(temp_df$n) * 10
strat_df <- temp_df %>%
left_join((df %>% group_by(var1, strat1) %>% count(var1) %>% pivot_wider(names_from = strat1, values_from = n)), by = "var1")
for(k in c("A","B","C")){
strat_df[paste0(k, "_pct")] <- (strat_df[[k]] / temp_df$n) * 100
}
I want this same sort of output, but with added columns for count and _pct of the other two stratification variables.
I've tried using the following for loop, but it's only giving me one row per variable and it only produces two columns for each strat variable, whereas the output I'm looking for would have a raw count and column percentage column for each category within a stratification variable. Since there are 3 strat vars, two having two categories and one having three categories, my desired output would have 13 columns including the column for "v#", "n", and "percent".
# Create a list of the variables of interest
variables <- c("var1", "var2", "var3", "var4", "var5", "var6")
# Create a list of the stratification variables
strats <- c("strat1", "strat2", "strat3")
# Create a loop that runs through each variable
for(i in variables){
# Create a frequency table for the current variable
temp_df <- df %>% count(!! i)
# Add a column for the percent of responses within each response category
temp_df$percent <- temp_df$n / sum(temp_df$n) * 100
# Add a column for the raw count for each category of the stratification variables
for(j in strats){
temp_df <- temp_df %>% group_by(!!i) %>% mutate( !!j := n() )
}
# Add a column for the percent of the stratification variable category within the response category
for(j in strats){
temp_df[paste0(j, "_pct")] <- (temp_df[[j]] / temp_df$n) * 100
}
assign(paste0(i,"_df"), temp_df)
}
This is what I would like my output to look like:
UPDATE:
Came up with a solution that outputs what I need:
for(i in variables){
j = sym(i)
temp_df <- df %>% count(!!j)
temp_df$percent <- temp_df$n / sum(temp_df$n) * 10
strat_df <- temp_df %>%
left_join((df %>% group_by(!!j, strat1) %>% count(!!j) %>% pivot_wider(names_from = strat1, values_from = n)), by = i) %>%
left_join((df %>% group_by(!!j, strat2) %>% count(!!j) %>% pivot_wider(names_from = strat2, values_from = n)), by = i) %>%
left_join((df %>% group_by(!!j, strat3) %>% count(!!j) %>% pivot_wider(names_from = strat3, values_from = n)), by = i)
for(k in c("A","B","C","X","Y","True","False")){
strat_df[paste0(k, "_pct")] <- (strat_df[[k]] / temp_df$n) * 100
}
assign(paste0(i,"_df"), strat_df)
Either convert to symbol and evaluate (!!) or use across as the variables looped are strings
for(i in variables){
# Create a frequency table for the current variable
temp_df <- df %>% count(across(all_of(i)))
# Add a column for the percent of responses within each response category
temp_df$percent <- temp_df$n / sum(temp_df$n) * 100
# Add a column for the raw count for each category of the stratification variables
strat_df <- temp_df %>%
left_join((df %>% group_by(across(all_of(c(i, "strat1")))) %>%
count(across(all_of(i))) %>%
pivot_wider(names_from = strat1, values_from = n)), by = i) %>%
left_join((df %>% group_by(across(all_of(c(i, "strat2")))) %>%
count(across(all_of(i))) %>%
pivot_wider(names_from = strat2, values_from = n)), by = i) %>%
left_join((df %>% group_by(across(all_of(c(i, "strat3")))) %>%
count(across(all_of(i))) %>%
pivot_wider(names_from = strat3, values_from = n)), by = i)
# Add a column for the percent of the stratification variable category within the response category
for(j in c("A","B","C","X","Y","True","False")){
strat_df[paste0(j, "_pct")] <- (strat_df[[j]] / temp_df$n) * 100
}
assign(paste0(i,"_df"), strat_df)
}
-output
> var1_df
var1 n percent A B C X Y False True A_pct B_pct C_pct X_pct Y_pct True_pct False_pct
1 1 121 12.1 36 42 43 59 62 63 58 29.75207 34.71074 35.53719 48.76033 51.23967 47.93388 52.06612
2 2 144 14.4 51 42 51 84 60 69 75 35.41667 29.16667 35.41667 58.33333 41.66667 52.08333 47.91667
3 3 147 14.7 41 39 67 60 87 73 74 27.89116 26.53061 45.57823 40.81633 59.18367 50.34014 49.65986
4 4 146 14.6 52 45 49 74 72 79 67 35.61644 30.82192 33.56164 50.68493 49.31507 45.89041 54.10959
5 5 165 16.5 51 57 57 86 79 76 89 30.90909 34.54545 34.54545 52.12121 47.87879 53.93939 46.06061
6 6 133 13.3 48 51 34 64 69 68 65 36.09023 38.34586 25.56391 48.12030 51.87970 48.87218 51.12782
7 7 144 14.4 53 44 47 67 77 73 71 36.80556 30.55556 32.63889 46.52778 53.47222 49.30556 50.69444
> var2_df
var2 n percent A B C X Y False True A_pct B_pct C_pct X_pct Y_pct True_pct False_pct
1 1 152 15.2 51 53 48 79 73 70 82 33.55263 34.86842 31.57895 51.97368 48.02632 53.94737 46.05263
2 2 147 14.7 49 46 52 73 74 55 92 33.33333 31.29252 35.37415 49.65986 50.34014 62.58503 37.41497
3 3 142 14.2 46 45 51 72 70 79 63 32.39437 31.69014 35.91549 50.70423 49.29577 44.36620 55.63380
4 4 147 14.7 50 48 49 74 73 72 75 34.01361 32.65306 33.33333 50.34014 49.65986 51.02041 48.97959
5 5 128 12.8 45 43 40 59 69 72 56 35.15625 33.59375 31.25000 46.09375 53.90625 43.75000 56.25000
6 6 152 15.2 37 52 63 74 78 83 69 24.34211 34.21053 41.44737 48.68421 51.31579 45.39474 54.60526
7 7 132 13.2 54 33 45 63 69 70 62 40.90909 25.00000 34.09091 47.72727 52.27273 46.96970 53.03030
I created a dummy data table called DT. And I am trying to calculate the sum of Capacity(numerical), Count the frequency of Code and State( Categorical) within each ID. For the end result, I want to display the sum of Capacity, frequency of A,B,C... and different State within each unique ID. Therefore, the column name will be ID,total.Cap,A,B,C... AZ,CA..
DT <- data.table(ID = rep(1:500,100),
Capacity = sample(1:1000, size = 50000, replace =T),
Code = sample(LETTERS[1:26], 50000, replace = T),
State = rep(c("AZ","CA","PA","NY","WA","SD"), 50000))
The format of result will like the table below:
ID total.Cap A B C ... AZ CA ...
1 28123 10 25 70 ... 29 ...
2 32182 20 42 50 ... 30 ...
3
I have tried to to use ddply, melt and dcast.. But the result does not comes out as what I thought. Could anyone give me some hints about how to structure a table looks like this? Thank you!
You can do this by constructing the totals, state counts, and code counts with three separate data.table statements then joining them. On states and codes, you can use dcast to turn it into one column per state/code with the counts within each.
library(data.table)
totals <- DT[, list(total.Cap = sum(Capacity)), by = "ID"]
states <- dcast(DT, ID ~ State)
codes <- dcast(DT, ID ~ Code)
You can then join the three tables together:
result <- setkey(totals, "ID")[states, ][codes, ]
This results in a table something like:
ID total.Cap AZ CA NY PA SD WA A B C D E F G H I J K L M N O P Q R S T U
1: 1 287526 200 0 0 200 0 200 12 18 24 42 12 30 30 18 6 36 24 6 18 24 30 24 6 24 36 18 30
2: 2 293838 0 200 200 0 200 0 18 24 42 30 30 12 24 6 24 12 48 42 18 18 42 24 24 24 12 18 24
3: 3 279450 200 0 0 200 0 200 24 18 24 6 12 12 18 12 12 30 24 18 54 30 6 42 18 30 24 24 18
4: 4 298200 0 200 200 0 200 0 30 30 36 30 36 24 24 18 24 18 30 30 30 24 6 30 18 6 18 18 18
5: 5 294084 200 0 0 200 0 200 18 6 24 12 42 12 18 42 18 18 18 18 24 24 30 18 30 24 6 30 24
Note that if you have many columns like State and Code, you can do all of them at once by melting them first:
# replace State and Code with the categorical variables you want
melted <- melt(DT, measure.vars = c("State", "Code"))
state_codes <- dcast(melted, ID ~ value)
setkey(totals, "ID")[state_codes, ]
Note you still need to join with the totals, and that this will not preserve the order of columns like "states then codes" or vice versa.
This creates the total.Cap, Code, and State summary elements in three separate data tables then merges them by ID:
# Storing intermediate pieces
total_cap <- DT[, j = list(total.Cap = sum(Capacity)), by = ID]
code <- dcast(DT[, .N, by = c("ID", "Code")], ID ~ Code, fill = 0)
state <- dcast(DT[, .N, by = c("ID", "State")], ID ~ State, fill = 0)
mytable <- merge(total_cap, code, by = "ID")
mytable <- merge(mytable, state, by = "ID")
mytable
# As a one-liner
mytable <- merge(
merge(DT[, j = list(total.Cap = sum(Capacity)), by = ID],
dcast(DT[, .N, by = c("ID", "Code")], ID ~ Code, fill = 0),
by = "ID"),
dcast(DT[, .N, by = c("ID", "State")], ID ~ State, fill = 0),
by = "ID")
mytable
How can i split a column separated by multiple delimiter into separate columns in data frame
read.table(text = " Chr Nm1 Nm2 Nm3
chr10_100064111-100064134+Nfif 20 20 20
chr10_100064115-100064138-Kitl 30 19 40
chr10_100076865-100076888+Tert 60 440 18
chr10_100079974-100079997-Itg 50 11 23
chr10_100466221-100466244+Tmtc3 55 24 53", header = TRUE)
Chr gene Nm1 Nm2 Nm3
chr10_100064111-100064134 Nfif 20 20 20
chr10_100064115-100064138 Kitl 30 19 40
chr10_100076865-100076888 Tert 60 440 18
chr10_100079974-100079997 Itg 50 11 23 12
chr10_100466221-100466244 Tmtc3 55 24 53 12
i used
library(stringr)
df2 <- str_split_fixed(df1$name, "\\+", 2)
I would like to know how can we include both + and - delimiter
If you're trying to split one column into multiple, tidyr::separate is handy:
library(tidyr)
dat %>% separate(Chr, into = paste0('Chr', 1:3), sep = '[+-]')
# Chr1 Chr2 Chr3 Nm1 Nm2 Nm3
# 1 chr10_100064111 100064134 Nfif 20 20 20
# 2 chr10_100064115 100064138 Kitl 30 19 40
# 3 chr10_100076865 100076888 Tert 60 440 18
# 4 chr10_100079974 100079997 Itg 50 11 23
# 5 chr10_100466221 100466244 Tmtc3 55 24 53
This should work:
str_split_fixed(a, "[-+]", 2)
Here is a way to do this in base R with strsplit:
# split Chr into a list
tempList <- strsplit(as.character(df$Chr), split="[+-]")
# replace Chr with desired values
df$Chr <- sapply(tempList, function(i) paste(i[[1]], i[[2]], sep="-"))
# get Gene variable
df$gene <- sapply(tempList, "[[", 3)
I have a data.frame
set.seed(100)
exp <- data.frame(exp = c(rep(LETTERS[1:2], each = 10)), re = c(rep(seq(1, 10, 1), 2)), age1 = seq(10, 29, 1), age2 = seq(30, 49, 1),
h = c(runif(20, 10, 40)), h2 = c(40 + runif(20, 4, 9)))
I'd like to make a lm for each row in a data set (h and h2 ~ age1 and age2)
I do it by loop
exp$modelh <- 0
for (i in 1:length(exp$exp)){
age = c(exp$age1[i], exp$age2[i])
h = c(exp$h[i], exp$h2[i])
model = lm(age ~ h)
exp$modelh[i] = coef(model)[1] + 100 * coef(model)[2]
}
and it works well but takes some time with very large files. Will be grateful for the faster solution f.ex. dplyr
Using dplyr, we can try with rowwise() and do. Inside the do, we concatenate (c) the 'age1', 'age2' to create 'age', likewise, we can create 'h', apply lm, extract the coef to create the column 'modelh'.
library(dplyr)
exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )
gives the output
# exp re age1 age2 h h2 modelh
#1 A 1 10 30 19.23298 46.67906 68.85506
#2 A 2 11 31 17.73018 47.55402 66.17050
#3 A 3 12 32 26.56967 46.69174 84.98486
#4 A 4 13 33 11.69149 47.74486 61.98766
#5 A 5 14 34 24.05648 46.10051 82.90167
#6 A 6 15 35 24.51312 44.85710 89.21053
#7 A 7 16 36 34.37208 47.85151 113.37492
#8 A 8 17 37 21.10962 48.40977 74.79483
#9 A 9 18 38 26.39676 46.74548 90.34187
#10 A 10 19 39 15.10786 45.38862 75.07002
#11 B 1 20 40 28.74989 46.44153 100.54666
#12 B 2 21 41 36.46497 48.64253 125.34773
#13 B 3 22 42 18.41062 45.74346 81.70062
#14 B 4 23 43 21.95464 48.77079 81.20773
#15 B 5 24 44 32.87653 47.47637 115.95097
#16 B 6 25 45 30.07065 48.44727 101.10688
#17 B 7 26 46 16.13836 44.90204 84.31080
#18 B 8 27 47 20.72575 47.14695 87.00805
#19 B 9 28 48 20.78425 48.94782 84.25406
#20 B 10 29 49 30.70872 44.65144 128.39415
We could do this with the devel version of data.table i.e. v1.9.5. Instructions to install the devel version are here.
We convert the 'data.frame' to 'data.table' (setDT), create a column 'rn' with the option keep.rownames=TRUE. We melt the dataset by specifying the patterns in the measure to convert from 'wide' to 'long' format. Grouped by 'rn', we do the lm and get the coef. This can be assigned as a new column in the original dataset ('exp') while removing the unwanted 'rn' column by assigning (:=) it to NULL.
library(data.table)#v1.9.5+
modelh <- melt(setDT(exp, keep.rownames=TRUE), measure=patterns('^age', '^h'),
value.name=c('age', 'h'))[, {model <- lm(age ~h)
coef(model)[1] + 100 * coef(model)[2]},rn]$V1
exp[, modelh:= modelh][, rn := NULL]
exp
# exp re age1 age2 h h2 modelh
# 1: A 1 10 30 19.23298 46.67906 68.85506
# 2: A 2 11 31 17.73018 47.55402 66.17050
# 3: A 3 12 32 26.56967 46.69174 84.98486
# 4: A 4 13 33 11.69149 47.74486 61.98766
# 5: A 5 14 34 24.05648 46.10051 82.90167
# 6: A 6 15 35 24.51312 44.85710 89.21053
# 7: A 7 16 36 34.37208 47.85151 113.37492
# 8: A 8 17 37 21.10962 48.40977 74.79483
# 9: A 9 18 38 26.39676 46.74548 90.34187
#10: A 10 19 39 15.10786 45.38862 75.07002
#11: B 1 20 40 28.74989 46.44153 100.54666
#12: B 2 21 41 36.46497 48.64253 125.34773
#13: B 3 22 42 18.41062 45.74346 81.70062
#14: B 4 23 43 21.95464 48.77079 81.20773
#15: B 5 24 44 32.87653 47.47637 115.95097
#16: B 6 25 45 30.07065 48.44727 101.10688
#17: B 7 26 46 16.13836 44.90204 84.31080
#18: B 8 27 47 20.72575 47.14695 87.00805
#19: B 9 28 48 20.78425 48.94782 84.25406
#20: B 10 29 49 30.70872 44.65144 128.39415
Great (double) answer from #akrun.
Just a suggestion for your future analysis as you mentioned "it's an example of a bigger problem". Obviously, if you are really interested in building models rowwise then you'll create more and more columns as your age and h observations increase. If you get N observations you'll have to use 2xN columns for those 2 variables only.
I'd suggest to use a long data format in order to increase your rows instead of your columns.
Something like:
exp[1,] # how your first row (model building info) looks like
# exp re age1 age2 h h2
# 1 A 1 10 30 19.23298 46.67906
reshape(exp[1,], # how your model building info is transformed
varying = list(c("age1","age2"),
c("h","h2")),
v.names = c("age_value","h_value"),
direction = "long")
# exp re time age_value h_value id
# 1.1 A 1 1 10 19.23298 1
# 1.2 A 1 2 30 46.67906 1
Apologies if the "bigger problem" refers to something else and this answer is irrelevant.
With base R, the function sprintf can help us create formulas. And lapply carries out the calculation.
strings <- sprintf("c(%f,%f) ~ c(%f,%f)", exp$age1, exp$age2, exp$h, exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
exp$modelh <- unlist(lst)
exp
# exp re age1 age2 h h2 modelh
# 1 A 1 10 30 19.23298 46.67906 68.85506
# 2 A 2 11 31 17.73018 47.55402 66.17050
# 3 A 3 12 32 26.56967 46.69174 84.98486
# 4 A 4 13 33 11.69149 47.74486 61.98766
# 5 A 5 14 34 24.05648 46.10051 82.90167
# 6 A 6 15 35 24.51312 44.85710 89.21053
# 7 A 7 16 36 34.37208 47.85151 113.37493
# 8 A 8 17 37 21.10962 48.40977 74.79483
# 9 A 9 18 38 26.39676 46.74548 90.34187
# 10 A 10 19 39 15.10786 45.38862 75.07002
# 11 B 1 20 40 28.74989 46.44153 100.54666
# 12 B 2 21 41 36.46497 48.64253 125.34773
# 13 B 3 22 42 18.41062 45.74346 81.70062
# 14 B 4 23 43 21.95464 48.77079 81.20773
# 15 B 5 24 44 32.87653 47.47637 115.95097
# 16 B 6 25 45 30.07065 48.44727 101.10688
# 17 B 7 26 46 16.13836 44.90204 84.31080
# 18 B 8 27 47 20.72575 47.14695 87.00805
# 19 B 9 28 48 20.78425 48.94782 84.25406
# 20 B 10 29 49 30.70872 44.65144 128.39416
In the lapply function the expression as.formula(x) is what converts the formulas created in the first line into a format usable by the lm function.
Benchmark
library(dplyr)
library(microbenchmark)
set.seed(100)
big.exp <- data.frame(age1=sample(30, 1e4, T),
age2=sample(30:50, 1e4, T),
h=runif(1e4, 10, 40),
h2= 40 + runif(1e4,4,9))
microbenchmark(
plafort = {strings <- sprintf("c(%f,%f) ~ c(%f,%f)", big.exp$age1, big.exp$age2, big.exp$h, big.exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
big.exp$modelh <- unlist(lst)},
akdplyr = {big.exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )}
,times=5)
t: seconds
expr min lq mean median uq max neval cld
plafort 13.00605 13.41113 13.92165 13.56927 14.53814 15.08366 5 a
akdplyr 26.95064 27.64240 29.40892 27.86258 31.02955 33.55940 5 b
(Note: I downloaded the newest 1.9.5 devel version of data.table today, but continued to receive errors when trying to test it.
The results also differ fractionally (1.93 x 10^-8). Rounding likely accounts for the difference.)
all.equal(pl, ak)
[1] "Attributes: < Component “class”: Lengths (1, 3) differ (string compare on first 1) >"
[2] "Attributes: < Component “class”: 1 string mismatch >"
[3] "Component “modelh”: Mean relative difference: 1.933893e-08"
Conclusion
The lapply approach seems to perform well compared to dplyr with respect to speed, but it's 5 digit rounding may be an issue. Improvements may be possible. Perhaps using apply after converting to matrix to increase speed and efficiency.
In previous versions of R I could combine factor levels that didn't have a "significant" threshold of volume using the following little function:
whittle = function(data, cutoff_val){
#convert to a data frame
tab = as.data.frame.table(table(data))
#returns vector of indices where value is below cutoff_val
idx = which(tab$Freq < cutoff_val)
levels(data)[idx] = "Other"
return(data)
}
This takes in a factor vector, looks for levels that don't appear "often enough" and combines all of those levels into one "Other" factor level. An example of this is as follows:
> sort(table(data$State))
05 27 35 40 54 84 9 AP AU BE BI DI G GP GU GZ HN HR JA JM KE KU L LD LI MH NA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
OU P PL RM SR TB TP TW U VD VI VS WS X ZH 47 BL BS DL M MB NB RP TU 11 DU KA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3
BW ND NS WY AK SD 13 QC 01 BC MT AB HE ID J NO LN NM ON NE VT UT IA MS AO AR ME
4 4 4 4 5 5 6 6 7 7 7 8 8 8 9 10 11 17 23 26 26 30 31 31 38 40 44
OR KS HI NV WI OK KY IN WV AL CO WA MN NH MO SC LA TN AZ IL NC MI GA OH ** CT DE
45 47 48 57 57 64 106 108 112 113 120 125 131 131 135 138 198 200 233 492 511 579 645 646 840 873 1432
RI DC TX MA FL VA MD CA NJ PA NY
1782 2513 6992 7027 10527 11016 11836 12221 15485 16359 34045
Now when I use whittle it returns me the following message:
> delete = whittle(data$State, 1000)
Warning message:
In `levels<-`(`*tmp*`, value = c("Other", "Other", "Other", "Other", :
duplicated levels in factors are deprecated
How can I modify my function so that it has the same effect but doesn't use these "deprecated" factor levels? Converting to a character, tabling, and then converting to the character "Other"?
I've always found it easiest (less typing and less headache) to convert to character and back for these sorts of operations. Keeping with your as.data.frame.table and using replace to do the replacement of the low-frequency levels:
whittle <- function(data, cutoff_val) {
tab = as.data.frame.table(table(data))
factor(replace(as.character(data), data %in% tab$data[tab$Freq < cutoff_val], "Other"))
}
Testing on some sample data:
state <- factor(c("MD", "MD", "MD", "VA", "TX"))
whittle(state, 2)
# [1] MD MD MD Other Other
# Levels: MD Other
I think this verison should work. The levels<- function allows you to collapse by assigning a list (see ?levels).
whittle <- function(data, cutoff_val){
tab <- table(data)
shouldmerge <- tab < cutoff_val
tokeep <- names(tab)[!shouldmerge]
tomerge <- names(tab)[shouldmerge]
nv <- c(as.list(setNames(tokeep,tokeep)), list("Other"=tomerge))
levels(data)<-nv
return(data)
}
And we test it with
set.seed(15)
x<-factor(c(sample(letters[1:10], 100, replace=T), sample(letters[11:13], 10, replace=T)))
table(x)
# x
# a b c d e f g h i j k l m
# 5 11 8 8 7 5 13 14 14 15 2 3 5
y <- whittle(x, 9)
table(y)
# y
# b g h i j Other
# 11 13 14 14 15 43
It's worth adding to this answer that the new forcats package contains the fct_lump() function which is dedicated to this.
Using #MrFlick's data:
x <- factor(c(sample(letters[1:10], 100, replace=T),
sample(letters[11:13], 10, replace=T)))
library(forcats)
library(magrittr) ## for %>% ; could also load dplyr
fct_lump(x, n=5) %>% table
# b g h i j Other
#11 13 14 14 15 43
The n argument specifies the number of most common values to preserve.
Here's another way of doing it by replacing all the items below the threshold with the first and then renaming that level to Other.
whittle <- function(x, thresh) {
belowThresh <- names(which(table(x) < thresh))
x[x %in% belowThresh] <- belowThresh[1]
levels(x)[levels(x) == belowThresh[1]] <- "Other"
factor(x)
}