Paste name of new columns when summarizing data.table [duplicate] - r

This question already has answers here:
Dynamically add column names to data.table when aggregating
(2 answers)
Dynamic column names in data.table
(3 answers)
Closed 5 years ago.
How to summarize a data.table creating new column whose name comes from a string or character?
reproducible example:
library(data.table)
dt <- data.table(x=rep(c("a","b"),20),y=factor(sample(letters,40,replace=T)), z=1:20)
i <- 15
new_var <- paste0("new_",i)
# my attempt
dt[, .( eval(new_var) = sum( z[which( z <= i)] )), by= x]
# expected result
dt[, .( new_15 = sum( z[which( z <= i)] )), by= x]
> x new_15
> 1: a 128
> 2: b 112
This approach using eval() works fine for creating a new column with := (see this SO questions), but I don't know why it does not work when summarizing a data.table.

One option is setNames
dt[, setNames(.(sum( z[which( z <= i)])), new_var) , by= x]
# x new_15
#1: a 128
#2: b 112

Related

How to split a column in multiple columns using data.table [duplicate]

This question already has answers here:
Split text string in a data.table columns
(5 answers)
How to speed up row level operation with dplyr
(1 answer)
Closed 6 months ago.
I have a quite simple question regarding data.table dt, but as always, this package brings me to the brink of despair :D
I have a column name with these contents: e.g. "bin1_position1_ID1" and I want to split these infos into separate columns:
name bins positions IDs
-------------------- ------------------------
"bin1_position1_ID1" -> "bin1" "position1" "ID1"
"bin2_position2_ID2" "bin2" "position2" "ID2"
I tried it with
dt <- dt[, bins := lapply(.SD, function(x) strsplit(x, "_")[[1]][1]), .SDcols="name"]
(and for the other new columns with [[1]][2] [[1]][3])
However, I end up having a new column bins (so far, so good), but this has the info from row 1 in every row and not the info from the same row then itself (i.e. bin1 in every row).
And I have some columns that have more infos, that I don't want to make to columns. e.g. one column has "bin5_position5_ID5_another5_more5"
Code for testing (see Maƫls solution):
library(data.table)
name <- c("bin1_position1_ID1",
"bin2_position2_ID2",
"bin3_position3_ID3",
"bin4_position4_ID4",
"bin5_position5_ID5_another5_more5")
dt <- data.table(name)
dt[, c("bin", "position", "ID") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]
Use tstrsplit with keep = 1:3 to keep only the first three columns:
dt[, c("bins", "positions", "IDs") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]
name bin position ID
1: bin1_position1_ID1 bin1 position1 ID1
2: bin2_position2_ID2 bin2 position2 ID2
3: bin3_position3_ID3 bin3 position3 ID3
4: bin4_position4_ID4 bin4 position4 ID4
5: bin5_position5_ID5_another5_more5 bin5 position5 ID5

Why do I have to use `[[1]]` with data.table everywhere? [duplicate]

This question already has answers here:
Select column of data.table and return vector
(2 answers)
Closed 2 years ago.
After indexing DT column with a variable name, the data is returned as type data.table data.frame, and the column is not an accessible vector, I have to unlist it first. Am I doing everything as intended?
Consider this example:
require(data.table)
DT <- data.table(a=seq(1.001, 10.999, length=100), b=factor(c(rep('a', 55), rep('b', 45))))
col.name <- 'a'
diff(DT[, col.name]) #column name not found error
diff(DT[, col.name, with=FALSE]) #null data table
diff(DT[, col.name, with=FALSE][[1]]) #works
The second example is what question is about.
You have many options to retrieve single columns. In my opinion, the most readable option, is using .SD, though not the fastest. It's also often desired that single column data.tables are not converted to vectors.
require(data.table)
DT <- data.table(a=seq(1.001, 10.999, length=100), b=factor(c(rep('a', 55), rep('b', 45))))
DT[ , get(col.name) ] # vector
DT[[ col.name ]] # vecotr
DT[ , col.name, with = FALSE ] # data.table
DT[ , .SD, .SDcols = col.name ] # data.table
Two (straight-forward) ways to to get just the vector and not the data.table, given your inputs:
DT[, get(col.name)]
DT[[col.name]] # as per comment by ismirsehregal
And some equally convoluted as DT[, col.name, with=FALSE][[1]]
with(DT, eval(parse(text=col.name)))
DT[, ..col.name][[1]] # as per comment by ismirsehregal

Combining rows if text is same before certain character [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have a data frame similar to this. I want to sum up values for rows if the text in column "Name" is the same before the - sign.
Remove everything after "-" using sub and then use How to sum a variable by group
df$Name <- sub("-.*", "",df$Name)
aggregate(cbind(val1, val2)~Name, df, sum)
Below is a data.table solution.
Data:
df = data.table(
Name = c('IRON - A', 'IRON - B', 'SABBATH - A', 'SABBATH - B'),
val1 = c(1,2,3,4),
val2 = c(5,6,7,8)
)
Code:
df[, Name := sub("-.*", "", Name)]
mat = df[, .(sum(val1), sum(val2)), by = Name]
> mat
Name V1 V2
1: IRON 3 11
2: SABBATH 7 15
You can rbind your 2 tables (top and bottom) into one data frame and then use dplyr or data.table. The data.table would be much faster for large tables.
data_framme$Name <- sub("-.*", "", data_frame$Name)
library(dplyr)
data_frame %>%
group_by(Name) %>%
summarise_all(sum)
library(data.table)
data.frame <- data.table(data.frame)
data.frame[, lapply(.SD, sum, na.rm=TRUE), by=Name ]

Rename multiple aggregated columns using data.table in R [duplicate]

This question already has answers here:
Calculate multiple aggregations on several variables using lapply(.SD, ...)
(2 answers)
Apply multiple functions to multiple columns in data.table
(5 answers)
Closed 4 years ago.
I have a data frame with many columns, some of which are measure variables. I would like to extract a bunch of summary statistics from the latter using data.table. My problem is the following: how to rename the aggregated columns according to the function that was used?
I want to have an aggregated data.table with column names like: c("measure1_mean", "measure1_sd", "measure2_mean", "measure2_sd", ...)
My code looks like this:
library(data.table)
library(stringr)
dt <- data.table(meas1=1:10,
meas2=seq(5,25, length.out = 10),
meas3=rnorm(10),
groupvar=rep(LETTERS[1:5], each=2))
measure_cols <- colnames(dt)[str_detect(colnames(dt), "^meas")]
dt_agg <- dt[, c(lapply(.SD, mean),
lapply(.SD, sd)),
by=groupvar, .SDcols = measure_cols]
# Does not work because of duplicates in rep(measure_cols, 3)
agg_names <- c(measure_cols, paste(rep(c("mean", "sd"), each=length(measure_cols)), measure_cols, sep="_"))
setnames(dt_agg, rep(measure_cols,3), agg_names)
This chunk effectively extracts the statistics but returns columns with identical names. Therefore I cannot use something like setnames(dt, old, new) because duplicates exist in my 'old' vector.
I came across this post: Rename aggregated columns using data.table in R. But I do not like the accepted solution because it relies on column index, and not names, to rename the columns.
library(data.table)
dt <- data.table(meas1=1:10,
meas2=seq(5,25, length.out = 10),
meas3=rnorm(10),
groupvar=rep(LETTERS[1:5], each=2))
measure_cols <- colnames(dt)[str_detect(colnames(dt), "^meas")]
dt_agg <- dt[, c(lapply(.SD, mean),
lapply(.SD, sd)),
by=groupvar, .SDcols = measure_cols]
you can create a vector with names... use the each argument to paste the function,names behind the measure_cols.
function.names <- c("mean", "sd")
column.names <- paste0( measure_cols, "_", rep( function.names, each = length( measure_cols ) ) )
setnames( dt_agg, c("groupvar", column.names ))
# groupvar meas1_mean meas2_mean meas3_mean meas1_sd meas2_sd meas3_sd
# 1: A 1.5 6.111111 0.2346044 0.7071068 1.571348 1.6733804
# 2: B 3.5 10.555556 0.5144621 0.7071068 1.571348 0.0894364
# 3: C 5.5 15.000000 -0.5469839 0.7071068 1.571348 2.1689620
# 4: D 7.5 19.444444 -0.3898213 0.7071068 1.571348 1.0007027
# 5: E 9.5 23.888889 0.5569743 0.7071068 1.571348 1.4499413

Convert *some* column classes in data.table

I want to convert a subset of data.table cols to a new class. There's a popular question here (Convert column classes in data.table) but the answer creates a new object, rather than operating on the starter object.
Take this example:
dat <- data.frame(ID=c(rep("A", 5), rep("B",5)), Quarter=c(1:5, 1:5), value=rnorm(10))
cols <- c('ID', 'Quarter')
How best to convert to just the cols columns to (e.g.) a factor? In a normal data.frame you could do this:
dat[, cols] <- lapply(dat[, cols], factor)
but that doesn't work for a data.table, and neither does this
dat[, .SD := lapply(.SD, factor), .SDcols = cols]
A comment in the linked question from Matt Dowle (from Dec 2013) suggests the following, which works fine, but seems a bit less elegant.
for (j in cols) set(dat, j = j, value = factor(dat[[j]]))
Is there currently a better data.table answer (i.e. shorter + doesn't generate a counter variable), or should I just use the above + rm(j)?
Besides using the option as suggested by Matt Dowle, another way of changing the column classes is as follows:
dat[, (cols) := lapply(.SD, factor), .SDcols = cols]
By using the := operator you update the datatable by reference. A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
As suggeted by #MattDowle in the comments, you can also use a combination of for(...) set(...) as follows:
for (col in cols) set(dat, j = col, value = factor(dat[[col]]))
which will give the same result. A third alternative is:
for (col in cols) dat[, (col) := factor(dat[[col]])]
On a smaller datasets, the for(...) set(...) option is about three times faster than the lapply option (but that doesn't really matter, because it is a small dataset). On larger datasets (e.g. 2 million rows), each of these approaches takes about the same amount of time. For testing on a larger dataset, I used:
dat <- data.table(ID=c(rep("A", 1e6), rep("B",1e6)),
Quarter=c(1:1e6, 1:1e6),
value=rnorm(10))
Sometimes, you will have to do it a bit differently (for example when numeric values are stored as a factor). Then you have to use something like this:
dat[, (cols) := lapply(.SD, function(x) as.integer(as.character(x))), .SDcols = cols]
WARNING: The following explanation is not the data.table-way of doing things. The datatable is not updated by reference because a copy is made and stored in memory (as pointed out by #Frank), which increases memory usage. It is more an addition in order to explain the working of with = FALSE.
When you want to change the column classes the same way as you would do with a dataframe, you have to add with = FALSE as follows:
dat[, cols] <- lapply(dat[, cols, with = FALSE], factor)
A check whether this worked:
> sapply(dat,class)
ID Quarter value
"factor" "factor" "numeric"
If you don't add with = FALSE, datatable will evaluate dat[, cols] as a vector. Check the difference in output between dat[, cols] and dat[, cols, with = FALSE]:
> dat[, cols]
[1] "ID" "Quarter"
> dat[, cols, with = FALSE]
ID Quarter
1: A 1
2: A 2
3: A 3
4: A 4
5: A 5
6: B 1
7: B 2
8: B 3
9: B 4
10: B 5
You can use .SDcols:
dat[, cols] <- dat[, lapply(.SD, factor), .SDcols=cols]

Resources