This question already has answers here:
paste two data.table columns
(4 answers)
Closed 6 years ago.
For example there is the following data.table:
dt <- data.table(x = list(1:2, 3:5, 6:9), y = c(1,2,3))
# x y
# 1: 1,2 1
# 2: 3,4,5 2
# 3: 6,7,8,9 3
I need to create a new data.table, where values of the y column will be appended to lists stored in the x column:
# z
# 1: 1,2,1
# 2: 3,4,5,2
# 3: 6,7,8,9,3
I've tried lapply, cbind, list, c functions. But I can't get the table I need.
UPDATE:
The question is different from paste two data.table columns because a trivial solution with paste function or something like this doesn't work.
This will do it
# Merge two lists
dt[, z := mapply(c, x, y, SIMPLIFY=FALSE)]
print(dt)
x y z
1: 1,2 1 1,2,1
2: 3,4,5 2 3,4,5,2
3: 6,7,8,9 3 6,7,8,9,3
And deleting the original x and y columns
dt[, c("x", "y") := NULL]
print(dt)
z
1: 1,2,1
2: 3,4,5,2
3: 6,7,8,9,3
I would like to suggest a general approach for this kind of task in case you have multiple columns that you would like to combine into a single column
An example data with multiple columns
dt <- data.table(x = list(1:2, 3:5, 6:9), y = 1:3, z = list(4:6, NULL, 5:8))
Solution
res <- melt(dt, measure.vars = names(dt))[, .(.(unlist(value))), by = rowid(variable)]
res$V1
# [[1]]
# [1] 1 2 1 4 5 6
#
# [[2]]
# [1] 3 4 5 2
#
# [[3]]
# [1] 6 7 8 9 3 5 6 7 8
The idea here is to convert to long format and then unlist/list by group
(You will receive an warning due to different classes in the resulting value column)
Related
Given data
library(data.table)
dt = data.table(x = 1:5, y = 6:10)
x y
1: 1 6
2: 2 7
3: 3 8
4: 4 9
5: 5 10
Lets say I want two new columns with the mean of x and y. I can easily do this by providing a character vector of column names on left hand side (LHS) of :=:
dt[, c("mean_x", "mean_y") := lapply(.SD, mean)]
But if I assign the column names to a vector beforehand, and use the name of the vector as LHS, I get an error:
new_names = c("mean_x", "mean_y")
dt[, new_names := lapply(.SD, mean)]
# Error in `[.data.table`(dt, , `:=`(new_names , lapply(.SD, mean))) :
# Supplied 2 items to be assigned to 10 items of column 'cols'.
# If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
However, if I put parenthesis around the name of the vector it works fine
dt[, (new_names) := lapply(.SD, mean)]
x y mean_x mean_y
1: 1 6 3 8
2: 2 7 3 8
3: 3 8 3 8
4: 4 9 3 8
5: 5 10 3 8
What does the parenthesis tell data.table internally so that it treats new_names as new columns and not something else?
This question already has answers here:
How to select columns programmatically in a data.table?
(2 answers)
Closed 4 years ago.
I have list of variable names in a vector of strings v and a data table my.dt that contains all these variables.
> v
[1] "var1" "var2" "var3"
I want to use those variables whose names are in the vector v, such that i create new variable that is cbind of these 3, or any number of names that appear in v, like:
new <- cbind(my.dt[,"var1"],my.dt[,"var2"],my.dt[,"var3"])
new1 <- rowSums(new, na.rm=TRUE) * ifelse(rowSums(is.na(new)) == ncol(new), NA, 1)
How can i get this, having in mind that number of variables is not fixed, so i dont want to refer to each element like v[1], v[2] etc.
Based on your comment, you are working on a data.table. You will need to add with = FALSE to the code as follows.
library(data.table)
my.dt <- data.table( ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18 )
v <- c("a", "ID")
my.dt[, v, with = FALSE]
# a ID
# 1: 1 b
# 2: 2 b
# 3: 3 b
# 4: 4 a
# 5: 5 a
# 6: 6 c
Notice that if you are working on a data frame, you don't need with = FALSE.
my.dt <- data.frame( ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18 )
v <- c("a", "ID")
my.dt[, v]
# a ID
# 1 1 b
# 2 2 b
# 3 3 b
# 4 4 a
# 5 5 a
# 6 6 c
I have a dataset which looks like this:
set.seed(43)
dt <- data.table(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10),
e = sample(c("x","y"),10,replace = T),
f=sample(c("t","s"),10,replace = T)
)
i need (for example) a count of negative values in columns 1:4 for each value of e, f. The result would have to look like this:
e neg_a_count neg_b_count neg_c_count neg_d_count
1: x 6 3 5 3
2: y 2 1 3 NA
1: s 4 2 3 1
2: t 4 2 5 2
Here's my code:
for (k in 5:6) { #these are the *by* columns
for (i in 1:4) {#these are the columns whose negative values i'm counting
n=paste("neg",names(dt[,i,with=F]),"count","by",names(dt[,k,with=F]),sep="_")
dt[dt[[i]]<0, (n):=.N, by=names(dt[,k,with=F])]
}
}
dcast(unique(melt(dt[,5:14], id=1, measure=3:6))[!is.na(value),],e~variable)
dcast(unique(melt(dt[,5:14], id=2, measure=7:10))[!is.na(value),],f~variable)
which obviously produces two tables, not one:
e neg_a_count_by_e neg_b_count_by_e neg_c_count_by_e neg_d_count_by_e
1: x 6 3 5 3
2: y 2 1 3 NA
f neg_a_count_by_f neg_b_count_by_f neg_c_count_by_f neg_d_count_by_f
1: s 4 2 3 1
2: t 4 2 5 2
and need to be rbind to produce one table.
This approach modifies dt by adding eight additional columns (4 data columns x 2 by columns), and the counts related to the levels of e and f get recycled (as expected). I was wondering if there is a cleaner way to achieve the result, one which does not modify dt. Also, casting after melting seems inefficient, there should be a better way, especially since my dataset has several e and f-like columns.
If there is only two grouping columns, we could do an rbindlist after grouping by them separately
rbindlist(list(dt[,lapply(.SD, function(x) sum(x < 0)) , .(e), .SDcols = a:d],
dt[,lapply(.SD, function(x) sum(x < 0)) , .(f), .SDcols = a:d]))
# e a b c d
#1: y 2 1 3 0
#2: x 6 3 5 3
#3: s 4 2 3 1
#4: t 4 2 5 2
Or make it more dynamic by looping through the grouping column names
rbindlist(lapply(c('e', 'f'), function(x) dt[, lapply(.SD,
function(.x) sum(.x < 0)), by = x, .SDcols = a:d]))
You can melt before aggregating as follows:
cols <- c("a","b","c", "d")
melt(dt, id.vars=cols)[,
lapply(.SD, function(x) sum(x < 0)), by=value, .SDcols=cols]
I'm finding it hard to put what I want into words so I will try to run through an example to explain it. Let's say I've repeated an experiment twice and have two tables:
[df1] [df2]
X Y X Y
2 3 4 1
5 2 2 4
These tables are stored in a list (where the list can contain more than two elements if necessary), and what I want to do is create an average of each cell in the tables across the list (or for a generalised version, apply any function I choose to the cells i.e. mad, sd, etc)
[df1] [df2] [dfMeans]
X Y X Y X Y
2 3 4 1 mean(2,4) mean(3,1)
5 2 2 4 mean(5,2) mean(2,4)
I have a code solution to my problem, but since this is in R there is most likely a cleaner way to do things:
df1 <- data.frame(X=c(2,3,4),Y=c(3,2,1))
df2 <- data.frame(X=c(5,1,3),Y=c(4,1,4))
df3 <- data.frame(X=c(2,7,4),Y=c(1,7,6))
dfList <- list(df1,df2,df3)
dfMeans <- data.frame(MeanX=c(NA,NA,NA),MeanY=c(NA,NA,NA))
for (rowIndex in 1:nrow(df1)) {
for (colIndex in 1:ncol(df1)) {
valuesAtCell <- c()
for (tableIndex in 1:length(dfList)) {
valuesAtCell <- c(valuesAtCell, dfList[[tableIndex]][rowIndex,colIndex])
}
dfMeans[rowIndex, colIndex] <- mean(valuesAtCell)
}
}
print(dfMeans)
Here is a data.table solution where the mean is applied row-wise across the data frames:
library(data.table)
dtList <- rbindlist(dfList, use.names = TRUE, idcol = TRUE)
dtList
.id X Y
1: 1 2 3
2: 1 3 2
3: 1 4 1
4: 2 5 4
5: 2 1 1
6: 2 3 4
7: 3 2 1
8: 3 7 7
9: 3 4 6
dtList[, rn := 1:.N, by = .id][][, .(X = mean(X), Y = mean(Y)), by = rn]
rn X Y
1: 1 3.000000 2.666667
2: 2 3.666667 3.333333
3: 3 3.666667 3.666667
You can replace the mean by another aggregation function, eg, median. The .id column numbers the original data frames each row was sourced from.
Edit
The solution can be extended to an arbitrary number of columns (provided column names and column order are identical in all data frames):
cn <- colnames(df1)
cn
[1] "X" "Y"
dtList[, rn := 1:.N, by = .id][, lapply(.SD, mean), by = rn, .SDcols = cn][, rn := NULL][]
X Y
1: 3.000000 2.666667
2: 3.666667 3.333333
3: 3.666667 3.666667
The column names are taken from one of the original data frames which adds to the flexibility of the solution. [, rn := NULL] removes the row numbers from the result, [] ensures the result ist printed.
You could simply sum all data.frame's in your list using Reduce(), and divide by the length of dfList, which is equal to the number of df's it contains.
Reduce(`+`, dfList) / length(dfList)
# X Y
#1 3.000000 2.666667
#2 3.666667 3.333333
#3 3.666667 3.666667
I'd like to order a data.table by a variable holding the name of a column:
I've tried every combination of + eval, getandc` without success:
I have colVar = "someColumnName"
I'd like to apply this to: DT[order(colVar)]
data.table has special functions for that matter which will modify your data set by reference instead of copying it to a new object.
You can either use setkey or (in versions >= 1.9.4) setorder which is capable of ordering in decreasing order too.
Note the difference between setkey vs. setkeyv and setorder vs. setorderv. v notes that you can pass either a quoted variable name or a variable containing one.
Using #andrewzm data set
dtbl
# x y
# 1: 1 5
# 2: 2 4
# 3: 3 3
# 4: 4 2
# 5: 5 1
setorderv(dtbl, colVar)[] # or `sekeyv(dtbl, colVar)[]` or `setorderv(dtbl, "y")[]`
# x y
# 1: 5 1
# 2: 4 2
# 3: 3 3
# 4: 2 4
# 5: 1 5
You can use double brackets for data tables:
library(data.table)
dtbl <- data.table(x = 1:5, y = 5:1)
colVar = "y"
dtbl_sorted <- dtbl[order(dtbl[[colVar]])]
dtbl_sorted