I have a data table in R that looks like this
DT = data.table(a = c(1,2,3,4,5), a_mean = c(1,1,2,2,2), b = c(6,7,8,9,10), b_mean = c(3,2,1,1,2))
I want to create two more columns a_final and b_final defined as a_final = (a - a_mean) and b_final = (b - b_mean). In my real life use case, there can be a large number of such column pairs and I want a scalable solution in the spirit of R's data tables.
I tried something along the lines of
DT[,paste0(c('a','b'),'_final') := lapply(.SD, function(x) ((x-get(paste0(colnames(.SD),'_mean'))))), .SDcols = c('a','b')]
but this doesn't quite work. Any idea of how I can access the column name of the column being processed within the lapply statement?
We can create a character vector with columns names, subset it from the original data.table, get their corresponding "mean" columns, subtract and add as new columns.
library(data.table)
cols <- unique(sub('_.*', '', names(DT))) #Thanks to #Sotos
#OR just
#cols <- c('a', 'b')
DT[,paste0(cols, '_final')] <- DT[,cols, with = FALSE] -
DT[,paste0(cols, "_mean"), with = FALSE]
DT
# a a_mean b b_mean a_final b_final
#1: 1 1 6 3 0 3
#2: 2 1 7 2 1 5
#3: 3 2 8 1 1 7
#4: 4 2 9 1 2 8
#5: 5 2 10 2 3 8
Another option is using mget with Map:
cols <- c('a', 'b')
DT[, paste0(cols,'_final') := Map(`-`, mget(cols), mget(paste0(cols,"_mean")))]
Relying on the .SD construct you could do something along the lines of:
cols <- c('a', 'b')
DT[, paste0(cols, "_final") :=
DT[, .SD, .SDcols = cols] -
DT[, .SD, .SDcols = paste0(cols, "_mean")]]
Related
I want to merge variables with the same name so values from the y dataset overwrite those in the x datatset.
This code should produce a replica of b because a$V2 should be overwritten by b$V2.
Instead I get V2.x and V2.y
a = data.frame(c("A","B","C","D"), c("1","2"))
names (a) = c("V1","V2")
b = data.frame(c("A","B","C","D"), c("3","4"))
names (b) = c("V1","V2")
merge.data.frame(a,b, by.x = "V1", by.y = "V1", all.y = T,)
It may be easier with rows_update
library(dplyr)
rows_update(a, b, by = 'V1')
Or do an assign (:=) by joining with data.table, which updates the column 'V2' in 'a' by the column ('V2') from 'b' data
library(data.table)
setDT(a)[b, V2 := i.V2, on = .(V1)]
Do you mean this one:
a$V2 <- b$V2
V1 V2
1 A 3
2 B 4
3 C 3
4 D 4
You can use ifelse to override the values you want after merge
df <- merge.data.frame(a,b, by = "V1")
df$V2 <- ifelse(df$V2.x == df$V2.y , df$V2.x , df$V2.y)
df |> subset(select = c(V1 , V2))
Output
V1 V2
1 A 3
2 B 4
3 C 3
4 D 4
I want to do something very simple but so far I have failed to do it in one command. I want to create a new data table by applying a function to some columns of an existing one while giving them a name and droppinh the rest.
Let's see a minimal example:
library(data.table)
dt = data.table(A = c('a', 'a', 'a', 'b', 'b'),
B = c(1 , 2 , 3 , 4 , 5 ),
C = c(10 , 20 , 30 , 40 , 50))
dt
A B C
a 1 10
a 2 20
a 3 30
b 4 40
b 5 50
For a single column, we can do:
dt1 = dt[, .(totalB = sum(B)), by=A]
dt1
A totalB
a 6
b 9
For more than 1 columns, we can do:
dt2 = dt[, .(totalB = sum(B), totalC = sum(C)), by=A]
dt2
A totalB totalC
a 6 60
b 9 90
But if the columns are many that's not the best practice. So I guess we should go with lapply like that:
dt3 = dt[, lapply(.SD, sum), by = A]
dt3
A B C
a 6 60
b 9 90
That creates the table but without the names. So we can add them:
names = c("totalA", "totalB")
dt4 = dt[, c("totalA", "totalB") := lapply(.SD, sum), by = A ]
dt4
A B C totalA totalB
a 1 10 6 60
a 2 20 6 60
a 3 30 6 60
b 4 40 9 90
b 5 50 9 90
But now the columns remained. How can we prevent that? Also note that in my actual problem I use a subset of the columns, via SDcols, which I didn't include here for simplicity.
EDIT: My desired output is the same as dt2 but I don't want to write down all columns.
Do you mean something like below?
dt[, setNames(lapply(.SD, sum), paste0("total", names(.SD))), A]
Output
A totalB totalC
1: a 6 60
2: b 9 90
Another option is setnames. Create a vector of column names that we want to apply the function other than the grouping variable ('nm1'), grouped by 'A', get the sum, and use setnames with old and new specified
nm1 <- setdiff(names(dt), "A")
setnames(dt[, lapply(.SD, sum), A], nm1, paste0('total', nm1))[]
I'm looking for the (1) name and (2) a (cleaner) method in R (base and data.table preferred) of the following.
Input
> d1
id x y
1 1 1 NA
2 2 NA 3
3 3 4 NA
> d2
id x y z
1 4 NA 30 a
2 3 20 2 b
3 2 14 NA c
4 1 15 97 d
(note that the actual data.frames have hundreds of columns)
Expected output:
> d1
id x y z
1 1 1 97 d
2 2 14 3 c
3 3 4 2 b
Data and current solution:
d1 <- data.frame(id = 1:3, x = c(1, NA, 4), y = c(NA, 3, NA))
d2 <- data.frame(id = 4:1, x = c(NA, 20, 14, 15), y = c(30, 2, NA, 97), z = letters[1:4])
for (col in setdiff(names(d1), "id")) {
# If missing look in d2
missing <- is.na(d1[[col]])
d1[missing, col] <- d2[match(d1$id[missing], d2$id), col]
}
for (col in setdiff(names(d2), names(d1))) {
# If column missing then add
d1[[col]] <- d2[match(d1$id, d2$id), col]
}
PS:
Likely this questions has been asked before but I'm lacking in vocabulary to search it.
Assuming you are working with 2 data.frames, here is a base solution
#expand d1 to have the same columns as d2
d <- merge(d1, d2[, c("id", setdiff(names(d2), names(d1))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#make sure that d2 also have same number of columns as d1
d2 <- merge(d2, d1[, c("id", setdiff(names(d1), names(d2))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#align rows and columns to match those in d1
mask <- d2[match(d1$id, d2$id), names(d)]
#replace NAs with those mask
replace(d, is.na(d), mask[is.na(d)])
If you dont mind, we can rewrite your question into a general matrix-coalesce question (i.e. any number of matrices, columns, rows) which seems like it has not been asked before.
edit:
Another base R solution is a hack of coalesce1a from How to implement coalesce efficiently in R
coalesce.mat <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
rn <- match(ans$id, elt$id)
ans[is.na(ans)] <- elt[rn, names(ans)][is.na(ans)]
}
ans
}
allcols <- Reduce(union, lapply(list(d1, d2), names))
do.call(coalesce.mat,
lapply(list(d1, d2), function(x) {
x[, setdiff(allcols, names(x))] <- NA
x
}))
edit:
a possible data.table solution using coalesce1a from How to implement coalesce efficiently in R by Martin Morgan.
coalesce1a <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- which(is.na(ans))
ans[i] <- elt[i]
}
ans
}
setDT(d1)
setDT(d2)
#melt into long formats and full outer join the 2
mdt <- merge(melt(d1, id.vars="id"), melt(d2, id.vars="id"), by=c("id","variable"), all=TRUE)
#perform a coalesce on vectors
mdt[, value := do.call(coalesce1a, .SD), .SDcols=grep("value", names(mdt), value=TRUE)]
#pivot into original format and subset to those in d1
dcast.data.table(mdt, id ~ variable, value.var="value")[
d1, .SD, on=.(id)]
Here is a possibility using dplyr::left_join:
left_join(d1, d2, by = "id") %>%
mutate(
x = ifelse(!is.na(x.x), x.x, x.y),
y = ifelse(!is.na(y.x), y.x, y.y)) %>%
select(id, x, y, z)
# id x y z
#1 1 1 97 d
#2 2 14 3 c
#3 3 4 2 b
We can use data.table with coalesce from dplyr. Create a vector of column names that are common ('nm1') and difference ('nm2') in both datasets. Convert the first dataset to 'data.table' (setDT(d1)), join on the 'id' column, assign (:=) the coalesced columns of the first and second (with prefix i. - if there are common columns) to update the values in the first dataset
library(data.table)
nm1 <- setdiff(intersect(names(d1), names(d2)), 'id')
nm2 <- setdiff(names(d2), names(d1))
setDT(d1)[d2, c(nm1, nm2) := c(Map(dplyr::coalesce, mget(nm1),
mget(paste0("i.", nm1))), mget(nm2)), on = .(id)]
d1
# id x y z
#1: 1 1 97 d
#2: 2 14 3 c
#3: 3 4 2 b
I'm trying to generate row sums of a variable and its lag(s). Say I have:
library(data.table)
data <- data.table(id = rep(c("AT","DE"), each = 3),
time = rep(2001:2003, 2), var1 = c(1:6), var2 = c(NA, 1:3, NA, 8))
And I want to create a variable which adds 'var1' and the first lag of 'var2' by 'id'. If I create the lag first and the sum, I know how to:
data[ , lag := shift(var2, 1), by = id]
data[ , goalmessy := sum(var1, lag, na.rm = TRUE), by = 1:NROW(data)]
But is there a way to use shift inside sum or something similar (like apply sum or sth)? The intuitive problem I have, is that the by command is evaluated first as far as I know so we will be in a single row which makes the shifting unfeasible. Any hints?
I think this will do what you want in one line:
dt[, myVals := rowSums(cbind(var1, shift(var2)), na.rm=TRUE), by=id]
dt
id time var1 var2 myVals
1: AT 2001 1 NA 1
2: AT 2002 2 1 2
3: AT 2003 3 2 4
4: DE 2001 4 3 4
5: DE 2002 5 NA 8
6: DE 2003 6 8 6
The two variables of interest are put into cbind which is used to feed rowSums and NAs are dropped as in your code.
We can use rowSums
data[, goalmessy := rowSums(setDT(.(var1, shift(var2))), na.rm = TRUE), by = id]
Given the data.table, DF below, I would like to select all except the first rows of the groups numbered 6 and 8. I was told that I should use paste0( ). I have a solution that gives the expected result but without paste0( ).
DF <- data.table(grp=c(6,6,8,8,8), Q1=c(2,2,3,5,2), Q2=c(5,5,4,4,1), Q3=c(2,1,4,2,4), H1=c(3,4,5,2,4), H2=c(5,2,4,1,2) )
Desired result:
desired_result <- data.table(grp=c(6,8,8), Q1=c(2,4,2), Q2=c(5,4,1), Q3=c(1,2,4) )
One method that achieves this result:
DF[ , .SD[-1], .SDcols = c("Q1", "Q2", "Q3"), by = grp]
How can I use paste0( ) rather than c( )? Is there any advantage to one of these or an example where only one would work?
This method seems to work:
DF[ , .SD, .SDcols = paste0("Q", 1:3), by = grp]
grp Q1 Q2 Q3
1: 6 2 5 2
2: 6 2 5 1
3: 8 3 4 4
4: 8 5 4 2
5: 8 2 1 4
Comparing one method to another.
all.equal(DF[ , .SD, .SDcols = c("Q1", "Q2", "Q3"), by = grp],
DF[ , .SD, .SDcols = paste0("Q", 1:3), by = grp])
[1] TRUE
Note that .SDcols selects columns and has nothing to do with dropping the first rows of each group. .SDcols can take a character vector, and paste0 produces character vectors, so selecting the columns can work either way.
One method to drop the first row of each group is tail that frivolously includes the paste0 function is:
DF[ , tail(.SD, -1), .SDcols = paste0("Q", 1:3), by = grp]
grp Q1 Q2 Q3
1: 6 2 5 1
2: 8 5 4 2
3: 8 2 1 4