R reshaping melted data.table with list column - r

I have a large (millions of rows) melted data.table with the usual melt-style unrolling in the variable and value columns. I need to cast the table in wide form (rolling the variables up). The problem is that the data table also has a list column called data, which I need to preserve. This makes it impossible to use reshape2 because dcast cannot deal with non-atomic columns. Therefore, I need to do the rolling up myself.
The answer from a previous question about working with melted data tables does not apply here because of the list column.
I am not satisfied with the solution I've come up with. I'm looking for suggestions for a simpler/faster implementation.
x <- LETTERS[1:3]
dt <- data.table(
x=rep(x, each=2),
y='d',
data=list(list(), list(), list(), list(), list(), list()),
variable=rep(c('var.1', 'var.2'), 3),
value=seq(1,6)
)
# Column template set up
list_template <- Reduce(
function(l, col) { l[[col]] <- col; l },
unique(dt$variable),
list())
# Expression set up
q <- substitute({
l <- lapply(
list_template,
function(col) .SD[variable==as.character(col)]$value)
l$data = .SD[1,]$data
l
}, list(list_template=list_template))
# Roll up
dt[, eval(q), by=list(x, y)]
x y var.1 var.2 data
1: A d 1 2 <list>
2: B d 3 4 <list>
3: C d 5 6 <list>

This old question piqued my curiosity as data.table has been improved sigificantly since 2013.
However, even with data.table version 1.11.4
dcast(dt, x + y + data ~ variable)
still returns an error
Columns specified in formula can not be of type list
The workaround follows the general outline of jonsedar's answer :
Reshape the non-list columns from long to wide format
Aggregate the list column data grouped by x and y
Join the two partial results on x and y
but uses the features of the actual data.table syntax, e.g., the on parameter:
dcast(dt, x + y ~ variable)[
dt[, .(data = .(first(data))), by = .(x, y)], on = .(x, y)]
x y var.1 var.2 data
1: A d 1 2 <list>
2: B d 3 4 <list>
3: C d 5 6 <list>
The list column data is aggregated by taking the first element. This is in line with OP's code line
l$data = .SD[1,]$data
which also picks the first element.

I have somewhat cheating method that might do the trick - importantly, I assume that each x,y,list combination is unique! If not, please disregard.
I'm going to create two separate datatables, the first which is dcasted without the data list objects, and the second which has only the unique data list objects and a key. Then just merge them together to get the desired result.
require(data.table)
require(stringr)
require(reshape2)
x <- LETTERS[1:3]
dt <- data.table(
x=rep(x, each=2),
y='d',
data=list(list("a","b"), list("c","d")),
variable=rep(c('var.1', 'var.2'), 3),
value=seq(1,6)
)
# First create the dcasted datatable without the pesky list objects:
dt_nolist <- dt[,list(x,y,variable,value)]
dt_dcast <- data.table(dcast(dt_nolist,x+y~variable,value.var="value")
,key=c("x","y"))
# Second: create a datatable with only unique "groups" of x,y, list
dt_list <- dt[,list(x,y,data)]
# Rows are duplicated so I'd like to use unique() to get rid of them, but
# unique() doesn't work when there's list objects in the data.table.
# Instead so I cheat by applying a value to each row within an x,y "group"
# that is unique within EACH group, but present within EVERY group.
# Then just simply subselect based on that unique value.
# I've chosen rank(), but no doubt there's other options
dt_list <- dt_list[,rank:=rank(str_c(x,y),ties.method="first"),by=str_c(x,y)]
# now keep only one row per x,y "group"
dt_list <- dt_list[rank==1]
setkeyv(dt_list,c("x","y"))
# drop the rank since we no longer need it
dt_list[,rank:=NULL]
# Finally just merge back together
dt_final <- merge(dt_dcast,dt_list)

Related

Sum numeric sub-dataframe within a list

Here I have a r list of dataframes, all dataframes are in the same format and have the same dimensionality, the first 2 columns are strings,like IDs and names(identical for all dataframes), and the rest are numeric values. Here I want to sum numeric parts of all the dataframes in matrix way, i.e. output at index (1,3) is the sum of values at index (1,3) of all the dataframes
e.g. Given list L consist of dataframe x and y, I want to get output like z
x <-data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(1,2),v2=c(3,4))
y <-data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(5,6),v2=c(7,8))
L <- list(x,y)
z <- data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(1+5,2+6),v2=c(3+7,4+8))
I know how to do this using for loop, but I want to learn to do it in a more R-like way, by that I mean, using some vectorized functions, like the apply family
Currently my idea is create a new dataframe with only ID and name columns, then use a global dataframe variable to sum the numeric parts, and at last cbind this 2 parts
output <- x[,1:2]
num_sum <- matrix(0,nrow=nrow(L[[1]]),ncol=ncol(L[[1]][,-c(1,2)]))
lapply(L,function(a){num_sum <<- a[3:length(a)]+num_sum})
cbind(output,num_sum)
but this approach has some problems I prefer to avoid
I need to manully set the 2 parts of output and then manully join the two parts
lapply() will return a list that each element is an intermiediate num_sum returned by an iteration, which requires much more memory space
Here I'm using the global variable num_sum to keep track of the progress, but num_sum is not needed later and I have to manully remove it later
If the order of the two first columns is always the same, you can do:
#Get all numeric columns
num <- sapply(L[[1]], is.numeric)
#Sum them across elements of the list
df_num <- Reduce(`+`, lapply(L, `[`, num))
#Get the non-numeric columns and bind them with sum of numeric columns
cbind(L[[1]][!num], df_num)
output
ID name v1 v2
1 aa cc 6 10
2 bb dd 8 12
If they are different you can use powerjoin to do an inner join on the selected columns and sum the rest with conflict argument:
library(powerjoin)
sum_inner_join <-
function(x, y) power_inner_join(x, y, by = c("ID", "name"), conflict = ~ .x + .y)
Reduce(sum_inner_join, L)
output
ID name v1 v2
1 aa cc 6 10
2 bb dd 8 12
using dplyr and purrr (which has a bit nicer map functions), you could do something like this:
library(purrr)
library(dplyr)
result <- reduce(L, function(x,y){
xVals <- x |> select(-ID, -name)
yVals <- y |> select(-ID, -name)
totalVals <- xVals |> map2(yVals, function(x,y) {
rowSums(cbind(x,y))
})
return(x |> select(ID, name) |> cbind(totalVals))
})
Similar logic to Maël's answer, but squishing it all into a Map call:
data.frame(do.call(Map,
c(\(...) if(is.numeric(..1)) Reduce(`+`, list(...)) else ..1, L)
))
# ID name v1 v2
#1 aa cc 6 10
#2 bb dd 8 12
If the first ..1 chunk of the column is numeric, sum + all the values in all the lists, otherwise return the first ..1 chunk.
You could also do it via an aggregation if all the rows are unique:
tmp <- do.call(rbind, L)
nums <- sapply(tmp, is.numeric)
aggregate(tmp[nums], tmp[!nums], FUN=sum)
# ID name v1 v2
#1 aa cc 6 10
#2 bb dd 8 12

Easily Switch Which Columns Ordered By [duplicate]

I am trying to order a data frame on multiple columns. And the column names are passed through variable, i.e. a character vector.
df <- data.frame(var1 = c("b","a","b","a"), var2 = c("l","l","k","k"),
var3 = c("t","w","x","t"))
var1 var2 var3
1 b l t
2 a l w
3 b k x
4 a k t
Sorting on one column using a variable
sortvar <- "var1"
df[order(df[ , sortvar]),]
var1 var2 var3
2 a l w
4 a k t
1 b l t
3 b k x
Now, if I want to order by two columns, the above solution does not work.
sortvar <- c("var1", "var2")
df[order(df[, sortvar]), ] #does not work
I can manually order with column names:
df[with(df, order(var1, var2)),]
var1 var2 var3
4 a k t
2 a l w
3 b k x
1 b l t
But, how do I order the data frame dynamically on multiple columns using a variable with column names? I am aware of the plyr and dplyr arrange function, but I want to use base R here.
order expects multiple ordering variables as separate arguments, which is unfortunate in your case but suggests a direct solution: use do.call:
df[do.call(order, df[, sortvar]), ]
In case you’re unfamiliar with do.call: it constructs and executes a call programmatically. The following two statements are equivalent:
fun(arg1, arg2, …)
do.call(fun, list(arg1, arg2, …))
It's a bit awkward, but you can use do.call() to pass each of the columns to order as a different argument
dat[do.call("order", dat[,cols, drop=FALSE]), ]
I added drop=FALSE just in case length(cols)==1 where indexing a data.frame would return a vector instead of a list. You can wrap it in a fucntion to make it a bit easier to use
order_by_cols <- function(data, cols=1) {
data[do.call("order", data[, cols, drop=FALSE]), ]
}
order_by_cols(dat, cols)
it's a bit easier with dplyr if that's something you might consider
library(dplyr)
dat %>% arrange(across(all_of(cols)))
dat %>% arrange_at(cols) # though this method has been superseded by the above line

Using apply in R to extract rows from a dataframe

Using R, I have to extract specific rows from a data frame depending on certain conditions. The data frame is large (5.5 million rows to 251 columns) but I have given the code below to create a sample data frame.
df <- data.frame("Name" = c("Name1", "Name1", "Name1", "Name1","Name1" ), "Value"=c("X", "X", "Y", "Y", "X"))
I need to skip through the entire data frame row by row starting at the top, and while skipping, when the value of the 'Value' column changes from X to Y or Y to X, I need to extract that row and next row and append them to another data frame. For example, in the data frame above, the Value column of row 2 is X and that of row 3 is Y, and since the value has changed from X to Y, I need to extract the entire row 2 and row 3 and add them to another data frame.
The result of the operations can be seen by running the code below
dfextract <- data.frame("Name" = c("Name1", "Name1"), "Value"=c("X", "Y"))
Currently I have used a 'for' loop to skip row to row and extract the rows when the values don't match. But it very slow and inefficient. The code snippet is below
for (i in 1:count) {
if (df[[i+1, 2]] != df[i,2]) {
dfextract <- rbind(dfextract, df[i,])
dfextract <- rbind(dfextract, df[i+1,])
}
}
I am looking for a better and faster solution to the above situation. Perhaps using the functions belonging to the family of 'apply()' or using 'by()'. Any help would be greatly appreciated.
Thanks in advance.
Maybe the following does it. Note that there are two lapply based loop, in order to predict for changes in the values of column Name.
diffstr <- function(x) x[-1] == x[-length(x)]
res <- lapply(split(df, df$Name), function(x) {
inx <- which(c(FALSE, !diffstr(x$Value)))
do.call(rbind, lapply(inx, function(i) x[(i - 1):i, ]))
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
How it works.
First, I define a helper function diffstr. It compares all values of x but the first with all values of x but the last. Note that x[-1] is the vector x[2], x[3], ..., x[length(x)], negative indices remove that element from the vector. And the same for x[-length(x), the negative index removes the last x.
split(df, df$Name) splits the data frame into subsets each one of its own Name.
I then lapply an unnamed function to these subsets. This function's argument x will be each of the sub-data frames mentioned above.
That function start by determining where in df$Valueare the changes. This is done with the call to the helper function diffstr. I have to append a FALSE to the return value because at first there are no changes.
The next line is a tricky one. Use lapply on the index of change points inx and for each one get a two rows segment of the data frame x. Then use do.call to call rbind those two rows df's and reassemble them together.
Now res is a list, with one sub-data frame for each Name (done with the split). So it needs to be put back together with another call to do.call(rbind(...)).
Final tidy up. The whole process messed up with the data frame's row names. To set them to NULL is just a well known trick that forces R to renumber the rows.
That's it. If you need more explanations, just say so.
We can use dplyr. lag can shift the row by 1, so we can use Value != lag(Value) to compare if the value is different than the previous one. which(Value != lag(Value)) converts the result to row number. After that, sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))) makes sure we also got the row number of those previous rows. Finally, slice can subset the data frame based on the row number provided.
library(dplyr)
df2 <- df %>%
slice(sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))))
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
If the code is too long to read, you can also calculate the index before using the slice function as follows.
library(dplyr)
ind <- which(df$Value != lag(df$Value))
ind2 <- sort(unique(c(ind, ind - 1)))
df2 <- df %>% slice(ind2)
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
Using base R, I would probably use an id for the rows and with diff:
df <- data.frame(colA=c(1, 1, 1, 2, 1, 1, 1, 3, 3, 3, 1, 1),
colB=1:12)
keep <- which(diff(df$colA) != 0)
df[unique(c(keep, keep+1)), ]
colA colB
3 1 3
4 2 4
7 1 7
10 3 10
5 1 5
8 3 8
11 1 11
There is probably a faster option though.
When you have a large dataset, speed might be the bottleneck. In this case data.table might be the best option for you.
Using the data.table-library, I would solve it like so:
library(data.table)
dt <- data.table(Name = c("Name1", "Name1", "Name1", "Name1","Name1" ),
Value = c("X", "X", "Y", "Y", "X"))
# look if Value changes to the next instance
dt[, idx := Value != shift(Value, 1, fill = dt$Value[1])]
# filter the rows where the index changes and the next value
# and deselect the variable idx
dt[idx | shift(idx, 1)][, .(Name, Value)]
#> Name Value
#> 1: Name1 Y
#> 2: Name1 Y
#> 3: Name1 X
Why does it give an odd-number and not an even-number?
Well, that is because in your data example, the last row should be selected as it changes, but there is no next row to select as well.

Order a data frame programmatically using a character vector of column names

I am trying to order a data frame on multiple columns. And the column names are passed through variable, i.e. a character vector.
df <- data.frame(var1 = c("b","a","b","a"), var2 = c("l","l","k","k"),
var3 = c("t","w","x","t"))
var1 var2 var3
1 b l t
2 a l w
3 b k x
4 a k t
Sorting on one column using a variable
sortvar <- "var1"
df[order(df[ , sortvar]),]
var1 var2 var3
2 a l w
4 a k t
1 b l t
3 b k x
Now, if I want to order by two columns, the above solution does not work.
sortvar <- c("var1", "var2")
df[order(df[, sortvar]), ] #does not work
I can manually order with column names:
df[with(df, order(var1, var2)),]
var1 var2 var3
4 a k t
2 a l w
3 b k x
1 b l t
But, how do I order the data frame dynamically on multiple columns using a variable with column names? I am aware of the plyr and dplyr arrange function, but I want to use base R here.
order expects multiple ordering variables as separate arguments, which is unfortunate in your case but suggests a direct solution: use do.call:
df[do.call(order, df[, sortvar]), ]
In case you’re unfamiliar with do.call: it constructs and executes a call programmatically. The following two statements are equivalent:
fun(arg1, arg2, …)
do.call(fun, list(arg1, arg2, …))
It's a bit awkward, but you can use do.call() to pass each of the columns to order as a different argument
dat[do.call("order", dat[,cols, drop=FALSE]), ]
I added drop=FALSE just in case length(cols)==1 where indexing a data.frame would return a vector instead of a list. You can wrap it in a fucntion to make it a bit easier to use
order_by_cols <- function(data, cols=1) {
data[do.call("order", data[, cols, drop=FALSE]), ]
}
order_by_cols(dat, cols)
it's a bit easier with dplyr if that's something you might consider
library(dplyr)
dat %>% arrange(across(all_of(cols)))
dat %>% arrange_at(cols) # though this method has been superseded by the above line

Mapply to Add Column to Each Dataframe in a List

Implemented some code from previous question:
Lapply to Add Columns to Each Dataframe in a List
Using the method above, I receive corrupt data. While I cannot provide actual data, I am wondering if additional arguments need to be implemented to prevent shuffling.
Basically, this:
Require: data.table
df1 <- data.frame(x = runif(3), y = runif(3))
df2 <- data.frame(x = runif(3), y = runif(3))
dfs <- list(df1, df2)
years <- list(2013, 2014)
a<-Map(cbind, dfs, year = years)
final<-rbindlist(a)
But applied to a list of thousands of data frame lists has incorrect results. Assume that some data frames, say df 1.5 somewhere between two above data frames, are empty. Would that affect the order in which the Map binds the years to the dfs? Essentially, I have an output with some data belonging to different years than the Map attached it to. I tested the length and order of years list, and compared it to the output year in final. They are identical. Any thoughts?
We create a logical index based on the length of each element in 'dfs', use that to subset both the 'dfs' and the 'years' and then do the cbind with Map
i1 <- sapply(dfs, length)>1
Or to make it more stringent
i1 <- sapply(dfs, function(x) is.data.frame(x) & !is.null(x) & length(x) >0 )
a <- Map(cbind, dfs[i1], year = years[i1])
and then do the rbindlist with fill = TRUE in case the number of columns are not the same in all the data.frames in the `list.
rbindlist(a, fill = TRUE)
data
dfs[[3]] <- list(NULL)
dfs[[4]] <- data.frame()
years <- 2013:2016
Use the idcol argument to rbindlist and add the year column afterwards:
res = rbindlist(dfs, idcol=TRUE)
res[.(.id = 1:2, year = 2013:2014), on=".id", year := i.year]
X[i, on=cols, z := i.z] merges X with i on cols and then copies z from i to X.

Resources