Sum numeric sub-dataframe within a list - r

Here I have a r list of dataframes, all dataframes are in the same format and have the same dimensionality, the first 2 columns are strings,like IDs and names(identical for all dataframes), and the rest are numeric values. Here I want to sum numeric parts of all the dataframes in matrix way, i.e. output at index (1,3) is the sum of values at index (1,3) of all the dataframes
e.g. Given list L consist of dataframe x and y, I want to get output like z
x <-data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(1,2),v2=c(3,4))
y <-data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(5,6),v2=c(7,8))
L <- list(x,y)
z <- data.frame(ID=c("aa","bb"),name=c("cc","dd"),v1=c(1+5,2+6),v2=c(3+7,4+8))
I know how to do this using for loop, but I want to learn to do it in a more R-like way, by that I mean, using some vectorized functions, like the apply family
Currently my idea is create a new dataframe with only ID and name columns, then use a global dataframe variable to sum the numeric parts, and at last cbind this 2 parts
output <- x[,1:2]
num_sum <- matrix(0,nrow=nrow(L[[1]]),ncol=ncol(L[[1]][,-c(1,2)]))
lapply(L,function(a){num_sum <<- a[3:length(a)]+num_sum})
cbind(output,num_sum)
but this approach has some problems I prefer to avoid
I need to manully set the 2 parts of output and then manully join the two parts
lapply() will return a list that each element is an intermiediate num_sum returned by an iteration, which requires much more memory space
Here I'm using the global variable num_sum to keep track of the progress, but num_sum is not needed later and I have to manully remove it later

If the order of the two first columns is always the same, you can do:
#Get all numeric columns
num <- sapply(L[[1]], is.numeric)
#Sum them across elements of the list
df_num <- Reduce(`+`, lapply(L, `[`, num))
#Get the non-numeric columns and bind them with sum of numeric columns
cbind(L[[1]][!num], df_num)
output
ID name v1 v2
1 aa cc 6 10
2 bb dd 8 12
If they are different you can use powerjoin to do an inner join on the selected columns and sum the rest with conflict argument:
library(powerjoin)
sum_inner_join <-
function(x, y) power_inner_join(x, y, by = c("ID", "name"), conflict = ~ .x + .y)
Reduce(sum_inner_join, L)
output
ID name v1 v2
1 aa cc 6 10
2 bb dd 8 12

using dplyr and purrr (which has a bit nicer map functions), you could do something like this:
library(purrr)
library(dplyr)
result <- reduce(L, function(x,y){
xVals <- x |> select(-ID, -name)
yVals <- y |> select(-ID, -name)
totalVals <- xVals |> map2(yVals, function(x,y) {
rowSums(cbind(x,y))
})
return(x |> select(ID, name) |> cbind(totalVals))
})

Similar logic to Maƫl's answer, but squishing it all into a Map call:
data.frame(do.call(Map,
c(\(...) if(is.numeric(..1)) Reduce(`+`, list(...)) else ..1, L)
))
# ID name v1 v2
#1 aa cc 6 10
#2 bb dd 8 12
If the first ..1 chunk of the column is numeric, sum + all the values in all the lists, otherwise return the first ..1 chunk.
You could also do it via an aggregation if all the rows are unique:
tmp <- do.call(rbind, L)
nums <- sapply(tmp, is.numeric)
aggregate(tmp[nums], tmp[!nums], FUN=sum)
# ID name v1 v2
#1 aa cc 6 10
#2 bb dd 8 12

Related

Extract and append data to new datasets in a for loop

I have (what I think) is a really simple question, but I can't figure out how to do it. I'm fairly new to lists, loops, etc.
I have a small dataset:
df <- c("one","two","three","four")
df <- as.data.frame(df)
df
I need to loop through this dataset and create a list of datasets, such that this is the outcome:
[[1]]
one
[[2]]
one
two
[[3]]
one
two
three
This is more or less as far as I've gotten:
blah <- list()
for(i in 1:3){
blah[[i]]<- i
}
The length will be variable when I use this in the future, so I need to automate it in a loop. Otherwise, I would just do
one <- df[1,]
two <- df[2,]
list(one, rbind(one, two))
Any ideas?
You can try using lapply :
result <- lapply(seq(nrow(df)), function(x) df[seq_len(x), , drop = FALSE])
result
#[[1]]
# df
#1 one
# [[2]]
# df
#1 one
#2 two
#[[3]]
# df
#1 one
#2 two
#3 three
#[[4]]
# df
#1 one
#2 two
#3 three
#4 four
seq(nrow(df)) creates a sequence from 1 to number of rows in your data (which is 4 in this case). function(x) part is called as anonymous function where each value from 1 to 4 is passed to one by one. seq_len(x) creates a sequence from 1 to x i.e 1 to 1 in first iteration, 1 to 2 in second and so on. We use this sequence to subset the rows from dataframe (df[seq_len(x), ]). Since the dataframe has only 1 column when we subset it , it changes it to a vector. To avoid that we add drop = FALSE.
Base R solution:
# Coerce df vector of data.frame to character, store as new data.frame: str_df => data.frame
str_df <- transform(df, df = as.character(df))
# Allocate some memory in order to split data into a list: df_list => empty list
df_list <- vector("list", nrow(str_df))
# Split the string version of the data.frame into a list as required:
# df_list => list of character vectors
df_list <- lapply(seq_len(nrow(str_df)), function(i){
str_df[if(i == 1){1}else{1:i}, grep("df", names(str_df))]
}
)
Data:
df <- c("one","two","three","four")
df <- as.data.frame(df)
df

R: Replacing values of one list element with values of a second list element

I want to replace the values of one element of a list with the values of a second element of a list. Specifically,
I have a list containing multiple data sets.
Each data set has 2 variables
The variables are factors
The n'th element of the second variable of each data set needs to be replaced with the n'th element of the first variable in each data set
Also, the replaced value should be called "replaced"
dat1 <- data.frame(names1 =c("a", "b", "c", "f", "x"),values= c("val1_1", "val2_1", "val3_1", "val4_1", "val5_1"))
dat1$values <- as.factor(dat1$values)
dat2 <- data.frame(names1 =c("a", "b", "f2", "s5", "h"),values= c("val1_2", "val2_2", "val3_2", "val4_2", "val5_2"))
dat2$values <- as.factor(dat2$values)
list1 <- list(dat1, dat2)
The result should be the same list, but just with the 5th value replaced.
[[1]]
names1 values
1 a val1_1
2 b val2_1
3 c val3_1
4 f val4_1
5 replaced x
[[2]]
names1 values
1 a val1_2
2 b val2_2
3 f2 val3_2
4 s5 val4_2
5 replaced h
A base R approach using lapply, since both the columns are factors we need to add new levels first before replacing them with new values otherwise those value would turn as NAs.
n <- 5
lapply(list1, function(x) {
levels(x$values) <- c(levels(x$values), as.character(x$names1[n]))
x$values[n] <- x$names1[n]
levels(x$names1) <- c(levels(x$names1), "replaced")
x$names1[n] <- "replaced"
x
})
#[[1]]
# names1 values
#1 a val1_1
#2 b val2_1
#3 c val3_1
#4 f val4_1
#5 replaced x
#[[2]]
# names1 values
#1 a val1_2
#2 b val2_2
#3 f2 val3_2
#4 s5 val4_2
#5 replaced h
There is also another approach where we can convert both the columns to characters, then replace the values at required position and again convert them back to factors but since every dataframe in the list can be huge we do not want to convert all the values to characters and then back to factor just to change one value which could be computationally very expensive.
Here is one option with tidyverse. Loop through the list with map, slice the row of interest (in this case, it is the last row, so n() can be used), mutate the column value and bind with the original data without the last row
library(tidyverse)
map(list1, ~ .x %>%
slice(n()) %>%
mutate(values = names1, names1 = 'replaced') %>%
bind_rows(.x %>% slice(-n()), .))
#[[1]]
# names1 values
#1 a val1_1
#2 b val2_1
#3 c val3_1
#4 f val4_1
#5 replaced x
#[[2]]
# names1 values
#1 a val1_2
#2 b val2_2
#3 f2 val3_2
#4 s5 val4_2
#5 replaced h
Or it can be made more compact with fct_c from forcats. Different factor levels can be combined together with fct_c for the 'values' and 'names1' column
library(forcats)
map(list1, ~ .x %>%
mutate(values = fct_c(values[-n()], names1[n()]),
names1 = fct_c(names1[-n()], factor('replaced'))))
Or using similar approach with base R where we loop through the list with lapply, then convert the data.frame to matrix, rbind the subset of matrix i.e. the last row removed with the values of interest, and convert to data.frame (by default, stringsAsFactors = TRUE - so it gets converted to factor)
lapply(list1, function(x) as.data.frame(rbind(as.matrix(x)[-5, ],
c('replaced', as.character(x$names1[5])))))

Using apply in R to extract rows from a dataframe

Using R, I have to extract specific rows from a data frame depending on certain conditions. The data frame is large (5.5 million rows to 251 columns) but I have given the code below to create a sample data frame.
df <- data.frame("Name" = c("Name1", "Name1", "Name1", "Name1","Name1" ), "Value"=c("X", "X", "Y", "Y", "X"))
I need to skip through the entire data frame row by row starting at the top, and while skipping, when the value of the 'Value' column changes from X to Y or Y to X, I need to extract that row and next row and append them to another data frame. For example, in the data frame above, the Value column of row 2 is X and that of row 3 is Y, and since the value has changed from X to Y, I need to extract the entire row 2 and row 3 and add them to another data frame.
The result of the operations can be seen by running the code below
dfextract <- data.frame("Name" = c("Name1", "Name1"), "Value"=c("X", "Y"))
Currently I have used a 'for' loop to skip row to row and extract the rows when the values don't match. But it very slow and inefficient. The code snippet is below
for (i in 1:count) {
if (df[[i+1, 2]] != df[i,2]) {
dfextract <- rbind(dfextract, df[i,])
dfextract <- rbind(dfextract, df[i+1,])
}
}
I am looking for a better and faster solution to the above situation. Perhaps using the functions belonging to the family of 'apply()' or using 'by()'. Any help would be greatly appreciated.
Thanks in advance.
Maybe the following does it. Note that there are two lapply based loop, in order to predict for changes in the values of column Name.
diffstr <- function(x) x[-1] == x[-length(x)]
res <- lapply(split(df, df$Name), function(x) {
inx <- which(c(FALSE, !diffstr(x$Value)))
do.call(rbind, lapply(inx, function(i) x[(i - 1):i, ]))
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
How it works.
First, I define a helper function diffstr. It compares all values of x but the first with all values of x but the last. Note that x[-1] is the vector x[2], x[3], ..., x[length(x)], negative indices remove that element from the vector. And the same for x[-length(x), the negative index removes the last x.
split(df, df$Name) splits the data frame into subsets each one of its own Name.
I then lapply an unnamed function to these subsets. This function's argument x will be each of the sub-data frames mentioned above.
That function start by determining where in df$Valueare the changes. This is done with the call to the helper function diffstr. I have to append a FALSE to the return value because at first there are no changes.
The next line is a tricky one. Use lapply on the index of change points inx and for each one get a two rows segment of the data frame x. Then use do.call to call rbind those two rows df's and reassemble them together.
Now res is a list, with one sub-data frame for each Name (done with the split). So it needs to be put back together with another call to do.call(rbind(...)).
Final tidy up. The whole process messed up with the data frame's row names. To set them to NULL is just a well known trick that forces R to renumber the rows.
That's it. If you need more explanations, just say so.
We can use dplyr. lag can shift the row by 1, so we can use Value != lag(Value) to compare if the value is different than the previous one. which(Value != lag(Value)) converts the result to row number. After that, sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))) makes sure we also got the row number of those previous rows. Finally, slice can subset the data frame based on the row number provided.
library(dplyr)
df2 <- df %>%
slice(sort(unique(unlist(lapply(which(Value != lag(Value)), function(x) c(x, x - 1))))))
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
If the code is too long to read, you can also calculate the index before using the slice function as follows.
library(dplyr)
ind <- which(df$Value != lag(df$Value))
ind2 <- sort(unique(c(ind, ind - 1)))
df2 <- df %>% slice(ind2)
df2
# A tibble: 4 x 2
Name Value
<fctr> <fctr>
1 Name1 X
2 Name1 Y
3 Name1 Y
4 Name1 X
Using base R, I would probably use an id for the rows and with diff:
df <- data.frame(colA=c(1, 1, 1, 2, 1, 1, 1, 3, 3, 3, 1, 1),
colB=1:12)
keep <- which(diff(df$colA) != 0)
df[unique(c(keep, keep+1)), ]
colA colB
3 1 3
4 2 4
7 1 7
10 3 10
5 1 5
8 3 8
11 1 11
There is probably a faster option though.
When you have a large dataset, speed might be the bottleneck. In this case data.table might be the best option for you.
Using the data.table-library, I would solve it like so:
library(data.table)
dt <- data.table(Name = c("Name1", "Name1", "Name1", "Name1","Name1" ),
Value = c("X", "X", "Y", "Y", "X"))
# look if Value changes to the next instance
dt[, idx := Value != shift(Value, 1, fill = dt$Value[1])]
# filter the rows where the index changes and the next value
# and deselect the variable idx
dt[idx | shift(idx, 1)][, .(Name, Value)]
#> Name Value
#> 1: Name1 Y
#> 2: Name1 Y
#> 3: Name1 X
Why does it give an odd-number and not an even-number?
Well, that is because in your data example, the last row should be selected as it changes, but there is no next row to select as well.

How to subset a large data frame through FOR loops and print the desired result?

I have a data frame that looks something like this:
x y
1 a
1 b
1 c
1 NA
1 NA
2 d
2 e
2 NA
2 NA
And my desired output should be a data frame that should display the sum of all complete cases of Y (that is the non-NA values) with the corresponding X. So if supposing Y has 2500 complete observations for X = 1, and 557 observations for X = 2, I should get this simple data frame:
x y(c.cases)
1 2500
2 557
Currently my function performs well but only for a single X but when I mention X to be a range (for ex. 30:25) then I get the sum of all the Ys specified instead of individual complete observations for each X. This is an outline of my function:
complete <- function(){
files <- file.list()
dat<- c() #Creates an empty vector
Y <- c() #Empty vector that will list down the Ys
result <- c()
for(i in c(X)){
dat <- rbind(dat, read.csv(files[i]))
}
dat_subset_Y <- dat[which(dat[, 'X'] %in% x), ]
Y <- c(Y, sum(complete.cases(dat)))
result <- cbind(X, Y)
print(result)
}
There are no errors or warning messages but only wrong results in a range of Xs.
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'x', get the sum of all non NA elements (!is.na(y)).
library(data.table)
setDT(df1)[, list(y=sum(!is.na(y))), by = x]
Or another option is table
with(df1, table(x, !is.na(y)))
no need for that loop.
library(dplyr)
df %>%
filter(complete.cases(.))%>%
group_by(x) %>%
summarise(sumy=length(y))
Or
df %>%
group_by(x) %>%
summarise(sumy=sum(!is.na(y)))

R reshaping melted data.table with list column

I have a large (millions of rows) melted data.table with the usual melt-style unrolling in the variable and value columns. I need to cast the table in wide form (rolling the variables up). The problem is that the data table also has a list column called data, which I need to preserve. This makes it impossible to use reshape2 because dcast cannot deal with non-atomic columns. Therefore, I need to do the rolling up myself.
The answer from a previous question about working with melted data tables does not apply here because of the list column.
I am not satisfied with the solution I've come up with. I'm looking for suggestions for a simpler/faster implementation.
x <- LETTERS[1:3]
dt <- data.table(
x=rep(x, each=2),
y='d',
data=list(list(), list(), list(), list(), list(), list()),
variable=rep(c('var.1', 'var.2'), 3),
value=seq(1,6)
)
# Column template set up
list_template <- Reduce(
function(l, col) { l[[col]] <- col; l },
unique(dt$variable),
list())
# Expression set up
q <- substitute({
l <- lapply(
list_template,
function(col) .SD[variable==as.character(col)]$value)
l$data = .SD[1,]$data
l
}, list(list_template=list_template))
# Roll up
dt[, eval(q), by=list(x, y)]
x y var.1 var.2 data
1: A d 1 2 <list>
2: B d 3 4 <list>
3: C d 5 6 <list>
This old question piqued my curiosity as data.table has been improved sigificantly since 2013.
However, even with data.table version 1.11.4
dcast(dt, x + y + data ~ variable)
still returns an error
Columns specified in formula can not be of type list
The workaround follows the general outline of jonsedar's answer :
Reshape the non-list columns from long to wide format
Aggregate the list column data grouped by x and y
Join the two partial results on x and y
but uses the features of the actual data.table syntax, e.g., the on parameter:
dcast(dt, x + y ~ variable)[
dt[, .(data = .(first(data))), by = .(x, y)], on = .(x, y)]
x y var.1 var.2 data
1: A d 1 2 <list>
2: B d 3 4 <list>
3: C d 5 6 <list>
The list column data is aggregated by taking the first element. This is in line with OP's code line
l$data = .SD[1,]$data
which also picks the first element.
I have somewhat cheating method that might do the trick - importantly, I assume that each x,y,list combination is unique! If not, please disregard.
I'm going to create two separate datatables, the first which is dcasted without the data list objects, and the second which has only the unique data list objects and a key. Then just merge them together to get the desired result.
require(data.table)
require(stringr)
require(reshape2)
x <- LETTERS[1:3]
dt <- data.table(
x=rep(x, each=2),
y='d',
data=list(list("a","b"), list("c","d")),
variable=rep(c('var.1', 'var.2'), 3),
value=seq(1,6)
)
# First create the dcasted datatable without the pesky list objects:
dt_nolist <- dt[,list(x,y,variable,value)]
dt_dcast <- data.table(dcast(dt_nolist,x+y~variable,value.var="value")
,key=c("x","y"))
# Second: create a datatable with only unique "groups" of x,y, list
dt_list <- dt[,list(x,y,data)]
# Rows are duplicated so I'd like to use unique() to get rid of them, but
# unique() doesn't work when there's list objects in the data.table.
# Instead so I cheat by applying a value to each row within an x,y "group"
# that is unique within EACH group, but present within EVERY group.
# Then just simply subselect based on that unique value.
# I've chosen rank(), but no doubt there's other options
dt_list <- dt_list[,rank:=rank(str_c(x,y),ties.method="first"),by=str_c(x,y)]
# now keep only one row per x,y "group"
dt_list <- dt_list[rank==1]
setkeyv(dt_list,c("x","y"))
# drop the rank since we no longer need it
dt_list[,rank:=NULL]
# Finally just merge back together
dt_final <- merge(dt_dcast,dt_list)

Resources