R: efficiently merge 1000+ variables - r

I have 1000+ datasets with the exact same dimensions and the same column a that I need to load from the web (using jsonlite) and then merge. I can choose the data.frame names but not change the data itself. I could do it all manually but there might be a more efficient way to do this. Let me show what I mean with this example of three datasets.
cola <- c(1, 2, 3, 4)
x0001 <- c(10, 11, 12, 13)
x0002 <- c(20, 22, 25, 29)
x0003 <- c(30, 31, 33, 38)
df0001 <- data.frame(cola, x0001)
colnames(df0001) <- c("A","B")
df0002 <- data.frame(cola, x0002)
colnames(df0002) <- c("A","B")
df0003 <- data.frame(cola, x0003)
colnames(df0003) <- c("A","B")
# data.frame names do not matter to me
alldata <- Reduce(function(x,y) merge(x=x, y=y, by="A"), list(df0001, df0002, df0003))
colnames(alldata) <- c("A", "df0001", "df0002", "df0003")
The merging to alldata and the colnames() function below would be veery long if I do it manually by listing all 1000+ variables. Maybe there is a better way, perhaps with a loop?

If the objects are all loaded in memory, you can load all the objects into a list with the mget and ls(pattern = ...) functions.
dfs <- mget(ls(pattern = "df[0-9]+"))
dfs
#$df0001
# A B
#1 1 10
#2 2 11
#3 3 12
#4 4 13
#
#...
#
#$df0003
# A B
#1 1 30
#2 2 31
#3 3 33
#4 4 38
If the data.frames always have the same columns, in the same order, you can use do.call:
cbind(dfs[[1]],do.call(cbind,lapply(dfs[-1],`[`,,-1)))
# A B df0002 df0003
#1 1 10 20 30
#2 2 11 22 31
#3 3 12 25 33
#4 4 13 29 38
Otherwise, you can use Reduce:
Reduce(function(x,y) merge(x,y,by = "A"), dfs)
# A B.x B.y B
#1 1 10 20 30
#2 2 11 22 31
#3 3 12 25 33
#4 4 13 29 38
The drawback of Reduce is it results in significant memory allocation.

Related

How to get the sum of rows using a vector and the make the result in a column

I have a dataframe and i want to calculate the sum of variables present in a vector in every row and make the sum in other variable after i want the name of new variable created to be from the name of the variable in vector
for example
data
Name A_12 B_12 C_12 D_12 E_12
r1 1 5 12 21 15
r2 2 4 7 10 9
r3 5 15 16 9 6
r4 7 8 0 7 18
let's say i have two vectors
vector_1 <- c("A_12","B_12","C_12")
vector_2 <- c("B_12","C_12","D_12","E_12")
The result i want is :
New_data >
Name A_12 B_12 C_12 ABC_12 D_12 E_12 BCDE_12
r1 1 5 12 18 21 15 54
r2 2 4 7 13 10 9 32
r3 5 15 16 36 9 6 45
r4 7 8 0 15 7 18 40
I created for loop to get the sum of the rows in a vector but i didn't get the correct result
Please tell me ig you need any more informations or clarifications
Thank you
You can use rowSums and simple column-subsetting:
dat$ABC_12 <- rowSums(dat[,vector_1])
dat$BCDE_12 <- rowSums(dat[,vector_2])
dat
# Name A_12 B_12 C_12 D_12 E_12 ABC_12 BCDE_12
# 1 r1 1 5 12 21 15 18 53
# 2 r2 2 4 7 10 9 13 30
# 3 r3 5 15 16 9 6 36 46
# 4 r4 7 8 0 7 18 15 33
Note that if your frames inherit from data.table, then you'll need to use either subset(dat, select=vector_1) or dat[,..vector_1] instead of simply dat[,vector_1]; if you aren't already using data.table, then you can safely ignore this paragraph.
Like this (using dplyr/tidyverse)
df %>%
rowwise() %>%
mutate(
ABC_12 = sum(c_across(vector_1)),
BCDE_12 = sum(c_across(vector_2))
)
Though I'm not sure the sums are correct in your example
-=-=-=EDIT-=-=-=-
Here's a function to help with the naming.
ex_fun <- function(vec, n_len){
paste0(paste(substr(vec,1,n_len), collapse = ""), substr(vec[1],n_len+1,nchar(vec[1])))
}
Which can then be implemented like so.
df %>%
rowwise() %>%
mutate(
!!ex_fun(vector_1, 1) := sum(c_across(vector_1)),
!!ex_fun(vector_2, 1) := sum(c_across(vector_2)),
)
-=-= Extra note -=--=
If you list your vectors up you could then combine this with r2evans answer and stick into a loop if you prefer.
vectors = list(vector_1, vector_2)
for (v in vectors){
df[ex_fun(v, 1)] <- rowSums(df[,v])
}
I believe this might work, so long as only the starting digits are different:
library("tidyverse")
#Input dataframe.
data <- data.frame(Name =c("r1", "r2", "r3", "r4"), A_12 = c(1, 2, 5, 7), B_12 = c(5, 4, 15, 8),
C_12 = c(12, 7, 16, 0), D_12 = c(21, 10, 9, 7), E_12 = c(15, 9, 6, 18))
#add all vectors to the "vectors" list. I have added vector_1 and vector_2, but
#there can be as many vectors as needed, they just need to be put in the list.
vector_1 <- c("A_12","B_12","C_12")
vector_2 <- c("B_12","C_12","D_12","E_12")
vector_list<-list(vector_1, vector_2)
vector_sum <- function(data, vector_list){
output <- data |>
dplyr::select(1, all_of(vector_list[[1]]))
for (i in vector_list) {
name1 <- substring(as.character(i), 1,1) |> paste(collapse = '')
name2 <- substring(as.character(i[1]), 2)
input_temp <- dplyr::select(data, all_of(i))
input_temp <- mutate(input_temp, temp=rowSums(input_temp))
names(input_temp)[names(input_temp) == "temp"] <- paste(name1, name2)
output = cbind(output, input_temp)
}
output[, !duplicated(colnames(output))]
}
vector_sum(data, vector_list)

Use apply on two data.frame's

If I had a data.frame X and wanted to apply a function foo to each of its rows, I would just run apply(X, 1, foo). This is all well-known and simple.
Now imagine I have another data.frame Y and the following function:
mean_of_sum <- function(x,y) {
return(mean(x+y))
}
Is there a way to write an "apply equivalent" to the following loop:
my_loop_fun <- function(X, Y)
results <- numeric(nrow(X))
for(i in 1: length(results)) {
results[i] <- mean_of_sum(X[i,], Y[i,])
}
return(results)
If such an "apply syntax" exists, would it be more efficient than my "good" old loop?
this should work:
sapply(seq_len(nrow(X)), function(i) mean_of_sum(X[i,], Y[i,]))
You apply the function on the sequence 1, 2, ..., n (where n is the number of rows ) and in each "iteration" you evaluate mean_of_sum for the i-th row.
We can split every row of X and Y in list and use mapply to apply the function. Changing the function mean_of_sum a bit to convert one-row dataframe to numeric
mean_of_sum <- function(x,y) {
return(mean(as.numeric(x) + as.numeric(y)))
}
Consider an example,
X <- data.frame(a = 1:5, b = 6:10)
Y <- data.frame(c = 11:15, d = 16:20)
mapply(mean_of_sum, split(X, seq_len(nrow(X))), split(Y, seq_len(nrow(Y))))
# 1 2 3 4 5
#17 19 21 23 25
where X and Y are
X
# a b
#1 1 6
#2 2 7
#3 3 8
#4 4 9
#5 5 10
Y
# c d
#1 11 16
#2 12 17
#3 13 18
#4 14 19
#5 15 20
So the first value 17 is counted as
mean(c(1 + 11, 6 + 16))
#[1] 17
and so on for next values.

Calculate mean of specific row pattern

I have a dataframe like this:
V1 = paste0("AB", seq(1:48))
V2 = seq(1:48)
test = data.frame(name = V1, value = V2)
I want to calculate the means of the value-column and specific rows.
The pattern of the rows is pretty complicated:
Rows of MeanA1: 1, 5, 9
Rows of MeanA2: 2, 6, 10
Rows of MeanA3: 3, 7, 11
Rows of MeanA4: 4, 8, 12
Rows of MeanB1: 13, 17, 21
Rows of MeanB2: 14, 18, 22
Rows of MeanB3: 15, 19, 23
Rows of MeanB4: 16, 20, 24
Rows of MeanC1: 25, 29, 33
Rows of MeanC2: 26, 30, 34
Rows of MeanC3: 27, 31, 35
Rows of MeanC4: 28, 32, 36
Rows of MeanD1: 37, 41, 45
Rows of MeanD2: 38, 42, 46
Rows of MeanD3: 39, 43, 47
Rows of MeanD4: 40, 44, 48
As you see its starting at 4 different points (1, 13, 25, 37) then always +4 and for the following 4 means its just stepping 1 more row down.
I would like to have an output of all these means in one list.
Any ideas? NOTE: In this example the mean is of course always the middle number, but my real df is different.
Not quite sure about the output format you require, but the following codes can calculate what you want anyhow.
calc_mean1 <- function(x) mean(test$value[seq(x, by = 4, length.out = 3)])
calc_mean2 <- function(x){sapply(x:(x+3), calc_mean1)}
output <- lapply(seq(1, 37, 12), calc_mean2)
names(output) <- paste0('Mean', LETTERS[seq_along(output)]) # remove this line if more than 26 groups.
output
## $MeanA
## [1] 5 6 7 8
## $MeanB
## [1] 17 18 19 20
## $MeanC
## [1] 29 30 31 32
## $MeanD
## [1] 41 42 43 44
An idea via base R is to create a grouping variable for every 4 rows, split the data every 12 rows (nrow(test) / 4) and aggregate to find the mean, i.e.
test$new = rep(1:4, nrow(test)%/%4)
lapply(split(test, rep(1:4, each = nrow(test) %/% 4)), function(i)
aggregate(value ~ new, i, mean))
# $`1`
# new value
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
# $`2`
# new value
# 1 1 17
# 2 2 18
# 3 3 19
# 4 4 20
# $`3`
# new value
# 1 1 29
# 2 2 30
# 3 3 31
# 4 4 32
# $`4`
# new value
# 1 1 41
# 2 2 42
# 3 3 43
# 4 4 44
And yet another way.
fun <- function(DF, col, step = 4){
run <- nrow(DF)/step^2
res <- lapply(seq_len(step), function(inc){
inx <- seq_len(run*step) + (inc - 1)*run*step
dftmp <- DF[inx, ]
tapply(dftmp[[col]], rep(seq_len(step), run), mean, na.rm = TRUE)
})
names(res) <- sprintf("Mean%s", LETTERS[seq_len(step)])
res
}
fun(test, 2, 4)
#$MeanA
#1 2 3 4
#5 6 7 8
#
#$MeanB
# 1 2 3 4
#17 18 19 20
#
#$MeanC
# 1 2 3 4
#29 30 31 32
#
#$MeanD
# 1 2 3 4
#41 42 43 44
Since you said you wanted a long list of the means, I assumed it could also be a vector where you just have all these values. You would get that like this:
V1 = paste0("AB", seq(1:48))
V2 = seq(1:48)
test = data.frame(name = V1, value = V2)
meanVector <- NULL
for (i in 1:(nrow(test)-8)) {
x <- c(test$value[i], test$value[i+4], test$value[i+8])
m <- mean(x)
meanVector <- c(meanVector, m)
}

Creating a vector from data.table row without using apply

Let's say I want to create a column in a data.table, in which the value in each row is equal to the standard deviation of the values in three other cells in the same row. E.g., if I make
DT <- data.table(a = 1:4, b = c(5, 7, 9, 11), c = c(13, 16, 19, 22), d = c(25, 29, 33, 37))
DT
a b c d
1: 1 5 13 25
2: 2 7 16 29
3: 3 9 19 33
4: 4 11 22 37
and I'd like to add a column that contains the standard deviation of a, b, and d for each row, like this:
a b c d abdSD
1: 1 5 13 23 12.86
2: 2 7 16 27 14.36
3: 3 9 19 31 15.87
4: 4 11 22 35 17.39
I could of course write a for-loop or use an apply function to calculate this. Unfortunately, what I actually want to do needs to be applied to millions of rows, isn't as simple a function as calculating a standard deviation, and needs to finish within a fraction of a second, so I really need a vectorized solution. I want to write something like
DT[, abdSD := sd(c(a, b, d))]
but unfortunately that doesn't give the right answer. Is there any data.table syntax that can create a vector out of different values within the same row, and make that vector accessible to a function populating a new cell within that row? Any help would be greatly appreciated. #Arun
Depending on the size of your data, you might want to convert the data into a long format, then calculate the result as follows:
complexFunc <- function(x) sd(x)
cols <- c("a", "b", "d")
rowres <- melt(DT[, rn:=.I], id.vars="rn", variable.factor=FALSE)[,
list(abdRes=complexFunc(value[variable %chin% cols])), by=.(rn)]
DT[rowres, on=.(rn)]
or if your complex function has 3 arguments, you can do something like
DT[, abdSD := mapply(complexFunc, a, b, d)]
As #Frank mentioned, I could avoid adding a column by doing by=1:nrow(DT)
DT[, abdSD:=sd(c(a,b,d)),by=1:nrow(DT)]
output:
a b c d abdSD
1: 1 5 13 25 12.85820
2: 2 7 16 29 14.36431
3: 3 9 19 33 15.87451
4: 4 11 22 37 17.38774
if you add a row_name column, it would be ultra easy
DT$row_id<-row.names(DT)
Simply by=row_id, would get you the result you want
DT[, abdSD:=sd(c(a,b,d)),by=row_id]
Result would have:
a b c d row_id abdSD
1: 1 5 13 25 1 12.85820
2: 2 7 16 29 2 14.36431
3: 3 9 19 33 3 15.87451
4: 4 11 22 37 4 17.38774
If you want row_id removed, simply adding [,row_id:=NULL]
DT[, abdSD:=sd(c(a,b,d)),by=row_id][,row_id:=NULL]
This line would get everything you want
a b c d abdSD
1: 1 5 13 25 12.85820
2: 2 7 16 29 14.36431
3: 3 9 19 33 15.87451
4: 4 11 22 37 17.38774
You just gotta do it by row.
data.frame does it by row on default, data.table does it by column on default I think. It's a bit tricky
Hope this helps
I think you should try matrixStats package
library(matrixStats)
#sample data
dt <- data.table(a = 1:4, b = c(5, 7, 9, 11), c = c(13, 16, 19, 22), d = c(25, 29, 33, 37))
dt[, `:=`(abdSD = rowSds(as.matrix(.SD), na.rm=T)), .SDcols=c('a','b','d')]
dt
Output is:
a b c d abdSD
1: 1 5 13 25 12.85820
2: 2 7 16 29 14.36431
3: 3 9 19 33 15.87451
4: 4 11 22 37 17.38774
Not an answer, but just trying to show the difference between using apply and the solution provided by Prem above :
I have blown up the sample data to 40,000 rows to show solid time differences :
library(matrixStats)
#sample data
dt <- data.table(a = 1:40000, b = rep(c(5, 7, 9, 11),10000), c = rep(c(13, 16, 19, 22),10000), d = rep(c(25, 29, 33, 37),10000))
df <- data.frame(a = 1:40000, b = rep(c(5, 7, 9, 11),10000), c = rep(c(13, 16, 19, 22),10000), d = rep(c(25, 29, 33, 37),10000))
t0 = Sys.time()
dt[, `:=`(abdSD = rowSds(as.matrix(.SD), na.rm=T)), .SDcols=c('a','b','d')]
print(paste("Time taken for data table operation = ",Sys.time() - t0))
# [1] "Time taken for data table operation = 0.117115020751953"
t0 = Sys.time()
df$abdSD <- apply(df[,c("a","b","d")],1, function(x){sd(x)})
print(paste("Time taken for apply opertaion = ",Sys.time() - t0))
# [1] "Time taken for apply opertaion = 2.93488311767578"
Using DT and matrixStats clearly wins the race
It's not hard to vectorize the sd for this situation:
vecSD = function(x) {
n = ncol(x)
sqrt((n/(n-1)) * (Reduce(`+`, x*x)/n - (Reduce(`+`, x)/n)^2))
}
DT[, vecSD(.SD), .SDcols = c('a', 'b', 'd')]
#[1] 12.85820 14.36431 15.87451 17.38774

How to substract a column by row?

I want to do an easy subtract in R, but I don't know how to solve it. I would like to know if I have to do a loop or if there is a function.
I have a column with numeric variables, and I would like to subtract n by n-1.
Time_Day Diff
10 10
15 5
45 30
60 15
Thus, I would like to find the variable "Diff".
you can also try with package dplyr
library(dplyr)
mutate(df, dif=Time_Day-lag(Time_Day))
# Time_Day Diff dif
# 1 10 10 NA
# 2 15 5 5
# 3 45 30 30
# 4 60 15 15
Does this do what you need?
Here we save the column as a variable:
c <- c(10, 15, 45, 60)
Now we add a 0 to the beginning and then cut off the last element:
cm1 <- c(0, c)[1:length(c)]
Now we subtract the two:
dif <- c - cm1
If we print that out, we get what you're looking for:
dif # 10 5 30 15
With diff :
df <- data.frame(Time_Day = c(10, 15, 45, 60))
df$Diff <- c(df$Time_Day[1], diff(df$Time_Day))
df
## Time_Day Diff
##1 10 10
##2 15 5
##3 45 30
##4 60 15
It works fine in dplyr too :
library("dplyr")
df <- data.frame(Time_Day = c(10, 15, 45, 60))
df %>% mutate(Diff = c(Time_Day[1], diff(Time_Day)))

Resources