Using Reduce function to merge recursively [duplicate]

Using Reduce function to merge recursively [duplicate] - r

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 6 years ago.
If I have a list of a list, and the list contains a set of dataframes and I want to merge the dataframes together but don't to merge all the list together. For example
list<- list(list(df1_2010,df2_2010,df3_2010), list(df1_2011,df2_2011,df3_2011), list(df1_2012,df2_2012,df3_2012))
And i want to merge all the 2010 dataframe together by let say column id. And I want to merge the 2011 dataframes together by a similar column id, and I want to merge all the 2012 dataframes together by another similar column id.
I want to output a list of merged dataframes by year:
list(df2010, df2011, df2012)
Here's a schematic of how I want to use the Reduce function:
f<-function(b) merge(...,by="ID",all.x=T)
list<- Reduce(f, list)
But I think this will merge all three lists together instead of each list separately. Let me know your suggestions.

Here's a simple reproducible example that I think maps onto your structure:
n <- 5
set.seed(n)
l <- list( list( data.frame(ID = 1:5, a = rnorm(n)),
data.frame(ID = 1:5, b = rnorm(n)),
data.frame(ID = 1:5, c = rnorm(n)),
data.frame(ID = 1:5, d = rnorm(n)) ),
list( data.frame(ID = 1:5, a = rnorm(n)),
data.frame(ID = 1:5, b = rnorm(n)),
data.frame(ID = 1:5, c = rnorm(n)),
data.frame(ID = 1:5, d = rnorm(n)) ),
list( data.frame(ID = 1:5, a = rnorm(n)),
data.frame(ID = 1:5, b = rnorm(n)),
data.frame(ID = 1:5, c = rnorm(n)),
data.frame(ID = 1:5, d = rnorm(n)) ))
You can write an lapply based function that uses Reduce on each element of the list:
out <-
lapply(l, function(x) Reduce(function(...) merge(..., by="ID", all.x=T), x))
And you should get a list of merged dataframes:
str(out)
List of 3
$ :'data.frame': 5 obs. of 5 variables:
..$ ID: int [1:5] 1 2 3 4 5
..$ a : num [1:5] -0.8409 1.3844 -1.2555 0.0701 1.7114
..$ b : num [1:5] -0.603 -0.472 -0.635 -0.286 0.138
..$ c : num [1:5] 1.228 -0.802 -1.08 -0.158 -1.072
..$ d : num [1:5] -0.139 -0.597 -2.184 0.241 -0.259
$ :'data.frame': 5 obs. of 5 variables:
..$ ID: int [1:5] 1 2 3 4 5
..$ a : num [1:5] 0.901 0.942 1.468 0.707 0.819
..$ b : num [1:5] -0.293 1.419 1.499 -0.657 -0.853
..$ c : num [1:5] 0.316 1.11 2.215 1.217 1.479
..$ d : num [1:5] 0.952 -1.01 -2 -1.762 -0.143
$ :'data.frame': 5 obs. of 5 variables:
..$ ID: int [1:5] 1 2 3 4 5
..$ a : num [1:5] 1.5501 -0.8024 -0.0746 1.8957 -0.4566
..$ b : num [1:5] 0.5622 -0.887 -0.4602 -0.7243 -0.0692
..$ c : num [1:5] 1.463 0.188 1.022 -0.592 -0.112
..$ d : num [1:5] -0.925 0.7533 -0.1126 -0.0641 0.2333

Another way to perform the recursive merge would be to use join_all from library(plyr)
library(plyr)
out1 <- lapply(l, join_all, by="ID") #using the example dataset of #Thomas
identical(out, out1)
# [1] TRUE

Related

Convert a list of data frames into a single data frame with list name [duplicate]

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 3 years ago.
I am hoping to determine an efficient way to convert a list of data frames into a single data frame. Below is my reproducible MWE:
set.seed(1)
ABAge = runif(100)
ABPoints = rnorm(100)
ACAge = runif(100)
ACPoints = rnorm(100)
BCAge = runif(100)
BCPoints = rnorm(100)
A_B <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = ABAge, Points = ABPoints)
A_C <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = ACAge, Points = ACPoints)
B_C <- data.frame(ID = as.character(paste0("ID", 1:100)), Age = BCAge, Points = BCPoints)
A_B$ID <- as.character(A_B$ID)
A_C$ID <- as.character(A_C$ID)
B_C$ID <- as.character(B_C$ID)
listFormat <- list("A_B" = A_B, "A_C" = A_C, "B_C" = B_C)
dfFormat <- data.frame(ID = as.character(paste0("ID", 1:100)), A_B.Age = ABAge, A_B.Points = ABPoints, A_C.Age = ACAge, A_C.Points = ACPoints, B_C.Age = BCAge, B_C.Points = BCPoints)
dfFormat$ID = as.character(dfFormat$ID)
This results in a data frame format (dfFormat) that looks like this:
'data.frame': 100 obs. of 7 variables:
$ ID : chr "ID1" "ID2" "ID3" "ID4" ...
$ A_B.Age : num 0.266 0.372 0.573 0.908 0.202 ...
$ A_B.Points: num 0.398 -0.612 0.341 -1.129 1.433 ...
$ A_C.Age : num 0.6737 0.0949 0.4926 0.4616 0.3752 ...
$ A_C.Points: num 0.409 1.689 1.587 -0.331 -2.285 ...
$ B_C.Age : num 0.814 0.929 0.147 0.75 0.976 ...
$ B_C.Points: num 1.474 0.677 0.38 -0.193 1.578 ...
and a list of data frames listFormat that looks like this:
List of 3
$ A_B:'data.frame': 100 obs. of 3 variables:
..$ ID : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
..$ Age : num [1:100] 0.266 0.372 0.573 0.908 0.202 ...
..$ Points: num [1:100] 0.398 -0.612 0.341 -1.129 1.433 ...
$ A_C:'data.frame': 100 obs. of 3 variables:
..$ ID : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
..$ Age : num [1:100] 0.6737 0.0949 0.4926 0.4616 0.3752 ...
..$ Points: num [1:100] 0.409 1.689 1.587 -0.331 -2.285 ...
$ B_C:'data.frame': 100 obs. of 3 variables:
..$ ID : chr [1:100] "ID1" "ID2" "ID3" "ID4" ...
..$ Age : num [1:100] 0.814 0.929 0.147 0.75 0.976 ...
..$ Points: num [1:100] 1.474 0.677 0.38 -0.193 1.578 ...
I am hoping to come up with an automated way to convert the dfFormat to listFormat. As can be seen in the above objects there are two main conditions:
1) If there is a common column (name and contents) in each sublist of listFormat (in these examples ID), then they are not repeated in the outputted dfFormat (in this example, it has one final ID column),
2) The rest of the column names in sublists of listFormat become columns in dfFormat and have names such that they retain their sublist name (i.e "A_B") followed by a dot and then their original column name (i.e. Age), so that it becomes (i.e. "A_B.Age") in the dfFormat.
I have tried various unlist() and sapply codes but have been unsuccessful thus far. What is an efficient way to accomplish this?

You're looking for dplyr::bind_rows:
library(dplyr)
bind_rows(listFormat, .id = "name")
Output:
name ID Age Points
1 A_B ID1 0.2655087 0.3981059
2 A_B ID2 0.3721239 -0.6120264
3 A_B ID3 0.5728534 0.3411197
4 A_B ID4 0.9082078 -1.1293631
5 A_B ID5 0.2016819 1.4330237
6 A_B ID6 0.8983897 1.9803999

Copy listFormat to L in case we need to preserve the input, listFormat. Remove the ID column from each component of L except the first, cbind what we have left together and then fix up the name of the first column. No packages are used.
L <- listFormat
L[-1] <- lapply(L[-1], transform, ID = NULL)
DF <- do.call(cbind, L)
names(DF)[1] <- "ID"
giving:
> str(DF)
'data.frame': 100 obs. of 7 variables:
$ ID : chr "ID1" "ID2" "ID3" "ID4" ...
$ A_B.Age : num 0.9932 0.1451 0.6166 0.0372 0.9039 ...
$ A_B.Points: num 0.4752 0.0288 1.0548 0.6113 0.0651 ...
$ A_C.Age : num 0.912 0.761 0.618 0.895 0.507 ...
$ A_C.Points: num -0.515 -0.945 0.398 0.502 -1.021 ...
$ B_C.Age : num 0.7935 0.2747 0.0487 0.6307 0.3499 ...
$ B_C.Points: num -0.963 -1.772 1.716 -0.819 0.577 ...

One purrr and dplyr option could be:
imap(listFormat, ~ setNames(.x, paste(rep(.y, length(.x)), names(.x), sep = ".")) %>%
rename_at(vars(ends_with("ID")), ~ "ID")) %>%
reduce(full_join, by = "ID")
ID A_B.Age A_B.Points A_C.Age A_C.Points B_C.Age B_C.Points
1 ID1 0.26550866 0.398105880 0.67371223 0.409401840 0.81425175 1.473881181
2 ID2 0.37212390 -0.612026393 0.09485786 1.688873286 0.92877723 0.677268492
3 ID3 0.57285336 0.341119691 0.49259612 1.586588433 0.14748105 0.379962687
4 ID4 0.90820779 -1.129363096 0.46155184 -0.330907801 0.74982166 -0.192798426
5 ID5 0.20168193 1.433023702 0.37521653 -2.285235535 0.97565735 1.577891795
6 ID6 0.89838968 1.980399899 0.99109922 2.497661590 0.97479246 0.596234109
7 ID7 0.94467527 -0.367221476 0.17635071 0.667066167 0.35062557 -1.173576941
8 ID8 0.66079779 -1.044134626 0.81343521 0.541327336 0.39394906 -0.155642535
9 ID9 0.62911404 0.569719627 0.06844664 -0.013399523 0.95095101 -1.918909820
10 ID10 0.06178627 -0.135054604 0.40044975 0.510108423 0.10664832 -0.195258846

Given that each data.frame has identical ID columns, in base R it's pretty easy.
as.data.frame(listFormat)
# A_B.ID A_B.Age A_B.Points A_C.ID A_C.Age A_C.Points B_C.ID B_C.Age B_C.Points
# 1 ID1 0.2655087 0.3981059 ID1 0.67371223 0.4094018 ID1 0.8142518 1.4738812
# 2 ID2 0.3721239 -0.6120264 ID2 0.09485786 1.6888733 ID2 0.9287772 0.6772685
# 3 ID3 0.5728534 0.3411197 ID3 0.49259612 1.5865884 ID3 0.1474810 0.3799627
# 4 ID4 0.9082078 -1.1293631 ID4 0.46155184 -0.3309078 ID4 0.7498217 -0.1927984
# 5 ID5 0.2016819 1.4330237 ID5 0.37521653 -2.2852355 ID5 0.9756573 1.5778918
# 6 ID6 0.8983897 1.9803999 ID6 0.99109922 2.4976616 ID6 0.9747925 0.5962341
You get an ID column for each data.frame, but this can then be easily tidied up
In case you need a more general solution for situations where the id columns of each data.frame differ, then you can do the following using library(data.table)
DTFormat = rbindlist(listFormat, idcol = T)
dcast(DTFormat, ID~.id, value.var = c('Age', 'Points'))

How to retrieve name of element in list (data frame) to use it as a title of the plot?

So briefly and without further ado - is it possible to retrieve only a name of element in list and use it as a main title of plot?
Let me explain - example:
Let's create a random df:
a <- c(1,2,3,4)
b <- runif(4)
c <- runif(4)
d <- runif(4)
e <- runif(4)
f <- runif(4)
df <- data.frame(a,b,c,d,e,f)
head(df)
a b c d e f
1 1 0.9694204 0.9869154 0.5386678 0.39331278 0.15054698
2 2 0.8949330 0.9910894 0.1009689 0.03632476 0.15523628
3 3 0.4930752 0.7179144 0.6957262 0.36579883 0.32006026
4 4 0.4850141 0.5539939 0.3196953 0.14348259 0.05292068
Then I want to create a list of data frame (based on this above) with specific columns to make a plot. In other words I'd like to make plot where first column of df (a) will be x axis on the plot and columns b,c,d,e and gonna represent values on y axis on the plot. Yes there'll be 5 plots - that's the point!
So my idea was to write some simple function which be able to create a list of df's based on that created above so:
my_fun <- function(x){
a <- df[1]
b <- x
aname <- "x_label"
bname <- "y_label"
df <- data.frame(a,b)
names(df) <- c(aname,bname)
return(df)
}
Run it for all (specified) columns:
df_s <- apply(df[,2:6], 2, function(x) my_fun(x))
So I have now:
class(df_s)
[1] "list"
str(df_s)
List of 5
$ b:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.969 0.895 0.493 0.485
$ c:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.987 0.991 0.718 0.554
$ d:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.539 0.101 0.696 0.32
$ e:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.3933 0.0363 0.3658 0.1435
$ f:'data.frame': 4 obs. of 2 variables:
..$ x_label: num [1:4] 1 2 3 4
..$ y_label: num [1:4] 0.1505 0.1552 0.3201 0.0529
Something that I wanted, but here's the question. I'd like to create a plot for every df in my list... As a result I want 5 plots with main titles b, c, d, e, f respectively Axis labels are the same name of the plot isn't... So I tried:
lapply(df_s, function(x) plot(x[2] ~ x[1], data = x, main = ???))
What should be instead of question marks? I tried main = names(df_s)[x] however it didin't work...

I think the following works. However, I think it might be best to use ggplot2 instead of the plot function (unless you are saving the plots inside inside lapply).
lapply(1 : length(df_s), function(x)
plot(df_s[[x]][,2] ~ df_s[[x]][,1],
xlab = names(df_s[[x]])[1],
ylab = names(df_s[[x]])[1],
main = names(df_s[x])))
With ggplot2
plot_lst <- lapply(seq_along(df_s), function(i) {
ggplot(df_s[[i]], aes(x=x_label, y=y_label)) +
geom_point() +
theme(plot.title = element_text(hjust = 0.5)) +
ggtitle(names(df_s)[i]) })

dplyr : how-to programmatically full_join dataframes contained in a list of lists?

Context and data structure
I'll share with you a simplified version of my huge dataset. This simplified version fully respects the structure of my original dataset but contains less list elements, dataframes, variables and observations than the original one.
According to the most upvoted answer to the question : How to make a great R reproducible example ?, I share my dataset using the output of dput(query1) to give you something that can be immediately used in R by copy/paste the following code block in the R console :
structure(list(plu = structure(list(year = structure(list(id = 1:3,
station = 100:102, pluMean = c(0.509068994778059, 1.92866478959912,
1.09517453602154), pluMax = c(0.0146962179957886, 0.802984389130343,
2.48170762478472)), .Names = c("id", "station", "pluMean",
"pluMax"), row.names = c(NA, -3L), class = "data.frame"), month = structure(list(
id = 1:3, station = 100:102, pluMean = c(0.66493845927034,
-1.3559338786041, 0.195600637750077), pluMax = c(0.503424623872161,
0.234402501255681, -0.440264545434053)), .Names = c("id",
"station", "pluMean", "pluMax"), row.names = c(NA, -3L), class = "data.frame"),
week = structure(list(id = 1:3, station = 100:102, pluMean = c(-0.608295829330578,
-1.10256919591373, 1.74984007126193), pluMax = c(0.969668266601551,
0.924426323739882, 3.47460867665884)), .Names = c("id", "station",
"pluMean", "pluMax"), row.names = c(NA, -3L), class = "data.frame")), .Names = c("year",
"month", "week")), tsa = structure(list(year = structure(list(
id = 1:3, station = 100:102, tsaMean = c(-1.49060721773042,
-0.684735418997484, 0.0586655881113975), tsaMax = c(0.25739838787582,
0.957634817758648, 1.37198023881125)), .Names = c("id", "station",
"tsaMean", "tsaMax"), row.names = c(NA, -3L), class = "data.frame"),
month = structure(list(id = 1:3, station = 100:102, tsaMean = c(-0.684668662999479,
-1.28087846387974, -0.600175481941456), tsaMax = c(0.962916941685075,
0.530773351897188, -0.217143593955998)), .Names = c("id",
"station", "tsaMean", "tsaMax"), row.names = c(NA, -3L), class = "data.frame"),
week = structure(list(id = 1:3, station = 100:102, tsaMean = c(0.376481732842365,
0.370435880636005, -0.105354927593471), tsaMax = c(1.93833635147645,
0.81176751708868, 0.744932493064975)), .Names = c("id", "station",
"tsaMean", "tsaMax"), row.names = c(NA, -3L), class = "data.frame")), .Names = c("year",
"month", "week"))), .Names = c("plu", "tsa"))
After executing this, if you execute str(query1), you'll get the structure of my example dataset as :
> str(query1)
List of 2
$ plu:List of 3
..$ year :'data.frame': 3 obs. of 4 variables:
.. ..$ id : int [1:3] 1 2 3
.. ..$ station: int [1:3] 100 101 102
.. ..$ pluMean: num [1:3] 0.509 1.929 1.095
.. ..$ pluMax : num [1:3] 0.0147 0.803 2.4817
..$ month:'data.frame': 3 obs. of 4 variables:
.. ..$ id : int [1:3] 1 2 3
.. ..$ station: int [1:3] 100 101 102
.. ..$ pluMean: num [1:3] 0.665 -1.356 0.196
.. ..$ pluMax : num [1:3] 0.503 0.234 -0.44
..$ week :'data.frame': 3 obs. of 4 variables:
.. ..$ id : int [1:3] 1 2 3
.. ..$ station: int [1:3] 100 101 102
.. ..$ pluMean: num [1:3] -0.608 -1.103 1.75
.. ..$ pluMax : num [1:3] 0.97 0.924 3.475
$ tsa:List of 3
..$ year :'data.frame': 3 obs. of 4 variables:
.. ..$ id : int [1:3] 1 2 3
.. ..$ station: int [1:3] 100 101 102
.. ..$ tsaMean: num [1:3] -1.4906 -0.6847 0.0587
.. ..$ tsaMax : num [1:3] 0.257 0.958 1.372
..$ month:'data.frame': 3 obs. of 4 variables:
.. ..$ id : int [1:3] 1 2 3
.. ..$ station: int [1:3] 100 101 102
.. ..$ tsaMean: num [1:3] -0.685 -1.281 -0.6
.. ..$ tsaMax : num [1:3] 0.963 0.531 -0.217
..$ week :'data.frame': 3 obs. of 4 variables:
.. ..$ id : int [1:3] 1 2 3
.. ..$ station: int [1:3] 100 101 102
.. ..$ tsaMean: num [1:3] 0.376 0.37 -0.105
.. ..$ tsaMax : num [1:3] 1.938 0.812 0.745
So how does it reads ? I have big list (query1) made of 2 parameters elements (plu & tsa), each of these 2 parameters elements being a list made of 3 elements (year, month, week), each of these 3 elements being a timeInterval dataframe made of the same 4 variables columns (id, station, mean, max) and exactly the same number of observations (3).
What I want to achieve
I want to programmatically full_join by id & station all the timeInterval dataframes with the same name (year, month, week). This means that I should end up with a new list (query1Changed) containing 3 dataframes (year, month, week), each of them containing 5 columns (id, station, pluMean, pluMax, tsaMean, tsaMax) and 3 observations. Schematically, I need to arrange data as follows :
do a full_join by station and id of :
dfquery1$plu$year with df query1$tsa$year
dfquery1$plu$month with df query1$tsa$month
dfquery1$plu$week with df query1$tsa$week
Or expressed with another representation :
dfquery1[[1]][[1]] with df query1[[2]][[1]]
dfquery1[[1]][[2]] with df query1[[2]][[2]]
dfquery1[[1]][[3]] with df query1[[2]][[3]]
And expressed programmatically (n being the total number of elements of the big list) :
dfquery1[[i]][[1]] with df query1[[i+1]][[1]]... with df query1[[n]][[1]]
dfquery1[[i]][[2]] with df query1[[i+1]][[2]]... with df query1[[n]][[2]]
dfquery1[[i]][[3]] with df query1[[i+1]][[3]]... with df query1[[n]][[3]]
I need to achieve this programmatically because in my real project I could encounter another big list with more than 2 parameters elements and more than 4 variables columns in each of their timeIntervals dataframes .
In my analysis, what will always remain the same is the fact that all the parameters elements of another big list will always have the same number of timeIntervals dataframes with the same names and each of these timeIntervals dataframes will always have the same number of observations and always share 2 columns with exactly the same name and same values (id & station)
What i have succeeded
Executing the following piece of code :
> query1Changed <- do.call(function(...) mapply(bind_cols, ..., SIMPLIFY=F), args = query1)
arranges the data as expected. However this is not a neat solution since we end up with repeated column names (id & station) :
> str(query1Changed)
List of 3
$ year :'data.frame': 3 obs. of 8 variables:
..$ id : int [1:3] 1 2 3
..$ station : int [1:3] 100 101 102
..$ pluMean : num [1:3] 0.509 1.929 1.095
..$ pluMax : num [1:3] 0.0147 0.803 2.4817
..$ id1 : int [1:3] 1 2 3
..$ station1: int [1:3] 100 101 102
..$ tsaMean : num [1:3] -1.4906 -0.6847 0.0587
..$ tsaMax : num [1:3] 0.257 0.958 1.372
$ month:'data.frame': 3 obs. of 8 variables:
..$ id : int [1:3] 1 2 3
..$ station : int [1:3] 100 101 102
..$ pluMean : num [1:3] 0.665 -1.356 0.196
..$ pluMax : num [1:3] 0.503 0.234 -0.44
..$ id1 : int [1:3] 1 2 3
..$ station1: int [1:3] 100 101 102
..$ tsaMean : num [1:3] -0.685 -1.281 -0.6
..$ tsaMax : num [1:3] 0.963 0.531 -0.217
$ week :'data.frame': 3 obs. of 8 variables:
..$ id : int [1:3] 1 2 3
..$ station : int [1:3] 100 101 102
..$ pluMean : num [1:3] -0.608 -1.103 1.75
..$ pluMax : num [1:3] 0.97 0.924 3.475
..$ id1 : int [1:3] 1 2 3
..$ station1: int [1:3] 100 101 102
..$ tsaMean : num [1:3] 0.376 0.37 -0.105
..$ tsaMax : num [1:3] 1.938 0.812 0.745
We could add a second process to "clean" the data but this would not be the most efficient solution. So I don't want to use this workaround.
Next, I've tried doing the same using dplyr full_join but with no success. Executing the following code :
> query1Changed <- do.call(function(...) mapply(full_join(..., by = c("station", "id")), ..., SIMPLIFY=F), args = query1)
returns the following error :
Error in UseMethod("full_join") :
no applicable method for 'full_join' applied to an object of class "list"
So, how should I write my full_join expression to make it run on the dataframes ?
or is there another way to perform my data transformation efficiently ?
What I've found on the web that could help ?
I've found the related questions but I still can't figure out how to adapt their solutions to my problem.
On stackoverflow :
- Merging a data frame from a list of data frames [duplicate]
- Simultaneously merge multiple data.frames in a list
- Joining list of data.frames from map() call
- Combining elements of list of lists by index
On blogs :
- Joining a List of Data Frames with purrr::reduce()
Any help would be greatly appreciated. I hope I've made the description of my problem clear.
I've started programming with R only 2 months ago so please be indulgent if the solution is obvious ;)

First of all, thanks for posting a really great description of what your problem is and which requirements you need for your solution.
First, I'd use purrr::map2 to create a function that takes two lists of data frames and joins them in parallel. That is, it joins the first data frame of plu with the first of tsa ... the last of plu with the last of tsa, and returns the results as a list.
> join_each = function(x, y) map2(x, y, full_join)
> join_each(query1$plu, query1$tsa)
Joining, by = c("id", "station")
Joining, by = c("id", "station")
Joining, by = c("id", "station")
$year
id station pluMean pluMax tsaMean tsaMax
1 1 100 0.509069 0.01469622 -1.49060722 0.2573984
2 2 101 1.928665 0.80298439 -0.68473542 0.9576348
3 3 102 1.095175 2.48170762 0.05866559 1.3719802
$month
id station pluMean pluMax tsaMean tsaMax
1 1 100 0.6649385 0.5034246 -0.6846687 0.9629169
2 2 101 -1.3559339 0.2344025 -1.2808785 0.5307734
3 3 102 0.1956006 -0.4402645 -0.6001755 -0.2171436
$week
id station pluMean pluMax tsaMean tsaMax
1 1 100 -0.6082958 0.9696683 0.3764817 1.9383364
2 2 101 -1.1025692 0.9244263 0.3704359 0.8117675
3 3 102 1.7498401 3.4746087 -0.1053549 0.7449325
Well, this works when there are only two of them, but you want it to work when there are n lists of data.frames. Now you are going to need purrr::reduce:
> reduce(query1, join_each)
Joining, by = c("id", "station")
Joining, by = c("id", "station")
Joining, by = c("id", "station")
$year
id station pluMean pluMax tsaMean tsaMax
1 1 100 0.509069 0.01469622 -1.49060722 0.2573984
2 2 101 1.928665 0.80298439 -0.68473542 0.9576348
3 3 102 1.095175 2.48170762 0.05866559 1.3719802
$month
id station pluMean pluMax tsaMean tsaMax
1 1 100 0.6649385 0.5034246 -0.6846687 0.9629169
2 2 101 -1.3559339 0.2344025 -1.2808785 0.5307734
3 3 102 0.1956006 -0.4402645 -0.6001755 -0.2171436
$week
id station pluMean pluMax tsaMean tsaMax
1 1 100 -0.6082958 0.9696683 0.3764817 1.9383364
2 2 101 -1.1025692 0.9244263 0.3704359 0.8117675
3 3 102 1.7498401 3.4746087 -0.1053549 0.7449325
It computes join_each(query1[[1]], query1[[2]]) %>% join_each(query1[[3]]) ... %>% join_each(query1[[n]]).
Update: The following one-liner does the same: reduce(query1, map2, full_join). It isn't as readable, though.

Change the coltypes based on a substring in colnames

I have a very large data frame of sales data (df8). When loading in, some of the variables that I want to be numeric loaded as chr. I want to change every column where the colname contains the word "Order" from chr to numeric. How can I do this?

I would use function grepl to find the occurrences of "order" and go through each column and convert to numeric. Notice that the variables are actually characters and it won't work if your data is a factor (that will need (as.numeric(as.character(x))).
# create data.frame with characters
xy <- data.frame(a = runif(5), b.order = runif(5), cOrder = runif(5))
xy[, c(2, 3)] <- sapply(xy[, c(2, 3)], FUN = as.character)
str(xy)
'data.frame': 5 obs. of 3 variables:
$ a : num 0.914 0.468 0.106 0.624 0.841
$ b.order: chr "0.363523897947744" "0.56488766730763" "0.42081760126166" "0.560672372812405" ...
$ cOrder : chr "0.949268750846386" "0.596737345447764" "0.368769273394719" "0.717566329054534" ...
with.order <- grepl("order", names(xy), ignore.case = TRUE)
xy[, with.order] <- sapply(xy[, with.order], FUN = as.numeric)
str(xy)
'data.frame': 5 obs. of 3 variables:
$ a : num 0.914 0.468 0.106 0.624 0.841
$ b.order: num 0.364 0.565 0.421 0.561 0.768
$ cOrder : num 0.949 0.597 0.369 0.718 0.417

Calculations on data frames in a list

I have a list of data frames:
str(Test)
List of 3
$ A:'data.frame': 32400 obs. of 4 variables:
..$ X : num [1:32400] -0.0152 -0.0302 -0.0453 -0.0604 -0.0755 ...
..$ Y : num [1:32400] 0.00875 0.01745 0.02615 0.0349 0.0436 ...
..$ Z : num [1:32400] -1 -0.999 -0.999 -0.998 -0.996 ...
..$ Ts: num [1:32400] 0.000427 0.001696 0.003805 0.006765 0.010537 ...
$ B:'data.frame': 32400 obs. of 4 variables:
..$ X : num [1:32400] -0.0153 -0.0305 -0.0457 -0.061 -0.0763 ...
..$ Y : num [1:32400] 0.00848 0.01692 0.02536 0.03384 0.04228 ...
..$ Z : num [1:32400] -1 -0.999 -0.999 -0.998 -0.996 ...
..$ Ts: num [1:32400] 0.000427 0.001696 0.003805 0.006765 0.010537 ...
$ C:'data.frame': 32400 obs. of 4 variables:
..$ X : num [1:32400] -0.0155 -0.0308 -0.0462 -0.0616 -0.077 ...
..$ Y : num [1:32400] 0.00822 0.01638 0.02455 0.03277 0.04094 ...
..$ Z : num [1:32400] -1 -0.999 -0.999 -0.998 -0.996 ...
..$ Ts: num [1:32400] 0.000427 0.001696 0.003805 0.006765 0.010537 ...
I want to create two new columns in each dataframe. The new values are based on X, Y, Z of each dataframe:
new_x = sqrt(2/(1-Z)) * X
new_y = sqrt(2/(1-Z)) * Y
I have tried a few things (and read a lot) and this is what I think should work:
t=function(x){
new_x = sqrt(2/(1-x[,3])) * x[,1]
new_y = sqrt(2/(1-x[,3])) * x[,2] }
New_Test=lapply(Test, within, t)
However, this only creates a new list that is exactly like the old list.
I have tried to use mapply and looked into the plyr package but could not find a solution. I am fairly new to R so be kind ;-)
Edit: Both solutions posted below work! Thanks for your help :-)

Here is a fully implementation of the suggestion I made in the comments.
First we simulate some data:
listOfDataframes<- list(
df1 = data.frame(X = runif(100), Y = runif(100), Z = runif(100)),
df2 = data.frame(X = runif(100), Y = runif(100), Z = runif(100)),
df3 = data.frame(X = runif(100), Y = runif(100), Z = runif(100))
)
Then we write a function to perform the desire operation. Note that the use of return is unnecessary and only included for clarity.
yourFun <- function(df) {
df$new_x <- sqrt(2/(1-df$Z)) * df$X
df$new_y <- sqrt(2/(1-df$Z)) * df$Y
return(df) # just "df" would produce the same result
}
Then we apply the function to your list of data.frames and assign the result to a new list:
newList <- lapply(listOfDataframes, yourFun)
Finally, we display the first few entries of every dataframe to verify our results.
lapply(newList, head)
$df1
X Y Z new_x new_y
1 0.7122989 0.85574735 0.26176397 1.1724104 1.4085198
2 0.8373206 0.18083472 0.19733040 1.3217167 0.2854489
3 0.6780758 0.76722834 0.48987088 1.3426203 1.5191462
4 0.3694669 0.42579811 0.10287797 0.5516515 0.6357597
5 0.1466816 0.69924651 0.08006688 0.2162781 1.0310202
6 0.3280546 0.06574292 0.22372561 0.5265669 0.1055252
$df2
X Y Z new_x new_y
1 0.9385518 0.50570095 0.3062779 1.593604 0.85864969
2 0.6672409 0.66002494 0.3721208 1.190857 1.17797831
3 0.7559528 0.73025591 0.4063969 1.387591 1.34042338
4 0.1960170 0.01639017 0.9700715 1.602382 0.13398487
5 0.9336734 0.76437690 0.3301318 1.613301 1.32077206
6 0.7320958 0.03640788 0.1761000 1.140632 0.05672482
$df3
X Y Z new_x new_y
1 0.45050818 0.9843507 0.8956288 1.9720924 4.3089794
2 0.32775145 0.1385610 0.9713440 2.7381165 1.1575725
3 0.07208382 0.8635344 0.5244027 0.1478200 1.7708221
4 0.50439997 0.1328935 0.5827728 1.1043424 0.2909594
5 0.46265459 0.3394566 0.4912585 0.9173252 0.6730551
6 0.15894944 0.4517309 0.3610197 0.2812097 0.7991919

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using Reduce function to merge recursively [duplicate] - r

Another way to perform the recursive merge would be to use join_all from library(plyr) library(plyr) out1 <- lapply(l, join_all, by="ID") #using the example dataset of #Thomas identical(out, out1) # [1] TRUE

Related

Convert a list of data frames into a single data frame with list name [duplicate]

How to retrieve name of element in list (data frame) to use it as a title of the plot?

dplyr : how-to programmatically full_join dataframes contained in a list of lists?

Change the coltypes based on a substring in colnames

Calculations on data frames in a list

Categories

Resources