combine multiple dataframes based on sequence of names - r

Say I have 30 dataframes all named with a date from 01/01/2000 to 30/01/2000 in the format of ddmmyy (code below) :
Season <- seq(as.Date("2000-01-01"),as.Date("2000-01-30"),1)
Season <- format(Season,"%d%m%y")
for (s in Season) {
df <- data.frame(X=1:10, Y=1:10)
aa <- paste(s,"tests",s ,sep = "_")
assign(aa,df)
}
Each name, you cans see, has the word tests added to it.I want to combine (rbind?) the data.frames based on the date. In this case, combine data.frames that contain the dates from 01-01-00 to 10-01-00.
I have the below code to combine all dataframes but what if I only want to select the ones shown above?
All_dfs <- do.call(rbind, eapply(.GlobalEnv,function(x) if(is.data.frame(x)) x))
Is it better to create a list first?

We can use mget to get the values of 'Season' in a list and then rbind the list of data.frames. As there is a suffix "tests" followed by "Season" concatenated to the "Season", we can use paste to get the string, then use mget.
res <- do.call(rbind, mget( paste0(Season[1:10], "_tests_", Season[1:10])))
dim(res)
#[1] 100 2

Related

Using Dataframe to Automatically create a list of values based off Subproduct

df <- data.frame("date"=
1:4,"product"=c("B","B","A","A"),"subproduct"=c("1","2","x","y"),"actuals"=1:4)
#creates df1,df2,dfx,dfy
for(i in unique(df$subproduct)) {
nam <- paste("df", i, sep = ".")
assign(nam, df[df$subproduct==i,])
}
# CREATES LIST OF DATAFRAMES
# How do I make this so i don't have to manually type list(df.,df.,df.)
list_df <- list(df.1,df.2,df.x,df.y) %>%
lapply( function(x) x[(names(x) %in% c("date", "actuals"))])
# creates df1,df2,df3,df4 only dates and actuals, removes the other column names
for (i in 1:length(list_df)) {
assign(paste0("df", i), as.data.frame(list_df[[i]]))
}
For the first for loop, it creates a df object based off unique subproduct. For the list() function, I want to be able to not have to type in df.1 ... df2... etc so if I have 100 unique subproducts in my data, I wouldn't need to type this df.1, df.2,df.x,df.y,df.z,df.zzz,df. over and over again. How would I best do this (1 question)
The last for loop creates separate dataframe objects with only date and actuals will be used to create time series for each. How can I put the values of these objects into a single dataframe or a list of dfs? (2nd question)
We can use mget to return the value of object on the subset of object names from ls. The pattern matches object names that starts with 'df'followed by a.` and any alphanumeric characters
mget(ls(pattern = '^df\\.[[:alnum:]]+$'))
If the OP wanted to create those objects in a different env
new_env <- new.env()
list2env(mget(ls(pattern = '^df\\.[[:alnum:]]+$')), envir = new_env)
If we want to create new objects from scratch, do a group_split on the 'subproduct' column, set the names accordingly, and create multiple objects (list2env - not recommended though)
library(dplyr)
library(stringr)
df %>%
group_split(subproduct) %>%
setNames(str_c('df.', c(1, 2, 'x', 'y'))) %>%
list2env(.GlobalEnv)

Fully programmatically rename columns in R with dplyr

I have sensor data, for several different sensor types, in many dataframes. I need to perform inner_joins on the dataframes so that I end up with one dataframe. The column names of the dataframes for a given sensor type are identical, e.g.
> z501h001
timeBgn soilTempMean soilTempVar
1 01:00:00 100 4
2 01:30:00 112 6
3 02:00:00 111 6
> z501h002
timeBgn soilTempMean soilTempVar
1 01:00:00 120 4
2 01:30:00 122 6
3 02:00:00 121 5
except there are way more columns. The column names are different for different types of sensors (they all have timeBgn in common) .
I need (in R) a flexible way to rename the columns (so I can tell which column corresponds to which sensor) based on adding a suffix to the existing column names for all columns except timeBgn (which is the common column by which the inner_join will be done).
Here is the Python / Pandas equivalent of what I am trying to do:
def rename_cols_by_sensor(df, sensor_name):
cols = df.columns
new_cols = [f'{c}_{sensor_name}' if c!='timeBgn' else c for c in cols]
df.columns = new_cols
I found most of a solution here:
programmatically rename columns in dplyr
The problem is that I cannot figure out how to make the cnames vector programmatically. I do not want to hard-code all of the myriad column names. As an example for z501h001 it would need to look like
cnames <- c('soilTempMean' = 'soilTempMean_z501h001', 'soilTempVar' = 'soilTempVar_z501h001')
the suffix (in the example: _z501h001) can be passed to the function so there is no need to discuss obtaining it here. The original names are easily obtained using names(df). All I need to know is how to put them together in this c("character" = "other_character") format.
I have tried:
rename_by_loc <- function(df, loc) {
old_names <- names(df)
new_names <- c()
loc = z501h001
for (name in old_names) {
if (name != "timeBgn") {
new_names <- c(new_names, paste(name, paste(name, loc, sep="_"), sep = " = ") )
}
}
return(new_names)
}
but that gives me names like "soilTempMean = soilTempMean_z501h001"
I need the = to be outside of the character strings. I have tried a few other things. None have been successful.
This problem is trivial using Pandas which makes me think I am missing something about column renaming in R.
Thanks.
We can use mget to get all the values of the objects with the pattern for object names starts with 'z' followed by 3 digits, 'h', and then 3 digits in a list, then use imap to loop over the list and rename all those columns except 'timeBgn' by concatenating (str_c) the original column with the object name
library(dplyr)
library(purrr)
library(stringr)
out <- mget(ls(pattern = "^z\\d{3}h\\d{3}$")) %>%
imap(~ {
nm1 <- .y
.x %>%
rename_with(~ str_c(., "_", nm1), -timeBgn)
})
The output will be a list. If we need to change the column name in the original object (not recommended), use list2env
list2env(out, .GlobalEnv)
Or using base R
v1 <- ls(pattern = "^z\\d{3}h\\d{3}$")
for(v in v1) {
tmp <- get(v)
i1 <- names(tmp) != 'timeBgn'
names(tmp)[i1] <- paste0(names(tmp)[i1], '_', v)
assign(v, tmp)
}

Add different suffix to column names on multiple data frames in R

I'm trying to add different suffixes to my data frames so that I can distinguish them after I've merge them. I have my data frames in a list and created a vector for the suffixes but so far I have not been successful.
data2016 is the list containing my 7 data frames
new_names <- c("june2016", "july2016", "aug2016", "sep2016", "oct2016", "nov2016", "dec2016")
data2016v2 <- lapply(data2016, paste(colnames(data2016)), new_names)
Your query is not quite clear. Therefore two solutions.
The beginning is the same for either solution. Suppose you have these four dataframes:
df1x <- data.frame(v1 = rnorm(50),
v2 = runif(50))
df2x <- data.frame(v3 = rnorm(60),
v4 = runif(60))
df3x <- data.frame(v1 = rnorm(50),
v2 = runif(50))
df4x <- data.frame(v3 = rnorm(60),
v4 = runif(60))
Suppose further you assemble them in a list, something akin to your data2016using mgetand ls and describing a pattern to match them:
my_list <- mget(ls(pattern = "^df\\d+x$"))
The names of the dataframes in this list are the following:
names(my_list)
[1] "df1x" "df2x" "df3x" "df4x"
Solution 1:
Suppose you want to change the names of the dataframes thus:
new_names <- c("june2016", "july2016","aug2016", "sep2016")
Then you can simply assign new_namesto names(my_list):
names(my_list) <- new_names
And the result is:
names(my_list)
[1] "june2016" "july2016" "aug2016" "sep2016"
Solution 2:
You want to add the new_names literally as suffixes to the 'old' names, in which case you would use pasteor paste0 thus:
names(my_list) <- paste0(names(my_list), "_", new_names)
And the result is:
names(my_list)
[1] "df1x_june2016" "df2x_july2016" "df3x_aug2016" "df4x_sep2016"
You could use an index number within lapply to reference both the list and your vector of suffixes. Because there are a couple steps, I'll wrap the process in a function(). (Called an anonymous function because we aren't assigning a name to it.)
data2016v2 <- lapply(1:7, function(i) {
this_data <- data2016[[i]] # Double brackets for a list
names(this_data) <- paste0(names(this_data), new_names[i]) # Single bracket for vector
this_data # The renamed data frame to be placed into data2016v2
})
Notice in the paste0() line we are recycling the term in new_names[i], so for example if new_names[i] is "june2016" and your first data.frame has columns "A", "B", and "C" then it would give you this:
> paste0(c("A", "B", "C"), "june2016")
[1] "Ajune2016" "Bjune2016" "Cjune2016"
(You may want to add an underscore in there?)
As an aside, it sounds like you might be better served by adding the "june2016" as a column in your data (like say a variable named month with "june2016" as the value in each row) and combining your data using something like bind_rows() from the dplyr package, running it "long" instead of "wide".

turning lists of lists of lists into a dataframe

I have a set of lists stored in the all_lists.
all_list=c("LIST1","LIST2")
From these, I would like to create a data frame such that
LISTn$findings${Coli}$character is entered into the n'th column with rowname from LISTn$rowname.
DATA
LIST1=list()
LIST1[["findings"]]=list(s1a=list(character="a1",number=1,string="a1type",exp="great"),
=list(number=2,string="b1type"),
in2a=list(character="c1",number=3,string="c1type"),
del3b=list(character="d1",number=4,string="d1type"))
LIST1[["rowname"]]="Row1"
LIST2=list()
LIST2[["findings"]]=list(s1a=list(character="a2",number=5,string="a2type",exp="great"),
s1b=list(character="b2",number=6,string="b2type"),
in2a=list(character="c2",number=7,string="c2type"),
del3b=list(character="d2",number=8,string="d2type"))
LIST2[["rowname"]]="Row2"
Please note that some characters are missing for which NA would suffice.
Desired output is this data frame:
s1a s1b in2a del3b
Row1 a1 NA c1 d1
Row2 a2 b2 c2 d2
There is about 1000 of these lists, speed is a factor. And each list is about 50mB after I load them through rjson::fromJSON(file=x)
The row and column names don't follow a particular pattern. They're names and attributes
We can use a couple of lapply/sapply combinations to loop over the nested list and extract the elements that have "Row" as the name
do.call(rbind, lapply(mget(all_list), function(x)
sapply(lapply(x$findings[grep("^Row\\d+", names(x$findings))], `[[`,
"character"), function(x) replace(x, is.null(x), NA))))
Or it can be also done by changing the names to a single value and then extract all those
do.call(rbind, lapply(mget(all_list), function(x) {
x1 <- setNames(x$findings, rep("Row", length(x$findings)) )
sapply(x1[names(x1)== "Row"], function(y)
pmin(NA, y$character[1], na.rm = TRUE)[1])}))
purrr has a strong function called map_chr which is built for these tasks.
library(purrr)
sapply(mget(all_list),function(x) purrr::map_chr(x$findings,"character",.default=NA))
%>% t
%>% data.frame

rbinding a list of lists of dataframes based on nested order

I have a dataframe, df and a function process that returns a list of two dataframes, a and b. I use dlply to split up the df on an id column, and then return a list of lists of dataframes. Here's sample data/code that approximates the actual data and methods:
df <- data.frame(id1=rep(c(1,2,3,4), each=2))
process <- function(df) {
a <- data.frame(d1=rnorm(1), d2=rnorm(1))
b <- data.frame(id1=df$id1, a=rnorm(nrow(df)), b=runif(nrow(df)))
list(a=a, b=b)
}
require(plyr)
output <- dlply(df, .(id1), process)
output is a list of lists of dataframes, the nested list will always have two dataframes, named a and b. In this case the outer list has a length 4.
What I am looking to generate is a dataframe with all the a dataframes, along with an id column indicating their respective value (I believe this is left in the list as the split_labels attribute, see str(output)). Then similarly for the b dataframes.
So far I have in part used this question to come up with this code:
list <- unlist(output, recursive = FALSE)
list.a <- lapply(1:4, function(x) {
list[[(2*x)-1]]
})
all.a <- rbind.fill(list.a)
Which gives me the final a dataframe (and likewise for b with a different subscript into list), however it doesn't have the id column I need and I'm pretty sure there's got to be a more straightforward or elegant solution. Ideally something clean using plyr.
Not very clean but you can try something like this (assuming the same data generation process).
list.aID <- lapply(1:4, function(x) {
cbind(list[[(2*x) - 1]], list[[2*x]][1, 1, drop = FALSE])
})
all.aID <- rbind.fill(list.aID)
all.aID
all.aID
d1 d2 id1
1 0.68103 -0.74023 1
2 -0.50684 1.23713 2
3 0.33795 -0.37277 3
4 0.37827 0.56892 4

Resources