Combination of match and lapply in R - r

Here is my problem.
I have 8 * 3 dataframes. 8 for the years (2005 to 2012) and for each year I have three data frames corresponding to ecology, flowerdistrib and location. The names of the csv files are based on the same typology (flowerdistrib_2005.csv, ecology_2005.csv, ...)
I would like to constitute for each year a data frame which contains all the columns of the "flowerdistrib" file and part of the "ecology" and "location" ones.
I imported all of them thanks to this script:
listflower = list.files(path = "C:/Directory/.../", pattern = "flowerdistrib_")
for (i in listflower) {
filepath1 <- file.path("C:/Directory/.../",paste(i))
assign(i,read.csv(filepath1, sep=";", dec=",", header=TRUE))
}
Same for ecology and location.
Then I want to do a vlookup for each year with the three files with some specific columns.
In each year, the csv files ecology, location and flowerdistrib have a column named "idp" in common.
I know how to do for one year. I use the following script:
2005 example, extraction of the column named "xl93" present in the file location_2005.csv:
flowerdistrib_2005[, "xl93"] = location_2005$"xl93"[match(flowerdistrib_2005$"idp", location_2005$"idp")]
But I don't know how to proceed to do this once for all the years. I was thinking of using a for loop combined with the lapply function but I don't handle it very well as i am a R beginner.
I would appreciate any and all help.
Thanks a lot.
PS: I am not an english native, apologies for the possible misunderstandings and probably language mistakes.

This is a bit of a re-organization of your read.csv proceedure, but you could use something like the script below to do what you need to do. It would create a list data, which contains all dataframes for the years specified. You can also potentially combine all those data frames into one, if the input tables all have the very same structure.
Hope this helps, not sure if the code below works if you copy paste it and update the paths, but something very similar to this could work for you hopefully.
# Prepare empty list
data <- list()
# Loop through all years
for(year in 2005:2012){
# Load data for this year
flowers <- read.csv(paste('C:/Directory/.../a/flowerdistrib_', year, '.csv', sep=''), sep=";", dec=",", header=TRUE)
ecology <- read.csv(paste('C:/Directory/.../a/ecology_', year, '.csv', sep=''), sep=";", dec=",", header=TRUE)
location <- read.csv(paste('C:/Directory/.../a/location_', year, '.csv', sep=''), sep=";", dec=",", header=TRUE)
# Merge data for this specific year, using idp as identifier
all <- merge(flowers, ecology, by = "idp", all = TRUE)
all <- merge(all, location, by = "idp", all = TRUE)
# Add a year column with constant year value to data
all$year <- year
# Drop unused columns
dropnames = c('column_x', 'column_y')
all <- all[,!(names(all) %in% dropnames)]
# Or alternatively, only keep wanted columns
keepnames = c('idp', 'year', 'column_z', 'column_v')
all <- all[keepnames]
# Append data to list
data[[as.character(year)]] <- all
}
# At this point, data should be a list of dataframes with all data for each year
# so this should print the summary of the data for 2007
summary(data[['2007']])
# If all years have the very same column structure,
# you can use use rbind to combine all years into one big dataframe
data <- do.call(rbind, data)
# This would summarize the data frame with all data combined
summary(data)

Here is a shorter version using some functional programming concepts. First, we write a function read_and_merge that accepts a year as an argument, constructs a list of files for the year, reads them into data_ which is a list consisting of three files. The final trick is to use the Reduce function which recursively merges the three data frames. I am assuming that the only common column is idp.
read_and_merge <- function(year, mydir = "C:/Directory/.../a/"){
files_ = list.files(mydir, pattern = paste("*_", year, ".csv"))
data_ = lapply(files_, read.csv, sep = ";", dec = ",", header = TRUE)
Reduce('merge', data_)
}
The second step is to create a list of the years and use lapply to create datasets for each year.
mydata = lapply(2005:2012, read_and_merge)

Related

merge data nasted dataframes in R

I have several DFs. Each of them is res csv file of one participant form my exp. Some of the csv have 48 variables. Others have in addition to these identical variables 6 more variable (53 variables). However, If I try to merge them like this:
flist <- list.files(path="my path", pattern = ".csv", full.names = TRUE)
Merge<-plyr::ldply(flist, read_csv) #Merge all files
the merging is done by the columns orders and not by the variable name. Therefore in one column in my big combine DF I get data form different variables.
So I tried different strategy: uploading my files as separate DFs:
data_files <- list.files("my_path") # Identify file names
data_files
for(i in 1:length(data_files)) { # Head of for-loop
assign(paste0("data", i), # Read and store data frames
read_csv(paste0("my_path/",
data_files[i])))
}
Then I tried to merge them by this script:
listDF <- names(which(unlist(eapply(.GlobalEnv,is.data.frame)))) #list of my DFs
listDF
library(plyr)
MergeDF<-do.call('rbind.fill', listDF)
But I'm still stuck.
We may use map_dfr
library(readr)
library(purrr)
map_dfr(setNames(flist, flist), read_csv, .id = "id")

Merging thousands of csv files into a single dataframe in R

I have 2500 csv files, all with the same columns and a variable number of observations.
Each file is approximately 3mb (~10000 obs per file).
Ideally, I would like to read all of these in to a single dataframe.
Each file represents a generation and contains info in regard to traits, phenotypes and allele frequencies.
While reading in this data, I am also trying to add an extra column to each read indicating the generation.
I have written the following code:
read_data <- function(ex_files,ex){
df <- NULL
ex <- as.character(ex)
for(n in 1:length(ex_files)){
temp <- read.csv(paste("Experiment ",ex,"/all e",ex," gen",as.character(n),".csv",sep=""))
temp$generation <- n
df <- rbind(df,temp)
}
return(df)
}
ex_files refers to list.length, while ex refers to the experiment number as it was performed in replicate (ie. I have multiple experiments each with 2500 csv files).
I am currently running it (I hope it's written correctly!), however it is taking quite a while (as expected). I'm wondering if there is a quicker way of doing this at all?
It is inefficient to grow objects in a loop. List all the files that you want to read using list.files and with purrr::map_df combine them into one dataframe with an additional column called generation which will give a unique number to each file.
filenames <- list.files(pattern = '\\.csv', full.names = TRUE)
df <- purrr::map_df(filenames, read.csv, .id = 'generation')
head(df)
Try plyr package
filenames = list.files(pattern = '\\.csv', full.names = TRUE)
df = plyr::ldpy(filenames , data.frame)

How can I dynamically combine data frames with different column names in R?

I have an analytics script that processes batches of data with similar structure, but different column names. I need to preserve the column names for later ETL scripts, but we want to do do some processing, e.g,:
results <- data.frame();
for (name in names(data[[1]])) {
# Start by combining each column into a single matrix
working <- lapply(data, function(item)item[[name]]);
working <- matrix(unlist(working), ncol = 50, byrow = TRUE);
# Dump the data for the archive
write.csv(working, file = paste(PATH, prefix, name, '.csv', sep = ''), row.names = FALSE);
# Calculate the mean and SD for each year, bind to the results
df <- data.frame(colMeans(working), colSds(working));
names(df) <- c(paste(name, '.mean', sep = ''), paste(name, '.sd', sep = ''));
# Combine the working df with the processing one
}
Per the last comment in the example, how can I combine data frames? I've tried rbind and rbind.fill but neither work and their may be 10's to 100's of different column names in the data files.
This might have been more of an issue with searching for the right keyword, but the cbind method was actually the way to go along with a matrix,
# Allocate for the number of rows needed
results <- matrix(nrow = rows)
for (name in names(data[[1]])) {
# Data processing
# Append the results to the working data
results <- cbind(results, df)
}
# Drop the first placeholder column created upon allocation
results <- results[, -1];
Obviously the catch is that the columns need to have the same number of rows, but otherwise it is just a matter of appending the columns to the matrix.

R: Loop for importing multiple xls as df, rename column of one df and then merge all df's

The below is driving me a little crazy and I’m sure theres an easy solution.
I currently use R to perform some calculations from a bunch of excel files, where the files are monthly observations of financial data. The files all have the exact same column headers. Each file gets imported, gets some calcs done on it and the output is saved to a list. The next file is imported and the process is repeated. I use the following code for this:
filelist <- list.files(pattern = "\\.xls")
universe_list <- list()
count <- 1
for (file in filelist) {
df <- read.xlsx(file, 1, startRow=2, header=TRUE)
*perform calcs*
universe_list[[count]] <- df
count <- count + 1
}
I now have a problem where some of the new operations I want to perform would involve data from two or more excel files. So for example, I would need to import the Jan-16 and the Jan-15 excel files, perform whatever needs to be done, and then move on to the next set of files (Feb-16 and Feb-15). The files will always be of fixed length apart (like one year etc)
I cant seem to figure out the code on how to do this… from a process perspective, Im thinking 1) need to design a loop to import both sets of files at the same time, 2) create two dataframes from the imported data, 3) rename the columns of one of the dataframes (so the columns can be distinguished), 4) merge both dataframes together, and 4) perform the calcs. I cant work out the code for steps 1-4 for this!
Many thanks for helping out
Consider mapply() to handle both data frame pairs together. Your current loop is actually reminiscient of other languages running for loop operations. However, R has many vectorized approaches to iterate over lists. Below assumes both 15 and 16 year list of files are same length with corresponding months in both and year abbrev comes right before file extension (i.e, -15.xls, -16.xls):
files15list <- list.files(path, pattern = "[15]\\.xls")
files16list <- list.files(path, pattern = "[16]\\.xls")
dfprocess <- function(x, y){
df1 <- read.xlsx(x, 1, startRow=2, header=TRUE)
names(df1) <- paste0(names(df1), "1") # SUFFIX COLS WITH 1
df2 <- read.xlsx(y, 1, startRow=2, header=TRUE)
names(df2) <- paste0(names(df2), "2") # SUFFIX COLS WITH 2
df <- cbind(df1, df2) # CBIND DFs
# ... perform calcs ...
return(df)
}
wide_list <- mapply(dfprocess, files15list, files16list)
long_list <- lapply(1:ncol(wide_list),
function(i) wide_list[,i]) # ALTERNATE OUTPUT
First sort your filelist such that the two files on which you want to do your calculations are consecutive to each other. After that try this:
count <- 1
for (count in seq(1, (len(filelist)),2) {
df <- read.xlsx(filelist[count], 1, startRow=2, header=TRUE)
df1 <- read.xlsx(filelist[count+1], 1, startRow=2, header=TRUE)
*change column names and apply merge or append depending on requirement
*perform calcs*
*save*
}

repeat the assigning of data frame in R [duplicate]

This question already has answers here:
Reading multiple files into multiple data frames
(2 answers)
Closed 6 years ago.
I am new to R and stackoverflow so this will probably have a very simple solution.
I have a set of data from 20 different subject. In the future I will have to perform a lot of different actions on this data and will have to repeat this action for all individual sets. Analyzing them separately and recombining them.
My question is how can I automate this process:
P4 <- read.delim("P4Rtest.txt")
P7 <- read.delim("P7Rtest.txt")
P13 <- read.delim("P13Rtest.txt")
etc etc etc.
I have tried looping with a for loop but see to get stuck with creating a new data.frame with a unique name every time.
Thank you for your help
The R way to do this would be to keep all the data sets together in a named list. For that you can use the following, where n is the number of files.
nm <- paste0("P", 1:n) ## create the names P1, P2, ..., Pn
dfList <- setNames(lapply(paste0(nm, "Rtest.txt"), read.delim), nm)
Now dfList will contain all the data sets. You can access them individually with dfList$P1 for P1, dfList$P2 for P2, and so on.
There are a bunch of different ways of doing stuff like this. You could combine all the data into one data frame using rbind. The first answer here has a good way of doing that: Replace rbind in for-loop with lapply? (2nd circle of hell)
If you combine everything into one data frame, you'll need to add a column that identifies the participant. So instead of
P4 <- read.delim("P4Rtest.txt")
...
You would have something like
my.list <- vector("list", number.of.subjects)
for(participant.number in 1:number.of.subjects){
# load individual participant data
participant.filename = paste("P", participant, "Rtest.txt", sep="")
participant.df <- read.delim(participant.filename)
# add a column:
participant.df$participant.number = participant.number
my.list[[i]] <- participant.df
}
solution <- rbind(solution, do.call(rbind, my.list))
If you want to keep them separate data frames for some reason, you can keep them in a list (leave off the last rbind line) and use lapply(my.list, function(participant.df) { stuff you want to do }) whenever you want to do stuff to the data frames.
You can use assign. Assuming all your files have a similar format as you have shown, this will work for you:
# Define how many files there are (with the numbers).
numFiles <- 10
# Run through that sequence.
for (i in 1:numFiles) {
fileName <- paste0("P", i, "Rtest.txt") # Creating the name to pull from.
file <- read.delim(fileName) # Reading in the file.
dName <- paste0("P", i) # Creating the name to assign the file to in R.
assign(dName, file) # Creating the file in R.
}
There are other methods that are faster and more compact, but I find this to be more readable, especially for someone who is new to R.
Additionally, if your numbers aren't a complete sequence like I've used here, you can just define a vector of what numbers are used like:
numFiles <- c(1, 4, 10, 25)

Resources