RBinding Multiple dfs from Excel Files - r

I am currently working on combining my data from multiple excel files into one df. Problem is, the number of columns differ across the files (due to different experiment versions), so I need to bind only certain columns/variables from each file (they have the same names).
I tried doing this "manually" at first, using:
library(openxlsx)
PWI <- read.xlsx("/Users/myname/Desktop/PrelimPWI/PWI_1_V1A.xlsx", colNames = TRUE, startRow = 2)
Slim_1 <- data.frame(PWI$Subject, PWI$Block, PWI$Category, PWI$Trial,PWI$prompt1.RT)
#read in and pull out variables of interest for one subject
mergedFullData = merge(mergedDataA, mergedDataB)
#add two together, then add the third to the merged file, add 4th to that merged file, etc
Obviously, it seems like there's a simpler way to combine the files. I've been working on using:
library(openxlsx)
path <- "/Users/myname/Desktop/PrelimPWI"
merge_file_name <- "/Users/myname/Desktop/PrelimPWI/merge_file_name.xlsx"
filenames_list <- list.files(path= path, full.names=TRUE)
All <- lapply(filenames_list,function(merge_file_name$Subject){
print(paste("Merging",merge_file_name,sep = " "))
read.xlsx(merge_file_name, colNames=TRUE, startRow = 2)
})
PWI <- do.call(rbind.data.frame, All)
write.xlsx(PWI,merge_file_name)
However, I keep getting the error that the number of columns doesn't match, but I'm not sure where to pull out the specific variables I need (the ones listed in the earlier code). Any other tweaks I've tried has resulted in only the first file being written into the xlsx, or a completely blank df. Any help would be greatly appreciated!

library(tidyverse)
df1 <- tibble(
a = c(1,2,3),
x = c(4,5,6)
)
df2 <- tibble(
x = c(7,8,9),
y = c("d","e","f")
)
bind_rows(df1, df2)
The bind functions from dplyr should be able to help you. They can bind dataframes together by row or column, and are flexible about different column names.
You can then select the actual columns you want to keep.

Related

Binding Rows in df

having a small issue binding some dfs together.
I used the following procedure to create simplified dfs:
PWI_1 <- read.xlsx("/Users/myname/Desktop/PrelimPWI/PWI_1_V1A.xlsx", colNames = TRUE, startRow = 2)
Slim_1 <- data.frame(PWI_1$ExperimentName,PWI_1$Subject, PWI_1$error,PWI_1$Block, PWI_1$Category, PWI_1$Trial,PWI_1$prompt1.RT)
...etc for the following files.
I then used this code to try and bind the dfs:
merged <- bind_rows(list(Slim_1,Slim_10,Slim_11...))
However, the dfs are concatenated to the right, instead of appended on the end in one long format df.
*Note the PWI_V1x is the name of the experiment version, which needs to be lined up
I think the error is caused by the variable trimming process (ie creating a 'slim' df), but unfortunately the untrimmed files have different numbers of columns, so I get an error when trying to bind the original dfs. Any advice would be appreciated!
bind_rows requires columns names to be the same. Instead of "slimming" your data frame, use dplyr::select so you pick out the same column names every time.
Slim_PWI_1 <- read.xlsx("/Users/myname/Desktop/PrelimPWI/PWI_1_V1A.xlsx", colNames = TRUE, startRow = 2) %>%
select(ExperimentName, Subject, error, Block, Category, Trial, prompt1.RT)
then this should work:
merged <- bind_rows(Slim_PWI_1, ...)
Edit:
If you have multiple files as you indicate, you can read and slim them all together like this:
Slim_PWI_list <- dir(path = "/Users/myname/Desktop/PrelimPWI/", pattern = "PWI.*xlsx", full.names = TRUE) %>%
map(~read.xlsx(., colNames = TRUE, startRow = 2)) %>%
map(~select(., ExperimentName, Subject, error, Block, Category, Trial, prompt1.RT))
merged <- bind_rows(Slim_PWI_list)

How to clean multiple excel files in one time in R?

I have more than one hundred excel files need to clean, all the files in the same data structure. The code listed below is what I use to clean a single excel file. The files' name all in the structure like 'abcdefg.xlsx'
library('readxl')
df <- read_excel('abc.xlsx', sheet = 'EQuote')
# get the project name
project_name <- df[1,2]
project_name <- gsub(".*:","",project_name)
project_name <- gsub(".* ","",project_name)
# select then needed columns
df <- df[,c(3,4,5,8,16,17,18,19)]
# remane column
colnames(df)[colnames(df) == 'X__2'] <- 'Product_Models'
colnames(df)[colnames(df) == 'X__3'] <- 'Qty'
colnames(df)[colnames(df) == 'X__4'] <- 'List_Price'
colnames(df)[colnames(df) == 'X__7'] <- 'Net_Price'
colnames(df)[colnames(df) == 'X__15'] <- 'Product_Code'
colnames(df)[colnames(df) == 'X__16'] <- 'Product_Series'
colnames(df)[colnames(df) == 'X__17'] <- 'Product_Group'
colnames(df)[colnames(df) == 'X__18'] <- 'Cat'
# add new column named 'Project_Name', and set value to it
df$project_name <- project_name
# extract rows between two specific characters
begin <- which(df$Product_Models == 'SKU')
end <- which(df$Product_Models == 'Sub Total:')
## set the loop
in_between <- function(df, start, end){
return(df[start:end,])
}
dividers = which(df$Product_Models %in% 'SKU' == TRUE)
df <- lapply(1:(length(dividers)-1), function(x) in_between(df, start =
dividers[x], end = dividers[x+1]))
df <-do.call(rbind, df)
# remove the rows
df <- df[!(df$Product_Models %in% c("SKU","Sub Total:")), ]
# remove rows with NA
df <- df[complete.cases(df),]
# remove part of string after '.'
NeededString <- df$Product_Models
NeededString <- gsub("\\..*", "", NeededString)
df$Product_Models <- NeededString
Then I can get a well structured datafram.Well Structured Dataframe Example
Can you guys help me to write a code, which can help me clean all the excel files at one time. So, I do not need to run this code hundred times. Then, aggregating all the files into a big csv file.
You can use lapply (base R) or map (purrr package) to read and process all of the files with a single set of commands. lapply and map iterate over a vector or list (in this case a list or vector of file names), applying the same code to each element of the vector or list.
For example, in the code below, which uses map (map_df actually, which returns a single data frame, rather than a list of separate data frames), file_names is a vector of file names (or file paths + names, if the files aren't in the working directory). ...all processing steps... is all of the code in your question to process df into the form you desire:
library(tidyverse) # Loads several tidyverse packages, including purrr and dplyr
library(readxl)
single_data_frame = map_df(file_names, function(file) {
df = read_excel(file, sheet="EQUOTE")
... all processing steps ...
df
}
Now you have a single large data frame, generated from all of your Excel files. You can now save it as a csv file with, for example, write_csv(single_data_frame, "One_large_data_frame.csv").
There are probably other things you can do to simplify your code. For example, to rename the columns of df, you can use the recode function (from dplyr). We demonstrate this below by first changing the names of the built-in mtcars data frame to be similar to the names in your data. Then we use recode to change a few of the names:
# Rename mtcars data frame
set.seed(2)
names(mtcars) = paste0("X__", sample(1:11))
# Look at data frame
head(mtcars)
# Recode three of the column names
names(mtcars) = recode(names(mtcars),
X__1="New.1",
X__5="New.5",
X__9="New.9")
Or, if the order of the names is always the same, you can do (using your data structure):
names(df) = c('Product_Models','Qty','List_Price','Net_Price','Product_Code','Product_Series','Product_Group','Cat')
Alternatively, if your Excel files have column names, you can use the skip argument of read_excel to skip to the header row before reading in the data. That way, you'll get the correct column names directly from the Excel file. Since it looks like you also need to get the project name from the first few rows, you can read just those rows first with a separate call to read_excel and use the range argument, and/or the n_max argument to get only the relevant rows or cells for the project name.

Merge in R only shows Header

I have 3 large excel databases converted to csv. I wish to combine these into one by using R.
I have tagged the 3 files as dat1,dat2,dat3 respectively. I tried to merge dat1 and dat2 with the name myfulldata, and then merge myfulldata with dat3, saved as myfulldata2.
When I did this though only the headers remained in the combination, essentially none of the contents of the databases were now visible. Screenshot linked below. The numbers of "obvs" in the myfulldata's are noted at 0 despite the respective ovs for each individual component being very large. Can anyone advise how to resolve?
Code:
dat1 <- read.csv("PS 2014.csv", header=T)
dat2 <- read.csv("PS 2015.csv", header=T)
dat3 <- read.csv("PS 2016.csv", header=T)
myfulldata = merge(dat1, dat2)
myfulldata2 = merge(myfulldata, dat3)
save(myfulldata2, file = "Palisis.RData")
Doing a merge in r is analogous to doing a join between two tables in a database. I suspect what you want to do is to aggregate your three CSV files row-wise (i.e. union them). In this case, you can try using rbind instead:
myfulldata <- rbind(dat1, dat2)
myfulldata <- rbind(myfulldata, dat3)
save(myfulldata, file = "Palisis.RData")
Note that this assumes that the number and ideally types of the columns in each data frame from CSV is the same (q.v. doing a UNION in SQL).

R: Loop for importing multiple xls as df, rename column of one df and then merge all df's

The below is driving me a little crazy and I’m sure theres an easy solution.
I currently use R to perform some calculations from a bunch of excel files, where the files are monthly observations of financial data. The files all have the exact same column headers. Each file gets imported, gets some calcs done on it and the output is saved to a list. The next file is imported and the process is repeated. I use the following code for this:
filelist <- list.files(pattern = "\\.xls")
universe_list <- list()
count <- 1
for (file in filelist) {
df <- read.xlsx(file, 1, startRow=2, header=TRUE)
*perform calcs*
universe_list[[count]] <- df
count <- count + 1
}
I now have a problem where some of the new operations I want to perform would involve data from two or more excel files. So for example, I would need to import the Jan-16 and the Jan-15 excel files, perform whatever needs to be done, and then move on to the next set of files (Feb-16 and Feb-15). The files will always be of fixed length apart (like one year etc)
I cant seem to figure out the code on how to do this… from a process perspective, Im thinking 1) need to design a loop to import both sets of files at the same time, 2) create two dataframes from the imported data, 3) rename the columns of one of the dataframes (so the columns can be distinguished), 4) merge both dataframes together, and 4) perform the calcs. I cant work out the code for steps 1-4 for this!
Many thanks for helping out
Consider mapply() to handle both data frame pairs together. Your current loop is actually reminiscient of other languages running for loop operations. However, R has many vectorized approaches to iterate over lists. Below assumes both 15 and 16 year list of files are same length with corresponding months in both and year abbrev comes right before file extension (i.e, -15.xls, -16.xls):
files15list <- list.files(path, pattern = "[15]\\.xls")
files16list <- list.files(path, pattern = "[16]\\.xls")
dfprocess <- function(x, y){
df1 <- read.xlsx(x, 1, startRow=2, header=TRUE)
names(df1) <- paste0(names(df1), "1") # SUFFIX COLS WITH 1
df2 <- read.xlsx(y, 1, startRow=2, header=TRUE)
names(df2) <- paste0(names(df2), "2") # SUFFIX COLS WITH 2
df <- cbind(df1, df2) # CBIND DFs
# ... perform calcs ...
return(df)
}
wide_list <- mapply(dfprocess, files15list, files16list)
long_list <- lapply(1:ncol(wide_list),
function(i) wide_list[,i]) # ALTERNATE OUTPUT
First sort your filelist such that the two files on which you want to do your calculations are consecutive to each other. After that try this:
count <- 1
for (count in seq(1, (len(filelist)),2) {
df <- read.xlsx(filelist[count], 1, startRow=2, header=TRUE)
df1 <- read.xlsx(filelist[count+1], 1, startRow=2, header=TRUE)
*change column names and apply merge or append depending on requirement
*perform calcs*
*save*
}

Combine some csv files into one - different number of columns

I already loaded 20 csv files with function:
tbl = list.files(pattern="*.csv")
for (i in 1:length(tbl)) assign(tbl[i], read.csv(tbl[i]))
or
list_of_data = lapply(tbl, read.csv)
That how it looks like:
> head(tbl)
[1] "F1.csv" "F10_noS3.csv" "F11.csv" "F12.csv" "F12_noS7_S8.csv"
[6] "F13.csv"
I have to combine all of those files into one. Let's call it a master file but let's try with making a one table with all of the names.
In all of those csv files is a column called "Accession". I would like to make a table of all "names" from all of those csv files. Of course many of the accessions can be repeated in different csv files. I would like to keep all of the data corresponding to the accession.
Some problems:
Some of those "names" are the same and I don't want to duplicate them
Some of those "names" are ALMOST the same. The difference is that there is name and after become the dot and the numer.
The number of columns can be different is those csv files.
That's the screenshot showing how those data looks like:
http://imageshack.com/a/img811/7103/29hg.jpg
Let me show you how it looks:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.
Is it possible to do ?
I couldn't do a dput(head) because it's even too big data set.
I tried to use such code:
all_data = do.call(rbind, list_of_data)
Error in rbind(deparse.level, ...) :
The number of columns is not correct.
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))
I tried to do it for almost 2 weeks and I am not able to. So please help me.
Your questions seems to contain multiple subquestions. I encourage you to separate them.
The first thing you apparently need is to combine data frames with different columns. You can use rbind.fill from the plyr package:
library(plyr)
all_data = do.call(rbind.fill, list_of_data)
Here's an example using some tidyverse functions and a custom function that can combine multiple csv files with missing columns into one data frame:
library(tidyverse)
# specify the target directory
dir_path <- '~/test_dir/'
# specify the naming format of the files.
# in this case csv files that begin with 'test' and a single digit but it could be as just as simple as 'csv'
re_file <- '^test[0-9]\\.csv'
# create sample data with some missing columns
df_mtcars <- mtcars %>% rownames_to_column('car_name')
write.csv(df_mtcars %>% select(-am), paste0(dir_path, 'test1.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-wt, -gear), paste0(dir_path, 'test2.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-cyl), paste0(dir_path, 'test3.csv'), row.names = FALSE)
# custom function that takes the target directory and file name pattern as arguments
read_dir <- function(dir_path, file_name){
x <- read_csv(paste0(dir_path, file_name)) %>%
mutate(file_name = file_name) %>% # add the file name as a column
select(file_name, everything()) # reorder the columns so file name is first
return(x)
}
# read the files from the target directory that match the naming format and combine into one data frame
df_panel <-
list.files(dir_path, pattern = re_file) %>%
map_df(~ read_dir(dir_path, .))
# files with missing columns are filled with NAs.

Resources