Combine some csv files into one - different number of columns - r

I already loaded 20 csv files with function:
tbl = list.files(pattern="*.csv")
for (i in 1:length(tbl)) assign(tbl[i], read.csv(tbl[i]))
or
list_of_data = lapply(tbl, read.csv)
That how it looks like:
> head(tbl)
[1] "F1.csv" "F10_noS3.csv" "F11.csv" "F12.csv" "F12_noS7_S8.csv"
[6] "F13.csv"
I have to combine all of those files into one. Let's call it a master file but let's try with making a one table with all of the names.
In all of those csv files is a column called "Accession". I would like to make a table of all "names" from all of those csv files. Of course many of the accessions can be repeated in different csv files. I would like to keep all of the data corresponding to the accession.
Some problems:
Some of those "names" are the same and I don't want to duplicate them
Some of those "names" are ALMOST the same. The difference is that there is name and after become the dot and the numer.
The number of columns can be different is those csv files.
That's the screenshot showing how those data looks like:
http://imageshack.com/a/img811/7103/29hg.jpg
Let me show you how it looks:
AT3G26450.1 <--
AT5G44520.2
AT4G24770.1
AT2G37220.2
AT3G02520.1
AT5G05270.1
AT1G32060.1
AT3G52380.1
AT2G43910.2
AT2G19760.1
AT3G26450.2 <--
<-- = Same sample, different names. Should be treated as one. So just ignore dot and a number after.
Is it possible to do ?
I couldn't do a dput(head) because it's even too big data set.
I tried to use such code:
all_data = do.call(rbind, list_of_data)
Error in rbind(deparse.level, ...) :
The number of columns is not correct.
all_data$CleanedAccession = str_extract(all_data$Accession, "^[[:alnum:]]+")
all_data = subset(all_data, !duplicated(CleanedAccession))
I tried to do it for almost 2 weeks and I am not able to. So please help me.

Your questions seems to contain multiple subquestions. I encourage you to separate them.
The first thing you apparently need is to combine data frames with different columns. You can use rbind.fill from the plyr package:
library(plyr)
all_data = do.call(rbind.fill, list_of_data)

Here's an example using some tidyverse functions and a custom function that can combine multiple csv files with missing columns into one data frame:
library(tidyverse)
# specify the target directory
dir_path <- '~/test_dir/'
# specify the naming format of the files.
# in this case csv files that begin with 'test' and a single digit but it could be as just as simple as 'csv'
re_file <- '^test[0-9]\\.csv'
# create sample data with some missing columns
df_mtcars <- mtcars %>% rownames_to_column('car_name')
write.csv(df_mtcars %>% select(-am), paste0(dir_path, 'test1.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-wt, -gear), paste0(dir_path, 'test2.csv'), row.names = FALSE)
write.csv(df_mtcars %>% select(-cyl), paste0(dir_path, 'test3.csv'), row.names = FALSE)
# custom function that takes the target directory and file name pattern as arguments
read_dir <- function(dir_path, file_name){
x <- read_csv(paste0(dir_path, file_name)) %>%
mutate(file_name = file_name) %>% # add the file name as a column
select(file_name, everything()) # reorder the columns so file name is first
return(x)
}
# read the files from the target directory that match the naming format and combine into one data frame
df_panel <-
list.files(dir_path, pattern = re_file) %>%
map_df(~ read_dir(dir_path, .))
# files with missing columns are filled with NAs.

Related

merge data nasted dataframes in R

I have several DFs. Each of them is res csv file of one participant form my exp. Some of the csv have 48 variables. Others have in addition to these identical variables 6 more variable (53 variables). However, If I try to merge them like this:
flist <- list.files(path="my path", pattern = ".csv", full.names = TRUE)
Merge<-plyr::ldply(flist, read_csv) #Merge all files
the merging is done by the columns orders and not by the variable name. Therefore in one column in my big combine DF I get data form different variables.
So I tried different strategy: uploading my files as separate DFs:
data_files <- list.files("my_path") # Identify file names
data_files
for(i in 1:length(data_files)) { # Head of for-loop
assign(paste0("data", i), # Read and store data frames
read_csv(paste0("my_path/",
data_files[i])))
}
Then I tried to merge them by this script:
listDF <- names(which(unlist(eapply(.GlobalEnv,is.data.frame)))) #list of my DFs
listDF
library(plyr)
MergeDF<-do.call('rbind.fill', listDF)
But I'm still stuck.
We may use map_dfr
library(readr)
library(purrr)
map_dfr(setNames(flist, flist), read_csv, .id = "id")

for loop with dplyr

I have a bunch of files I read in manually as such:
# gel above replicates
A_gel <-read.delim("XL1_3_S35_L004_R1_001_w_XL2_3_S37_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
B_gel <-read.delim("XL2_3_S37_L004_R1_001_w_XL2_3_S37_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
C_gel <- read.delim("XL2_3_S37_L004_R1_001_w_XL1_3_S35_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
D_gel <- read.delim("XL1_3_S35_L004_R1_001_w_XL1_3_S35_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
# gel below replicates
A_below_gel <- read.delim("XL1_3b_S36_L004_R1_001_w_XL2_3b_S38_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
B_below_gel <- read.delim("XL2_3b_S38_L004_R1_001_w_XL2_3b_S38_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
C_below_gel <- read.delim("XL2_3b_S38_L004_R1_001_w_XL1_3b_S36_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
D_below_gel <- read.delim("XL1_3b_S36_L004_R1_001_w_XL1_3b_S36_L004_R1_001_01.basedon.peaks.l2inputnormnew.bed.compressed.bed")
I would like to change all the columns of these files and arrange by the start column with something like this:
colnames(A_gel) <- c("Chromosome", "Start", "End", "LogPVal", "LogFC", "Strand")
A_gel <- A_gel %>%
arrange(A_gel$Start)
Instead, I would like to use a for loop for all files using R.
Never create multiple variables following the same pattern. The properly supported solution for this general problem is the use of lists (i.e. instead of having variables A_gel, B_gel, …, you have one variable gel, which is a list that contains your individual data.frames; you can also assign names to these individual items, though in your case that doesn’t seem necessary).
Then you can use e.g. lapply to run over your file paths and read the data of the different files into that list:
gel = lapply(gel_filenames, read.delim)
below_gel = lapply(below_gel_filenames, read.delim)
… and likewise you can put your arrangement code into a function and apply that, changing the above to:
read_bed = function (filename) {
read.delim(filename) %>%
setNames(c("Chromosome", "Start", "End", "LogPVal", "LogFC", "Strand")) %>%
arrange(Start)
}
# …
gel = lapply(gel_filenames, read_bed)
Better yet, use purrr::map_dfr to read all data into a single combined table:
gel = gel_filenames %>%
setNames(., .) %>%
map_dfr(read_bed, .id = 'Filename')
(The setNames(., .) step is necessary since read_dfr assigns the names of the input vector to the added ID column.)
This will create one master table for the “GEL” dat, which has an added ID column for the original filename (you’ll probably want to extract just some ID from that, using tidyr::extract).

R - Combine multiple data frames according to the pattern in their name

I would like to combine data frames in the global environment according to the pattern in their name, and simultaneously add the name of the file they are originally from.
My problem is that I have originally a zip file, with over 20 text files in the main folder and sub-folders, which observe mainly two different scenarios: "test" and "train". Hence, I decided to first read ALL of the txt files into R, create two different lists of df names which either have "test" or "train" pattern and using those lists merge the dataframes into two main dataframes. Now, I need to combine those dataframes according to the names in the list, but the rbind just creates another list of their names - how to make rbind treat inputs as objects from the name list, not strings?
Moreover, rbind would combine the dfs without an opportunity to add the variable of their column names - maybe there is a solution which lets to simultaneously combine dfs and add the df name as a column variable?
What I did so far:
#loading the necessary libraries
library(dplyr)
library(readr)
library(easycsv)
#setting url and directory of the data file
url <- "https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip"
destination <- "accelerometer_data.zip"
#downloading the file and storing it into computer memory
download.file(url, destfile = destination)
#read all txt files into R
test_folder <- easycsv::fread_zip(file = destination,
extension = "TXT")
#create a list of "test" data frames
list_test <- as.list(
do.call(cbind, ls(
grep(pattern = "^UCI+(.*)test",
x = ls(),
value = TRUE)
)
)
)
)
#bind dfs as named in list_test
test_df <- lapply(list_test, FUN = function(x) {
rbind(
eval(
parse(text = x)
)
)
}
)
You can use mget to get all the data with specific pattern in a list, then use dplyr::bind_rows to combine them into one dataframe and use .id parameter to include the file name as a separate column.
library(dplyr)
test_data <- bind_rows(mget(grep(pattern = "^UCI+(.*)test", x = ls(),
value = TRUE)), .id = 'filename')
train_data <- bind_rows(mget(grep(pattern = "^UCI+(.*)train", x = ls(),
value = TRUE)), .id = 'filename')
However, the 'test' and 'train' files have dataframes with different number of columns hence you have certain columns with only NAs for some files. Maybe you need to update the pattern and make the pattern more strict?

RBinding Multiple dfs from Excel Files

I am currently working on combining my data from multiple excel files into one df. Problem is, the number of columns differ across the files (due to different experiment versions), so I need to bind only certain columns/variables from each file (they have the same names).
I tried doing this "manually" at first, using:
library(openxlsx)
PWI <- read.xlsx("/Users/myname/Desktop/PrelimPWI/PWI_1_V1A.xlsx", colNames = TRUE, startRow = 2)
Slim_1 <- data.frame(PWI$Subject, PWI$Block, PWI$Category, PWI$Trial,PWI$prompt1.RT)
#read in and pull out variables of interest for one subject
mergedFullData = merge(mergedDataA, mergedDataB)
#add two together, then add the third to the merged file, add 4th to that merged file, etc
Obviously, it seems like there's a simpler way to combine the files. I've been working on using:
library(openxlsx)
path <- "/Users/myname/Desktop/PrelimPWI"
merge_file_name <- "/Users/myname/Desktop/PrelimPWI/merge_file_name.xlsx"
filenames_list <- list.files(path= path, full.names=TRUE)
All <- lapply(filenames_list,function(merge_file_name$Subject){
print(paste("Merging",merge_file_name,sep = " "))
read.xlsx(merge_file_name, colNames=TRUE, startRow = 2)
})
PWI <- do.call(rbind.data.frame, All)
write.xlsx(PWI,merge_file_name)
However, I keep getting the error that the number of columns doesn't match, but I'm not sure where to pull out the specific variables I need (the ones listed in the earlier code). Any other tweaks I've tried has resulted in only the first file being written into the xlsx, or a completely blank df. Any help would be greatly appreciated!
library(tidyverse)
df1 <- tibble(
a = c(1,2,3),
x = c(4,5,6)
)
df2 <- tibble(
x = c(7,8,9),
y = c("d","e","f")
)
bind_rows(df1, df2)
The bind functions from dplyr should be able to help you. They can bind dataframes together by row or column, and are flexible about different column names.
You can then select the actual columns you want to keep.

How to clean multiple excel files in one time in R?

I have more than one hundred excel files need to clean, all the files in the same data structure. The code listed below is what I use to clean a single excel file. The files' name all in the structure like 'abcdefg.xlsx'
library('readxl')
df <- read_excel('abc.xlsx', sheet = 'EQuote')
# get the project name
project_name <- df[1,2]
project_name <- gsub(".*:","",project_name)
project_name <- gsub(".* ","",project_name)
# select then needed columns
df <- df[,c(3,4,5,8,16,17,18,19)]
# remane column
colnames(df)[colnames(df) == 'X__2'] <- 'Product_Models'
colnames(df)[colnames(df) == 'X__3'] <- 'Qty'
colnames(df)[colnames(df) == 'X__4'] <- 'List_Price'
colnames(df)[colnames(df) == 'X__7'] <- 'Net_Price'
colnames(df)[colnames(df) == 'X__15'] <- 'Product_Code'
colnames(df)[colnames(df) == 'X__16'] <- 'Product_Series'
colnames(df)[colnames(df) == 'X__17'] <- 'Product_Group'
colnames(df)[colnames(df) == 'X__18'] <- 'Cat'
# add new column named 'Project_Name', and set value to it
df$project_name <- project_name
# extract rows between two specific characters
begin <- which(df$Product_Models == 'SKU')
end <- which(df$Product_Models == 'Sub Total:')
## set the loop
in_between <- function(df, start, end){
return(df[start:end,])
}
dividers = which(df$Product_Models %in% 'SKU' == TRUE)
df <- lapply(1:(length(dividers)-1), function(x) in_between(df, start =
dividers[x], end = dividers[x+1]))
df <-do.call(rbind, df)
# remove the rows
df <- df[!(df$Product_Models %in% c("SKU","Sub Total:")), ]
# remove rows with NA
df <- df[complete.cases(df),]
# remove part of string after '.'
NeededString <- df$Product_Models
NeededString <- gsub("\\..*", "", NeededString)
df$Product_Models <- NeededString
Then I can get a well structured datafram.Well Structured Dataframe Example
Can you guys help me to write a code, which can help me clean all the excel files at one time. So, I do not need to run this code hundred times. Then, aggregating all the files into a big csv file.
You can use lapply (base R) or map (purrr package) to read and process all of the files with a single set of commands. lapply and map iterate over a vector or list (in this case a list or vector of file names), applying the same code to each element of the vector or list.
For example, in the code below, which uses map (map_df actually, which returns a single data frame, rather than a list of separate data frames), file_names is a vector of file names (or file paths + names, if the files aren't in the working directory). ...all processing steps... is all of the code in your question to process df into the form you desire:
library(tidyverse) # Loads several tidyverse packages, including purrr and dplyr
library(readxl)
single_data_frame = map_df(file_names, function(file) {
df = read_excel(file, sheet="EQUOTE")
... all processing steps ...
df
}
Now you have a single large data frame, generated from all of your Excel files. You can now save it as a csv file with, for example, write_csv(single_data_frame, "One_large_data_frame.csv").
There are probably other things you can do to simplify your code. For example, to rename the columns of df, you can use the recode function (from dplyr). We demonstrate this below by first changing the names of the built-in mtcars data frame to be similar to the names in your data. Then we use recode to change a few of the names:
# Rename mtcars data frame
set.seed(2)
names(mtcars) = paste0("X__", sample(1:11))
# Look at data frame
head(mtcars)
# Recode three of the column names
names(mtcars) = recode(names(mtcars),
X__1="New.1",
X__5="New.5",
X__9="New.9")
Or, if the order of the names is always the same, you can do (using your data structure):
names(df) = c('Product_Models','Qty','List_Price','Net_Price','Product_Code','Product_Series','Product_Group','Cat')
Alternatively, if your Excel files have column names, you can use the skip argument of read_excel to skip to the header row before reading in the data. That way, you'll get the correct column names directly from the Excel file. Since it looks like you also need to get the project name from the first few rows, you can read just those rows first with a separate call to read_excel and use the range argument, and/or the n_max argument to get only the relevant rows or cells for the project name.

Resources