Reading multiple Files with different columns and different row lengths - r

I have a number of .tsv files. Unfortunately, they have differences in two ways - a different number of rows (I want to rbind to deal with this) and some have an extra column (which I want to exclude on import). I also want to remove "_raw" from the file names and insert this into a column
My starting point has been:
filenames <- dir_ls("Data/", regexp = "raw")
names <- filenames %>%
path_file() %>%
path_ext_remove()
data_raw <- map(filenames, read_tsv) %>%
set_names(names)

library(dplyr)
# Empty list to hold results
ll <- list()
# Target source files
fs <- list.files(pattern = ".tsv")
for(f in fs){
# Read from file
temp <- read_tsv(f)
# Save modified filename as field value
temp$file <- sub(pattern = "_raw.tsv", replacement = "", x = f)
# Save within a list
ll[[f]] <- temp
}
# Compile list elements into a two dimensional object (matrix or dataframe)
# Using bind_rows, you don't need to worry about all columns
# matching among input datasets.
do.call(dplyr::bind_rows, ll)

Related

Cannot combine files in list of files when opening multiple .dta files [duplicate]

I have a folder with more than 500 .dta files. I would like to load some of this files into a single R object.
My .dta files have a generic name composed of four parts : 'two letters/four digits/y/.dta'. For instance, a name can be 'de2015y.dta' or 'fr2008y.dta'. Only the parts corresponding to the two letters and the four digits change across the .dta file.
I have written a code that works, but I am not satisfied with it. I would like to avoid using a loop and to shorten it.
My code is:
# Select the .dta files I want to load
#.....................................
name <- list.files(path="E:/Folder") # names of the .dta files in the folder
db <- as.data.frame(name)
db$year <- substr(db$name, 3, 6)
db <- subset (db, year == max(db$year)) # keep last year available
db$country <- substr(db$name, 1, 2)
list.name <- as.list(db$country)
# Loading all the .dta files in the Global environment
#..................................................
for(i in c(list.name)){
obj_name <- paste(i, '2015y', sep='')
file_name <- file.path('E:/Folder',paste(obj_name,'dta', sep ='.'))
input <- read.dta13(file_name)
assign(obj_name, value = input)
}
# Merge the files into a single object
#..................................................
df2015 <- rbind (at2015y, be2015y, bg2015y, ch2015y, cy2015y, cz2015y, dk2015y, ee2015y, ee2015y, es2015y, fi2015y,
fr2015y, gr2015y, hr2015y, hu2015y, ie2015y, is2015y, it2015y, lt2015y, lu2015y, lv2015y, mt2015y,
nl2015y, no2015y, pl2015y, pl2015y, pt2015y, ro2015y, se2015y, si2015y, sk2015y, uk2015y)
Does anyone know how I can avoid using a loop and shortening my code ?
You can also use purrr for your task.
First create a named vector of all files you want to load (as I understand your question, you simply need all files from 2015). The setNames() part is only necessary in case you'd like an ID variable in your data frame and it is not already included in the .dta files.
After that, simply use map_df() to read all files and return a data frame. Specifying .id is optional and results in an ID column the values of which are based on the names of in_files.
library(purrr)
library(haven)
in_files <- list.files(path="E:/Folder", pattern = "2015y", full.names = TRUE)
in_files <- setNames(in_files, in_files)
df2015 <- map_df(in_files, read_dta, .id = "id")
The following steps should give you what you want:
Load the foreign package:
library(foreign) # or alternatively: library(haven)
create a list of file names
file.list <- list.files(path="E:/Folder", pattern='*.dat', full.names = TRUE)
determine which files to read (NOTE: you have to check if these are the correct position in substr it is an estimate from my side)
vec <- as.integer(substr(file.list,13,16))
file.list2 <- file.list[vec==max(vec)]
read the files
df.list <- sapply(file.list2, read.dta, simplify=FALSE)
remove the path from the listnames
names(df.list) <- gsub("E:/Folder","",names(df.list))
bind the the dataframes together in one data.frame/data.table and create an id-column as well
library(data.table)
df <- rbindlist(df.list, idcol = "id")
# or with 'dplyr'
library(dplyr)
df <- bind_rows(df.list, .id = "id")
Now you have a data.frame with an id-column that identifies the different original files.
I would change the working directory for this task...
Then does this do what you are asking for?
setwd("C:/.../yourfiles")
# get file names where year equals "2015"
name=list.files(pattern="*.dta")
name=name[substr(name,3,6)=="2015"]
# read in the files in a list
files=lapply(name,foreign::read.dta)
# remove ".dta" from file names and
# give the file contents in the list their name
names(files)=lapply(name,function(x) substr(x, 1, nchar(x)-4))
#or alternatively
names(files)=as.list(substr(name,1,nchar(name)-4))
# optional: put all file contents into one data-frame
#(data-frames (vectors) need to have the same row counts (lengths) for this last step to work)
mydatafrm = data.frame(files)

Read in CSV files and Add a Column with File name

Assume you have 2 files as follows.
file_1_october.csv
file_2_november.csv
The files have identical columns. So I want to read both files in R which I can easily do with map. I also want to include in each read file a column month with the name of the file. For instance, for file_1_october.csv, I want a column called “month” that contains the words “file_1_october.csv”.
For reproducibility, assume file_1_october.csv is
name,age,gender
james,24,male
Sue,21,female
While file_2_november.csv is
name,age,gender
Grey,24,male
Juliet,21,female
I want to read both files but in each file include a month column that corresponds to the file name so that we have;
name,age,gender,month
james,24,male, file_1_october.csv
Sue,21,female, file_1_october.csv
AND
name,age,gender,month,
Grey,24,male, file_2_november.csv,
Juliet,21,female, file_2_november.csv
Maybe something like this?
csvlist <- c("file_1_october.csv", "file_2_november.csv")
df_list <- lapply(csvlist, function(x) read.csv(x) %>% mutate(month = x))
for (i in seq_along(df_list)) {
assign(paste0("df", i), df_list[[i]])
}
The two dataframes will be saved in df1 and df2.
Here's a (mostly) tidyverse alternative that avoids looping:
library(tidyverse)
csv_names <- list.files(path = "path/", # set the path to your folder with csv files
pattern = "*.csv", # select all csv files in the folder
full.names = T) # output full file names (with path)
# csv_names <- c("file_1_october.csv", "file_2_november.csv")
csv_names2 <- data.frame(month = csv_names,
id = as.character(1:length(csv_names))) # id for joining
data <- csv_names %>%
lapply(read_csv) %>% # read all the files at once
bind_rows(.id = "id") %>% # bind all tables into one object, and give id for each
left_join(csv_names2) # join month column created earlier
This gives a single data object with data from all the CSVs together. In case you need them separately, you can omit the bind_rows() step, giving you a list of multiple tables ("tibbles"). These can then be split using list2env() or some split() function.

How should I approach merging (full joining) multiple (>100) CSV files with a common key but inconsistent number of rows?

Before I dive into the question, here is a similar problem asked but there is not yet a solution.
So, I am working in R, and there is a folder in my working directory called columns that contains 198 similar .csv files with the name format of a 6-digit integer (e.g. 100000) that increases inconsistently (since the name of those files are actually names for each variable).
Now, I have would like to full join them, but somehow I have to import all of those files into R and then join them. Naturally, I thought about using a list to contain those files and then use a loop to join them. This is the code I tried to use:
#These are the first 3 columns containing identifiers
matrix_starter <- read_csv("files/matrix_starter.csv")
## import_multiple_csv_files_to_R
# Purpose: Import multiple csv files to the Global Environment in R
# set working directory
setwd("columns")
# list all csv files from the current directory
list.files(pattern=".csv$") # use the pattern argument to define a common pattern for import files with regex. Here: .csv
# create a list from these files
list.filenames <- list.files(pattern=".csv$")
#list.filenames
# create an empty list that will serve as a container to receive the incoming files
list.data <- list()
# create a loop to read in your data
for (i in 1:length(list.filenames))
{
list.data[[i]] <- read.csv(list.filenames[i])
list.data[[i]] <- list.data[[i]] %>%
select(`Occupation.Title`,`X2018.Employment`) %>%
rename(`Occupation title` = `Occupation.Title`) #%>%
#rename(list.filenames[i] = `X2018.Employment`)
}
# add the names of your data to the list
names(list.data) <- list.filenames
# now you can index one of your tables like this
list.data$`113300.csv`
# or this
list.data[1]
# source: https://www.edureka.co/community/1902/how-can-i-import-multiple-csv-files-into-r
The chunk above solve the importing part. Now I have a list of .csv files. Next, I would like to join them:
for (i in 1:length(list.filenames)){
matrix_starter <- matrix_starter %>% full_join(list.data[[i]], by = `Occupation title`)
}
However, this does not work nicely. I end up with somewhere around 47,000 rows, to which I only expect around 1700 rows. Please let me know your opinion.
Reading the files into R as a list and including the file name as a column can be done like this:
files <- list.files(path = path,
full.names = TRUE,
all.files = FALSE)
files <- files[!file.info(files)$isdir]
data <- lapply(files,
function(x) {
data <- read_xls(
x,
sheet = 1
)
data$File_name <- x
data
})
I am assuming now that all your excel files have the same structure: the same columns and column types.
If that is the case you can use dplyr::bind_rows to create one combined data frame.
You could off course loop through the list and left_join the list elements. E.g. by using Reduce and merge.
Update based on mihndang's comment. Is this what you are after when you say: Is there a way to use the file name to name the column and also not include the columns of file names?
library(dplyr)
library(stringr)
path <- "./files"
files <- list.files(path = path,
full.names = TRUE,
all.files = FALSE)
files <- files[!file.info(files)$isdir]
data <- lapply(files,
function(x) {
read.csv(x, stringsAsFactors = FALSE)
})
col1 <- paste0(str_sub(basename(files[1]), start = 1, end = -5), ": Values")
col2 <- paste0(str_sub(basename(files[1]), start = 1, end = -5), ": Character")
df1 <- data[[1]] %>%
rename(!!col1 := Value,
!!col2 := Character)
I created two simple .csv files in ./files: file1.csv and file2.csv. I read them into a list. I extract the first list element (the DF) and work out column names in a variable. I then rename the columns in the DF by passing the two variables to them. The column name includes the file name.
Result:
> View(df1)
> df1
file1: Values file1: Character
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
I guess you are looking for :
result <- Reduce(function(x, y) merge(x, y, by = `Occupation title`, all = TRUE), list.data)
which can be done using purrrs Reduce as well :
result <- purrr::reduce(list.data, dplyr::full_join, by = `Occupation title`)
When you do full join it adds every combination and gives us the tables. if you are looking for unique records then you might want to use left join where keep dataframe/table on left whose all columns you want keep as reference and keep the file you want to join on right.
Hope this helps.

Select CSV files and read in pairs

I am comparing two pairs of csv files each at a time. The files I have each end with a number like cars_file2.csv, Lorries_file3.csv, computers_file4.csv, phones_file5.csv. I have like 70 files per folder and the way I am comparing is, I compare cars_file2.csv and Lorries_file3.csv then Lorries_file3.csv and
computers_file4.csv, and the pattern is 2,3,3,4,4,5 like that. Is there a smart way I can handle this instead of manually coming back and change file like the way I am reading here or I can use the last number on each csv to read them smartly. NOTE the files have same suffixes _file:
library(daff)
setwd("path")
# Load csvs to compare into data frames
x_original <- read.csv("cars_file2.csv", strip.white=TRUE, stringsAsFactors = FALSE)
x_changed <- read.csv("Lorries_file3.csv", strip.white=TRUE, stringsAsFactors = FALSE)
render(diff_data(x_original,x_changed ,ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE))
My intention is to compare each two pairs of csv and recorded, Field additions, deletions and modified
You may want to load all files at once and do your comparison with a full list of files.
This may help:
# your path
path <- "insert your path"
# get folders in this path
dir_data <- as.list(list.dirs(path))
# get all filenames
dir_data <- lapply(dir_data,function(x){
# list of folders
files <- list.files(x)
files <- paste(x,files,sep="/")
# only .csv files
files <- files[substring(files,nchar(files)-3,nchar(files)) %in% ".csv"]
# remove possible errors
files <- files[!is.na(files)]
# save if there are files
if(length(files) >= 1){
return(files)
}
})
# delete NULL-values
dir_data <- compact(dir_data)
# make it a named vector
dir_data <- unique(unlist(dir_data))
names(dir_data) <- sub(pattern = "(.*)\\..*$", replacement = "\\1", basename(dir_data))
names(dir_data) <- as.numeric(substring(names(dir_data),nchar(names(dir_data)),nchar(names(dir_data))))
# remove possible NULL-values
dir_data <- dir_data[!is.na(names(dir_data))]
# make it a list again
dir_data <- as.list(dir_data)
# load data
data_upload <- lapply(dir_data,function(x){
if(file.exists(x)){
data <- read.csv(x,header=T,sep=";")
}else{
data <- "file not found"
}
return(data)
})
# setup for comparison
diffs <- lapply(as.character(sort(as.numeric(names(data_upload)))),function(x){
# check if the second dataset exists
if(as.character(as.numeric(x)+1) %in% names(data_upload)){
# first dataset
print(data_upload[[x]])
# second dataset
print(data_upload[[as.character(as.numeric(x)+1)]])
# do your operations here
comparison <- render(diff_data(data_upload[[x]],
data_upload[[as.character(as.numeric(x)+1)]],
ignore_whitespace=T,count_like_a_spreadsheet = F))
numbers <- c(x, as.numeric(x)+1)
# save both the comparison data and the numbers of the datasets
return(list(comparison,numbers))
}
})
# you can find the differences here
diffs
This script loads all csv-files in a folder and its sub-folders and puts them into a list by their numbers. In case there are no doubles, this will work. If you have doubles, you will have to adjust the part where the vector is named so that you can index the full names of the files afterwards.
A simple for- loop using paste will read-in the pairs:
for (i in 1:70) { # assuming the last pair is cars_file70.csv and Lorries_file71.csv
x_original <- read.csv(paste0("cars_file",i,".csv"), strip.white=TRUE, stringsAsFactors = FALSE)
x_changed <- read.csv(paste0("Lorries_file3",i+1,".csv"), strip.white=TRUE, stringsAsFactors = FALSE)
render(diff_data(x_original,x_changed ,ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE))
}
For simplicity I used 2 .csv files.
csv_1
1,2,4
csv_2
1,8,10
Load all the .csv files from folder,
files <- dir("Your folder path", pattern = '\\.csv', full.names = TRUE)
tables <- lapply(files, read.csv)
#create empty list to store comparison output
diff <- c()
Loop through all loaded files and compare,
for (pos in 1:length(csv)) {
if (pos != length(csv)) { #ignore last one
#save comparison output
diff[[pos]] <- diff_data(as.data.frame(csv[pos]), as.data.frame(csv[pos + 1]), ignore_whitespace=TRUE,count_like_a_spreadsheet = FALSE)
}
}
Compared output by diff
[[1]]
Daff Comparison: ‘as.data.frame(tables[pos])’ vs. ‘as.data.frame(tables[pos + 1])’
+++ +++ --- ---
## X1 X8 X10 X2 X4

How to apply a function to every possible pairwise combination of files stored in a common directory

I have a directory containing a large number of csv files. I would like to load the data into R and apply a function to every possible pair combination of csv files in the directory, then write the output to file.
The function that I would like to apply is matchpt() from the biobase library which compares locations between two data frames.
Here is an example of what I would like to do (although I have many more files than this):
Three files in directory: A, B and C
Perform matchpt on each pairwise combination:
nn1 = matchpt(A,B)
nn2 = matchpt(A,C)
nn3 = matchpt(B,C)
Write nn1, nn2 and nn3 to csv file.
I have not been able to find any solutions for this yet and would appreciate any suggestions. I am really not sure where to go from here but I am assuming that some sort of nested for loop is required to somehow cycle sequentially through all pairwise combinations of files. Below is a beginning at something but this only compares the first file with all the others in the directory so does not work!
library("Biobase")
# create two lists of identical filenames stored in the directory:
filenames1 = list.files(path=dir, pattern="csv$", full.names=FALSE, recursive=FALSE)
filenames2 = list.files(path=dir, pattern="csv$", full.names=FALSE, recursive=FALSE)
for(i in 1:length(filenames2)){
# load the first data frame in list 1
df1 <- lapply(filenames1[1], read.csv, header=TRUE, stringsAsFactors=FALSE)
df1 <- data.frame(df1)
# load a second data frame from list 2
df2 <- lapply(filenames2[i], read.csv, header=TRUE, stringsAsFactors=FALSE)
df2 <- data.frame(df2)
# isolate the relevant columns from within the two data frames
dat1 <- as.matrix(df1[, c("lat", "long")])
dat2 <- as.matrix(df2[, c("lat", "long")])
# run the matchpt function on the two data frames
nn <- matchpt(dat1, dat2)
#Extract the unique id code in the two filenames (for naming the output file)
file1 = filenames1[1]
code1 = strsplit(file1,"_")[[1]][1]
file2 = filenames2[i]
code2 = strsplit(file2,"_")[[1]][1]
outname = paste(code1, code2, sep=”_”)
outfile = paste(code, "_nn.csv", sep="")
write.csv(nn, file=outname, row.names=FALSE)
}
Any suggestions on how to solve this problem would be greatly appreciated. Many thanks!
You could do something like:
out <- combn( list.files(), 2, FUN=matchpt )
write.table( do.call( rbind, out ), file='output.csv', sep=',' )
This assumes that matchpt is expecting 2 strings with the names of the files and that the result is the same structure each time so that the rbinding makes sense.
You could also write your own function to pass to combn that takes the 2 file names, runs matchpt and then appends the results to the csv file. Remember that if you pass an open filehandle to write.table then it will append to the file instead of overwriting what is there.
Try this example:
#dummy filenames
filenames <- paste0("file_",1:5,".txt")
#loop through unique combination
for(i in 1:(length(filenames)-1))
for(j in (i+1):length(filenames))
{
flush.console()
print(paste("i=",i,"j=",j,"|","file1=",filenames[i],"file2=",filenames[j]))
}
In response to my question I seem to have found a solution. The below uses a for loop to perform every pairwise combination of files in a common directory (this seems to work and gives EVERY combination of files i.e. A & B and B & A):
# create a list of filenames
filenames = list.files(path=dir, pattern="csv$", full.names=FALSE, recursive=FALSE)
# For loop to compare the files
for(i in 1:length(filenames)){
# load the first data frame in the list
df1 = lapply(filenames[i], read.csv, header=TRUE, stringsAsFactors=FALSE)
df1 = data.frame(df1)
file1 = filenames[i]
code1 = strsplit(file1,"_")[[1]][1] # extract unique id code of file (in case where the id comes before an underscore)
# isolate the columns of interest within the first data frame
d1 <- as.matrix(df1[, c("lat_UTM", "long_UTM")])
# load the comparison file
for (j in 1:length(filenames)){
# load the second data frame in the list
df2 = lapply(filenames[j], read.csv, header=TRUE, stringsAsFactors=FALSE)
df2 = data.frame(df2)
file2 = filenames[j]
code2 = strsplit(file2,"_")[[1]][1] # extract uniqe id code of file 2
# isolate the columns of interest within the second data frame
d2 <- as.matrix(df2[, c("lat_UTM", "long_UTM")])
# run the comparison function on the two data frames (in this case matchpt)
out <- matchpt(d1, d2)
# Merge the unique id code in the two filenames (for naming the output file)
outname = paste(code1, code2, sep="_")
outfile = paste(outname, "_out.csv", sep="")
# write the result to file
write.csv(out, file=outfile, row.names=FALSE)
}
}

Resources