Read in CSV files and Add a Column with File name - r

Assume you have 2 files as follows.
file_1_october.csv
file_2_november.csv
The files have identical columns. So I want to read both files in R which I can easily do with map. I also want to include in each read file a column month with the name of the file. For instance, for file_1_october.csv, I want a column called “month” that contains the words “file_1_october.csv”.
For reproducibility, assume file_1_october.csv is
name,age,gender
james,24,male
Sue,21,female
While file_2_november.csv is
name,age,gender
Grey,24,male
Juliet,21,female
I want to read both files but in each file include a month column that corresponds to the file name so that we have;
name,age,gender,month
james,24,male, file_1_october.csv
Sue,21,female, file_1_october.csv
AND
name,age,gender,month,
Grey,24,male, file_2_november.csv,
Juliet,21,female, file_2_november.csv

Maybe something like this?
csvlist <- c("file_1_october.csv", "file_2_november.csv")
df_list <- lapply(csvlist, function(x) read.csv(x) %>% mutate(month = x))
for (i in seq_along(df_list)) {
assign(paste0("df", i), df_list[[i]])
}
The two dataframes will be saved in df1 and df2.

Here's a (mostly) tidyverse alternative that avoids looping:
library(tidyverse)
csv_names <- list.files(path = "path/", # set the path to your folder with csv files
pattern = "*.csv", # select all csv files in the folder
full.names = T) # output full file names (with path)
# csv_names <- c("file_1_october.csv", "file_2_november.csv")
csv_names2 <- data.frame(month = csv_names,
id = as.character(1:length(csv_names))) # id for joining
data <- csv_names %>%
lapply(read_csv) %>% # read all the files at once
bind_rows(.id = "id") %>% # bind all tables into one object, and give id for each
left_join(csv_names2) # join month column created earlier
This gives a single data object with data from all the CSVs together. In case you need them separately, you can omit the bind_rows() step, giving you a list of multiple tables ("tibbles"). These can then be split using list2env() or some split() function.

Related

R: how to read multiple csv files with column name in row n and select certain columns from the file and add file name to the file as a new column?

I have 100 csv files in the same folder, let's say the path="D:\Data".
For each file I want to:
Step 1. read the file from row 12 since the column names are at row 12;
Step 2. select certain columns from the file, let's say the colname I want to keep
are "Date","Time","Value";
Step 3. add the file name to the file as a new column, for example, I want to
save file1 of which name is "example 1.csv" as file1$Name="example 1.csv",
and similarly, save file2 of which name is "example 2.csv" as
file2$Name="example 2.csv", etc...
So far we got 100 new files with 4 columns "Date","Time","Value","Name". Then finally rbind all the 100 new files together.
I have no idea how to code these steps all together in R. So anyone can help? Thanks very much for your time.
update
Due the complicated data structure in my data, it always return errors by using the sample code in answers. The ideas behind the code were correct, but somehow I could only solve the problem by using the code as below. I believe there would be more elegant way to modify my code instead of using loop.
# set up working directory
setwd("D:/Data")
library(data.table)
files <- list.files(path ="D:/Data", pattern = ".csv")
# read and save each file as a list of data frame in temp
temp <- lapply(files, read.csv, header = TRUE, skip=11, sep = "\t", fileEncoding="utf-16")
seq_along(temp) # the number of files is 112
## select columns "Date","Time","Value" as a new file,
## and attach the file name as a new column to each new file,
## and finally row bind all the files together
temp2=NULL
for(i in 1:112) {
dd=cbind(File=files[i],temp[[i]][,c("Date","Time","Value")])
temp2=rbind(temp2,dd)
}
You can do this very neatly with vroom. It can take a list of files as an argument rather than having to do each separately, and add the filename column itself:
library(vroom)
vroom(files, skip = 11, id = 'filename', col_select = c(Date, Time, Value, filename))
You could try something like this
list_of_files <- list.files(path <- "D:/Data/", pattern="*.csv", full.names=TRUE)
library(dplyr)
library(purrr)
list_of_files %>%
set_names() %>%
map_dfr(~ .x %>%
readr::read_csv(.,
skip = 12,
col_names = TRUE
) %>%
select(Date, Time, Value) %>%
mutate(Date = as.character(Date)) %>%
# Alternatively you could use the .id argument in map_dfr for the filename
mutate(filename = match(.x, list_of_files)))

Cannot combine files in list of files when opening multiple .dta files [duplicate]

I have a folder with more than 500 .dta files. I would like to load some of this files into a single R object.
My .dta files have a generic name composed of four parts : 'two letters/four digits/y/.dta'. For instance, a name can be 'de2015y.dta' or 'fr2008y.dta'. Only the parts corresponding to the two letters and the four digits change across the .dta file.
I have written a code that works, but I am not satisfied with it. I would like to avoid using a loop and to shorten it.
My code is:
# Select the .dta files I want to load
#.....................................
name <- list.files(path="E:/Folder") # names of the .dta files in the folder
db <- as.data.frame(name)
db$year <- substr(db$name, 3, 6)
db <- subset (db, year == max(db$year)) # keep last year available
db$country <- substr(db$name, 1, 2)
list.name <- as.list(db$country)
# Loading all the .dta files in the Global environment
#..................................................
for(i in c(list.name)){
obj_name <- paste(i, '2015y', sep='')
file_name <- file.path('E:/Folder',paste(obj_name,'dta', sep ='.'))
input <- read.dta13(file_name)
assign(obj_name, value = input)
}
# Merge the files into a single object
#..................................................
df2015 <- rbind (at2015y, be2015y, bg2015y, ch2015y, cy2015y, cz2015y, dk2015y, ee2015y, ee2015y, es2015y, fi2015y,
fr2015y, gr2015y, hr2015y, hu2015y, ie2015y, is2015y, it2015y, lt2015y, lu2015y, lv2015y, mt2015y,
nl2015y, no2015y, pl2015y, pl2015y, pt2015y, ro2015y, se2015y, si2015y, sk2015y, uk2015y)
Does anyone know how I can avoid using a loop and shortening my code ?
You can also use purrr for your task.
First create a named vector of all files you want to load (as I understand your question, you simply need all files from 2015). The setNames() part is only necessary in case you'd like an ID variable in your data frame and it is not already included in the .dta files.
After that, simply use map_df() to read all files and return a data frame. Specifying .id is optional and results in an ID column the values of which are based on the names of in_files.
library(purrr)
library(haven)
in_files <- list.files(path="E:/Folder", pattern = "2015y", full.names = TRUE)
in_files <- setNames(in_files, in_files)
df2015 <- map_df(in_files, read_dta, .id = "id")
The following steps should give you what you want:
Load the foreign package:
library(foreign) # or alternatively: library(haven)
create a list of file names
file.list <- list.files(path="E:/Folder", pattern='*.dat', full.names = TRUE)
determine which files to read (NOTE: you have to check if these are the correct position in substr it is an estimate from my side)
vec <- as.integer(substr(file.list,13,16))
file.list2 <- file.list[vec==max(vec)]
read the files
df.list <- sapply(file.list2, read.dta, simplify=FALSE)
remove the path from the listnames
names(df.list) <- gsub("E:/Folder","",names(df.list))
bind the the dataframes together in one data.frame/data.table and create an id-column as well
library(data.table)
df <- rbindlist(df.list, idcol = "id")
# or with 'dplyr'
library(dplyr)
df <- bind_rows(df.list, .id = "id")
Now you have a data.frame with an id-column that identifies the different original files.
I would change the working directory for this task...
Then does this do what you are asking for?
setwd("C:/.../yourfiles")
# get file names where year equals "2015"
name=list.files(pattern="*.dta")
name=name[substr(name,3,6)=="2015"]
# read in the files in a list
files=lapply(name,foreign::read.dta)
# remove ".dta" from file names and
# give the file contents in the list their name
names(files)=lapply(name,function(x) substr(x, 1, nchar(x)-4))
#or alternatively
names(files)=as.list(substr(name,1,nchar(name)-4))
# optional: put all file contents into one data-frame
#(data-frames (vectors) need to have the same row counts (lengths) for this last step to work)
mydatafrm = data.frame(files)

Reading multiple Files with different columns and different row lengths

I have a number of .tsv files. Unfortunately, they have differences in two ways - a different number of rows (I want to rbind to deal with this) and some have an extra column (which I want to exclude on import). I also want to remove "_raw" from the file names and insert this into a column
My starting point has been:
filenames <- dir_ls("Data/", regexp = "raw")
names <- filenames %>%
path_file() %>%
path_ext_remove()
data_raw <- map(filenames, read_tsv) %>%
set_names(names)
library(dplyr)
# Empty list to hold results
ll <- list()
# Target source files
fs <- list.files(pattern = ".tsv")
for(f in fs){
# Read from file
temp <- read_tsv(f)
# Save modified filename as field value
temp$file <- sub(pattern = "_raw.tsv", replacement = "", x = f)
# Save within a list
ll[[f]] <- temp
}
# Compile list elements into a two dimensional object (matrix or dataframe)
# Using bind_rows, you don't need to worry about all columns
# matching among input datasets.
do.call(dplyr::bind_rows, ll)

How can I read multiple csv files into R at once and know which file the data is from? [duplicate]

This question already has answers here:
Add "filename" column to table as multiple files are read and bound [duplicate]
(6 answers)
Closed 14 days ago.
I want to read multiple csv files into R and combine them into one large table. I however need to a column that identifies which file each row came from.
Basically, every row has a unique identifying number within a file but those numbers are repeated across files. So if I bind all files into a table without knowing which file every row is from I won't have a unique identifier anymore which makes my planned analysis impossible.
What I have so far is this but this doesn't give me what file the data came from.
list_file <- list.files(pattern="*.csv") %>% lapply(read.csv,stringsAsFactors=F)
combo_data <- list.rbind(list_file)
I have about 100 files to read in so I'd really appreciate any help so I don't have to do them all individually.
One way would be to use map_df from purrr to bind all the csv's into one with a unique column identifier.
filenames <- list.files(pattern="*.csv")
purrr::map_df(filenames, read.csv,stringsAsFactors = FALSE, .id = 'filename') %>%
dplyr::mutate(filename = filenames[filename]) -> combo_data
Also :
combo_data <- purrr::map_df(filenames,
~read.csv(.x, stringsAsFactors = FALSE) %>% mutate(filename = .x))
In base R :
combo_data <- do.call(rbind, lapply(filenames, function(x)
cbind(read.csv(x, stringsAsFactors = FALSE), filename = x)))
In case you want to use base R you can use
file.names <- list.files(pattern = "*.csv")
df.list <- lapply(file.names, function(file.name)
{
df <- read.csv(file.name)
df$file.name <- file.name
return(df)
})
df <- list.rbind(df.list)
As other answer suggested, now tidyverse made things easier:
library(readr)
library(purrr)
library(dplyr)
library(stringr)
df <- fs::dir_ls(regexp = "\\.csv$") %>%
map_dfr(read_csv, id='path') %>%
mutate(thename = str_replace(path, ".tsv","")) %>%
select(-path)

Merging multiple html saved as xls in a table in R

I have a folder with multiple html files with .xls extension.
data sample
I need to combine them in a single table:
I have started with reading files in a folder:
library(rvest)
library(tibble)
file_list <- list.files(pattern = '*.xls')
html_df <- lapply(file_list,function(x)read_html(x))
I do not know how to proceed from here to pull tables from each file and combine together
This should work if all the files have the same format as the sample you've uploaded:
library(rvest)
file_list <- list.files(pattern = '*.xls')
data <-
purrr::map_dfr( # use map_dfr() to combine data frames
file_list,
function(x) {
read_html(x) %>%
html_node("table") %>% # read the first 'table' node (which is the only one in the sample)
html_table(fill = T) %>% # fill it because the table is not neat yet
setNames(.[1, ]) %>% # use the first row to set column names
.[-c(1, nrow(.)), ] # chop the first row which is the repeated column names and the last row which is the total row
}
)

Resources