Merging multiple html saved as xls in a table in R - r

I have a folder with multiple html files with .xls extension.
data sample
I need to combine them in a single table:
I have started with reading files in a folder:
library(rvest)
library(tibble)
file_list <- list.files(pattern = '*.xls')
html_df <- lapply(file_list,function(x)read_html(x))
I do not know how to proceed from here to pull tables from each file and combine together

This should work if all the files have the same format as the sample you've uploaded:
library(rvest)
file_list <- list.files(pattern = '*.xls')
data <-
purrr::map_dfr( # use map_dfr() to combine data frames
file_list,
function(x) {
read_html(x) %>%
html_node("table") %>% # read the first 'table' node (which is the only one in the sample)
html_table(fill = T) %>% # fill it because the table is not neat yet
setNames(.[1, ]) %>% # use the first row to set column names
.[-c(1, nrow(.)), ] # chop the first row which is the repeated column names and the last row which is the total row
}
)

Related

R: how to read multiple csv files with column name in row n and select certain columns from the file and add file name to the file as a new column?

I have 100 csv files in the same folder, let's say the path="D:\Data".
For each file I want to:
Step 1. read the file from row 12 since the column names are at row 12;
Step 2. select certain columns from the file, let's say the colname I want to keep
are "Date","Time","Value";
Step 3. add the file name to the file as a new column, for example, I want to
save file1 of which name is "example 1.csv" as file1$Name="example 1.csv",
and similarly, save file2 of which name is "example 2.csv" as
file2$Name="example 2.csv", etc...
So far we got 100 new files with 4 columns "Date","Time","Value","Name". Then finally rbind all the 100 new files together.
I have no idea how to code these steps all together in R. So anyone can help? Thanks very much for your time.
update
Due the complicated data structure in my data, it always return errors by using the sample code in answers. The ideas behind the code were correct, but somehow I could only solve the problem by using the code as below. I believe there would be more elegant way to modify my code instead of using loop.
# set up working directory
setwd("D:/Data")
library(data.table)
files <- list.files(path ="D:/Data", pattern = ".csv")
# read and save each file as a list of data frame in temp
temp <- lapply(files, read.csv, header = TRUE, skip=11, sep = "\t", fileEncoding="utf-16")
seq_along(temp) # the number of files is 112
## select columns "Date","Time","Value" as a new file,
## and attach the file name as a new column to each new file,
## and finally row bind all the files together
temp2=NULL
for(i in 1:112) {
dd=cbind(File=files[i],temp[[i]][,c("Date","Time","Value")])
temp2=rbind(temp2,dd)
}
You can do this very neatly with vroom. It can take a list of files as an argument rather than having to do each separately, and add the filename column itself:
library(vroom)
vroom(files, skip = 11, id = 'filename', col_select = c(Date, Time, Value, filename))
You could try something like this
list_of_files <- list.files(path <- "D:/Data/", pattern="*.csv", full.names=TRUE)
library(dplyr)
library(purrr)
list_of_files %>%
set_names() %>%
map_dfr(~ .x %>%
readr::read_csv(.,
skip = 12,
col_names = TRUE
) %>%
select(Date, Time, Value) %>%
mutate(Date = as.character(Date)) %>%
# Alternatively you could use the .id argument in map_dfr for the filename
mutate(filename = match(.x, list_of_files)))

Read in CSV files and Add a Column with File name

Assume you have 2 files as follows.
file_1_october.csv
file_2_november.csv
The files have identical columns. So I want to read both files in R which I can easily do with map. I also want to include in each read file a column month with the name of the file. For instance, for file_1_october.csv, I want a column called “month” that contains the words “file_1_october.csv”.
For reproducibility, assume file_1_october.csv is
name,age,gender
james,24,male
Sue,21,female
While file_2_november.csv is
name,age,gender
Grey,24,male
Juliet,21,female
I want to read both files but in each file include a month column that corresponds to the file name so that we have;
name,age,gender,month
james,24,male, file_1_october.csv
Sue,21,female, file_1_october.csv
AND
name,age,gender,month,
Grey,24,male, file_2_november.csv,
Juliet,21,female, file_2_november.csv
Maybe something like this?
csvlist <- c("file_1_october.csv", "file_2_november.csv")
df_list <- lapply(csvlist, function(x) read.csv(x) %>% mutate(month = x))
for (i in seq_along(df_list)) {
assign(paste0("df", i), df_list[[i]])
}
The two dataframes will be saved in df1 and df2.
Here's a (mostly) tidyverse alternative that avoids looping:
library(tidyverse)
csv_names <- list.files(path = "path/", # set the path to your folder with csv files
pattern = "*.csv", # select all csv files in the folder
full.names = T) # output full file names (with path)
# csv_names <- c("file_1_october.csv", "file_2_november.csv")
csv_names2 <- data.frame(month = csv_names,
id = as.character(1:length(csv_names))) # id for joining
data <- csv_names %>%
lapply(read_csv) %>% # read all the files at once
bind_rows(.id = "id") %>% # bind all tables into one object, and give id for each
left_join(csv_names2) # join month column created earlier
This gives a single data object with data from all the CSVs together. In case you need them separately, you can omit the bind_rows() step, giving you a list of multiple tables ("tibbles"). These can then be split using list2env() or some split() function.

Loop code through multiple .csv files in R

I'm trying to run a chunk of code through 50+ csv files, but I can't figure out how to do it. All the csv files have the same columns, but different number of rows.
So far I have the following:
filelist<- list.files(pattern = ".csv") #made a file list with all the .csv files in the directory
samples <- lapply(filelist, function(x) read.table(x, header=T)[,c(1,2,3,5)]) #read all the csv files (only the columns I'm interested in)
output <- samples %>%
rename (orf = protein) %>%
filter (!grepl("sp", orf)) %>%
write.csv (paste0("new_", filename))
#I want to rename a column and remove all rows containing "sp" in that column, then export the dataframe as new_originalfilename.csv
Any help would be greatly appreciated!
You may do this in the same lapply loop.
library(dplyr)
lapply(filelist, function(x) {
read.table(x, header=T) %>%
select(1, 2, 3, 5) %>%
rename(orf = protein) %>%
filter(!grepl("sp", orf)) %>%
write.csv(paste0("new_", x), row.names = FALSE)
})

How can I read multiple csv files into R at once and know which file the data is from? [duplicate]

This question already has answers here:
Add "filename" column to table as multiple files are read and bound [duplicate]
(6 answers)
Closed 14 days ago.
I want to read multiple csv files into R and combine them into one large table. I however need to a column that identifies which file each row came from.
Basically, every row has a unique identifying number within a file but those numbers are repeated across files. So if I bind all files into a table without knowing which file every row is from I won't have a unique identifier anymore which makes my planned analysis impossible.
What I have so far is this but this doesn't give me what file the data came from.
list_file <- list.files(pattern="*.csv") %>% lapply(read.csv,stringsAsFactors=F)
combo_data <- list.rbind(list_file)
I have about 100 files to read in so I'd really appreciate any help so I don't have to do them all individually.
One way would be to use map_df from purrr to bind all the csv's into one with a unique column identifier.
filenames <- list.files(pattern="*.csv")
purrr::map_df(filenames, read.csv,stringsAsFactors = FALSE, .id = 'filename') %>%
dplyr::mutate(filename = filenames[filename]) -> combo_data
Also :
combo_data <- purrr::map_df(filenames,
~read.csv(.x, stringsAsFactors = FALSE) %>% mutate(filename = .x))
In base R :
combo_data <- do.call(rbind, lapply(filenames, function(x)
cbind(read.csv(x, stringsAsFactors = FALSE), filename = x)))
In case you want to use base R you can use
file.names <- list.files(pattern = "*.csv")
df.list <- lapply(file.names, function(file.name)
{
df <- read.csv(file.name)
df$file.name <- file.name
return(df)
})
df <- list.rbind(df.list)
As other answer suggested, now tidyverse made things easier:
library(readr)
library(purrr)
library(dplyr)
library(stringr)
df <- fs::dir_ls(regexp = "\\.csv$") %>%
map_dfr(read_csv, id='path') %>%
mutate(thename = str_replace(path, ".tsv","")) %>%
select(-path)

count rows split by type for many csv files R

I have many csv files and I need to count the number of rows split by type. An example csv format is
Type,speed
Turtle,10
Lion,50
Cheetah,100
Turtle,12
Lion,70
Cheetah,110
Cheetah,170
So the example output would be:
Type count
turtle 2
lion 2
cheetah 3
I can do this for an individual file using the below R code:
library(dplyr)
##
a1 <- read.csv("data1.csv")
a1 %>%
group_by(Type, Type) %>%
summarise(count=n())
Can someone help me loop this across all csv files? I have data1.csv to data100.csv.
As mentioned in the comments you can use list.files to get a list of the files in your directory:
file_list <- list.files(directory) # pattern omitted since they're the only files
Then read all files into a list:
files <- lapply(file_list, read.csv, header=TRUE)
names(files) <- sub("\\.csv$", "", file_list)
Now, you could do:
res <- lapply(files, function(dat) dplyr::count(dat, Type))

Resources