Ignore files with parsing errors in import (read_csv)

Ignore files with parsing errors in import (read_csv) - r

I have incredibly raw data in the format of a .zip with a .txt file inside. For the most part, it cleanly reads in using read_csv, but there are some lines where the data is logging something else and completely skews the column structure. This data has no chance of being fixed.
When using read_csv, it shows up as a parsing problem. I want to set up my code where if this problem appears in the data, the whole file is ignored. It'd be great if there was a log of which files were ignored/thrown out. I looked into possibly(), but since it's not a full error with the file, only the lines, it doesn't skip the file.
This is my code at the moment.
library(dplyr)
library(readr)
library(purrr)
read_log <- function(path) {
read_csv(path, col_types = cols(.default = col_character())) %>%
mutate(filename = basename(path))
}
test_files <- file.path("example.txt") #would normally be list.files, simplified for this reprex
raw_data <- map_dfr(test_files, read_log)
#> Warning: 6 parsing failures.
#> row col expected actual file
#> 3 -- 17 columns 4 columns 'example.txt'
#> 4 -- 17 columns 23 columns 'example.txt'
#> 5 -- 17 columns 23 columns 'example.txt'
#> 6 -- 17 columns 23 columns 'example.txt'
#> 7 -- 17 columns 23 columns 'example.txt'
#> ... ... .......... .......... .............
#> See problems(...) for more details.

You can return NULL if a warning is returned. Try using this function.
library(reader)
library(purrr)
library(dplyr)
read_log <- function(path) {
data <- tryCatch(read_csv(path,col_types = cols(.default = col_character())),
warning = function(e) return(NULL))
if(!is.null(data))
data <- data %>% mutate(filename = basename(path))
return(data)
}
Read the data with map instead of map_dfr :
all_data <- map(test_files, read_log)
Files which were not read
not_read_files <- test_files[sapply(all_data, is.null)]
Combine the data
total_data <- bind_rows(all_data)

Related

Read all csv files in a directory and add the name of each file in a new column [duplicate]

This question already has answers here:
Importing multiple .csv files into R and adding a new column with file name
(2 answers)
How can I read multiple csv files into R at once and know which file the data is from? [duplicate]
(3 answers)
Closed 13 days ago.
I have this code that reads all CSV files in a directory.
nm <- list.files()
df <- do.call(rbind, lapply(nm, function(x) read_delim(x,';',col_names = T)))
I want to modify it in a way that appends the filename to the data. The result would be a single data frame that has all the CSV files, and inside the data frame, there is a column that specifies from which file the data came. How to do it?

Instead of do.call(rbind, lapply(...)), you can use purrr::map_dfr() with the .id argument:
library(readr)
library(purrr)
df <- list.files() |>
set_names() |>
map_dfr(read_delim, .id = "file")
df
# A tibble: 9 × 3
file col1 col2
<chr> <dbl> <dbl>
1 f1.csv 1 4
2 f1.csv 2 5
3 f1.csv 3 6
4 f2.csv 1 4
5 f2.csv 2 5
6 f2.csv 3 6
7 f3.csv 1 4
8 f3.csv 2 5
9 f3.csv 3 6
Example data:
for (f in c("f1.csv", "f2.csv", "f3.csv")) {
readr::write_delim(data.frame(col1 = 1:3, col2 = 4:6), f, ";")
}

readr::read_csv() can accept a vector of file names. The id parameter is "the name of a column in which to store the file path. This is useful when reading multiple input files and there is data in the file paths, such as the data collection date."
nm |>
readr::read_csv(
id = "file_path"
)
I see other answers use file name without the directory. If that's desired, consider using functions built for file manipulation, instead of regexes, unless you're sure the file names & paths are always well-behaved.
nm |>
readr::read_csv(
id = "file_path"
) |>
dplyr::mutate(
file_name_1 = basename(file_path), # If you want the extension
file_name_2 = tools::file_path_sans_ext(file_name_1), # If you don't
)

Here is another solution using purrr, which removes the file extention from the value in the column filename.
library(tidyverse)
nm <- list.files(pattern = "\\.csv$")
df <- map_dfr(
.x = nm,
~ read.csv(.x) %>%
mutate(
filename = stringr::str_replace(
.x,
"\\.csv$",
""
)
)
)
View(df)
EDIT
Actually you can still removes the file extention from the column for the file names when you apply #zephryl's approach by adding a mutate() process as follows:
df <- nm %>%
set_names() %>%
map_dfr(read_delim, .id = "file") %>%
mutate(
file = stringr::str_replace(
file,
"\\.csv$",
""
)
)

You can use bind_rows() from dplyr and supply the argument .id that creates a new column of identifiers to link each row to its original data frame.
df <- dplyr::bind_rows(
lapply(setNames(nm, basename(nm)), read_csv2),
.id = 'src'
)
The use of basename() removes the directory paths prepended to the file names.

For conventional scenarios, I prefer for readr to loop through the csvs by itself. But there some scenarios where it helps to process files individually before stacking them together.
A few weeks ago, purrr 1.0's map_dfr() function was "superseded in favour of using the
appropriate map function along with list_rbind()".
#zephryl's snippet is slightly modified to become
list.files() |>
rlang::set_names() |>
purrr::map(readr::read_delim) |>
# { possibly process files here before stacking/binding } |>
purrr::list_rbind(names_to = "file")
The functions were superseded in purrr 1.0.0 because their names suggest they work like _lgl(), _int(), etc which require length 1 outputs, but actually they return results of any size because the results are combined without any size checks. Additionally, they use dplyr::bind_rows() and dplyr::bind_cols() which require dplyr to be installed and have confusing semantics with edge cases. Superseded functions will not go away, but will only receive critical bug fixes.
Instead, we recommend using map(), map2(), etc with list_rbind() and list_cbind(). These use vctrs::vec_rbind() and vctrs::vec_cbind() under the hood, and have names that more clearly reflect their semantics.
Source: https://purrr.tidyverse.org/reference/map_dfr.html

How to merge files in a directory with r?

Good afternoon,
have a folder with 231 .csv files and I would like to merge them in R. Each file is one spectrum with 2 columns (Wavenumber and Reflectance), but as they come from the spectrometer they don't have colnames. So they look like this when I import them:
C_Sycamore = read.csv("#C_SC_1_10 average.CSV", header = FALSE)
head(C_Sycamore)
V1 V2
1 399.1989 7.750676e+001
2 401.1274 7.779499e+001
3 403.0559 7.813432e+001
4 404.9844 7.837078e+001
5 406.9129 7.837600e+001
6 408.8414 7.822227e+001
The first column (Wavenumber) is identical in all 231 files and all spectra contain exactly 1869 rows. Therefore, it should be possible to merge the whole folder in one big dataframe, right? At least this would very practical for me.
So what I tried is this. I set the working directory to the according folder. Define an empty variable d. Store all the file names in file.list. And the loop through the names in the file.list. First, I want to change the colnames of every file to "Wavenumber" and "the according file name itself", so I use deparse(substitute(i)). Then, I want to read in the file and merge it with the others. And then I could probably do merge(d, read.csv(i, header = FALSE, by = "Wavenumber"), but I don't even get this far.
d = NULL
file.list = list.files()
for(i in file.list){
colnames(i) = c("Wavenumber", deparse(substitute(i)))
d = merge(d, read.csv(i, header = FALSE))
}
When I run this I get the error code
"Error in colnames<-(*tmp*, value = c("Wavenumber", deparse(substitute(i)))) :
So I tried running it without the "colnames()" line, which does not produce an error code, but doesn't work either. Instead of my desired dataframe I get am empty dataframe with only two columns and the message:
"reread"#S_BE_1_10 average.CSV" "#S_P_1_10 average.CSV""
This kind of programming is new to me. So I am thankful for all useful suggestions. Also I am happy to share more data if it helps.
Thanks in advance.

Solution
library(tidyr)
library(purrr)
path <- "your/path/to/folder"
# in one pipeline:
C_Sycamore <- path %>%
# get csvs full paths. (?i) is for case insentitive
list.files(pattern = "(?i)\\.csv$", full.names = TRUE) %>%
# create a named vector: you need it to assign ids in the next step.
# and remove file extection to get clean colnames
set_names(tools::file_path_sans_ext(basename(.))) %>%
# read file one by one, bind them in one df and create id column
map_dfr(read.csv, col.names = c("Wavenumber", "V2"), .id = "colname") %>%
# pivot to create one column for each .id
pivot_wider(names_from = colname, values_from = V2)
Explanation
I would suggest not to change the working directory.
I think it's better if you read from that folder instead.
You can read each CSV file in a loop and bind them together by row. You can use map_dfr to loop over each item and then bind every dataframe by row (that's what the _dfr stands for).
Note that I've used .id = to create a new column called colname. It gets populated out of the names of the vector you're looping over. (That's why we added the names with set_names)
Then, to have one row for each Wavenumber, you need to reshape your data. You can use pivot_wider.
You will have at the end a dataframe with as many rows as Wavenumber and as many columns as the number of CSV plus 1 (the wavenumber column).
Reproducible example
To double check my results, you can use this reproducible example:
path <- tempdir()
csv <- "399.1989,7.750676e+001
401.1274,7.779499e+001
403.0559,7.813432e+001
404.9844,7.837078e+001
406.9129,7.837600e+001
408.8414,7.822227e+001"
write(csv, file.path(path, "file1.csv"))
write(csv, file.path(path, "file2.csv"))
You should expect this output:
C_Sycamore
#> # A tibble: 5 x 3
#> Wavenumber file1 file2
#> <dbl> <dbl> <dbl>
#> 1 401. 77.8 77.8
#> 2 403. 78.1 78.1
#> 3 405. 78.4 78.4
#> 4 407. 78.4 78.4
#> 5 409. 78.2 78.2
Thanks a lot to #Konrad Rudolph for the suggestions!!

no need for a loop here simply use lapply.
first set your working directory to file location###
library(dplyr)
files_to_upload<-list.files(, pattern = "*.csv")
theData_list<-lapply(files_to_upload, read.csv)
C_Sycamore <-bind_rows(theData_list)

r - read in multiple files and select max value from column by group

I am looking for the most elegant way loop through and read in multiple files organized by date and select the most recent value if anything changed based on multiple keys.
Sadly, the reason I need to read in all the files and not just the last files is because there could be an instance in the file that disappears that I would like to capture.
Here is an example of what the files looks like (I'm posting comma separated even though it's fixed width)
file_20200101.txt
key_1,key_2,value,date_as_numb
123,abc,100,20200101
456,def,200,20200101
789,xyz,100,20200101
100,foo,15,20200101
file_20200102.txt
key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102
and an example of the desired output:
desired_df
key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102
100,foo,15,20200101
In addition, here is some code I know that works to read in multiple files and then get my ideal output, but I need it to be inside of the loop. The dataframe would be way too big to import and bind all the files:
files <- list.files(path, pattern = ".txt")
df <- files %>%
map(function(f) {
print(f)
df <- fread(f)
df <- df %>% mutate(date_as_numb = f)
return(df)
}) %>% bind_rows()
df <- df %>%
mutate(file_date = as.numeric(str_remove_all(date_as_numb, ".*_"))) %>%
group_by(key_1, key_2) %>%
filter(date_as_numb == max(date_as_numb))
Thanks in advance!

Don't know what is "way too big" for you. Data table can (allegedly) handle really big data. So if bind_rows of the list is not ok, maybe use data table.
(in my own experience, dplyr::group_by can be really slow with many groups (say 10^5 groups) in large-ish data (say around 10^6 rows). I don't have much experience with data.table, but all the threads mention its superiority for large data).
I've used this answer for merging a list of data tables
library(data.table)
dt1 <- fread(text = "key_1,key_2,value,date_as_numb
123,abc,100,20200101
456,def,200,20200101
789,xyz,100,20200101
100,foo,15,20200101")
dt2 <- fread(text = "key_1,key_2,value,date_as_numb
123,abc,50,20200102
456,def,500,20200102
789,xyz,300,20200102")
ls_files <- list(dt1, dt2)
# you would have created this list by calling fread with lapply, like
# ls_files <- lapply(files, fread)
# Now here a two-liner with data table.
alldata <- Reduce(function(...) merge(..., all = TRUE), ls_files)
alldata[alldata[, .I[which.max(date_as_numb)], by = .(key_1, key_2)]$V1]
#> key_1 key_2 value date_as_numb
#> 1: 100 foo 15 20200101
#> 2: 123 abc 50 20200102
#> 3: 456 def 500 20200102
#> 4: 789 xyz 300 20200102
Created on 2021-01-28 by the reprex package (v0.3.0)

R Plyr Write CSV

I am trying to split a data frame and write it to a csv file in r using the unique values in one variable. I am new to r and I'm not entirely sure I know what I'm doing.
## trying to subset data
library(dplyr)
library(plyr)
#set the working directory
setwd("S:/some stuff")
## load the datafile into an object called data.
data <- read.csv("S:/some stuff/Area.csv",
header = TRUE, sep = ",")
#Create subsets of data by LA
LA<-subset(data,AREA == "LA")
My data frame has 2,500 observations and 20 variables.
My dataframe is called LA
The variable I'd like to split it by is called Disease
I found this How to create multiple ,csv files in R?
And reapproriated it accordingly
from
plyr::d_ply(iris, .(Species), function(x) write.csv(x,
file = paste(x$Species, ".csv", sep = "")))
to
plyr::d_ply(LA, .(Disease), function(x) write.csv(x,
file = paste(LA$Disease, ".csv", )))
However....
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
In addition: Warning message:
In if (file == "") file <- stdout() else if (is.character(file)) { :
Show Traceback
Rerun with Debug
Error in file(file, ifelse(append, "a", "w")) :
invalid 'description' argument
There are two things I'd like to solve.
1) subsetting a dataframe
2) writing to a path
Ideally I'd like to loop through it from the import of data (the Area.csv file).
This has areas and disease. There are 12 areas and 20 diseases.
I would like to create csv files of each disease by area.
In this example Area = LA and then disease.
How can I step through using a loop to create the 20 different files for each area?
I thought this:
https://blog.ouseful.info/2013/04/03/splitting-a-large-csv-file-into-separate-smaller-files-based-on-values-within-a-specific-column/
mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")
#mpExpenses2012 is the large dataframe containing data for each MP
#Get the list of unique MP names
for (name in levels(mpExpenses2012$MP.s.Name)){
#Subset the data by MP
tmp=subset(mpExpenses2012,MP.s.Name==name)
#Create a new filename for each MP - the folder 'mpExpenses2012' should already exist
fn=paste('mpExpenses2012/',gsub(' ','',name),sep='')
#Save the CSV file containing separate expenses data for each MP
write.csv(tmp,fn,row.names=FALSE)
}
might be helpful, but it's writing to a path that's getting me down.
EDIT
library(tidyr)
library(purrr)
temp_dir <- tempfile()
dir.create(temp_dir)
LA %>%
nest(-FinalDiseaseForMonthlyAnalysis) %>%
pwalk(function(FinalDiseaseForMonthlyAnalysis, data) write.csv(data, file.path(temp_dir, paste0(FinalDiseaseForMonthlyAnalysis, ".csv"))))
list.files(temp_dir)
temp_dir
unlink(temp_dir, recursive = T)
This works. But now comes the "where are the files?" question.
Yes: I get the temp file and then the unlink.
But how do I save in a folder on S:/some stuff/
?
EDIT FINAL: SOLVED
I've read that in r everything is a list. And I found a way to split by two columns to do what I needed. Annoyingly it's linked in the comments in here:
https://blog.ouseful.info/2013/04/03/splitting-a-large-csv-file-into-separate-smaller-files-based-on-values-within-a-specific-column/
and I missed it.
I was also having problems generating a dir using dir.create. Who knew that dir.create needs to have recursive = TRUE when you're trying to do stuff? I DO NOW.
Anyway. here's what I did:
## trying to subset data
# generate data:
library(tidyr)
library(purrr)
library(dplyr)
library(write)
## set working directory
setwd("S:/somestuff")
#create the directories - pretty sure there's a way to avoid doing this long hand
dir.create("S:/somestuff/CSV source files", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA1", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA2", recursive = TRUE)
dir.create("S:/somestuff/CSV source files/LA3", recursive = TRUE)
#Read in the CSV
DF = read.csv("S:/somestuff/CSV source files/ALL.csv",
header = TRUE, sep = ",")
glimpse(DF)
#This splits the dataframe generated above (DF) and calls it DF4
DF4 <- split(DF,list(DF$LA,DF$FinalDiseaseForMonthlyAnalysis))
lapply(names(DF4), function(name) write.csv(DF4[[name]], file = paste("S:/somestuff/CSV source files/",gsub('','',name),sep = ''), row.names = F))
I'm guessing if I read in the dataframe I could then use dir.create to create paths from the names in LA in the data frame.
After returning to the problem. It's much easier in the latest version of dplyr
ourdata<-DF4%>%
group_by(DF$LA,DF$FinalDiseaseForMonthlyAnalysis)%>%
group_walk(~ write_csv(.x, paste0(.y$LA,.y$FinalDiseaseForMonthlyAnalysis, ".csv")))

This was really helpful to me! Thanks!! I tried to simplify the crux of the matter.
library(tidyverse)
library(reprex)
states4 <- tribble(~state,~name,~area,
"AL","Alabama",50645.3242,
"AZ","Arizona",113594.0781,
"AR","Arkansas",52035.4727,
"CA","California",155779.2031
)
chain4 <- states4 %>% split(.$state)
map(names(chain4),function(stateabbrev){write_csv(chain4[[stateabbrev]],paste0("~/Downloads/","testtoken_",stateabbrev,".csv"))})
#> [[1]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AL Alabama 50645.
#>
#> [[2]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AR Arkansas 52035.
#>
#> [[3]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 AZ Arizona 113594.
#>
#> [[4]]
#> # A tibble: 1 x 3
#> state name area
#> <chr> <chr> <dbl>
#> 1 CA California 155779.
list.files(path="~/Downloads", pattern = "testtoken.*csv")
#> [1] "testtoken_AL.csv" "testtoken_AR.csv" "testtoken_AZ.csv"
#> [4] "testtoken_CA.csv"
reprex()
Created on 2019-10-02 by the reprex package (v0.3.0)

In the end I used:
## trying to subset data
# generate data:
library(tidyr)
library(purrr)
library(dplyr)
library(stringr)
library(plyr)
library (car)
## set working directory
setwd("S:/Somestuff/Borough profile maps/Working")
## read data in from geocoded file
geocoded<-read.csv("geocoded 2015 - 2018.csv",na.strings=c(""," ","N/A"))
str(geocoded)
str(geocoded$GENDER)
levels(geocoded$LA)
#split geocoded data by LA
x <-split(geocoded,geocoded$LA)
str(x)
#Split geocoded data by LA and Final
#split(x, f, drop = FALSE, sep = ".", lex.order = FALSE, .)
y<-split(geocoded,list(geocoded$Final,geocoded$LA), drop = TRUE, sep = "_")
str(y)
#create dir and then write CSV files of geocoded to file locations
dir.create("S:/Somestuff/Borough profile maps/Working/TEST/",, recursive = TRUE)
dir.create("S:/Somestuff/Borough profile maps/Working/TEST/TEST2",, recursive = TRUE)
lapply(names(x), function(name) write.csv(x[[name]], file = paste('S:/Somestuff/Borough profile maps/Working/TEST/',gsub(' ','',name),sep = ''), row.names = F))
lapply(names(y),function(name) write.csv(y[[name]], file = paste('S:/Somestuff/Borough profile maps/Working/TEST/TEST2/',name,".csv")))
The problem was that in my original code you'll notice I was using read.csv BUT feeding in a .txt file. I changed the file to .csv and BANG. it worked. First time.
I realise that you don't need all the libraries I called at the beginning, but they're left in from my ridiculous number of attempts.

After returning to the problem. It's much easier in the latest version of dplyr
DF4%>%
group_by(DF$LA,DF$FinalDiseaseForMonthlyAnalysis)%>%
group_walk(~ write_csv(.x, paste0(.y$LA,.y$FinalDiseaseForMonthlyAnalysis, ".csv")))

Reading multiple xl files into dataframe

I have been using XLConnect function loadworkbook to load each xlsx file into R then rbind to merge them together. what is the best way of doing this instead of writing multiple df to later merge them. I am trying to use the code below to merge my excel files into 2 dataframes(2 sheet names for most files). The columns are always the same but the file names will change.
Current /slow way
require(XLConnect)
df <- loadWorkbook(paste(location,'UK.xlsx',sep=""))
dfb <- loadWorkbook(paste(location,'US.xlsx',sep=""))
UK <-readWorksheet(df,sheet="School",startRow=0,startCol=0,autofitRow=TRUE,endCol=21,header=TRUE)
US <-readWorksheet(dfb,sheet="School",startRow=0,startCol=0,autofitRow=TRUE,endCol=21,header=TRUE)
School <- rbind(UK,US)
UK <-readWorksheet(df,sheet="College",startRow=0,startCol=0,autofitRow=TRUE,endCol=21,header=TRUE)
US <-readWorksheet(dfb,sheet="College",startRow=0,startCol=0,autofitRow=TRUE,endCol=21,header=TRUE)
College <- rbind(UK,US)
New code
require(readxl)
filelist<- list.files(location,pattern='xlsx',full.names = T)
How can I read each sheetname into a dataframe when not every file has both sheetnames. I need 2 dataframes 1 for School and 1 for College.
I think I need to try something like Schools <-lapply(filelist, read_excel, sheet="School") but I get Error: Sheet 'School' not found. I think this error is because sheet School is not on every file. I am using list.files because the filenames are not always the same.

What about this approach?
library(purrr)
library(readxl)
# filenames to xl-sheets
files <- sprintf("Mappe%i.xlsx", 1:3)
# read only df for xl-files with school-sheet
xl_school <- map_if(files, ~ "School" %in% excel_sheets(.x), ~read_excel(.x))
# read only df for xl-files with college-sheet
xl_college <- map_if(files, ~ "College" %in% excel_sheets(.x), ~read_excel(.x))
# combine school-files to data frame (repeat same for college)
school_df <- map_df(xl_school, function(x) if(is.data.frame(x)) x)
school_df
#> # A tibble: 3 × 1
#> Test
#> <chr>
#> 1 fdsf
#> 2 543534
#> 3 gfdgfdd
You might need to force the column type to be text. Just add col_types = "text" to the read_excel()-call:
# read only df for xl-files with school-sheet
xl_school <- map_if(files, ~ "School" %in% excel_sheets(.x), ~read_excel(.x, col_types = "text"))
# read only df for xl-files with college-sheet
xl_college <- map_if(files, ~ "College" %in% excel_sheets(.x), ~read_excel(.x, col_types = "text"))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Ignore files with parsing errors in import (read_csv) - r

Related

Read all csv files in a directory and add the name of each file in a new column [duplicate]

How to merge files in a directory with r?

r - read in multiple files and select max value from column by group

R Plyr Write CSV

Reading multiple xl files into dataframe

Categories

Resources