R - read_csv without using paste - r

Use the read_csv function to read each of the files you got in the files object with code below:
path <- system.file("extdata", package = "dslabs")
files <- list.files(path)
files
I tried this code below but I get "vroom_ error. Please help.
for (f in files){
read_csv(f)
}

First, in list.files you should set full.names=TRUE to include the whole path. Next, if you look into files, there are also .xls and .pdf files included. You may want to filter just for .csv files, which can easily be done using grep.
files <- list.files(path, full.names=TRUE)
files <- grep('.csv$', files, value=TRUE)
However, even then readr::read_csv complains about column issues.
lst <- readr::read_csv(files)
# Error: Files must all have 2 columns:
# * File 2 has 57 columns
To avoid editing the columns by hand, I recommend to use rio::import_list instead, which gives just a warning, that a column name was guessed and can be changed if needed. You may even include the .xls in the grep.
files <- grep('.csv$|.xls', files, value=TRUE)
lst <- rio::import_list(files)
Note that rio::import_list (as well as readr::read_csv) is vectorized, so you won't need a loop.
Data:
path <- system.file("extdata", package="dslabs")

Related

Parsing issue, unexpected character when loading a folder

I am using this answer to load in a folder of Excel Files:
# Get the list of files
#----------------------------#
folder <- "path/to/files"
fileList <- dir(folder, recursive=TRUE) # grep through these, if you are not loading them all
# use platform appropriate separator
files <- paste(folder, fileList, sep=.Platform$file.sep)
So far, so good.
# Load them in
#----------------------------#
# Method 1:
invisible(sapply(files, source, local=TRUE))
#-- OR --#
# Method 2:
sapply(files, function(f) eval(parse(text=f)))
But the source function (Method 1) gives me the error:
Error in source("C:/Users/Username/filename.xlsx") :
C:/Users/filename :1:3: unexpected input
1: PK
^
For method 2 get the error:
Error in parse(text = f) : <text>:1:3: unexpected '/'
1: C:/
^
EDIT: I tried circumventing the issue by setting the working directory to the directory of the folder, but that did not help.
Any ideas why this happens?
EDIT 2: It works when doing the following:
How can I read multiple (excel) files into R?
setwd("...")
library(readxl)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list, read_excel)
just to provide a proper answer outside of the comment section...
If your target is to read many Excel files, you shouldn't use source.
source is dedicated to run external R code.
If you need to read many Excel files you can use the following code and the support of one of these libraries: readxl, openxlsx, tidyxl (with unpivotr).
filelist <- dir(folder, recursive = TRUE, full.names = TRUE, pattern = ".xlsx$|.xls$", ignore.case = TRUE)
l_df <- lapply(filelist, readxl::read_excel)
Note that we are using dir to list the full paths (full.names = TRUE) of all the files that ends with .xlsx, .xls (pattern = ".xlsx$|.xls$"), .XLSX, .XLS (ignore.case = TRUE) in the folder folder and all its subfolders (recursive = TRUE).
readxl is integrated with tidyverse. It is pretty easy to use. It is most likely what you're looking for.
Personally, I advice to use openxlsx if you need to write (rather than read) customized Excel files with many specific features.
tidyxl is the best package I've seen to read Excel files, but it may be rather complicated to use. However, it's really careful in the types preservation.
With the support of unpivotr it allows you to handle complicated Excel structures.
For example, when you find multiple headers and multiple left index columns.

reading multiple csv files using data.table doesn't work when given files path, possible bug?

I want to read multiple csv files where I only read two columns from each. So my code is this:
library(data.table)
files <- list.files(pattern="C:\\Users\\XYZ\\PROJECT\\NAME\\venv\\RawCSV_firstBatch\\*.csv")
temp <- lapply(files, function(x) fread(x, select = c("screenNames", "retweetUserScreenName")))
data <- rbindlist(temp)
This yields character(0). However when I move those csv files out to where my script is, and change the files to this:
files <- list.files(pattern="*.csv")
#....
My dir() output is this:
[1] "adjaceny_list.R" "cleanusrnms_firstbatch"
[3] "RawCSV_firstBatch" "username_cutter.py"
everything gets read. Could you help me track down what's exactly going on please? The folder that contains these csv files are in same directory where the script is. SO even if I do patterm= "RawCSV_firstBatch\\*.csv" same problem.
EDIT:
also did:
files <- list.files(path="C:\\Users\\XYZ\\PROJECT\\NAME\\venv\\RawCSV_firstBatch\\",pattern="*.csv")
#and
files <- list.files(pattern="C:/Users/XYZ/PROJECT/NAME/venv/RawCSV_firstBatch/*.csv")
Both yielded empty data frame.
#NelsonGon mentioned a workaround:
Do something like: list.files("./path/folder",pattern="*.csv$") Use ..
or . as required.(Not sure about using actual path). Can also utilise
~
So that works. Thank you. (sorry have 2 days limit before I tick this as answer)

Read several PDF files into R with pdf_text

I have several PDF files in my directory. I have downloaded them previously, no big deal so far.
I want to read all those files in R. My idea was to use the "pdf_text" function from the "pdftools" package and write a formula like this:
mypdftext <- pdf_text(files)
Where "files" is an object that gathers all the PDF file names, so that I don't have to write manually all the names. Because I have actually downlaoded a lot of files, it would avoid me to write:
mypdftext <- pdf_text("file1.pdf", "file2.pdf", and many more files...)
To create the object "pdflist", I used "files <- list.files (pattern = "pdf$")"
The “files” vector contains all the PDF file names.
But "files" does not work with pdf_text function, probably because it's a vector. What can I do instead?
maybe this is not the best solution but this works for me:
library(pdftools)
# Set your path here.
your_path = 'C:/Users/.../pdf_folder'
setwd(your_path)
getwd()
lf = list.files(path=getwd(), pattern=NULL, all.files=FALSE,
full.names=FALSE)
#Creating a list to iterate
my_pdfs = {}
#Iterate. Asssign each element of list files, to a list.
for (i in 1:length(lf)){my_pdfs[i] <- pdf_text(lf[i])}
#Calling the first pdf of the list.
my_pdfs[1]
Then you can assign each of the pdfs to a single file of whatever you want. Of course, each file will be saved in each element of the list. Does this solve your problem?
You could try using lapply over the vector that contains the location of every pdf file (files). I would recommend using list.files(..., full.names = T) to get the complete location of each pdf file. This should work.
mypdfs<-lapply(files, pdf_text)

Importing files with almost similar path and name

I have many txt files that I want to import into R. These files are imported one by one, I do the operations that I want, and then I import the next file.
All these files are located in a database system where all the folders have almost the same names, e.g.
database\type4\system50
database\type6\system50
database\type4\system30
database\type4\system50
Similarly, the names of the files are also almost the same, referring to the folder where they are positioned, e.g..
type4.system50.txt
type6.system50.txt
type4.system30.txt
type4.system50.txt
I have heard that there should be a easier way of importing these many files one by one, than simply multiple setwd and read.csv2 commands. As far as I understand this is possible by the macro import function in SAS, where you specify an overall path and then for each time you want to import a file you specify what is specific about this file name/folder name.
Is there a similar function in R? I tried to look at
Importing Data in R like SAS macro
, but this question did not really show me how to specify the folder name/file name.
Thank you for your help.
If you want to specify folder name / file name, try this
databasepath="path/to/database"
## list all files
list.files(getwd(),recursive = T,full.names = T,include.dirs = T) -> tmp
## filter files you want to read
readmyfile <- function(foldername,filename){
tmp[which(grepl(foldername,tmp) & grepl(filename,tmp))]
}
files_to_read <- readmyfile("type4", "system50")
some_files <- lapply(files_to_read, read.csv2)
## Or you can read all of them (if memory is large enough to hold them)
all_files <- lapply(tmp,read.csv2)
Instead of using setwd continuously, you could specify the absolute path for each file, save all of the paths to a vector, loop through the vector of paths and load the files into a list
library(data.table)
file_dir <- "path/to/files/"
file_vec <- list.files(path = file_dir, pattern = "*.txt")
file_list <- list()
for (n in 1:length(file_list)){
file_list[[n]] <- fread(input = paste0(file_dir, file_vec[n]))
}

How to read a csv file or load an excel workbook by ignoring some characters in the file path?

I'm writing a loop script which involves reading a file from a workbook (using the package XLConnect). The challenge is that the file names contain characters (representing time) that I want to ignore.
For example, here are 3 paths to those files:
G://User//Documents//daily_data//Op_Schedule_20160520_132025.xlsx
G://User//Documents//daily_data//Op_Schedule_20160521_142805.xlsx
G://User//Documents//daily_data//Op_Schedule_20160522_103052.xlsx
I need to import hundreds of those files. I can easily account for the character string representing the date (e.g. 20160522), but not the time.
Is there a way to tell R to ignore some characters located in the file path? Here is how I was thinking of writing my script (the "???" is where i need help). I know a loop is probably not the most efficient way, but i'm open to suggestions, should you have any:
require(XLConnect)
path= "G://User//Documents//daily_data//Op_Schedule_"
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
scheduleList = rep(list(matrix(1,1,1)),length(wd.seq))
for(i in 1:length(wd.seq)) {
wb = loadWorkbook(file= paste0(path,wd.seq[i],"???",".xlxs"))
scheduleList[[i]] = readWorksheet(wb,sheet='=SCHEDULE', header = TRUE)
}
`
Thanks for reading and suggestions, if any.
Mathieu
I don't know if this is helpful, but if you want to read all the files in a certain directory (which it seems to me is what you're after), you can read all the filenames into a list using the list.files() function, for example
fileList <- list.files(""G://User//Documents//daily_data//")
And then load the xlsx files looping through the list with a for loop
for(i in fileList) {
loadWorkbook(file = i)
}
I haven't used the XLConnect function before so that exact code probably doesn't work, but the loop will iterate through all the files in that directory and so you can construct your loading call using the i variable for the filename (it won't be an absolute path though, so you might need to use paste to add the first part of the filepath)
I realize there might be other files in the directory that are not excel files, you could use grepl to select only files containg "OP_Schedule_"
fileListClean <- fileList[grepl("Op_Schedule_",fileList)]
or perhaps only selecting .xlsx files in the directory:
fileListClean <- fileList[grepl(".xlsx",fileList)]
Edit to fit your reply:
Since you need to fit it to a sequence, you can do it as you did earlier:
wd.seq = format(seq(as.Date("2014-01-01"),as.Date("2016-12-31"),"days"),format="%Y%m%d")
wd.seq2 <- paste("Op_Schedule_", wd.seq, sep = "")
And then use grepl to only pick files starting with that extensions:
fileListClean <- fileList[grepl(paste(wd.seq2, collapse = "|"), fileList)]
Full disclosure: The last part i got from this SO answer: grep using a character vector with multiple patterns

Resources