I have a list of files in a directory which I'm trying to convert to csv, had tried rio package and solutions as suggested here
The output is list of empty CSV files with no content. It could be because the first 8 rows of the xls files have an image and few emtpy lines with couple couple of cells filled with text.
Is there any way I could skip those first 8 lines in all of xls files before converting.
Tried exploring options from openxlsx or readxls packages, any suggestions or guidance will be helpful.
Please do not mark as duplicate since I have a different problem than the one that was already answered
Maybe the following will work. At least it does for my own mock-up of an excel file with a picture in the top
library("readxl") # To read xlsx
library("readr") # Fast csv write
indata <- read_excel("~/cowexcel.xlsx", skip=8)
write_csv(indata, path="cow.csv")
If you are running this for several files then combine it into a function. Note that the function below does no checking and might overwrite existing csv files
convert_excel_to_csv <- function(name) {
indata <- read_excel(name, skip=8)
write_csv(indata, path=paste0(tools::file_path_sans_ext(name), ".csv"))
}
Although I was not able to do it with rio to convert, I read it as xls and wrote it back as csv using below code. Testing worked fine, Hope it works without glitch in implementation.
files <- list.files(pattern = '*.xls')
y=NULL
for(i in files ) {
x <- read.xlsx(i, sheetIndex = 1, header=TRUE, startRow=9)
y= rbind(y,x)
}
dt <- Sys.Date()
fn<- paste("path/",dt,".csv",sep="")
write.csv(y,fn,row.names = FALSE)
Related
I am trying to read multiple excel files with in R studio so I can append them to one large dataset. For some reason it is not uploading any of the excel files. I have the correct working directory but the R file is always empty.
The name of the files are sbir_award_1 sbir_award_2 , etc. Any help would be greatly appreciated.
library(tidyverse)
library(readxl)
library(writexl)
sbir <- list.files(pattern = "data/raw/sbir/sbir_award_.xlsx")
data <- list()
df.list <- lapply(file.list,read.xlsx)
I want to read multiple csv files where I only read two columns from each. So my code is this:
library(data.table)
files <- list.files(pattern="C:\\Users\\XYZ\\PROJECT\\NAME\\venv\\RawCSV_firstBatch\\*.csv")
temp <- lapply(files, function(x) fread(x, select = c("screenNames", "retweetUserScreenName")))
data <- rbindlist(temp)
This yields character(0). However when I move those csv files out to where my script is, and change the files to this:
files <- list.files(pattern="*.csv")
#....
My dir() output is this:
[1] "adjaceny_list.R" "cleanusrnms_firstbatch"
[3] "RawCSV_firstBatch" "username_cutter.py"
everything gets read. Could you help me track down what's exactly going on please? The folder that contains these csv files are in same directory where the script is. SO even if I do patterm= "RawCSV_firstBatch\\*.csv" same problem.
EDIT:
also did:
files <- list.files(path="C:\\Users\\XYZ\\PROJECT\\NAME\\venv\\RawCSV_firstBatch\\",pattern="*.csv")
#and
files <- list.files(pattern="C:/Users/XYZ/PROJECT/NAME/venv/RawCSV_firstBatch/*.csv")
Both yielded empty data frame.
#NelsonGon mentioned a workaround:
Do something like: list.files("./path/folder",pattern="*.csv$") Use ..
or . as required.(Not sure about using actual path). Can also utilise
~
So that works. Thank you. (sorry have 2 days limit before I tick this as answer)
Trying to write an R script that will convert multiple xlsx workbook files within a folder while also converting the sheets within the workbook as separate csv files.
Looking for a single script to automatically apply code to all workbooks and their spreadsheets.
For reading Excel files, there are several packages.
I personally am happy with the xlsx package, which you can use to read Excel files, as well as their individual sheets. This article looks like it will give you the gist of it.
Each worksheet you read out you should then be able to export to CSV files by using R's built-in write.csv (or write.csv2) method.
Below is an example to convert a single xlsx workbook to multiple csv files.
Note that type conversions are not guaranteed to be correct.
xlsx_path <-"path_to_xlsx.xlsx"
sheet_names <- readxl::excel_sheets(xlsx_path)
# read from all sheets to a list of data frames
xlsx_data <- purrr::map(
sheet_names,
~readxl::read_excel(xlsx_path,.x,col_types = "text",col_names = FALSE)
)
# write a list of data frame to csv files
purrr::walk2(
xlsx_data,sheet_names,
~readr::write_csv(.x,paste0(xlsx_path,"-",.y,".csv"),col_names = FALSE)
)
# csv files will be saved as:
# path_to_xlsx-sheet1.xlsx, path_to_xlsx-sheet2.xlsx, ...
If you need to apply this function to many xlsx files. Use list.files() to get the path to all xlsx files. And write a for loop or use another map function to iterate this process.
If you are using Rstudio it is possible that you already have the package readxl installed. They have many workflows for common usecases explained here: https://readxl.tidyverse.org/articles/articles/readxl-workflows.html
They also provide this nice code snippet to do what you are asking for:
read_then_csv <- function(sheet, path) {
pathbase <- tools::file_path_sans_ext(basename(path))
df <- read_excel(path = path, sheet = sheet)
write.csv(df, paste0(pathbase, "-", sheet, ".csv"),
quote = FALSE, row.names = FALSE)
df
}
path <- readxl_example("datasets.xlsx")
sheets <- excel_sheets(path)
xl_list <- lapply(excel_sheets(path), read_then_csv, path = path)
names(xl_list) <- sheets
If you go to here and put "excel" and "xls" in the search bar, you 'll get a list of packages and functions which might help.
I am attempting to import a directory of daily crime statistics into R. My data files do not have headers and when I import the CSV files into R its makes the first row of the dataset the header. I have tried col_names = FALSE but I am receiving an error, any help would be greatly appreciated.
folder <- "/Users/myname/Desktop/Stats 408/folder/"
file_list <- list.files(path=folder, pattern="*.csv")
for (i in 1:length(file_list)){
assign(file_list[i],
read.csv((paste(folder, file_list[i], sep='')))
)}
I generally prefer to import data like this into a list, since it tends to clutter up the workspace less than using assign (and it's less of a nuisance if you have files with strange names).
setNames(lapply(file_list, function(fname){
read.csv(paste0(folder, fname), header=F)
}), file_list)
Have you tried the read.table function? I'm completely new to R, so I'm sure I won't be of any great help, but I've gotten used to using the read.table function with header=FALSE if there are no headers.
I was trying to read an excel spreadsheet into R data frame. However, some of the columns have formulas or are linked to other external spreadsheets. Whenever I read the spreadsheet into R, there are always many cells becomes NA. Is there a good way to fix this problem so that I can get the original value of those cells?
The R script I used to do the import is like the following:
options(java.parameters = "-Xmx8g")
library(XLConnect)
# Step 1 import the "raw" tab
path_cost = "..."
wb = loadWorkbook(...)
raw = readWorksheet(wb, sheet = '...', header = TRUE, useCachedValues = FALSE)
UPDATE: read_excel from the readxl package looks like a better solution. It's very fast (0.14 sec in the 1400 x 6 file I mentioned in the comments) and it evaluates formulas before import. It doesn't use java, so no need to set any java options.
# sheet can be a string (name of sheet) or integer (position of sheet)
raw = read_excel(file, sheet=sheet)
For more information and examples, see the short vignette.
ORIGINAL ANSWER: Try read.xlsx from the xlsx package. The help file implies that by default it evaluates formulas before importing (see the keepFormulas parameter). I checked this on a small test file and it worked for me. Formula results were imported correctly, including formulas that depend on other sheets in the same workbook and formulas that depend on other workbooks in the same directory.
One caveat: If an externally linked sheet has changed since the last time you updated the links on the file you're reading into R, then any values read into R that depend on external links will be the old values, not the latest ones.
The code in your case would be:
library(xlsx)
options(java.parameters = "-Xmx8g") # xlsx also uses java
# Replace file and sheetName with appropriate values for your file
# keepFormulas=FALSE and header=TRUE are the defaults. I added them only for illustration.
raw = read.xlsx(file, sheetName=sheetName, header=TRUE, keepFormulas=FALSE)