Read multiple “.xlsx” files - r

I am trying to read multiple excel files under different folders by R
Here is my solution:
setwd("D:/data")
filename <- list.files(getwd(),full.names = TRUE)
# Four folders "epdata1" "epdata2" "epdata3" "epdata4" were inside the folder "data"
dataname <- list.files(filename,pattern="*.xlsx$",full.names = TRUE)
# Every folder in the folder "data" contains five excel files
datalist <- lapply(dataname,read_xlsx)
Error: `path` does not exist:'D:/data/epidata1/出院舱随访1.xlsx'
But read_xlsx was successfully run
read_xlsx("D:/data/epidata1/出院舱随访1.xlsx")
All file directories are available in the "data" folder and why R fails to read those excel file?
Your help will much appreciated!

I dont see any point why your code shouldnt work. Make sure your folder names are correct. In your comments you write "epdata1" and your error says "epidata1".
I tried it with some csv and mixed xlsx files.
This is again what i would come up with, to find the error/typo:
library(readxl)
pp <- function(...){print(paste(...))}
main <- function(){
# finding / setting up data main folder
# You may change this to your needs
main_dir <- paste0(getwd(),"/data/")
pp("working directory:",dir_data)
pp("Found following folders:")
pp(list.files(main_dir,full.names = FALSE))
data_folders <- list.files(dir_data,full.names = TRUE)
pp("Found these files in folders:",list.files(data_folders,full.names = TRUE))
pp("Filtering *.xlsx files",list.files(data_folders,pattern="*.xlsx$",full.names = TRUE))
files <- list.files(data_folders,pattern="\\.xlsx$",full.names = TRUE)
datalist <- lapply(files,read_xlsx)
print(datalist)
}
main()

Related

file.copy not working when called inside function

I'm trying to do a standardized directory setup through a function call. Inside this function I'm using two file.copy calls to copy some files from a selfmade package into the working directory of a project.
If I run the code line by line, everything is working fine but if I run the whole function, only the directories get created, but no files get copied. Unfortunately the function does not throw any error, so I really do not understand whats going on or where to start troubleshooting.
Maybe one of you guys can give me a hint where to find the solution.
abstract (non working) example:
dir_setup <- function() {
# list directories which shall be created
dir_names <- c("dir1", "dir2", "dir3", "dir4")
# create directories
lapply(dir_names, function(x){dir.create(path = paste(getwd(), x, sep = '/'))})
# get path of package library
lib_path <- .libPaths()
# shorten list to vector of length 1
if (length(lib_path) > 1) lib_path = lib_path[1]
# list files in source
files <- list.files(paste0(lib_path, "/package/files/dir1"), full.names = TRUE)
# copy resource files from package directory to working directory
file.copy(files, paste(getwd(), "dir1", sep = '/'), overwrite = TRUE)
# list more files
files2 <- list.files(paste0(lib_path, "/package/files/dir2"), full.names = TRUE)
# copy more files from package directory to working directory
file.copy(files2, paste(getwd(), "dir2", sep = '/'), overwrite = TRUE)
}

How can I create a data frame in R from a zip file with multiple levels located in an URL?

I have been trying to work this out but I have not been able to do it...
I want to create a data frame with four columns: country-number-year-(content of the .txt file)
There is a .zip file in the following URL:
https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT
The file contains a folder with 49 folders in it, and each of them contain 150 .txt files give or take.
I first tried to download the zip file with get_dataset but did not work
if (!require("dataverse")) devtools::install_github("iqss/dataverse-client-r")
library("dataverse")
Sys.setenv("DATAVERSE_SERVER" = "dataverse.harvard.edu")
get_dataset("=doi:10.7910/DVN/0TJX8Y/PZUURT", key = "", server = "dataverse.harvard.edu")
"Error in get_dataset("=doi:10.7910/DVN/0TJX8Y/PZUURT", key = "", server = "dataverse.harvard.edu") :
Not Found (HTTP 404)."
Then I tried
temp <- tempfile()
download.file("https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT",temp)
UNGDC <-unzip(temp, "UNGDC+1970-2018.zip")
It worked to some point... I downloaded the .zip file and then I created UNGDC but nothing happened, because it only has the following information:
UNGDC
A connection with
description "/var/folders/nl/ss_qsy090l78_tyycy03x0yh0000gn/T//RtmpTc3lvX/fileab730f392b3:UNGDC+1970-2018.zip"
class "unz"
mode "r"
text "text"
opened "closed"
can read "yes"
can write "yes"
Here I don't know what to do... I have not found relevant information to proceed... Can someone please give me some hints? or any web to learn how to do it?
Thanks for your attention and help!!!
How about this? I used the zip package to unzip, but possibly the base unzip might work as well.
library(zip)
dir.create(temp <- tempfile())
url<-'https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/0TJX8Y/PZUURT'
download.file(url, paste0(temp, '/PZUURT.zip'), mode = 'wb', exdir = temp)
unzip(paste0(temp, '/PZUURT.zip'), exdir = temp)
Note in particular I had to set the mode = 'wb' as I'm on a Windows machine.
I then saw that the unzipped archive had a _MACOSX folder and a Converted sessions folder. Assuming I don't need the MACOSX stuff, I did the following to get just the files I'm interested in:
root_folder <- paste0(temp,'/Converted sessions/')
filelist <- list.files(path = root_folder, pattern = '*.txt', recursive = TRUE)
filenames <- basename(filelist)
'filelist' contains the full paths to each text file, while 'filenames' has just each file name, which I'll then break up to get the country, the number and the year:
df <- data.frame(t(sapply(strsplit(filenames, '_'),
function(x) c(x[1], x[2], substr(x[3], 1, 4)))))
colnames(df) <- c('Country', 'Number', 'Year')
Finally, I can read the text from each of the files and stick it into the dataframe as a new Text field:
df$Text <- sapply(paste0(root_folder, filelist), function(x) readChar(x, file.info(x)$size))

How do I apply the same action to all Excel Files in the directory?

I need to shape the data stored in Excel files and save it as new .csv files. I figured out what specific actions should be done, but can't understand how to use lapply.
All Excell files have the same structure. Each of the .csv files should have the name of original files.
## the original actions successfully performed on a single file
library(readxl)
library("reshape2")
DataSource <- read_excel("File1.xlsx", sheet = "Sheet10")
DataShaped <- melt(subset(DataSource [-(1),], select = - c(ng)), id.vars = c ("itemname","week"))
write.csv2(DataShaped, "C:/Users/Ol/Desktop/Meta/File1.csv")
## my attempt to apply to the rest of the files in the directory
lapply(Files, function (i){write.csv2((melt(subset(read_excel(i,sheet = "Sheet10")[-(1),], select = - c(ng)), id.vars = c ("itemname","week"))))})
R returns the result to the console but doesn't create any files. The result resembles .csv structure.
Could anybody explain what I am doing wrong? I'm new to R, I would be really grateful for the help
Answer
Thanks to the prompt answer from #Parfait the code is working! So glad. Here it is:
library(readxl)
library(reshape2)
Files <- list.files(full.names = TRUE)
lapply(Files, function(i) {
write.csv2(
melt(subset(read_excel(i, sheet = "Decomp_Val")[-(1),],
select = -c(ng)),id.vars = c("itemname","week")),
file = paste0(sub(".xlsx", ".csv",i)))
})
It reads an Excel file in the directory, drops first row (but headers) and the column named "ng", melts the data by labels "itemname" and "week", writes the result as a .csv to the working directory attributing the name of the original file. And then - rinse and repeat.
Simply pass an actual file path to write.csv2. Otherwise, as denoted in docs ?write.csv, the default value for file argument is empty string "" :
file: either a character string naming a file or a connection open for writing. "" indicates output to the console.
Below concatenates the Excel file stem to the specified path directory with .csv extension:
path <- "C:/Users/Ol/Desktop/Meta/"
lapply(Files, function (i){
write.csv2(
melt(subset(read_excel(i, sheet = "Sheet10")[-(1),],
select = -c(ng)),
id.vars = c("itemname","week")),
file = paste0(path, sub(".xlsx", ".csv", i))
)
})

Split files present in different Folders in R

I have a folder that contain many folders and each folder contains one csv file. I want to split each file on the basis of CN into its own folder.This is the position of files:
home -> folder -> f_5324 -> f_5324.csv
-> f_5674 -> f_5674.csv
-> f_8769 -> f_8769.csv and so on
I want to write a code that will take first folder(f_5324) read csv file then split that file and save in that folder(f_5324) then take second folder(f_5674) read csv file then split and save in that folder(f_5674) then will do the same with all folders.
This is my code in R :-
dir <- "/home/folder"
my_dirs <- list.dirs(dir, recursive = FALSE)
for(i in my_dirs){
a <- list.files(path = i, full.names = TRUE, recursive = TRUE)
df <- read.csv(a)
a0 <- df[df$CN=="cn=0",]
a1 <- df[df$CN=="cn=1",]
a3 <- df[df$CN=="cn=3",]
a4 <- df[df$CN=="cn=4",]
write.csv(a0,"cn0.csv")
write.csv(a1,"cn1.csv")
write.csv(a3,"cn3.csv")
write.csv(a4,"cn4.csv")
}
I am trying hard but it's not working properly it splits the file but creates one file for cn0,cn1,cn3,cn4 and overwrite all results. please tell me how to pass path to each folder and get separate result for all csv file in there own folder.
Help will appreciated
Use -
dir <- "/home/folder"
my_dirs <- list.dirs(dir, recursive = FALSE)
for(i in my_dirs){
a <- list.files(path = i, full.names = TRUE, recursive = TRUE)
df <- read.csv(a)
a0 <- df[df$CN=="cn=0",]
a1 <- df[df$CN=="cn=1",]
a3 <- df[df$CN=="cn=3",]
a4 <- df[df$CN=="cn=4",]
write.csv(a0,paste(i,"cn0.csv",sep="/"))
write.csv(a1,paste(i,"cn1.csv",sep="/"))
write.csv(a3,paste(i,"cn3.csv",sep="/"))
write.csv(a4,paste(i,"cn4.csv",sep="/"))
}
Explanation
In your initial implementation, write.csv(a0,"cn0.csv") implies you are writing a csv file named cn0.csv to your present working directory.
Next time the loop comes, it is just overriding the existing file again and again.
To avoid this, you need to specify the directory correctly for each csv write, which has been done by modifying to write.csv(a0,paste(i,"cn0.csv",sep="/")), which populates the correct target directory.

How to download multiple gzip file?

I have to download a lot of files in the format .gz (one file ~40mb , 40k rows).
The file contain a data from another country i would like to choose data only from france -> fr (limiting the number of columns)
I am trying to automate this process but I have problems with unpacking.
The data is on a webpage
and I'm intersted in data from whole folder.
I try with:
create tempfile
dowloand zip to tempfile
unzip, read and selected rows.
save as new file and repeat to next file.
I would like to ask if this way of think is correct.(the below code will be in for loop)
temp <- tempfile()
temp1 <- "C:/Users/tdo/Desktop/data/test.txt"
download.file("https://dumps.wikimedia.org/other/pageviews/2018/2018-
06/pageviews-20180601-000000.gz",temp) # example
unzip(files = temp,exdir = temp1)
data <- read.table(..)
daata[data$name == 'fr']
write.table(...)
In this way I created links:
dumpList <- read_html("https://dumps.wikimedia.org/other/pageviews/2018/2018-04/")
links <- data_frame(filename = html_attr(html_nodes(dumpList, "a"), "href")) %>%
filter(grepl(x = filename, "pageviews")) %>% data by project
mutate(link = paste0("https://dumps.wikimedia.org/other/pageviews/2018/2018-04/", filename))
Why not directly read the gzipped files? I don't see the need to locally unpack the archives, if all you want to do is subset/filter the data and store as new local files.
I recommend using readr::read_table2 to directly read the gzipped file.
Here is a minimal example:
# List of files to download
# url is the link, target the local filename
lst.files <- list(
list(
url = "https://dumps.wikimedia.org/other/pageviews/2018/2018-06/pageviews-20180601-000000.gz",
target = "pageviews-20180601-000000.gz"))
# Download gzipped files (only if file does not exist)
lapply(lst.files, function(x)
if (!file.exists(x$target)) download.file(x$url, x$target))
# Open files
library(readr)
lst <- lapply(lst.files, function(x) {
df <- read_table2(x$target)
# Filter/subset entries
# Write to file with write_delim
})

Resources