R issues vs excel - r

I am working on a project where I have multiple excel files and each file is having multiple workbook, I have to get the data from one of the workbook let say sheet=6, and after that all theses data to store in a new .xls or .csv file.
I am facing issue while trying to read the data from file and string it in to list. getting the following error:
Error: `path` does not exist: ‘BillingReport___Gurgaon-Apr-2019.xlsx’
I am trying mapdfr funtion ot get the data.
library(purrr)
library(readxl)
library(dplyr)
library(rio)
library(XLConnect)
library(tidyverse)
setwd ="F:/Capstone/Billing Reports final/"
#Set path of Billing source folder
billingptah <- "F:/Capstone/Billing Reports final/"
#Set path of destination folder
csvexportpath <- "F:/Capstone/Billing_data/billing_data.csv"
#get the names of the files to be loaded
files_to_load <- list.files(path = billingptah)
files_to_load
#Load all the data into one file
billing_data <- map_dfr(files_to_load, function(x) map_dfr( excel_sheets(x) , function(y) read_excel(path=x, sheet = 6,col_types = "text" ) %>% mutate(sheet=6) ) %>% mutate(filename=x) )
following is the error message:
Error: `path` does not exist:
‘BillingReport___Gurgaon-Apr-2019.xlsx’

It is all about the difference about relative and absolute path. You're telling R to load a file located in your current working directory named ‘BillingReport___Gurgaon-Apr-2019.xlsx’. You need to add the path to access this file name as a suffix. Try this after building files_to_load:
files_to_load <- paste0(billingptah, files_to_load)
It will tell R to access the files named after file_to_load located in billingptah directory.
Edit
Let me just point you out some useful links:
https://www.reed.edu/data-at-reed/resources/R/reading_and_writing.html
And for best practices: https://stat.ethz.ch/R-manual/R-devel/library/base/html/file.path.html

Related

Unzip a zipped file that is saved using name and date of download like this (Query Transaction History_20221126.zip)

I have a data related problem,
I want to unzip a file in my download folder.
This zip file is for daily use, meaning that is something I will keep downloading on a daily basis and the zip folder name usually contain date and time of download as the folder name.
This is how the zip folder name usually looks like below.
Query Transaction History_20221125035935_42217.zip
and the excel file in the folder comes like this
Query Transaction History_20221126035617_01.xls
If you check closely the name for the both the zip file and xls are combination of name(Query Transaction History_) and date and time of the download (20221126035617_01)
So I was able to come up with the script below.
library(plyr)
my_dir<-"C:/Users/Guest 1/Downloads"
zip_files <-list.files(path ="C:/Users/Guest 1/Downloads",
pattern ="Query Transaction History_20221126.*zip",full.names = TRUE )
ldply(.data =zip_files,.fun = unzip,exdir =my_dir )
it works fine and extract the file to the download folder. But it is something that i will be keep doing on a daily basis and the date will keep changing while the name is constatcnt so i tried this code.
library(glue)
sea<-format(Sys.Date(),"%Y%m%d") #formatting date to suit the date attached to zip file
zip_files <-list.files(path ="C:/Users/Guest 1/Downloads",
pattern =glue("Query Transaction History_{sea}.*zip",full.names = TRUE )) #using glue to glue the date variable and attached it to the zipped file and run.
It works fine
Now to unzip using apply function below
ldply(.data =zip_files,.fun = unzip,exdir =my_dir )#using apply to unzip
I get the error below
Warning message:
In FUN(X[[i]], ...) : error 1 in extracting from zip file
Thanks
With the zip files residing outside your getwd() you'll need to specify the full file path like so:
ldply(.data =zip_files,
.fun = function(file_name){
unzip(file.path("C:/Users/Guest 1/Downloads",
file_name),
exdir = my_dir
)
}
)
Note: unless you require {plyr} anyway, you could just use base R sapply or Map instead of ldply.

Parsing issue, unexpected character when loading a folder

I am using this answer to load in a folder of Excel Files:
# Get the list of files
#----------------------------#
folder <- "path/to/files"
fileList <- dir(folder, recursive=TRUE) # grep through these, if you are not loading them all
# use platform appropriate separator
files <- paste(folder, fileList, sep=.Platform$file.sep)
So far, so good.
# Load them in
#----------------------------#
# Method 1:
invisible(sapply(files, source, local=TRUE))
#-- OR --#
# Method 2:
sapply(files, function(f) eval(parse(text=f)))
But the source function (Method 1) gives me the error:
Error in source("C:/Users/Username/filename.xlsx") :
C:/Users/filename :1:3: unexpected input
1: PK
^
For method 2 get the error:
Error in parse(text = f) : <text>:1:3: unexpected '/'
1: C:/
^
EDIT: I tried circumventing the issue by setting the working directory to the directory of the folder, but that did not help.
Any ideas why this happens?
EDIT 2: It works when doing the following:
How can I read multiple (excel) files into R?
setwd("...")
library(readxl)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list, read_excel)
just to provide a proper answer outside of the comment section...
If your target is to read many Excel files, you shouldn't use source.
source is dedicated to run external R code.
If you need to read many Excel files you can use the following code and the support of one of these libraries: readxl, openxlsx, tidyxl (with unpivotr).
filelist <- dir(folder, recursive = TRUE, full.names = TRUE, pattern = ".xlsx$|.xls$", ignore.case = TRUE)
l_df <- lapply(filelist, readxl::read_excel)
Note that we are using dir to list the full paths (full.names = TRUE) of all the files that ends with .xlsx, .xls (pattern = ".xlsx$|.xls$"), .XLSX, .XLS (ignore.case = TRUE) in the folder folder and all its subfolders (recursive = TRUE).
readxl is integrated with tidyverse. It is pretty easy to use. It is most likely what you're looking for.
Personally, I advice to use openxlsx if you need to write (rather than read) customized Excel files with many specific features.
tidyxl is the best package I've seen to read Excel files, but it may be rather complicated to use. However, it's really careful in the types preservation.
With the support of unpivotr it allows you to handle complicated Excel structures.
For example, when you find multiple headers and multiple left index columns.

Use of wildcards with readtext()

A basic question. I have a bunch of transcripts (.docx files) I want to read into a corpus. I use readtext() to read in single files no problem.
dat <- readtext("~/ownCloud/NLP/interview_1.docx")
As soon as I put "*.docx" in my readtext statement it spits an error.
dat <- readtext("~/ownCloud/NLP/*.docx")
Error: '/var/folders/bl/61g7ngh55vs79cfhfhnstd4c0000gn/T//RtmpWD6KSx/readtext-aa71916b691c0cf3cabc73a2e04a45f7/word/document.xml' does not exist.
In addition: Warning message:
In utils::unzip(file, exdir = path) : error 1 in extracting from zip file
Why the reference to a zip file? I have only .docx files in the directory.
I was able to reproduce the same problem. The issue was there are some hidden/temp .docx files in that folder, if you delete them and then try the code it works.
To see the hidden files, go to the folder from where you are reading docx files and based on your OS select a way to show them. On my mac I used
CMD + SHIFT + .
Once you delete them, try the code again and it should work
library(readtext)
dat <- readtext("~/ownCloud/NLP/*.docx")

How can I read all hdf files in a folder using R?

I have thousands of hdf files in a folder. Is there a way to create a loop to read all of the hdf files in that folder and write some specific data to another file?
I read the first file in the folder using the code below:
mydata <- h5read("/path to file/name of the file.he5", "/HDFEOS/GRIDS/Northern Hemisphere/Data Fields/SWE_NorthernDaily")
But I have 1686 more files in the folder, and it is not possible to read one by one. I think I need to write a for loop to read all files in the folder.
I found some codes listing the txt files in a folder and then, read all the files:
nm <- list.files(path="path/to/file")
do.call(rbind, lapply(nm, function(x) read.table(file=x)[, 2]))
I tried to change the code as seen below:
nm <- list.files(path="path/to/file")
do.call(rbind, lapply(nm, function(x) h5read(file=x)[, 2]))
But the error message says:
Error in h5checktypeOrOpenLoc(file, readonly = TRUE, native = native) :
Error in h5checktypeOrOpenLoc(). Cannot open file. File 'D:\path to file\name of the file.he5' does not exist.
What should I do in that situation?
If you are not bound to a specific technology, you may want to take a look at HDFql. Using HDFql in R, your issue can be solved as follows (for the sake of this example, assume that (1) dataset /HDFEOS/GRIDS/Northern Hemisphere/Data Fields/SWE_NorthernDaily exists in all the HDF5 files stored in the directory, (2) it has one dimension (size 1024), and (3) is of data type integer):
# load HDFql R wrapper (make sure it can be found by the R interpreter)
source("HDFql.R")
# create variable "values" and initialize it
values <- array(dim = c(1024))
for(x in 1:1024)
{
values[x] <- as.integer(0)
}
# show (i.e. get) files stored in directory "/path/to/hdf5/files" and populate HDFql default cursor with it
hdfql_execute("SHOW FILE /path/to/hdf5/files")
# iterate HDFql default cursor
while(hdfql_cursor_next() == HDFQL_SUCCESS)
{
file_name <- hdfql_cursor_get_char()
# select (i.e. read) data from dataset "/HDFEOS/GRIDS/Northern Hemisphere/Data Fields/SWE_NorthernDaily" and populate variable "values" with it
hdfql_execute(paste("SELECT FROM", file_name, "\"/HDFEOS/GRIDS/Northern Hemisphere/Data Fields/SWE_NorthernDaily\" INTO MEMORY", hdfql_variable_transient_register(values)))
# display values stored in variable "values"
for(x in 1:1024)
{
print(values[x])
}
}
Additional examples on how to read datasets using HDFql can be found in the quick start guide and reference manual.

Can't open .biom file for Phyloseq tree plotting

After trying to read a biom file:
rich_dense_biom <-
system.file("extdata", "D:\sample_otutable.biom", package = "phyloseq")
myData <-
import_biom(rich_dense_biom, treefilename, refseqfilename, parseFunction =
parse_taxonomy_greengenes)
the following errors are showing
Error in read_biom(biom_file = BIOMfilename) :
Both attempts to read input file:
either as JSON (BIOM-v1) or HDF5 (BIOM-v2).
Check file path, file name, file itself, then try again.
Are you sure D:\sample_otutable.biom really exists? And is a system file?
In R for Windows, it is at least safer (if not required?) to separate file paths with \\
This works for me
library("devtools")
install_github("biom", "joey711")
library(biom)
biom.file <-
"C:\\Users\\Mark Miller\\Documents\\R\\win-library\\3.3\\biom\\extdata\\min_dense_otu_table.biom"
my.data <- import_biom(BIOMfilename = biom.file)

Resources