PDF File Import R - r

I have multiple .pdf-files (stored in a local folder), that contain text. I would like to import the .pdf-files (i.e., the texts) in R. I applied the function 'read_dir' (R package: [textreadr][1])
library ("textreadr")
Data <- read_dir("<MY PATH>")
The function works well. BUT. For several files, that include special characters (i.e., letters) in their names (such as 'ć'; e.g., 'filenameć.pdf'), the function did not work (error message: 'The following files failed to read in and were removed:' …).
What can I do?
I tried to rename the files via R (did not work (probably due to the same reasons)). That might be a workaround.
I did not want to rename the files manually :)
Follow-Up (only for experts):
For several files, I got one of the following error messages (and I have no idea why):
PDF error: Mismatch between font type and embedded font file
or
PDF error: Couldn't find trailer dictionary
Any suggestions or hints how to solve this issue?

Likely the issue concerns the encoding of the file names. If you absolutely want to use R to rename the files for you, the function you want to use is iconv, determine the encoding of the file names and then convert them to utf-8.
However, a much better system would imply renaming them using bash from command line. Can you provide a more complete set of examples?

Related

How do I get R to work with spaces in a file path?

I believe a switch to OneDrive is causing some issues in various packages in R due to spaces being incorporated into the file path name. One shown below is the readxl package. Is there a way to get the package to read the spaces in the file path names? Or is it something other than the spaces that I might have overlooked?
Installation and the loading of the library work fine. However, when trying to import an excel file, it only works if I put the file in a location without spaces in the file path. I need the file to be in OneDrive so that it will be backed up.
install.packages("readxl")
library("readxl")
TRENDS_2020 <- read_excel("C:\\Users\\name03\\OneDrive - Specific Details Here (ABC)\\Backup_12_22_2020\\WQ_ALL_FINAL_WEBSITE_PIVOT_TRENDS_2020.xlsx")
I get the following error when running that:
Error in utils::unzip(zip_path, list = TRUE) :
zip file 'C:\Users\name03\OneDrive - Specific Details Here (ABC)\Backup_12_22_2020\TRENDS_2020.xlsx' cannot be opened
The following does work for the same file that I copy and pasted into my C drive:
TRENDS_2020 <-read_excel("C:\\TRENDS_2020.xlsx")
Zip {utils}
treated as if passed to system, if the filepaths contain spaces they must be quoted e.g. by shQuote.
Statistical Data Analysis ETH Zurich

Importing to R an Excel file saved as web-page

I would like to open an Excel file saved as webpage using R and I keep getting error messages.
The desired steps are:
1) Upload the file into RStudio
2) Change the format into a data frame / tibble
3) Save the file as an xls
The message I get when I open the file in Excel is that the file format (excel webpage format) and extension format (xls) differ. I have tried the steps in this answer, but to no avail. I would be grateful for any help!
I don't expect anybody will be able to give you a definitive answer without a link to the actual file. The complication is that many services will write files as .xls or .xlsx without them being valid Excel format. This is done because Excel is so common and some non-technical people feel more confident working with Excel files than a csv file. Now, the files will have been stored in a format that Excel can deal with (hence your warning message), but R's libraries are more strict and don't see the actual file type they were expecting, so they fail.
That said, the below steps worked for me when I last encountered this problem. A service was outputting .xls files which were actually just HTML tables saved with an .xls file extension.
1) Download the file to work with it locally. You can script this of course, e.g. with download.file(), but this step helps eliminate other errors involved in working directly with a webpage or connection.
2) Load the full file with readHTMLTable() from the XML package
library(XML)
dTemp = readHTMLTable([filename], stringsAsFactors = FALSE)
This will return a list of dataframes. Your result set will quite likely be the second element or later (see ?readHTMLTable for an example with explanation). You will probably need to experiment here and explore the list structure as it may have nested lists.
3) Extract the relevant list element, e.g.
df = dTemp[2]
You also mention writing out the final data frame as an xls file which suggests you want the old-style format. I would suggest the package WriteXLS for this purpose.
I seriously doubt Excel is 'saved as a web page'. I'm pretty sure the file just sits on a server and all you have to do is go fetch it. Some kind of files (In particular Excel and h5) are binary rather than text files. This needs an added setting to warn R that it is a binary file and should be handled appropriately.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download.file(url=myurl, destfile="localcopy.xlsx", mode="wb")
or, for use downloader, and ty something like this.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download(myurl, destfile="localcopy.csv", mode="wb")

Error Reading Multiple Excel Sheets Using openxlsx package in R

I'm trying to load an Excel workbook with a large number of tabs into R, do some analysis, and then export the results back into Excel. I'm using the openxlsx package because of some of the features of that package that are not easily accessible using other packages (such as the ability to create "comments" in the output file, color code the tabs, and work with 64-bit R).
When I try to read in the workbooks, I sometimes get the following error message (or something similar):
Error in unzip(xlsxFile, exdir = xmlDir) :
cannot open file 'C:/Users/MENDEL~1/AppData/Local/Temp/RtmpIb3WOf/_excelXMLRead/xl/worksheets/sheet5.xml': Permission denied
This error message doesn't always show up - but sometimes it will appear and the program crashes.
Does anyone have any ideas how to fix this problem? I don't know why the program sometimes thinks it doesn't have permission to access the sheets.
Thank you in advance!
I can think of two possible scenarios for this error:
Scenario 1:
C:/Users/MENDEL~1/AppData/Local/ (This looks like you are trying to read a temporary file)
Solution:
If that is the case try moving the file to a different location like desktop and make sure that you update your working directory accordingly.
Scenario 2
C:/Users/MENDEL~1/AppData/Local/Temp/RtmpIb3WOf/_excelXMLRead/xl/worksheets/sheet5.xml' (Looks like there is some issue with Sheet5 which is of type .xml and the openxlsx does not allow you to read .xml)
Solution:
Check if there is some issue with the format or contents of sheet5 in the file that you are trying to read.
For additional information check CRAN Documentation

Reading an Excel file into an R dataframe from a zipped folder

I have an Excel file (.xls extension) that is inside a zipped folder that I would like to read as a dataframe into R. I loaded the gdata library and set up my working directory to the folder that houses the zipped folder.
When I type in the following syntax:
data_frame1 <- read.xls( unz("./Data/Project1.zip","schools.xls"))
I get the following error messages:
Error in path.expand(xls) : invalid 'path' argument
Error in file.exists(tfn) : invalid 'file' argument
I'm guessing that I'm missing some arguments in the syntax, but I'm not entirely sure what else needs to be included.
Thanks for your help! This R newbie really appreciates it!
Unfortunately, after a quick survey of all the xls functions I know, there is no xls reading function that can recognize the unz output (I would love to be proven wrong here). If it were a 'csv' it would work fine. As it stands, until such a function is written, you must do the loading in two steps extraction and then loading.
To give you a little more control, you can specify which file to unzip as well as the directory to place the files with unzip.
# default exdir is current directory
unzip(zipfile="./Data/Project1.zip", files = "schools.xls", exdir=".")
dataframe_1 <- read.xls("schools.xls")
Sadly, this also means that you must do cleanup afterwards if you don't want the 'xls' file hanging around.

How to read excel file in Chinese character [R]?

I always convert excel file into CSV file to import to R as following.
myDataFrame <- read.csv("mydatafile.csv", stringsAsFactors=F)
But, I got a serious problem when I convert xlsx file which is written in Chinese. Most of characters(not all of them) shows '??' because of encoding.
So, I decided to use xlsx package to import directly. But the problem is that size of excel file exceeds 10MB.
It gave me an error message because of JVMs memory limit. (I assume that xlsx uses Java internally.)
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: GC overhead limit exceeded
How can I import chinese excel file to R? I tried 'Save as..' CSV file, and opened it notepad, and save it with option 'UTF-8'. but the result was the same(shows '??').
FYI, I can see full chinese character in the original excel file.
Your question is a mixed one. Let's assume that you have converted the xlsx file into csv. If you haven't, please refer to other threads like this one. I think this step is best carried out in some externel tool rather than in R.
Now we've got a csv, there remain two problems, size and encoding. For encoding, as you have mentioned in the comment, you can use the encoding= option of several R functions like read.csv. For Chinese files coming out of Excel, the encoding is most probably "GB18030". If cannot decide, the open file dialog of Libreoffice Calc may give you some clue.
If the file size is large, you may first convert the encoding using the Linux command iconv, and then further process it in R.
Now for the size part. A 50mb or even 500mb csv can easily handled by read.csv, although not necessarily fast, provided that you have enough memory. If the file is larger than 1G, there are two options:
Use the sqldf package, which reads the csv into a temporary database, and then into a data.frame.
Process the csv line by line. First use file() to create a connection, then use readLines() to process it line by line. Finally manually combine the result into a data.frame or other appropriate structure.
The first one is simpler, the second one can handle really large file.
Hope it helps.

Resources