I have been trying to import the file
reuters <- Corpus(DirSource(directory = "E:\\R Programs\\Test\\Reuteurs\\reut2-000.xml", encoding = "UTF-8"),
readerControl = list(reader = readReut21578XMLasPlain))
However I get below error:
Error in DirSource(directory = "E:\\R Programs\\Test\\Reuteurs\\reut2-000.xml", :
empty directory
I have also checked other solutions provide in stackoverflow but its not working for me. Am I missing anything?
But below code works: Why DirSource method is not working for me? Am I missing anything?
reuters <- Corpus(URISource("file://E:\\R Programs\\Test\\Reuteurs\\reut2-000.xml",encoding="UTF-8"),
readerControl = list(reader = readReut21578XMLasPlain))
Reference link which I referred:
R: Got problems in reading text file
Using R for Text Mining Reuters-21578
R Error in trying to access local data
reut2-000.xml probably is a file, and not a directory?
Opening a file as directory will cause an error.
I would suggest that you use the preprocessed Reuters Corpus from R package tm.corpus.Reuters21578 (as I've already recommended here: Using R for Text Mining Reuters-21578).
install.packages("tm.corpus.Reuters21578", repos = "http://datacube.wu.ac.at")
library(tm.corpus.Reuters21578)
data(Reuters21578)
These are the same data as in the original Reuters xml files, however without the issues with encoding, missing xml declaration etc.
finally i found a way out to this error:
words<-Corpus(VectorSource(fread(file,encoding = 'UTF-8',sep = ',',verbose = TRUE)))
hope this helps
Related
I have a secure Url which provides data in Excel format. How do I read it in R studio?
Please mention the necessary package and functions. I have tried read.xls(),
read_xlsx, read.URL and some more. Nothing seems to work.
You can do it in two steps. First, you'll need to download it with something like download.file, then read it with readxl::read_excel
download.file("https://file-examples.com/wp-content/uploads/2017/02/file_example_XLS_10.xls", destfile = "/tmp/file.xls")
readxl::read_excel("/tmp/file.xls")
library(readxl)
library(httr)
url<-'https://......xls'
GET(url, write_disk(TF <- tempfile(fileext = ".xls")))
read_excel(TF)
Have you tried importing it as a .csv dataset into RStudio? Might be worth a try!:)
I am trying to read the micro data from the extract that I downloaded from IPUMS USA into R. It seemed simple at first, but I can't get it. I already downloaded the DDI and CSV, and it is not working!
Would appreciate any help as how to how to get this data into R.
I've tried two different ways. I learned how to do this code from this website: https://tech.popdata.org/Integrating-IPUMS-Data-with-R/ (but apparently it was wrong).
Here's my code:
cps_ddi <- read_ipums_ddi(ipums_example("wagesdata.xml"))
cps_data <- read_ipums_micro(cps_ddi, data_file = ipums_example("usa_00004.csv"), verbose = FALSE)
Console returns this:
Error in ipums_example("wagesdata.xml") :
Could not find file 'wagesdata.xml' in examples. Available files are: cps_00006.csv.gz, cps_00006.dat.gz, cps_00006.xml, cps_00010.dat.gz, cps_00010.xml, cps_00015.dat.gz, cps_00015.xml, nhgis0008_csv.zip, nhgis0008_shape_small.zip`
cps_data <- read_ipums_micro(cps_ddi, data_file = ipums_example("usa_00004.csv"), verbose = FALSE)
Error in read_ipums_micro(cps_ddi, data_file = ipums_example("usa_00004.csv"), : object 'cps_ddi' not found
The ipums_example() function is designed to find the example data included with the R package.
However, if you're working with your own data, you don't need it.
I believe this should work:
cps_ddi <- read_ipums_ddi("wagesdata.xml")
cps_data <- read_ipums_micro(cps_ddi, data_file = "usa_00004.csv", verbose = FALSE)
If it doesn't, then most likely you haven't downloaded the data to your current working directory. You can check where your session is by running command getwd() and see what files are currently available with list.files()
I am using ‘openxlsx’ package in R. ٰI want to add some data in xlsx file. I have used following code to create the workbook and add worksheet in it.
wb=createWorkbook()
addWorksheet(wb,"sheet 1")
writeData(wb,sheet = 1,"From",startCol = 1,startRow = 1)
writeData(wb,sheet = 1,"To",startCol = 2,startRow = 1)
writeData(wb,sheet = 1,"From",startCol = 1,startRow = 2)
writeData(wb,sheet = 1,"From",startCol = 1,startRow = 2)
saveWorkbook(wb,"file.xlsx",overwrite = TRUE)
This code was working well for a long time, But recently, I am facing this error
Error in addWorksheet(wb, "sheet 1") : First argument must be a
Workbook.
How this error will be resolved?
I had the same issue with this. I did the followings and it fixed the issue. Maybe it can solve yours.
Close R or RStudio.
Make sure that your current working directory does not have any other file or folder. In other words, the path in which you would like to save the xlsx is empty prior to running createWorkbook(). If you have already saved any file in there, just copy and paste it somewhere else.
Run your code again from the beginning.
I had the same problem and I just unloaded the package, reinstalled it and reloaded it, and it worked (without having to close R studio) :
detach("package:openxlsx", unload=TRUE)
install.packages("openxlsx")
library(openxlsx)
I was able to fix this error by disabling XLSX package
detach("package:xlsx", unload = TRUE)
Works for me unload it and re-install (or uninstall) library rio
detach("package:rio", unload=TRUE)
detach("package:openxlsx", unload=TRUE)
install.packages("openxlsx", "rio")
library(openxlsx)
Encountered the same issue, removed library(xlsx) & library(readxl) from script and now runs without issue. Using the below libraries now in script for context:
library(openxlsx)
library(rio)
library(rJava)
I experienced a similar issue when I tried to re-run code to export an XLSX file. To have my code working well, I just made sure the workbook (wb) was removed from my global environment. I think inserting the following line right before your code may help.
rm(list=deparse(substitute(wb)),envir=.GlobalEnv)
I would like to read automatically in R the file which is located at
https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017
This link generates the automatic download of a zipfile. This zipfile contains the Excel file I want to read in R.
Does any of you have any suggestions on this? Thanks.
Panagiotis' comment to use download.file() is generally good advice, but I couldn't make it work here (and would be curious to know why). Instead I used httr.
(Edit: got it, I reversed args of download.file()... Repeat after me: always use named args...)
Another problem with this data: it appears not to be a regular xls file, I couldn't open it with the yet excellent readxl package.
Looks like a tab separated flat file, but no success with read.table() either. readr::read_delim() made it.
library(httr)
library(readr)
r <- GET("https://clients.rte-france.com/servlets/IndispoProdServlet?annee=2017")
# Write the archive on disk
writeBin(r$content, "./data/rte_data")
rte_data <-
read_delim(
unzip("./data/rte_data", exdir = "./data/"),
delim = "\t",
locale = locale(encoding = "ISO-8859-1"),
col_names = TRUE
)
There still are parsing problems, but not sure they should be dealt with in this SO question.
Is it possible to parse text data from PDF files in R? There does not appear to be a relevant package for such extraction, but has anyone attempted or seen this done in R?
In Python there is PDFMiner, but I would like to keep this analysis all in R if possible.
Any suggestions?
Linux systems have pdftotext which I had reasonable success with. By default, it creates foo.txt from a give foo.pdf.
That said, the text mining packages may have converters. A quick rseek.org search seems to concur with your crantastic search.
This is a very old thread, but for future reference: the pdftools R package extracts text from PDFs.
A colleague turned me on to this handy open-source tool: http://tabula.nerdpower.org/. Install, upload the PDF, and select the table in the PDF that requires data-ization. Not a direct solution in R, but certainly better than manual labor.
A purely R solution could be:
library('tm')
file <- 'namefile.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
then you'll have pdf lines in an array.
install.packages("pdftools")
library(pdftools)
download.file("http://www.nfl.com/liveupdate/gamecenter/56901/DEN_Gamebook.pdf",
"56901.DEN.Gamebook", mode = "wb")
txt <- pdf_text("56901.DEN.Gamebook")
cat(txt[1])
The tabula PDF table extractor app is based around a command line application based on a Java JAR package, tabula-extractor.
The R tabulizer package provides an R wrapper that makes it easy to pass in the path to a PDF file and get data extracted from data tables out.
Tabula will have a good go at guessing where the tables are, but you can also tell it which part of a page to look at by specifying a target area of the page.
Data can be extracted from multiple pages, and a different area can be specified for each page, if required.
For an example use case, see: When Documents Become Databases – Tabulizer R Wrapper for Tabula PDF Table Extractor.
I used an external utility to do the conversion and called it from R. All files had a leading table with the desired information
Set path to pdftotxt.exe and convert pdf to text
exeFile <- "C:/Projects/xpdfbin-win-3.04/bin64/pdftotext.exe"
for(i in 1:length(pdfFracList)){
fileNumber <- str_sub(pdfFracList[i], start = 1, end = -5)
pdfSource <- paste0(reportDir,"/", fileNumber, ".pdf")
txtDestination <- paste0(reportDir,"/", fileNumber, ".txt")
print(paste0("File number ", i, ", Processing file ", pdfSource))
system(paste(exeFile, "-table" , pdfSource, txtDestination, sep = " "), wait = TRUE)
}