Error in reading files in R - r

I'm a newcomer in the R community. Coding my first programs I've faced with a silly problem! When trying to read an RDS file with the following code:
tweets <- readRDS("RDataMining-Tweets-20160212.rds")
the following error will arise.
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file 'RDataMining-Tweets-20160212.rds', probable reason 'No such file or directory'
What's the problem here?

Since we don't have access to your file, it'll be difficult to really know for sure, so let me give you some examples of what other types files might give you.
First, some files:
ctypes <- list(FALSE, 'gzip', 'bzip2', 'xz')
saverds_names <- sprintf('saveRDS_%s.rds', ctypes)
save_names <- sprintf('save_%s.rda', ctypes)
ign <- mapply(function(fn,ct) saveRDS(mtcars, file=fn, compress=ct),
saverds_names, ctypes)
ign <- mapply(function(fn,ct) save(mtcars, file=fn, compress=ct),
save_names, ctypes)
str(lapply(saverds_names, function(fn) system2("file", fn, stdout=TRUE)))
# List of 4
# $ : chr "saveRDS_FALSE.rds: data"
# $ : chr "saveRDS_gzip.rds: gzip compressed data, from HPFS filesystem (OS/2, NT)"
# $ : chr "saveRDS_bzip2.rds: bzip2 compressed data, block size = 900k"
# $ : chr "saveRDS_xz.rds: XZ compressed data"
str(lapply(save_names, function(fn) system2("file", fn, stdout=TRUE)))
# List of 4
# $ : chr "save_FALSE.rda: data"
# $ : chr "save_gzip.rda: gzip compressed data, from HPFS filesystem (OS/2, NT)"
# $ : chr "save_bzip2.rda: bzip2 compressed data, block size = 900k"
# $ : chr "save_xz.rda: XZ compressed data"
A common (unix-y) utility is file, which uses file signatures to determine probable file type. (If you are on windows, it is usually installed with Rtools, so look for it there. If Sys.which("file") is empty, then look around for where you have Rtools installed, for something like c:\Rtools\bin\file.exe.)
Sys.which('file')
# file
# "c:\\Rtools\\bin\\file.exe"
With this, let's see what file thinks these files are likely to be:
str(lapply(saverds_names, function(fn) system2("file", fn, stdout=TRUE)))
# List of 4
# $ : chr "saveRDS_FALSE.rds: data"
# $ : chr "saveRDS_gzip.rds: gzip compressed data, from HPFS filesystem (OS/2, NT)"
# $ : chr "saveRDS_bzip2.rds: bzip2 compressed data, block size = 900k"
# $ : chr "saveRDS_xz.rds: XZ compressed data"
str(lapply(save_names, function(fn) system2("file", fn, stdout=TRUE)))
# List of 4
# $ : chr "save_FALSE.rda: data"
# $ : chr "save_gzip.rda: gzip compressed data, from HPFS filesystem (OS/2, NT)"
# $ : chr "save_bzip2.rda: bzip2 compressed data, block size = 900k"
# $ : chr "save_xz.rda: XZ compressed data"
Helps a little. If your does not return one of these four strings, then you are likely looking at a corrupted file (or mis-named, i.e., not really an .rds format that we are expecting).
If it does return one of them, though, know that readRDS (the first four) and load (last four) will automatically determine the compress= argument to use, which means that the file is most likely corrupt (or some other form of compressed data; again, likely mis-named).
In contrast, some other file types return these:
system2("file", "blank.accdb")
# blank.accdb: raw G3 data, byte-padded
system2("file", "Book1.xlsx")
# Book1.xlsx: Zip archive data, at least v2.0 to extract
system2("file", "Book1.xls")
# Book1.xls: OLE 2 Compound Document
system2("file", "j.fra.R") # code for this answer
# j.fra.R: ASCII text, with CRLF line terminators
(The with CRLF is a windows-y thing. *sigh*) The last one also will also be the case for CSV and similar text-based tabular files, etc.
#divibisan's suggestion that the file could be corrupt is the most likely culprit, in my mind, but it might give different output:
file.size(saverds_names[1])
# [1] 3798
head(readRDS(rawConnection(readBin(saverds_names[1], what=raw(1)))), n=2)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
but incomplete data from truncated files looks different: I truncated the files (externally with dd), and the error I received as "Error in readRDS: error reading from connection\n".
Looking at the source for R, that error string is only present in R_gzread, suggesting that R thinks the file is compressed with "gzip" (which is the default, perhaps because it could not positively identify any other obvious compression method).
This isn't much of an answer, but it might give you some appreciation for what could be wrong. The bottom line, unfortunately, is that it is highly unlikely to be able to recover any data from a corrupted.

Try to open the file in this way, avoiding any kind of possible problem related to the file path:
tweets <- readRDS(file.choose())
An interactive window will be open and you will be able to select your file.

There are two rds files at the website that supports the book you are following:
http://www.rdatamining.com/data/
One of them is named: RDataMining-Tweets-20160203.rds and the other is named: RDataMining-Tweets-20160212.rds. I suspect you put them in a "downloads" folder using your browser and it happens to be different folder than the one that you will see if you execute:
getwd()
You should try re-downloading from that website and checking the file locations.
You can get a listing of files in your current working directory with:
list.files()
Make sure that the file name appears in that output.
If that proves difficult, then this also should succeed:
tweets2 <- readRDS( url("http://www.rdatamining.com/data/RDataMining-Tweets-20160212.rds?attredirects=0&d=1") )
str(tweets2) # should be a complex and long output
The url function creates a "connection" that can be used to download RDS files

First check your working directory by getwd(). Check whether the file you want to red in this directory or not. If not use setwd(dir=<#pathwhere you want to change it>).
i think you should not use readRDS in double quotes as extension .rds is already there. Instead use with file equivalent as readRDS(File="RDataMining-Tweets-20160212.rds")

Chuckle - I recognise that data file name...
You appear to be using the reference data from Yanchang Zhao's presentation on "Text Mining with R", the data set can be downloaded via the author's website [http://www.rdatamining.com/data] using this link - [RDataMining-Tweets-20160203.rds]
Looking at the error, as Devyani Balyan states the file you are trying to load is likely not in your working directory. The code examples generally put the data file in a <project>\data structure. Yanchang in his examples follows this heuristic.
My recommendation is you get a copy of a good R book to help get you started. My personal recommendation would be "The R Book" by Michael J. Crawley.
Here is a working code example, I hope it helps you progress:
Code Example
# used to load multiple R packages at once
# and or install them.
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
# Load twitterR Package
packages <- c("twitteR")
ipak(packages)
# Load Example dataset from the website
url <- "http://www.rdatamining.com/data/RDataMining-Tweets-20160212.rds"
destfile = "RDataMining-Tweets-20160212.rds"
# Download into the current directory for simplicity
# Normally I would recommend saving to <project>/data
# Check if file exists if not download it...
if (!file.exists(destfile)) {
download.file(url, destfile)
}
## load tweets into R
tweets <- readRDS(destfile)
head(tweets, n = 3)
Console output
> ipak <- function(pkg){
+ new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
+ if (length(new.pkg))
+ install.packages(new.pkg, dependencies = TRUE)
+ sapply(pkg, require, character.only = TRUE)
+ }
> # Load twitterR Package
> packages <- c("twitteR")
> ipak(packages)
twitteR
TRUE
> url <- "http://www.rdatamining.com/data/RDataMining-Tweets-20160212.rds"
> destfile = "RDataMining-Tweets-20160212.rds"
> # Check if file exists if not download it...
> if (!file.exists(destfile)) {
+ download.file(url, destfile)
+ }
> ## load tweets into R
> tweets <- readRDS(destfile)
> head(tweets, n = 3)
[[1]]
[1] "RDataMining: A Twitter dataset for text mining: #RDataMining Tweets extracted on 3 February 2016. Download it at **********"
[[2]]
[1] "RDataMining: Vacancy of Data Scientist – Healthcare Analytics, Adelaide, Australia\n***********"
[[3]]
[1] "RDataMining: Thanks to all for your ongoing support to ******** 3. Merry Christmas and Happy New Year!"

It seems your working directory is not properly set.
See which directory you downloaded the file in and copy the path.
Use the command setwd('Your/path/here') and then it should work. Just remember to change the orientation of your slashes (from '\' to '/').

Related

sparklyr :: Error reading parquet file using sparklyr library in R

I am trying to read parquet file from databricks Filestore
library(sparklyr)
parquet_dir has been pre-defined
parquet_dir = /dbfs/FileStore/test/flc_next.parquet'
List the files in the parquet dir
filenames <- dir(parquet_dir, full.names = TRUE)
"/dbfs/FileStore/test/flc_next.parquet/_committed_6244562942368589642"
[2] "/dbfs/FileStore/test/flc_next.parquet/_started_6244562942368589642"
[3] "/dbfs/FileStore/test/flc_next.parquet/_SUCCESS"
[4] "/dbfs/FileStore/test/flc_next.parquet/part-00000-tid-6244562942368589642-0edceedf-7157-4cce-a084-0f2a4a6769e6-925-1-c000.snappy.parquet"
Show the filenames and their sizes
data_frame(
filename = basename(filenames),
size_bytes = file.size(filenames)
)
rning: `data_frame()` was deprecated in tibble 1.1.0.
Please use `tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
# A tibble: 4 × 2
filename size_bytes
<chr> <dbl>
1 _committed_6244562942368589642 124
2 _started_6244562942368589642 0
3 _SUCCESS 0
4 part-00000-tid-6244562942368589642-0edceedf-7157-4cce-a084-0f2a4a6… 248643
Import the data into Spark
timbre_tbl <- spark_read_parquet("flc_next.parquet", parquet_dir)
Error : $ operator is invalid for atomic vectors
Some(<code style = 'font-size:10p'> Error: $ operator is invalid for atomic vectors </code>)
I would appreciate any help/suggestion
Thanks in advance
The first argument of spark_read_parquet expects a spark connection, check this: sparklyr::spark_connect. If you are running the codes in Databricks then this should work:
sc <- spark_connect(method = "databricks")
timbre_tbl <- spark_read_parquet(sc, "flc_next.parquet", parquet_dir)

Download zip file to R when download link ends in '/download'

My issue is similar to this post, but the solution suggestion does not appear applicable.
I have a lot of zipped data stored an online server (B2Drop), that provides a download link with the extension "/download" instead of ".zip". I have been unable to get the method described here, to work.
I have created a test download page https://b2drop.eudat.eu/s/K9sPPjWz3jxtXEq, where the download link https://b2drop.eudat.eu/s/K9sPPjWz3jxtXEq/download can be obtained by right clicking the download button. Here is my script:
temp <- tempfile()
download.file("https://b2drop.eudat.eu/s/K9sPPjWz3jxtXEq/download",temp, mode="wb")
data <- read.table(unz(temp, "Test_file1.csv"))
unlink(temp)
When I run it, I get the error:
download.file("https://b2drop.eudat.eu/s/K9sPPjWz3jxtXEq/download",temp, mode="wb")
trying URL 'https://b2drop.eudat.eu/s/K9sPPjWz3jxtXEq/download'
Content type 'application/zip' length 558 bytes
downloaded 558 bytes
data <- read.table(unz(temp, "Test_file1.csv"))
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot locate file 'Test_file1.csv' in zip file 'C:\Users\User_name\AppData\Local\Temp\RtmpMZ6gXi\file3e881b1f230e'
which typically indicates a problem with the working directory where R is looking for the file. In this case that should be the temp wd.
Your internal path is wrong. You can use list=TRUE to list the files in the archive, analogous to the command-line utility's -l argument.
unzip(temp, list=TRUE)
# Name Length Date
# 1 Test/Test_file1.csv 256 2021-09-27 10:13:00
# 2 Test/Test_file2.csv 286 2021-09-27 10:14:00
Better than read.table, though, use read.csv since it's comma-delimited.
data <- read.csv(unz(temp, "Test/Test_file1.csv"))
head(data, 3)
# ID Variable1 Variable2 Variable Variable3
# 1 1 f 54654 25 t1
# 2 2 t 421 64 t2
# 3 3 x 4521 85 t3

Issue running purrr::possible for error recovery inside ~ tibble

I was working through an example from the Hadley Wickham's "The Joy of Functional Programming (for Data Science)" (available on youtube) lecture on purrr and wanted to add in error-recovery using possibly() to handle any zip files that fail to be read.
A bit of his code;
paths <- dir_ls("NSFopen_Alldata/", glob= ".zip")
files<- map_dfr(paths, ~ tibble(path=.x, files= unzip(.x, list = TRUE)$Name))
Adding error recovery with possibly();
unzip_safe <- possibly(.f= unzip, otherwise = NA)
files<- map_dfr(paths, ~ tibble(path=.x, files= unzip_safe(.x, list = TRUE)$Name))
I get the following error: $ operator is invalid for atomic vectors.
Is this because possibly is in a tibble?
Files that fail return an NA and you are trying to extract $Name from it which returns an error. See
NA$NAme
Error in NA$NAme : $ operator is invalid for atomic vectors
Extract $Name from the successful files in possibly itself. Try :
library(purrr)
unzip_safe <- possibly(.f= ~unzip(., list = TRUE)$Name, otherwise = NA)
files <- map_dfr(paths, ~ tibble(path=.x, files = unzip_safe(.x))

how to categorize files according to user's answer

I try to make it simple. I've got in my working directory "Laurent/R" csv files (never more than 5) with names that change from one experiment to the other.​
Is it possible to use for and if loops to display each file one after the other and ask for each of them: "is it a "control" file ?", or to ask for each file something like " Is " "file.name[i] " "a control file ? " and codify the answer for the next steps ?
Thanks
I think you're looking for something like this:
label_controls <- function(my_dir)
{
filenames <- list.files(my_dir)
is_control <- logical(length(filenames))
for(i in seq_along(filenames))
{
cat(filenames[i], "\n")
answer <- readline("Is this a control file (Y/N)? : ")
is_control[i] <- grepl("Y|y", answer)
cat("\n")
}
data.frame(filenames, is_control)
}
If you run this function with a particular directory, it will prompt you for each file whether it is a control file or not, to which you answer Y or N. It will return a data frame of all the files in the directory in one column, with a second column indicating whether that file is a control file or not:
df <- label_controls("Me/Subdir/files")
my_csv1.csv
Is this a control file (Y/N)? : N
my_csv2.csv
Is this a control file (Y/N)? : Y
my_csv3.csv
Is this a control file (Y/N)? : N
And you can review the results:
df
#> filenames is_control
#> 1 my_csv1.csv FALSE
#> 2 my_csv2.csv TRUE
#> 3 my_csv3.csv FALSE

Using R to access FTP Server and Download Files Results in Status "530 Not logged in"

What I'm Attempting to Do
I'm attempting to download several weather data files from the US National Climatic Data Centre's FTP server but am running into problems with an error message after successfully completing several file downloads.
After successfully downloading two station/year combinations I start getting an error "530 Not logged in" message. I've tried starting at the offending year and running from there and get roughly the same results. It downloads a year or two of data and then stops with the same error message about not being logged in.
Working Example
Following is a working example (or not) with the output truncated and pasted below.
options(timeout = 300)
ftp <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/"
td <- tempdir()
station <– c("983240-99999", "983250-99999", "983270-99999", "983280-99999", "984260-41231", "984290-99999", "984300-99999", "984320-99999", "984330-99999")
years <- 1960:2016
for (i in years) {
remote_file_list <- RCurl::getURL(
paste0(ftp, "/", i, "/"), ftp.use.epsv = FALSE, ftplistonly = TRUE,
crlf = TRUE, ssl.verifypeer = FALSE)
remote_file_list <- strsplit(remote_file_list, "\r*\n")[[1]]
file_list <- paste0(station, "-", i, ".op.gz")
file_list <- file_list[file_list %in% remote_file_list]
file_list <- paste0(ftp, i, "/", file_list)
Map(function(ftp, dest) utils::download.file(url = ftp,
destfile = dest, mode = "wb"),
file_list, file.path(td, basename(file_list)))
}
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1960/983250-99999-1960.op.gz'
Content type 'unknown' length 7135 bytes
==================================================
downloaded 7135 bytes
...
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1961/984290-99999-1961.op.gz'
Content type 'unknown' length 7649 bytes
==================================================
downloaded 7649 bytes
trying URL 'ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz'
downloaded 0 bytes
Error in utils::download.file(url = ftp, destfile = dest, mode = "wb") :
cannot download all files In addition: Warning message:
In utils::download.file(url = ftp, destfile = dest, mode = "wb") :
URL ftp://ftp.ncdc.noaa.gov/pub/data/gsod/1962/983250-99999-1962.op.gz':
status was '530 Not logged in'
Different Methods and Ideas I've Tried but Haven't Yet Been Successful
So far I've tried to slow the requests down using Sys.sleep in a for loop and any other manner of retrieving the files more slowly by opening then closing connections, etc. It's puzzling because: i) it works for a bit then stops and it's not related to the particular year/station combination per se; ii) I can use nearly the exact same code and download much larger annual files of global weather data without any errors over a long period of years like this; and iii) it's not always stopping after 1961 going to 1962, sometimes it stops at 1960 when it starts on 1961, etc., but it does seem to be consistently between years, not within from what I've found.
The login is anonymous, but you can use userpwd "ftp:your#email.address". So far I've been unsuccessful in using that method to ensure that I was logged in to download the station files.
I think you're going to need a more defensive strategy when working with this FTP server:
library(curl) # ++gd > RCurl
library(purrr) # consistent "data first" functional & piping idioms FTW
library(dplyr) # progress bar
# We'll use this to fill in the years
ftp_base <- "ftp://ftp.ncdc.noaa.gov/pub/data/gsod/%s/"
dir_list_handle <- new_handle(ftp_use_epsv=FALSE, dirlistonly=TRUE, crlf=TRUE,
ssl_verifypeer=FALSE, ftp_response_timeout=30)
# Since you, yourself, noted the server was perhaps behaving strangely or under load
# it's prbly a much better idea (and a practice of good netizenship) to cache the
# results somewhere predictable rather than a temporary, ephemeral directory
cache_dir <- "./gsod_cache"
dir.create(cache_dir, showWarnings=FALSE)
# Given the sporadic efficacy of server connection, we'll wrap our calls
# in safe & retry functions. Change this variable if you want to have it retry
# more times.
MAX_RETRIES <- 6
# Wrapping the memory fetcher (for dir listings)
s_curl_fetch_memory <- safely(curl_fetch_memory)
retry_cfm <- function(url, handle) {
i <- 0
repeat {
i <- i + 1
res <- s_curl_fetch_memory(url, handle=handle)
if (!is.null(res$result)) return(res$result)
if (i==MAX_RETRIES) { stop("Too many retries...server may be under load") }
}
}
# Wrapping the disk writer (for the actual files)
# Note the use of the cache dir. It won't waste your bandwidth or the
# server's bandwidth or CPU if the file has already been retrieved.
s_curl_fetch_disk <- safely(curl_fetch_disk)
retry_cfd <- function(url, path) {
# you should prbly be a bit more thorough than `basename` since
# i think there are issues with the 1971 and 1972 filenames.
# Gotta leave some work up to the OP
cache_file <- sprintf("%s/%s", cache_dir, basename(url))
if (file.exists(cache_file)) return()
i <- 0
repeat {
i <- i + 1
if (i==6) { stop("Too many retries...server may be under load") }
res <- s_curl_fetch_disk(url, cache_file)
if (!is.null(res$result)) return()
}
}
# the stations and years
station <- c("983240-99999", "983250-99999", "983270-99999", "983280-99999",
"984260-41231", "984290-99999", "984300-99999", "984320-99999",
"984330-99999")
years <- 1960:2016
# progress indicators are like bowties: cool
pb <- progress_estimated(length(years))
walk(years, function(yr) {
# the year we're working on
year_url <- sprintf(ftp_base, yr)
# fetch the directory listing
tmp <- retry_cfm(year_url, handle=dir_list_handle)
con <- rawConnection(tmp$content)
fils <- readLines(con)
close(con)
# sift out only the target stations
map(station, ~grep(., fils, value=TRUE)) %>%
keep(~length(.)>0) %>%
flatten_chr() -> fils
# grab the stations files
walk(paste(year_url, fils, sep=""), retry_cfd)
# tick off progress
pb$tick()$print()
})
You may also want to set curl_interrupt to TRUE in the curl handle if you want to be able to stop/esc/interrupt the downloads.

Resources