While working with rnotebook I get the following error getting the output of a linear regression model via broom.
This is a dummy example of what I encounter:
N <- 100
a <- rnorm(N)
b <- a + rnorm(N)
df1 <- data.frame(a, b)
lModel <- lm(b ~ a, df1)
summary(lModel)
Then if I want to get the output of tidy(lModel) I get the error:
Error in tempfile(pattern = "_rs_rdf_", tmpdir = outputFolder, fileext = ".rdf") : temporary name too long
The thing is that I have used the tidy() function from broom not long before and got the output. I wonder what may be the issue, and how it can be fixed.
This is the traceback of the error above:
Error in tempfile(pattern = "_rs_rdf_", tmpdir = outputFolder, fileext = ".rdf") : temporary name too long
4.
tempfile(pattern = "_rs_rdf_", tmpdir = outputFolder, fileext = ".rdf")
3.
overridePrint(o$x, o$options, o$className, o$nRow, o$nCol)
2.
print.data.frame(x)
1.
function (x, ...) UseMethod("print")(x)
Thanks a lot in advance.
This error occurs on Windows systems when directories are nested too many levels in. The Windows API has a maximum path length of 260 characters.
Maximum Path Length Limitation
In the Windows API (with some exceptions discussed in the following paragraphs), the maximum length for a path is MAX_PATH, which is defined as 260 characters. A local path is structured in the following order: drive letter, colon, backslash, name components separated by backslashes, and a terminating null character. For example, the maximum path on drive D is "D:\some 256-character path string" where "" represents the invisible terminating null character for the current system codepage. (The characters < > are used here for visual clarity and cannot be part of a valid path string.)
This is pretty easy to avoid. Just adjust your working directory, or the structure where you're saving your tempfile. Either your file name is too long, or your directory is nested too deeply so the path is exceeding Windows' path limits.
As an aside, on Unix systems this the max path significantly longer but there is a max file name length of 255 chars.
This can be fixed with a registry change on Windows 10.
Open the registry edit tool by searching regedit in the start menu
Navigate to Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\FileSystem
Change LongPathsEnabled from 0 to 1
Now the 256 character restriction will be ignored.
Update Aug 4th '18: If your R working directory is inside a Google Drive File Stream folder then the limit is still enforced. This is because GDFS is a virtual drive that has its own limations.
Related
I received a series of 100+ files from a client. This client received the files as part of litigation, so they didn't have to be transmitted in a convenient fashion, they just all had to be present. In a single .zip file, all the files are all tracked with names like Folder1.001, Folder1.002, Folder3.001, etc. When unpackaged these files using the 7-Zip program, they don't show up with a .txt, .csv, or any other file extension. Windows incorrectly interprets the unzipped files as a ".001 File" or ".002 File." This is not the issue, because I know that the files are delimited by a ~ and are 118 columns wide. Each file has between 2.5M and 4.9M rows, and each is about 1 GB in size when unzipped.
This is my first ever post here, so please excuse any breach of etiquette.
I am working in a .Rmd file on a virtual machine running Windows. I have R4.2.2 (64-bit), and RStudio 2022.12.0+353. All work is being done within a drive on the virtual machine that has 9+ GB free out of 300 GB total. The size of this virtual drive could be increased, if necessary.
My goal here is examine one variable in each file, to see if cases fall within a given range for that variable, and save those rows that do. I have been saving them as .rds files using write_rds().
I have been bringing in the files using a read_delim() statement specifying 'delim = "~"'. I created a vector of 120 column names which I use because the columns are not labeled. These commands on their own are not an issue. A successful import looks like the below.
work1 <- read_delim("Data\\Folder1\\File1.001"), delim = "~", col_names = vNames1)
Rows: 2577668 Columns: 120── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "~" chr (16): Press_ZIP, Person1ID, Specialty, PCode, Retailer, ProdType, ProdGroupNo, Unk1, Skip2, Skip3, Skip4, Skip5, Skip6, Skip7... dbl (102): Person2No, ReportNo, DateStr, BucketNo, Bu1, Bu2, Bu3, Bu4, Bu5, Bu6, Bu7, Bu8, Bu9, Bu10, Bu11, Bu12, Bu13, Bu14, Bu15, B... lgl (2): Skip1, Skip9 ℹ Use spec()to retrieve the full column specification for this data. ℹ Specify the column types or setshow_col_types = FALSE to quiet this message.
It mishandles the columns named Skip1 and Skip9 as logical values, but those aren't a necessary part of my analysis.
I then filter and write the file using
work1 <- work1 %>% filter(as.numeric(Press_ZIP) > 78900, as.numeric(Press_ZIP) < 99900)
write_rds(work1, "Data\\Working\\Folder1_001.rds")
I have also done this with the read_delim() and filter() piped into a single command. This is not the issue. NOTE: Before I read in the next file (File1.002), I now have a work1 file that is at most, 4000 cases, down from millions when it was imported.
Since I have over 100 of these files, I have written multiple code chunks to do a few of these at a time. After one to three read_delim() statements in a row, I get the below error.
work2 <- read_delim("Data\\Folder1\\File1.002"), delim = "~", col_names = vNames1)
Error std::bad_alloc
Which I understand has to memory allocation. I can close out RStudio and restart and that will allow me to do one or two more imports, filterings, then writings. Doing that for over 100 files is far too inefficient.
I condensed my code a step further by writing the read_delim() step within the write_rds() step, which looks like the below.
write_rds((read_delim("Data\\Folder1\\File003",
delim = "~", col_names = vNames1) %>%
filter(as.numeric(Press_ZIP) > 78900, as.numeric(Press_ZIP) < 99900)),
"Data\\Working\\Folder1_003.rds")
Rows: 2577668 Columns: 120── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "~" chr (16): Press_ZIP, Person1ID, Specialty, PCode, Retailer, ProdType, ProdGroupNo, Unk1, Skip2, Skip3, Skip4, Skip5, Skip6, Skip7... dbl (102): Person2No, ReportNo, DateStr, BucketNo, Bu1, Bu2, Bu3, Bu4, Bu5, Bu6, Bu7, Bu8, Bu9, Bu10, Bu11, Bu12, Bu13, Bu14, Bu15, B... lgl (2): Skip1, Skip9 ℹ Use spec()to retrieve the full column specification for this data. ℹ Specify the column types or setshow_col_types = FALSE to quiet this message.
Yet after 1 or 2 successful runs, I get the same
Error std::bad_alloc message.
Using traceback(), it seems like it is related to vroom::vroom(), but I'm not sure how to check any further.
I have been obtaining .zip archives of genome annotation from NCBI (mainly gff files). In order save disk space I prefer not to unzip the archive, but to read these files directly into R using unz(). However, it seems that unz() is unable to extract files from the end of 'large' zip files:
ncbi.zip <- "file_location/name.zip"
files <- unzip(ncbi.zip, list=TRUE)
gff.files <- files$Name[ grep("gff$", files$Name) ]
## this works
gff.128 <- readLines( unz(ncbi.zip, gff.files[128]) )
## this gives an empty data structure (read.table() stops
## with an error saying no lines or similar
gff.129 <- readLines( unz(ncbi.zip, gff.files[129]) )
## there are 31 more gff files after the 129th one.
## no lines are read from any of these.
The zip file itself seems to be fine; I can unzip the specific files using unzip on the command line and unzip -t does not report any errors.
I've tried this with R versions 3.5 (openSuse Leap 15.1), 3.6, and 4.2 (centOS 7) and with more than one zip file and get exactly the same result.
I attached strace to R whilst reading in the 128 and 129th file. In both cases I get a lot of lseek towards the end of file (offset 2845892608, larger than 2^31) to start with. This is where I assume the zip directory can be found. For the 128th file (the one that can be read), I eventually get an lseek to an offset slightly below 2^31, followed by a set of lseeks and reads (that extend beyone 2^31).
For the 129th file, I get the same reads towards the end of the file, but then rather than finding a position within the file I get:
lseek(3, 2845933568, SEEK_SET) = 2845933568
lseek(3, 4294963200, SEEK_SET) = 4294963200
read(3, "", 4096) = 0
lseek(3, 4095, SEEK_CUR) = 4294967295
read(3, "", 4096) = 0
Which is a bit weird since the file itself is only about 2.8 GB. 4294967295, is of course 2^32 - 1.
To me this feels like an integer overflow bug, and I am considering to post a bug report. But am wondering if anyone has seen something similar before or if I am doing something stupid.
Having done what I should have started with (reading the specification for the zip64 format specification), it's actually clear that this is not an integer overflow error.
Zip files contain a central directory at the end of the archive; this contains amongst other things the names of the compressed files and the offset of the compressed data in the zip archive. The offset (and file size fields) are only given 4 bytes each in the standard directory field; when the offset is larger than this it should instead be given in the extra fields section and the value in the standard field should be set to 0xFFFFFFFF. Since this is the offset that gets used when reading the file it seems clear that the problem lies in the parsing of the extra field.
I had a look at the source code for R 4.2.1 and it seems that the problem is due to the way the offset specified in the standard offset field is tested:
if(file_info.uncompressed_size == (ZPOS64_T)(unsigned long)-1)
changing this == 0xFFFFFFFF seems to fix the problem.
I've submitted a bug report to R. Hopefully changing the check will not have any unintended consequences and the issue will be fixed.
Still, I'm curious as to whether anyone else has come across the same issue. Seems a bit unlikely that my experience is unique.
I have a program which creates and stores files automatically on GitHub. An example is
https://raw.githubusercontent.com/VIC-Laboratory-ExperimentalData/test/master/test-999-666.txt
However, the files are coded on Dos/Windows machine with UCS-2 LE BOM (according to notepad++).
I am trying to read this text file into R but to no avail:
repo <- "https://raw.githubusercontent.com/VIC-Laboratory-ExperimentalData/test/master"
file <- "test-999-666.txt"
myurl <- paste(repo, file, sep="/")
library(RCurl)
cnt <- getURL(myurl)
I get an error
Error in curlPerform(curl = curl, .opts = opts, .encoding = .encoding) :
caractère nul au milieu de la chaîne : '<ff><fe>*'
How can I configure getURL to read this file? I also tried with httr::GET (but receive an empty content).
This seems to be a relatively common pain point when working with files produced by Windows. I'm going to be honest and say that the solution I'm presenting doesn't seem the best, because it mainly bypasses getting everything into the right encoding and instead goes to the binary directly.
Using the same variables as you:
cnt <- getURLContent(myurl, binary = T)
cnt <- rawToChar(cnt[cnt != 00])
Should produce a parsable string.
The idea is that instead of trying to have curl read the file, let it treat it like binary and deal with encoding later on. This gives us a vector of type raw. Then, since the main issue seems to be that null characters (i.e. \00) were causing a problem, we just exclude them from cnt before coerce cnt from raw to char.
In the end, from your example, I get
"ÿþ*** Header Start ***\r\nVersionPersist: 1\r\nLevelName: Session\r\nLevelName: Block\r\nLevelName: Trial\r\nLevelName: SubTrial\r\nLevelName: LogLevel5\r\nLevelName: LogLevel6\r\nLevelName: LogLevel7\r\nLevelName: LogLevel8\r\nLevelName: LogLevel9\r\nLevelName: LogLevel10\r\nExperiment: test\r\nSessionDate: 07-04-2019\r\nSessionTime: 12:35:06\r\nSessionStartDateTimeUtc: 2019-07-04 16:35:06\r\nSubject: 999\r\nSession: 666\r\nDataFile.Basename: test-999-666\r\nRandomSeed: -1018314635\r\nGroup: 1\r\nDisplay.RefreshRate: 60.005\r\n*** Header End ***\r\nLevel: 1\r\n*** LogFrame Start ***\r\nExperiment: test\r\nSessionDate: 07-04-2019\r\nSessionTime: 12:35:06\r\nSessionStartDateTimeUtc: 2019-07-04 16:35:06\r\nSubject: 999\r\nSession: 666\r\nDataFile.Basename: test-999-666\r\nRandomSeed: -1018314635\r\nGroup: 1\r\nDisplay.RefreshRate: 60.005\r\nClock.Information: <?xml version=\"1.0\"?>\\n<Clock xmlns:dt=\"urn:schemas-microsoft-com:datatypes\"><Description dt:dt=\"string\">E-Prime Primary Realtime Clock</Description><StartTime><Timestamp dt:dt=\"int\">0</Timestamp><DateUtc dt:dt=\"string\">2019-07-04T16:35:05Z</DateUtc></StartTime><FrequencyChanges><FrequencyChange><Frequency dt:dt=\"r8\">2742255</Frequency><Timestamp dt:dt=\"r8\">492902384024</Timestamp><Current dt:dt=\"r8\">0</Current><DateUtc dt:dt=\"string\">2019-07-04T16:35:05Z</DateUtc></FrequencyChange></FrequencyChanges></Clock>\\n\r\nStudioVersion: 2.0.10.252\r\nRuntimeVersion: 2.0.10.356\r\nRuntimeVersionExpected: 2.0.10.356\r\nRuntimeCapabilities: Professional\r\nExperimentVersion: 1.0.0.543\r\nExperimentStuff.RT: 2555\r\n*** LogFrame End ***\r\n"
Which seems to contain all the right content.
If you want you can try adding options(encoding = "UCS-2LE-BOM") before this code, I don't know if it changes anything, but it seems like it affects rawToChar.
I am bit new to R and have a question about a program I am trying to write. I am hoping to take in files (as many as a user pleases) with a while loop (eventually using read.table on each) but it keeps breaking on me.
Here is what I have so far:
cat("Please enter the full path for your files, if you have no more files to add enter 'X': ")
fil<-readLines(con="stdin", 1)
cat(fil, "\n")
while (!input=='X' | !input=='x'){
inputfile=input
input<- readline("Please enter the full path for your files, if you have no more files to add enter 'X': ")
}
if(input=='X' | input=='x'){
exit -1
}
When I run it (from the commandline (UNIX)) I get these results:
> library("lattice")
>
> cat("Please enter the full path for your files, if you have no more files to add enter 'X': ")
Please enter the full path for your files, if you have no more files to add enter 'X': > fil<-readLines(con="stdin", 1)
x
> cat(fil, "\n")
x
> while (!input=='X' | !input=='x'){
+ inputfile=input
+ input<- readline("Please enter the full path for your files, if you have no more files to add enter 'X': ")
+ }
Error: object 'input' not found
Execution halted
I am not quite sure how to fix the problem, but I am pretty sure that it is probably a simple problem.
Any suggestions?
Thanks!
when you first run the script input doesnt exist. Assign
input<-c()
say before your while statement or put
inputfile=input
below input<- readline....
I'm not exactly sure what the underlying problem is for your issue. It may be that you're inputting the directory path incorrectly.
Here's a solution I've used a few times. It makes it much easier for the user. Basically, your code will not require user input, all it requires is that you have a certain naming convention for your files.
setwd("Your/Working/Directory") #This doesn't change
filecontents <- 1
i <- 1
while (filecontents != 0) {
mydata.csv <- try(read.csv(paste("CSV_file_",i,".csv", sep = ""), header = FALSE), silent = TRUE)
if (typeof(mydata.csv) != "list") { #checks to see if the imported data is a list
filecontents <- 0
}
else {
assign(paste('dataset',i, sep=''), mydata)
#Whatever operations you want to do on the files.
i <- i + 1
}
}
As you can see, the naming convention for the files is CSV_file_n where n is any number of input files (i took this code out of one of my programs, in which I load csv's). One of the problems I kept having was Error messages popping up when my code looked for a file that wasn't there. With this loop, those messages won't arise. If it assigns the contents of a non-existant file to mydata.csv, it merely checks to see if mydata.csv is a list. If it is, it continues operating. If not, it stops. If you're worried about differentiating between your data from different files within the code, just insert any relevant information about the file in a constant location within the file itself. For example, in my csv's, My 3rd column always contained the name of the image from which I gathered the information contained in the rest of the csv.
Hope this helps you a bit, even though I see you've already got a solution :-). It's really just an option if you want your program to be more autonomous.
What's the best way to determine if a character is a valid file path? So CheckFilePath( "my*file.csv") would return FALSE (on windows * is invalid character), whereas CheckFilePath( "c:\\users\\blabla\\desktop\\myfile.csv" ) would return TRUE.
Note that a file path can be valid but not exist on disk.
This is the code that save is using to perform that function:
....
else file(file, "wb")
on.exit(close(con))
}
else if (inherits(file, "connection"))
con <- file
else stop("bad file argument")
......
Perhaps file.exists() is what you're after? From the help page:
file.exists returns a logical vector indicating whether the files named by its argument exist.
(Here ‘exists’ is in the sense of the system's stat call: a file will be reported as existing only
if you have the permissions needed by stat. Existence can also be checked by file.access, which
might use different permissions and so obtain a different result.
Several other functions to tap into the computers file system are available as well, also referenced on the help page.
No, there's no way to do this (reliably). I don't see an operating system interface in neither Windows nor Linux to test this. You would normally try and create the file and get a fail message, or try and read the file and get a 'does not exist' kind of message.
So you should rely on the operating system to let you know if you can do what you want to do to the file (which will usually be read and/or write).
I can't think of a reason other than a quiz ("Enter a valid fully-qualified Windows file path:") to want to know this.
I would suggest trying checkPathForOutput function offered by the checkmate package. As stated in the linked documentation, the function:
Check[s] if a file path can be safely be used to create a file and write to it.
Example
checkmate::checkPathForOutput(x = tempfile(pattern = "sample_test_file", fileext = ".tmp"))
# [1] TRUE
checkmate::checkPathForOutput(x = "c:\\users\\blabla\\desktop\\myfile.csv")
# [1] TRUE
Invalid path
\0 character should not be used in Linux1 file names:
checkmate::check_path_for_output("my\0file.csv")
# Error: nul character not allowed (line 1)
1 Not tested on Windows, but looking at the code of checkmate::check_path_for_output indicates that function should work correctly on MS Windows system as well.