This is the command I'm using :
dallData <- fread("data.csv", showProgress = TRUE, colClasses = c(rep("NULL", 2), "character", rep("NULL", 37)))
but I get this error when trying to load it: R character strings are limited to 2^31-1 bytes|
Anyway to skip those values ?
Here's a strategy that may work or at least narrow down the possible sources of error. It assumes you have enough working memory to hold the data and that your separators are really commas. If you actually have tabs as separators then you will need to modify accordingly. The plan is to read using readLines which will basically ignore the quotes that are probably mismatched. Then figure out which line or lines are at fault using count.fields, table, and which.
input <- readLines("data.csv") # ignores quotes
counts.def <- count.fields(textConnection(input),
sep=",") # defaults quotes are both ' and "
table(counts.def) # might show a variety of line counts.
# Second try with just double-quotes
counts.dbl <- count.fields(textConnection(input),
sep=",", quote="\"") # just dbl-quotes
table(counts.dbl) # if all the same, then all you do is change the quotes argument
Depending on the results you may need to edit cerain lines which can be identified using which(counts.def < 40) assuming most of them are 40 as your input efforts suggest is the expected number of fields per line.
(If the tag for [ram] means you are limited and getting warnings or using virtual memory which slows things down horribly, then you should restart your OS, and only load R before trying again. R needs contiguous block of memory and Windoze isn't very good at memory management.)
Here's a small test case to work with:
input <- readLines(textConnection(
"v1,v2,v3,v4,v5,v6
text, text, text, text, text, text
text, text, O'Malley, text,text,text
junk,junk, more junk, \"text\", tex\"t, nothing
3,4,5,6,7,8")
if you can help with converting a big text:
sample of the text :
X1"II"ID_Sitze.x"II"Produktionsdatum.x"II"Herstellernummer.x"II"Werksnummer.x"II"Fehlerhaft.x"II"Fehlerhaft_Datum.x"II"Fehlerhaft_Fahrleistung.x"II"ID_Sitze.y"II"Produktionsdatum.y"II"Herstellernummer.y"II"Werksnummer.y"II"Fehlerhaft.y"II"Fehlerhaft_Datum.y"II"Fehlerhaft_Fahrleistung.y""1"II1II"K2LE1-109-1091-2"II2008-11-12II"109"II1091II1II2010-10-18II37080IINAIINAIINAIINAIINAIINAIINA"2"II2II"K2LE1-109-1091-1"II2008-11-12II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"3"II3II"K2LE1-109-1091-12"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"4"II4II"K2LE1-109-1091-5"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"5"II5II"K2LE1-109-1091-40"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"6"II6II"K2LE1-109-1091-15"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"7"II7II"K2LE1-109-1091-31"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"8"II8II"K2LE1-109-1091-6"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"9"II9II"K2LE1-109-1091-8"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"10"II10II"K2LE1-109-1091-25"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"11"II11II"K2LE1-109-1091-24"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"12"II12II"K2LE1-109-1091-36"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"13"II13II"K2LE1-109-1091-33"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"14"II14II"K2LE1-109-1091-42"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"15"II15II"K2LE1-109-1091-14"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"16"II16II"K2LE1-109-1091-21"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"17"II17II"K2LE1-109-1091-43"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"18"II18II"K2LE1-109-1091-44"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA"19"II19II"K2LE1-109-1091-19"II2008-11-13II"109"II1091II1II2010-10-19II37
with separator "II" to a Dataframe.
i have used :
df_BSt7<-readLines("Komponente_K2LE1.txt")
df_BST7<-str_replace_all(df_BSt7,"II",",")
df_BST7<-read.table(df_BST7,sep = ",")
head(df_BST7)
but I am always getting an Error
could not allocate memory (206 Mb) in C function 'R_AllocStringBuffer'
and when i call head() I am getting
'"X1","ID_Sitze.x","Produktionsdatum.x","Herstellernummer.x","Werksnummer.x","Fehlerhaft.x","Fehlerhaft_Datum.x","Fehlerhaft_Fahrleistung.x","ID_Sitze.y","Produktionsdatum.y","Herstellernummer.y","Werksnummer.y","Fehlerhaft.y","Fehlerhaft_Datum.y","Fehlerhaft_Fahrleistung.y""1",1,"K2LE1-109-1091-2",2008-11-12,"109",1091,1,2010-10-18,37080,NA,NA,NA,NA,NA,NA,NA"2",2,"K2LE1-109-1091-1",2008-11-12,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"3",3,"K2LE1-109-1091-12",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"4",4,"K2LE1-109-1091-5",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"5",5,"K2LE1-109-1091-40",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"6",6,"K2LE1-109-1091-15",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"7",7,"K2LE1-109-1091-31",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"8",8,"K2LE1-109-1091-6",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"9",9,"K2LE1-109-1091-8",2008-11-13,"109",1091,0,NA,0,NA,NA,NA,NA,NA,NA,NA"10",10,"K2LE1-109-109 [... abgeschnitten]
So, there are several possible problems, some might be specific to your examples.
Clean example data
First, let's take a look at your example data. In what you provide, there are no newlines, everything is on a single line. Is that the case in the original "Komponente_K2LE1.txt" file? If yes, we might need some more work to find where to add newlines (see below).
The first column name, X1, only has a quote on the right. It can't work without the quote on the left: "X1"IIID_Sitze.
The saved dataframe has 16 columns, I expect because there is a row number at the beginning of each row which is not in the header. So we can add an additional column header to have 16 of them:
"row_nb"II"X1"II"ID_Sitze.x"II"Produktionsdatum.x"II"Herstellernummer.x"II"Werksnummer.x"II"Fehlerhaft.x"II"Fehlerhaft_Datum.x"II"
Then we have a small problem with line 19 which is truncated, I assume it comes from your copy/paste and that's not a problem with the full file. So let's forget about it for now. So I have this text:
raw_lines <- '"row_nb"II"X1"II"ID_Sitze.x"II"Produktionsdatum.x"II"Herstellernummer.x"II"Werksnummer.x"II"Fehlerhaft.x"II"Fehlerhaft_Datum.x"II"Fehlerhaft_Fahrleistung.x"II"ID_Sitze.y"II"Produktionsdatum.y"II"Herstellernummer.y"II"Werksnummer.y"II"Fehlerhaft.y"II"Fehlerhaft_Datum.y"II"Fehlerhaft_Fahrleistung.y"
"1"II1II"K2LE1-109-1091-2"II2008-11-12II"109"II1091II1II2010-10-18II37080IINAIINAIINAIINAIINAIINAIINA
"2"II2II"K2LE1-109-1091-1"II2008-11-12II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"3"II3II"K2LE1-109-1091-12"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"4"II4II"K2LE1-109-1091-5"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"5"II5II"K2LE1-109-1091-40"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"6"II6II"K2LE1-109-1091-15"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"7"II7II"K2LE1-109-1091-31"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"8"II8II"K2LE1-109-1091-6"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"9"II9II"K2LE1-109-1091-8"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"10"II10II"K2LE1-109-1091-25"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"11"II11II"K2LE1-109-1091-24"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"12"II12II"K2LE1-109-1091-36"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"13"II13II"K2LE1-109-1091-33"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"14"II14II"K2LE1-109-1091-42"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"15"II15II"K2LE1-109-1091-14"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"16"II16II"K2LE1-109-1091-21"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"17"II17II"K2LE1-109-1091-43"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA
"18"II18II"K2LE1-109-1091-44"II2008-11-13II"109"II1091II0IINAII0IINAIINAIINAIINAIINAIINAIINA'
Now you are replacing "II" with "," and reading it with read.table(), which is perfectly correct, except that read.table() would assume you're giving a file name and throw an error as it can't open that connection (that file). To make it work you need this:
df_BST7<-str_replace_all(raw_lines,'II',",")
df_BST7 <- read.table(text = df_BST7,sep = ",")
So now that does run on my computer.
Side note, since you're already using the tidyverse, you could as well use that equivalent code instead:
df_BST7 <- str_replace_all(raw_lines,'II',",")
df_BST7 <- read_csv(df_BST7)
which could help with something later
The error message
Now the error message you get suggests it's a memory problem. I see 2 possibilities: the table is so big it can't fit in your computer's memory, or indeed your whole input table is on a single line, so that makes a very long line, which won't fit in memory.
Whole table too big
I don't think it's the problem here, but just in case, check how big the file on the disk is, and how much memory is free on your computer, and whether you could free up enough memory by just closing a few programs. Possibly you could save your modified text to disk and delete it from R's memory with rm(df_BSt7), then load it directly from disk into df_BST7. Since the raw text fits in memory, that should work. If memory is a challenge, you can replace read_csv() with read_csv_chunked() and process one chunk at a time.
All on one line
I think this is the most likely. Again, there are two possibilities.
Missing carriage return
Actually line breaks can be described in 2 ways, Unix-like systems (MacOS and GNU/Linux) use the symbol newline (\n), whereas Windows uses a pair of carriage return and newline (\r\n). I'm not sure how this could create problems inside R, but if your file was generated on a Unix-like system and you're trying to read it on Windows that's an explanation. Then the goal would become to replace \n with \r\n.
No line breaks at all
If there is absolutely no line break, neither \r nor \n, then we need to guess where they are. On a Unix system you could try awk or sed, but there are ways to do it in R. The following code should work, except the last column will need some cleaning up afterwards:
raw_lines2 <- str_remove_all(raw_lines2, "\r")
all_fields <- raw_lines2 %>%
str_split("II") %>%
unlist()
nb_lines <- (length(all_fields) - 1)/15
reconstruct_lines <- map_chr(0:(nb_lines-1), ~ paste(all_fields[(2+15*.):(16+15*.)], collapse = ",")) %>%
paste(collapse = "\n")
cat(reconstruct_lines)
I am facing a problem with the fwrite function from the DataTable package in R. In fact it appends the wrong way and I'd end up with something like:
**user ref version status Type DataExtraction**
user1 2.02E+11 1 Pending 1 No
user2 2.02E+11 1 Saved 2 No"user3" 2.01806E+11 1 Saved NB No
I am using the function as follows :
library(data.table)
fwrite(Save, "~/Downloads/Register.csv", append = TRUE, sep = ",", quote = TRUE)
Reproducible example:
fwrite(data.table(user="user3",
ref="204094093",
version="2",
status="Pending",
Type="1",DataExtraction="No"),
"~/Downloads/test.csv", sep = ",", append = FALSE)
fwrite(data.table(user="user3",
ref="204094093",
version="2",
status="Pending",
Type="1",DataExtraction="No"),
"~/Downloads/test.csv", sep = ",", append = TRUE)
I'm not sure if it isolates the problem, but it seems that if I manually change something in the .csv file (for instance rename DataExtraction to Extraction), the problem of appending in the wrong way occurs.
Does someone know what is going wrong?
Thanks!
When I run your example code I have no problems with the behavior - the file comes out as desired. Based on your comments about manually changing what is in the file, and what the undesired output looks like, here is what I believe is happening. When fwrite() (and many other similar IO functions) write to a file, each line has at the end of it a line break (in R, this is generally represented as \n). This is desired, so that subsequent lines of data indeed appear on subsequent lines of the file. Generally this will also mean that when you open a file in a text editor, there will be a blank line at the very end, since this reflects the line break in the last line that was written. (different editors handle this differently though). So, I suspect what is happening is that when you go in and manually edit the file in your editor, you are somehow losing that last line break. What this means is that when you go to write again using append, there is no line break at the end of the file, and therefore you get the undesired behavior of two lines of data on a single line of the file.
So, the solution would be to either find how to prevent your manual editing from deleting that last line break character. Barring that, there are ways to just write a single line break character to the file using R. E.g. with the cat() function.
I am working with importing a series of text files from many zipped folders.
Before, I was unzipping the files with WinRAR, then applying read.table as:
main <- read.table(file.path(dir, paste0("Current/17/Folder/Main.txt")),
sep = '|',
header = FALSE,
stringsAsFactors = FALSE,
skipNul = TRUE,
comment.char="",
quote = ""
)
Which worked, but required unzipping myself (which takes up a prohibitive amount of space). I stumbled on unz, which would allow me to avoid this step as follows:
main <- read.table(unz(file.path(dir, paste0("Current/17.zip")), "Folder/Main.txt"),
sep = '|',
header = FALSE,
stringsAsFactors = FALSE,
skipNul = TRUE,
comment.char="",
quote = ""
)
The latter works in most cases, but for some files, the read.table throws an error
number of items read is not a multiple of the number of columns
I discerned that this is from some odd characters in the file. Comparing the two approaches in one case, I find that the manual unzip and read approach converts these special characters to "??" (with a white background, suggesting a foreign character) (which is desirable, if it cannot read them), whereas the second approach simply throws an error when it hits them (the ?? are not reported in the output), and does not read the rest of the current line or file. I looked at the line in question in Bash with:
sed -n '10905951 p' Main.txt
And it also reports the same "??" that show up in the read.table in R.
I have tried using fileEncoding, with a few different file formats, to no avail. Many other approaches I have seen for dealing with odd characters like this require manually replacing the characters, but I am struggling to even find out what the characters are, and why WinRAR's unzip converts them to something readable while unz does not. I have also considered other read options, but read.table is preferable to me because it allows for the skipNul option, which is relevant for my data.
Any help or ideas would be appreciated.
I've got a strange text file with a bunch of NUL characters in it (actually about 10 such files), and I'd like to programmatically replace them from within R. Here is a link to one of the files.
With the aid of this question I've finally figured out a better-than-ad-hoc way of going into each file and find-and-replacing the nuisance characters. It turns out that each pair of them should correspond to one space ([NUL][NUL]->) to maintain the intended line width of the file (which is crucial for reading these as fixed-width further down the road).
However, for robustness' sake, I prefer a more automable approach to the solution, ideally (for organization's sake) something I could add at the beginning of an R script I'm writing to clean up the files. This question looked promising but the accepted answer is insufficient - readLines throws an error whenever I try to use it on these files (unless I activate skipNul).
Is there any way to get the lines of this file into R so I could use gsub or whatever else to fix this issue without resorting to external programs?
You want to read the file as binary then you can substitute the NULs, e.g. to replace them by spaces:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(0x20) ## replace with 0x20 = <space>
writeBin(r, "00staff.txt")
str(readLines("00staff.txt"))
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__ ...
You could also substitute the NULs with a really rare character (such as "\01") and work on the string in place, e.g., let's say if you want to replace two NULs ("\00\00") with one space:
r = readBin("00staff.dat", raw(), file.info("00staff.dat")$size)
r[r==as.raw(0)] = as.raw(1)
a = gsub("\01\01", " ", rawToChar(r), fixed=TRUE)
s = strsplit(a, "\n", TRUE)[[1]]
str(s)
# chr [1:155432] "000540952Anderson Shelley J FW1949 2000R000000000000119460007620 3 0007000704002097907KGKG1616"| __truncated__