R data.table fwrite to fread space delimiter and empties - r

I am having problems using fread with " " as delimiter and interspersed blank values. For example, this:
dt <- data.table(1:5,1:5,1:5) #make a simple table
dt[3,"V2" := NA] #add a blank in the middle to illustrate the problem
fwrite(dt, file = "dt.csv", sep = " ") #save to file
dt <- fread("dt.csv", sep = " ") #try to retrieve
The fread fails with: "Stopped early on line 4. Expected 3 fields but found 2." The problem seems to be that with the NA value in the middle column, fwrite gives value|space|space|value, then fread doesn't recognize the implied blank value in the middle.
I understand it would be simple to use another delimiter in the first place. However, is it possible to get fread to reproduce the original dt here?
EDIT WITH A READ-SIDE SOLUTION:
I found the same question here. It's a bit confusing because it gives a solution, but then the solution later stopped working. On pursuing some other leads the closest I've now found to a read-side solution with fread() is with a Unix command like this:
dt <- fread(cmd="wsl sed -r 's/ /,/g' dt.csv") #converts spaces to commas on the way in
On Windows 10 I had to do some trial and error to get my system to run Unix commands. The "wsl" part seems to depend on the system. This video was helpful, and I used the first method he describes there. This and this question provides a bit more on sed with fread. The latter says sed comes with rTools, although I did not try that.

Maybe export NA as something other than "" by default
Here I use #
library(data.table)
dt <- data.table(1:5,1:5,1:5) #make a simple table
dt[3,"V2" := NA] #add a blank in the middle to illustrate the problem
fwrite(dt, file = "dt.csv", sep = " ", na="#") #save to file
dt <- fread("dt.csv", sep = " ",na.strings = "#") #try to retrieve

Related

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

R- my first row has # character in the column names?

My test file is formatted very odly.
The first rows starts with:
If i ignore the first row and import the data by using the read.table it works well but then i donot have the column names. But if i try to import the data using col.names=TRUE, it says "more columns than column names". I guess i can separately import the first row and the rest of data and add the first (which is the column names) to the final output file. But when i import the the first row: it completely ignores the column names and jumps to the row with 0 0 0 0.... Is it because the first row has a # character. And also because of the # character there is an extra empty column in the data.
Here are a few possibilities:
1) process twice Read it in as a character vector of lines, L, using readLines. Then remove the # and read L using read.table:
L <- sub("#", "", readLines("myfile.dat"))
read.table(text = L, header = TRUE)
2) read header separately For smaller files the prior approach is short and should be fine but if the file is large you may not want to process it twice. In that case, use readLines to read in only the header line, fix it up and then read in the rest applying the column names.
File <- "myfile.dat"
col.names <- scan(text = readLines(File, 1), what = "", quiet = TRUE)[-1]
read.table(File, col.names = col.names)
3) pipe Another approach is to make use of external commands:
File <- "myfile.dat"
cmd <- paste("sed -e 1s/.//", File)
read.table(pipe(cmd), header = TRUE)
On UNIX-like systems sed should be available. On Windows you will need to install Rtools and either ensure sed is on the PATH or else use the path to the file:
cmd <- paste("C:/Rtools/bin/sed -e 1s/.//", File)
read.table(pipe(cmd), header = TRUE)
One approach would be to just do a single separate read of the first line to sniff out the column names. Then, do a read.table as you were already doing, and skip the first line.
f <- "path/to/yourfile.csv"
con <- file(f, "r")
header <- readLines(con, n=1)
close(con)
df <- read.table(f, header=FALSE, sep = " ", skip=1) # skip the first line
names(df) <- strsplit(header, "\\s+")[[1]][-1] # assign column names
But, I don't like this approach and would rather prefer that you fix the source of your flat files to not include that troublesome # symbol. Also, if you only need this requirement as a one time thing, you might also just edit the flat file manually to remove the # symbol.

R - write.table() creates table that isn't read properly by fread()

I have a data.table in R that I'm trying to write out to a .txt file, and then input back into R.
It's sizeable table of 6.5M observations and 20 variables, so I want to use fread().
When I use
write.table(data, file = "data.txt")
a table of about 2.2GB is written in data.txt. In manually inspecting it, I can see that there are column names, that it's separated by " ", and that there are quotes on character variables. So everything should be fine.
However,
data <- fread("data.txt")
returns a data.table of 6.5M observations and 1 variable. OK, maybe for some reason fread() isn't automatically understanding the separator string:
data <- fread("data.txt", sep = " ")
All the data is in the proper variables now, but
R has added an unnecessary row-number column
in one (only one) of my columns all NAs have been replaced by 9218868437227407266
All variable names are missing
Maybe fread() isn't recognizing the header, somehow.
data <- fread("data.txt", sep = " ", header = T)
Now my first set of observations is my column names. Not very useful.
I'm completely baffled. Does anyone understand what's happening here?
EDIT:
row.names = F solved the names problem, thanks Ananda Mahto.
Ran
datasub <- data[runif(1000,1,6497651), ]
write.table(datasub, file = "datasub.txt", row.names = F)
fread("datasub.txt")
fread() seems to work fine for the smaller dataset.
EDIT:
Here is the subset of data I created above:
https://github.com/cbcoursera1/ExploratoryDataAnalysisProject2/blob/master/datasub.txt
This data comes from the National Emissions Inventory (NEI) and is made available by the EPA. More information is available here:
http://www.epa.gov/ttn/chief/eiinformation.html
EDIT:
I can no longer reproduce this issue. It may be that row.names = F solved the issue, or possibly restarting R/clearing my environment/something random fixed the problem.

How do I write a csv file in R, where my input is written to the file as row?

This is a very simple issue and I'm surprised that there are no examples online.
I have a vector:
vector <- c(1,1,1,1,1)
I would like to write this as a csv as a simple row:
write.csv(vector, file ="myfile.csv", row.names=FALSE)
When I open up the file I've just written, the csv is written as a column of values.
It's as if R decided to put in newlines after each number 1.
Forgive me for being ignorant, but I always assumed that the point of having comma-separated-values was to express a sequence from left to right, of values, separated by commas. Sort of like I just did; in a sense mimicking the syntax of written word. Why does R cling so desperately to the column format when a csv so clearly should be a row?
All linguistic philosophy aside, I have tried to use the transpose function. I've dug through the documentation. Please help! Thanks.
write.csv is designed for matrices, and R treats a single vector as a matrix with a single column. Try making it into a matrix with one row and multiple columns and it should work as you expect.
write.csv(matrix(vector, nrow=1), file ="myfile.csv", row.names=FALSE)
Not sure what you tried with the transpose function, but that should work too.
write.csv(t(vector), file ="myfile.csv", row.names=FALSE)
Here's what I did:
cat("myVar <- c(",file="myVars.r.txt", append=TRUE);
cat( myVar, file="myVars.r.txt", append=TRUE, sep=", ");
cat(")\n", file="myVars.r.txt", append=TRUE);
this generates a text file that can immediately be re-loaded into R another day using:
source("myVars.r.txt")
Following up on what #Matt said, if you want a csv, try eol=",".
I tried with this:
write.csv(rbind(vector), file ="myfile.csv", row.names=FALSE)
Output is getting written column wise, but, with column names.
This one seems to be better:
write.table(rbind(vector), file = "myfile.csv", row.names =FALSE, col.names = FALSE,sep = ",")
Now, the output is being printed as:
1 1 1 1 1
in the .csv file, without column names.
write.table(vector, "myfile.csv", eol=" ", row.names=FALSE, col.names=FALSE)
You can simply change the eol to whatever you want. Here I've made it a space.
You can use cat to append rows to a file. The following code would write a vector as a line to the file:
myVector <- c("a","b","c")
cat(myVector, file="myfile.csv", append = TRUE, sep = ",", eol = "\n")
This would produce a file that is comma-separated, but with trailing commas on each line, hence it is not a CSV-file.
If you want a real CSV-file, use the solution given by #vamosrafa. The code is as follows:
write.table(rbind(myVector), file = "myfile.csv", row.names =FALSE, col.names = FALSE,sep = ",", append = TRUE)
The output will be like this:
"a","b","c"
If the function is called multiple times, it will add lines to the file.
One more:
write.table(as.list(vector), file ="myfile.csv", row.names=FALSE, col.names=FALSE, sep=",")
fwrite from data.table package is also another option:
library(data.table)
vector <- c(1,1,1,1,1)
fwrite(data.frame(t(vector)),file="myfile.csv",sep=",",row.names = FALSE)

read.table creates too few rows, but readLines has the right number

I am trying to import a tab separated list into R.
It is 81704 rows long. However, read.table is only creating 31376. Here is my code:
population <- read.table('population.txt', header=TRUE,sep='\t',na.strings = 'NA',blank.lines.skip = FALSE)
There are no # commenting anything out.
Here are the first few lines:
[1] "NAME\tSTATENAME\tPOP_2009" "Alabama\tAlabama\t4708708" "Abbeville city\tAlabama\t2934" "Adamsville city\tAlabama\t4782"
[5] "Addison town\tAlabama\t711"
When I read it raw, readLines gives the right number.
Any ideas are much appreciated!
Difficult to diagnose without seeing the input file, but the usual suspects are quotes and comment characters (even if you think there are none of the latter). You can try:
quote = "", comment.char = ""
as arguments to read.table() and see if that helps.
Check with count.fields what's in file:
n <- count.fields('population.txt', sep='\t', blank.lines.skip=FALSE)
Then you could check
length(n) # should be 81705 (it count header so rows+1), if yes then:
table(n) # show you what's wrong
Then you readLines your file and check rows with wrong number of fields. (e.g. x<-readLines('population.txt'); head(x[n!=6]))

Resources