read.table creates too few rows, but readLines has the right number - r

I am trying to import a tab separated list into R.
It is 81704 rows long. However, read.table is only creating 31376. Here is my code:
population <- read.table('population.txt', header=TRUE,sep='\t',na.strings = 'NA',blank.lines.skip = FALSE)
There are no # commenting anything out.
Here are the first few lines:
[1] "NAME\tSTATENAME\tPOP_2009" "Alabama\tAlabama\t4708708" "Abbeville city\tAlabama\t2934" "Adamsville city\tAlabama\t4782"
[5] "Addison town\tAlabama\t711"
When I read it raw, readLines gives the right number.
Any ideas are much appreciated!

Difficult to diagnose without seeing the input file, but the usual suspects are quotes and comment characters (even if you think there are none of the latter). You can try:
quote = "", comment.char = ""
as arguments to read.table() and see if that helps.

Check with count.fields what's in file:
n <- count.fields('population.txt', sep='\t', blank.lines.skip=FALSE)
Then you could check
length(n) # should be 81705 (it count header so rows+1), if yes then:
table(n) # show you what's wrong
Then you readLines your file and check rows with wrong number of fields. (e.g. x<-readLines('population.txt'); head(x[n!=6]))

Related

Is there a problem with read.table when using # as seperator?

I'm trying to read in a text file (via read.table()) which looks like this:
1#2#3
4#5#6
7#8#9
The "#" should be used as the field's seperator.
My code:
test_data <- read.table("a_test_file.csv", sep= "#")
I'm only able to read in the first column (the values 1,4,7). Is there something I don't see or any ideas concerning a workaround?
Edit: I just realize that # is used to insert coments which are not treated as code text. Could it be that this sign is kind of 'locked' for any other purposes than creating coments?
Set comment.char to something else, e.g.:
test_data <- read.table("a_test_file.csv", sep = "#", comment.char = "")
read.table comes with an comment.char argument set to # by default. This is why it will discard everything after the first #
read.table("a_test_file.csv", sep="#",comment.char="")
will do the job

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

R: Reading in data form tab-delimeted file with duplicated tabs

I need to read a pretty large tab-delimited text file into R (about two gigabytes). The issue is that the file contains plenty of duplicated tabs (two subsequent tabs without anything in between). They seem to cause trouble in that (some?) of them are interpreted as the end of the line.
Since the data is huge, I uploaded a tiny fraction to illustrate the problem, please see the code below.
count.fields(file = "http://m.uploadedit.com/ba3c/1429271380882.txt", sep = "\t")
read.table(file = "http://m.uploadedit.com/ba3c/1429271380882.txt",
header = TRUE, sep = "\t")
Thank's for your help.
Edit
Edit: The example does not perfectly illustrate the original problem. For the whole data, I should have a total of 6312 fields per row, but when I do count.fields() on it, rows are broken down in a 4571 - 1741 - 4571 - 1741 - ... pattern, so having an additional end of line after field number 4571.
It seems that there are \n strings randomly scattered throughout the column names. If we look for the first 5 or so occurrences of \n in the file using substr() and gregexpr(), the results seem strange:
library(readr) # useful pkg to read files
df <- read_file("http://m.uploadedit.com/ba3c/1429271380882.txt")
> substr(df, gregexpr("\n", df)[[1]][1]-10, gregexpr("\n", df)[[1]][1]+10)
[1] "1-024.Top \nAlleles\tCF"
> substr(df, gregexpr("\n", df)[[1]][2]-10, gregexpr("\n", df)[[1]][2]+10)
[1] "053.Theta\t\nCFF01-053."
> substr(df, gregexpr("\n", df)[[1]][3]-10, gregexpr("\n", df)[[1]][3]+10)
[1] "CFF01-072.\nTop Allele"
> substr(df, gregexpr("\n", df)[[1]][4]-10, gregexpr("\n", df)[[1]][4]+10)
[1] "CFF01-086.\nTheta\tCFF0"
> substr(df, gregexpr("\n", df)[[1]][5]-10, gregexpr("\n", df)[[1]][5]+10)
[1] "ype\tCFF01-\n303.Top Al"
So, the issue is apparently not two subsequent \t, but the randomly scattered line breaks. This obviously causes the read.table parser to break down.
But: if randomly scattered line breaks are the problem, let's remove them all and insert them at the correct position. The following code will correctly read the posted example data. You'd probably need to come up with a better regex for the ID_REF variable to automatically replace it with a \n before the ID string in case the ID string varies more than in the example data:
library(readr)
df <- read_file("http://m.uploadedit.com/ba3c/1429271380882.txt")
df <- gsub("\n", "", df)
df <- gsub("abph1", "\nabph1", df)
df <- read_delim(df, delim = "\t")
Check you file for quote and comment characters. The default behavior is to not count tabs or other delimiters that are inside of quotes (or after comments). So the fact that you number of fields per line keeps alternating and the 2 values add to the correct number suggests that you have a quote character after field 4570 on each line. So the first line reads the 1st 4570 records, sees the quote and reads the rest of that line and the first 4570 fields of the next line as a single field, then reads the remaining 1741 lines on the second row as individual fields, repeat with lines 3 and 4, etc.
The count.fields and read.table and related functions have arguments to set the quoting characters and the comment characters. Changing these to empty strings will tell R to ignore quotes and comments, that is a quick way to test my theory.
Well, I did not get to the root of the problem but I figured out you have duplicated rownames in the table. I loaded your data in R workspace like this.
to.load = readLines("http://m.uploadedit.com/ba3c/1429271380882.txt")
data = read.csv(text = to.load, sep = "\t", nrows=length(to.load) - 1, row.names=NULL)

How do I write a csv file in R, where my input is written to the file as row?

This is a very simple issue and I'm surprised that there are no examples online.
I have a vector:
vector <- c(1,1,1,1,1)
I would like to write this as a csv as a simple row:
write.csv(vector, file ="myfile.csv", row.names=FALSE)
When I open up the file I've just written, the csv is written as a column of values.
It's as if R decided to put in newlines after each number 1.
Forgive me for being ignorant, but I always assumed that the point of having comma-separated-values was to express a sequence from left to right, of values, separated by commas. Sort of like I just did; in a sense mimicking the syntax of written word. Why does R cling so desperately to the column format when a csv so clearly should be a row?
All linguistic philosophy aside, I have tried to use the transpose function. I've dug through the documentation. Please help! Thanks.
write.csv is designed for matrices, and R treats a single vector as a matrix with a single column. Try making it into a matrix with one row and multiple columns and it should work as you expect.
write.csv(matrix(vector, nrow=1), file ="myfile.csv", row.names=FALSE)
Not sure what you tried with the transpose function, but that should work too.
write.csv(t(vector), file ="myfile.csv", row.names=FALSE)
Here's what I did:
cat("myVar <- c(",file="myVars.r.txt", append=TRUE);
cat( myVar, file="myVars.r.txt", append=TRUE, sep=", ");
cat(")\n", file="myVars.r.txt", append=TRUE);
this generates a text file that can immediately be re-loaded into R another day using:
source("myVars.r.txt")
Following up on what #Matt said, if you want a csv, try eol=",".
I tried with this:
write.csv(rbind(vector), file ="myfile.csv", row.names=FALSE)
Output is getting written column wise, but, with column names.
This one seems to be better:
write.table(rbind(vector), file = "myfile.csv", row.names =FALSE, col.names = FALSE,sep = ",")
Now, the output is being printed as:
1 1 1 1 1
in the .csv file, without column names.
write.table(vector, "myfile.csv", eol=" ", row.names=FALSE, col.names=FALSE)
You can simply change the eol to whatever you want. Here I've made it a space.
You can use cat to append rows to a file. The following code would write a vector as a line to the file:
myVector <- c("a","b","c")
cat(myVector, file="myfile.csv", append = TRUE, sep = ",", eol = "\n")
This would produce a file that is comma-separated, but with trailing commas on each line, hence it is not a CSV-file.
If you want a real CSV-file, use the solution given by #vamosrafa. The code is as follows:
write.table(rbind(myVector), file = "myfile.csv", row.names =FALSE, col.names = FALSE,sep = ",", append = TRUE)
The output will be like this:
"a","b","c"
If the function is called multiple times, it will add lines to the file.
One more:
write.table(as.list(vector), file ="myfile.csv", row.names=FALSE, col.names=FALSE, sep=",")
fwrite from data.table package is also another option:
library(data.table)
vector <- c(1,1,1,1,1)
fwrite(data.frame(t(vector)),file="myfile.csv",sep=",",row.names = FALSE)

Read tab delimited file with unusual characters, then write an exact copy

The problem
I have a tab delimited input file that looks like so:
Variable [1] Variable [2]
111 Something
Nothing 222
The first row represents column names and the two next rows represents column values. As you can see, the column names includes both spaces and some tricky signs.
Now, what I want to do is to import this file into R and then output it again to a new text file, making it look exactly the same as the input. For this purpose I have created the following script (assuming that the input file is called "Test.txt"):
file <- "Test.txt"
x <- read.table(file, header = TRUE, sep = "\t")
write.table(x, file = "TestOutput.txt", sep = "\t", col.names = TRUE, row.names = FALSE)
From this, I get an output that looks like this:
"Variable..1." "Variable..2."
"1" "111" "Something"
"2" "Nothing" "222"
Now, there are a couple of problems with this output.
The "[" and "]" signs have been converted to dots.
The spaces have been converted to dots.
Quote signs have appeared everywhere.
How can I make the output file look exactly the same as the input file?
What I've tried so far
Regarding problem number one and two, I've tried specifying the column names through creating an internal vector, c("Variable [1]", "Variable [2]"), and then using the col.names option for read.table(). This gives me the exact same output. I've also tried different encodings, through the encoding option for table.read(). If I look at the internally created vector, mentioned above, it prints the variable names as they should be printed so I guess there is a problem with the conversion between the "text -> R" and the "R -> text" phases of the process. That is, if I look at the data frame created by read.table() without any internally created vectors, the column names are wrong.
As for problem number three, I'm pretty much lost and haven't been able to figure out what I should try.
Given the following input file as test.txt:
Variable [1] Variable [2]
111 Something
Nothing 222
Where the columns are tab-separated you can use the following code to create an exact copy:
a <- read.table(file='test.txt', check.names=F, sep='\t', header=T,
stringsAsFactors=F)
write.table(x=a, file='test_copy.txt', quote=F, row.names=F,
col.names=T, sep='\t')

Resources