R: Reading in data form tab-delimeted file with duplicated tabs - r

I need to read a pretty large tab-delimited text file into R (about two gigabytes). The issue is that the file contains plenty of duplicated tabs (two subsequent tabs without anything in between). They seem to cause trouble in that (some?) of them are interpreted as the end of the line.
Since the data is huge, I uploaded a tiny fraction to illustrate the problem, please see the code below.
count.fields(file = "http://m.uploadedit.com/ba3c/1429271380882.txt", sep = "\t")
read.table(file = "http://m.uploadedit.com/ba3c/1429271380882.txt",
header = TRUE, sep = "\t")
Thank's for your help.
Edit
Edit: The example does not perfectly illustrate the original problem. For the whole data, I should have a total of 6312 fields per row, but when I do count.fields() on it, rows are broken down in a 4571 - 1741 - 4571 - 1741 - ... pattern, so having an additional end of line after field number 4571.

It seems that there are \n strings randomly scattered throughout the column names. If we look for the first 5 or so occurrences of \n in the file using substr() and gregexpr(), the results seem strange:
library(readr) # useful pkg to read files
df <- read_file("http://m.uploadedit.com/ba3c/1429271380882.txt")
> substr(df, gregexpr("\n", df)[[1]][1]-10, gregexpr("\n", df)[[1]][1]+10)
[1] "1-024.Top \nAlleles\tCF"
> substr(df, gregexpr("\n", df)[[1]][2]-10, gregexpr("\n", df)[[1]][2]+10)
[1] "053.Theta\t\nCFF01-053."
> substr(df, gregexpr("\n", df)[[1]][3]-10, gregexpr("\n", df)[[1]][3]+10)
[1] "CFF01-072.\nTop Allele"
> substr(df, gregexpr("\n", df)[[1]][4]-10, gregexpr("\n", df)[[1]][4]+10)
[1] "CFF01-086.\nTheta\tCFF0"
> substr(df, gregexpr("\n", df)[[1]][5]-10, gregexpr("\n", df)[[1]][5]+10)
[1] "ype\tCFF01-\n303.Top Al"
So, the issue is apparently not two subsequent \t, but the randomly scattered line breaks. This obviously causes the read.table parser to break down.
But: if randomly scattered line breaks are the problem, let's remove them all and insert them at the correct position. The following code will correctly read the posted example data. You'd probably need to come up with a better regex for the ID_REF variable to automatically replace it with a \n before the ID string in case the ID string varies more than in the example data:
library(readr)
df <- read_file("http://m.uploadedit.com/ba3c/1429271380882.txt")
df <- gsub("\n", "", df)
df <- gsub("abph1", "\nabph1", df)
df <- read_delim(df, delim = "\t")

Check you file for quote and comment characters. The default behavior is to not count tabs or other delimiters that are inside of quotes (or after comments). So the fact that you number of fields per line keeps alternating and the 2 values add to the correct number suggests that you have a quote character after field 4570 on each line. So the first line reads the 1st 4570 records, sees the quote and reads the rest of that line and the first 4570 fields of the next line as a single field, then reads the remaining 1741 lines on the second row as individual fields, repeat with lines 3 and 4, etc.
The count.fields and read.table and related functions have arguments to set the quoting characters and the comment characters. Changing these to empty strings will tell R to ignore quotes and comments, that is a quick way to test my theory.

Well, I did not get to the root of the problem but I figured out you have duplicated rownames in the table. I loaded your data in R workspace like this.
to.load = readLines("http://m.uploadedit.com/ba3c/1429271380882.txt")
data = read.csv(text = to.load, sep = "\t", nrows=length(to.load) - 1, row.names=NULL)

Related

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

read data file with mixed numbers and texts in r

I have data files that contain the following:
the first 10 columns are numbers, the last column is text. They are separated by space. The problem is that the text in the last column may also contain space. So when I used read.table() I got the following error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 21 did not have 11 elements
what's the easiest way of reading the first 10 columns into a data matrix, and the last column into a string vector? Should I use readLines() first then process it?
If you cannot re-export or recreate your data files with different, non-whitespace separators or with quotation marks around the last column to avoid that problem, you can use read.table(... , fill = TRUE) to read in a file with unequal columns and then combine columns 11+ with dat$col11 <- do.call(paste, c(dat[11:nrow(dat)], sep=" ")) (or something like that) and then drop the now unwanted columns with dat[11:(nrow(dat)-1)] <- NULL. Finally, you may need to trim the whitespace from the end of the eleventh column with trimws(dat$col11).
Note that fill only considers the first five lines of your file, so you may need to find out the number of 'pseudo-columns' in the longest line manually and specify an appropriate number of col.names in read.table (see the linked answer).
Hinted by the useful fill = TRUE option of read.table() function, I used the following to solve my problem:
dat <- read.table(fname, fill = T)
dat <- dat[subset(1:nrow(dat),!((1:nrow(dat)) %in% (which(dat[,11]=="No") + 1))),]
The fill = TRUE option puts everything after the first space of the 11th column to a new row (redundant rows that the original data do not have). The code above removes the redundant rows based on three assumptions: (1) the number of space separators in the 11th column is no more than 11 such that we know there is only one more row of text after a line whose 11th column contains space (that's what the +1 does); (2) we know the line whose 11th column starts with a certain word (in my case it is "No") (3) Keeping only the first word in the 11th column would be sufficient (without ambiguity).
The following solved my problem:
nc <- max(count.fields(fname), sep = " ")
data <- read.table(fname, fill = T, col.names = paste0("V", seq_len(nc)), sep = " ", header = F)
Then the first 10 columns will be the numeric results I want and the remaining nc-10 columns can be combined into one string vector.
The most helpful post is:
How can you read a CSV file in R with different number of columns
You could reformat your file before reading it in R.
For example, using perl in a terminal:
perl -pe 's/(?<=[0-9]) /,/g' myfile.txt > myfile.csv
This replaces every space preceded by a number by a comma.
Then read it into R using read.csv:
df = read.csv("myfile.csv")

Read.Table - Having problems with percentage character

I am trying to read in a tab-separated file (available online, using ftp) using read.table() in R.
The issue seems to be that for the third column, the character string for some of the rows contains characters such as apostrophe character ' as well as percentage character % for example Capital gains 15.0% and Bob's earnings, for example -
810| 13 | Capital gains 15.0% |
170| -20 | Bob’s earnings|
100| 80 | Income|
To handle the apostrophe character, I was using the following syntax
df <- read.table(url('ftp://location'), sep="|", quote="\"", as.is=T)
Yet this syntax shown above does not seem to handle the lines where the problematic third column contains a string with the percentage character, so that the entire remainder of the table is jammed into one field by the function.
I also tried to ignore the third column altogether by using colClasses df <- read.table(url('ftp://location'), sep="|", quote="\"", as.is=T, colClasses=c("numeric", "numeric", "NULL")), but the function still fails over on the lines with the percentage character is present in that third column.
Any ideas on how to address the issue?
Instead of read.table wrapper, use scan function. You can control anything. And is faster, specially if you use "what" option to indicate R what are you scanning.
dat <- scan(file = "data.tsv", # Substitute this for url
what = list(c(rep("numeric",2), "character")), # describe columns
skip = 0, # lines to skip
sep = "|") # separator
Instead of file, use your url.
In this case scan is expecting 2 columns of "numeric" followed by a column of "character"
Type
?scan
to get the complete list of options.

Read tab delimited file with unusual characters, then write an exact copy

The problem
I have a tab delimited input file that looks like so:
Variable [1] Variable [2]
111 Something
Nothing 222
The first row represents column names and the two next rows represents column values. As you can see, the column names includes both spaces and some tricky signs.
Now, what I want to do is to import this file into R and then output it again to a new text file, making it look exactly the same as the input. For this purpose I have created the following script (assuming that the input file is called "Test.txt"):
file <- "Test.txt"
x <- read.table(file, header = TRUE, sep = "\t")
write.table(x, file = "TestOutput.txt", sep = "\t", col.names = TRUE, row.names = FALSE)
From this, I get an output that looks like this:
"Variable..1." "Variable..2."
"1" "111" "Something"
"2" "Nothing" "222"
Now, there are a couple of problems with this output.
The "[" and "]" signs have been converted to dots.
The spaces have been converted to dots.
Quote signs have appeared everywhere.
How can I make the output file look exactly the same as the input file?
What I've tried so far
Regarding problem number one and two, I've tried specifying the column names through creating an internal vector, c("Variable [1]", "Variable [2]"), and then using the col.names option for read.table(). This gives me the exact same output. I've also tried different encodings, through the encoding option for table.read(). If I look at the internally created vector, mentioned above, it prints the variable names as they should be printed so I guess there is a problem with the conversion between the "text -> R" and the "R -> text" phases of the process. That is, if I look at the data frame created by read.table() without any internally created vectors, the column names are wrong.
As for problem number three, I'm pretty much lost and haven't been able to figure out what I should try.
Given the following input file as test.txt:
Variable [1] Variable [2]
111 Something
Nothing 222
Where the columns are tab-separated you can use the following code to create an exact copy:
a <- read.table(file='test.txt', check.names=F, sep='\t', header=T,
stringsAsFactors=F)
write.table(x=a, file='test_copy.txt', quote=F, row.names=F,
col.names=T, sep='\t')

read.table creates too few rows, but readLines has the right number

I am trying to import a tab separated list into R.
It is 81704 rows long. However, read.table is only creating 31376. Here is my code:
population <- read.table('population.txt', header=TRUE,sep='\t',na.strings = 'NA',blank.lines.skip = FALSE)
There are no # commenting anything out.
Here are the first few lines:
[1] "NAME\tSTATENAME\tPOP_2009" "Alabama\tAlabama\t4708708" "Abbeville city\tAlabama\t2934" "Adamsville city\tAlabama\t4782"
[5] "Addison town\tAlabama\t711"
When I read it raw, readLines gives the right number.
Any ideas are much appreciated!
Difficult to diagnose without seeing the input file, but the usual suspects are quotes and comment characters (even if you think there are none of the latter). You can try:
quote = "", comment.char = ""
as arguments to read.table() and see if that helps.
Check with count.fields what's in file:
n <- count.fields('population.txt', sep='\t', blank.lines.skip=FALSE)
Then you could check
length(n) # should be 81705 (it count header so rows+1), if yes then:
table(n) # show you what's wrong
Then you readLines your file and check rows with wrong number of fields. (e.g. x<-readLines('population.txt'); head(x[n!=6]))

Resources