read data file with mixed numbers and texts in r - r

I have data files that contain the following:
the first 10 columns are numbers, the last column is text. They are separated by space. The problem is that the text in the last column may also contain space. So when I used read.table() I got the following error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 21 did not have 11 elements
what's the easiest way of reading the first 10 columns into a data matrix, and the last column into a string vector? Should I use readLines() first then process it?

If you cannot re-export or recreate your data files with different, non-whitespace separators or with quotation marks around the last column to avoid that problem, you can use read.table(... , fill = TRUE) to read in a file with unequal columns and then combine columns 11+ with dat$col11 <- do.call(paste, c(dat[11:nrow(dat)], sep=" ")) (or something like that) and then drop the now unwanted columns with dat[11:(nrow(dat)-1)] <- NULL. Finally, you may need to trim the whitespace from the end of the eleventh column with trimws(dat$col11).
Note that fill only considers the first five lines of your file, so you may need to find out the number of 'pseudo-columns' in the longest line manually and specify an appropriate number of col.names in read.table (see the linked answer).

Hinted by the useful fill = TRUE option of read.table() function, I used the following to solve my problem:
dat <- read.table(fname, fill = T)
dat <- dat[subset(1:nrow(dat),!((1:nrow(dat)) %in% (which(dat[,11]=="No") + 1))),]
The fill = TRUE option puts everything after the first space of the 11th column to a new row (redundant rows that the original data do not have). The code above removes the redundant rows based on three assumptions: (1) the number of space separators in the 11th column is no more than 11 such that we know there is only one more row of text after a line whose 11th column contains space (that's what the +1 does); (2) we know the line whose 11th column starts with a certain word (in my case it is "No") (3) Keeping only the first word in the 11th column would be sufficient (without ambiguity).

The following solved my problem:
nc <- max(count.fields(fname), sep = " ")
data <- read.table(fname, fill = T, col.names = paste0("V", seq_len(nc)), sep = " ", header = F)
Then the first 10 columns will be the numeric results I want and the remaining nc-10 columns can be combined into one string vector.
The most helpful post is:
How can you read a CSV file in R with different number of columns

You could reformat your file before reading it in R.
For example, using perl in a terminal:
perl -pe 's/(?<=[0-9]) /,/g' myfile.txt > myfile.csv
This replaces every space preceded by a number by a comma.
Then read it into R using read.csv:
df = read.csv("myfile.csv")

Related

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

R: fread multiple files with different decimal seperators

when reading a csv file via fread and using colClasses to read the columns as numerics I am having trouble with data that consists of numbers with commas instead of dots. Since the data files have different origins, some use "." and some use "," as decimal separator
dt <- data.table(a=c("1,4","2,0","4,5","3,5","6,9"),c=(10:14))
write.csv(dt,"dt.csv",row.names=F)
dcsv <- fread("dt.csv", colClasses = list(numeric = 1:2), dec = ",").
I have 2 problems:
I want to read both columns as numerics. So I tried using dec = ",". I now get an error: Column number 2 (colClasses[[1]][2]) is out of range [1,ncol=1]
So I changed to colClasses = list(numeric = 1), but don't quite understand this.
Still the first column turns out to be character type instead of numeric.
How could I also change dec to .and ,, since I don't know in advance what type of decimal separator any of the hundreds of files uses. I tried a vector, but did not work out. What am I missing? Thanks for any help!
It is not normal to have a file with 2 different types of numeric separator.
You should question the source of the file first thing.
Nevertheless, if you have such a file, the correct way to read it is with the variables with a comma separator as a string then to convert it to a numeric.
library(data.table)
dt <- data.table(a=c("1,4","2,0","4,5","3,5","6,9"),c=(10:14))
write.csv(dt,"dt.csv",row.names=F)
dcsv <- fread("dt.csv", dec = ".")
dcsv[, a:= as.numeric(gsub("\"", "", gsub(",", ".", a)))]
If you don't know if your variable is with a comma or a dot separator, you can loop over your variable to test if the variable is a string with only number and comma and convert only the ones fulfilling that condition.

Combining CSV files and splitting the column into 2 columns using R

I have 40 CSV files with only 1 column each. I want to combine all 40 files data into 1 CSV file with 2 columns.
Data format is like this :
I want to split this column by space and combine all 40 CSV files into 1 file. I want to preserve the number format as well.
I tried below code but Number format is not fixed and and extra 3rd column added for Negative numbers. Not sure why.
My Code :
filenames <- list.files(path="C://R files", full.names=TRUE)
merged <- data.frame(do.call("rbind", lapply(filenames, read.csv, header = FALSE)))
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," ",fixed=FALSE))
write.csv(data, "export1.csv", row.names=FALSE, na="NA")
The output which i got is as shown below. If you observe, the negative numbers are put into extra column. I just want to split by space and put in 2 columns in the exact number format as in the input.
R Output:
The problem is that the source data is delimited by:
one space when the second number is negative, and
two spaces when the second number is positive (space for the absent minus sign).
The trick is to split the string on one or more spaces:
data <- do.call("rbind", strsplit(as.character(trimws(merged$V1))," +",fixed=FALSE))
I'm a bit OCD on charsets, unreliable files, etc, so I tend to use splitters such as "[[:space:]]+" instead, since it'll catch whitespace-variants instead of the space " " or tab "\t".
(In regex-speak, the + says "one or more". Other modifiers include ? as zero or one, and * as zero or more.)

R: Reading in data form tab-delimeted file with duplicated tabs

I need to read a pretty large tab-delimited text file into R (about two gigabytes). The issue is that the file contains plenty of duplicated tabs (two subsequent tabs without anything in between). They seem to cause trouble in that (some?) of them are interpreted as the end of the line.
Since the data is huge, I uploaded a tiny fraction to illustrate the problem, please see the code below.
count.fields(file = "http://m.uploadedit.com/ba3c/1429271380882.txt", sep = "\t")
read.table(file = "http://m.uploadedit.com/ba3c/1429271380882.txt",
header = TRUE, sep = "\t")
Thank's for your help.
Edit
Edit: The example does not perfectly illustrate the original problem. For the whole data, I should have a total of 6312 fields per row, but when I do count.fields() on it, rows are broken down in a 4571 - 1741 - 4571 - 1741 - ... pattern, so having an additional end of line after field number 4571.
It seems that there are \n strings randomly scattered throughout the column names. If we look for the first 5 or so occurrences of \n in the file using substr() and gregexpr(), the results seem strange:
library(readr) # useful pkg to read files
df <- read_file("http://m.uploadedit.com/ba3c/1429271380882.txt")
> substr(df, gregexpr("\n", df)[[1]][1]-10, gregexpr("\n", df)[[1]][1]+10)
[1] "1-024.Top \nAlleles\tCF"
> substr(df, gregexpr("\n", df)[[1]][2]-10, gregexpr("\n", df)[[1]][2]+10)
[1] "053.Theta\t\nCFF01-053."
> substr(df, gregexpr("\n", df)[[1]][3]-10, gregexpr("\n", df)[[1]][3]+10)
[1] "CFF01-072.\nTop Allele"
> substr(df, gregexpr("\n", df)[[1]][4]-10, gregexpr("\n", df)[[1]][4]+10)
[1] "CFF01-086.\nTheta\tCFF0"
> substr(df, gregexpr("\n", df)[[1]][5]-10, gregexpr("\n", df)[[1]][5]+10)
[1] "ype\tCFF01-\n303.Top Al"
So, the issue is apparently not two subsequent \t, but the randomly scattered line breaks. This obviously causes the read.table parser to break down.
But: if randomly scattered line breaks are the problem, let's remove them all and insert them at the correct position. The following code will correctly read the posted example data. You'd probably need to come up with a better regex for the ID_REF variable to automatically replace it with a \n before the ID string in case the ID string varies more than in the example data:
library(readr)
df <- read_file("http://m.uploadedit.com/ba3c/1429271380882.txt")
df <- gsub("\n", "", df)
df <- gsub("abph1", "\nabph1", df)
df <- read_delim(df, delim = "\t")
Check you file for quote and comment characters. The default behavior is to not count tabs or other delimiters that are inside of quotes (or after comments). So the fact that you number of fields per line keeps alternating and the 2 values add to the correct number suggests that you have a quote character after field 4570 on each line. So the first line reads the 1st 4570 records, sees the quote and reads the rest of that line and the first 4570 fields of the next line as a single field, then reads the remaining 1741 lines on the second row as individual fields, repeat with lines 3 and 4, etc.
The count.fields and read.table and related functions have arguments to set the quoting characters and the comment characters. Changing these to empty strings will tell R to ignore quotes and comments, that is a quick way to test my theory.
Well, I did not get to the root of the problem but I figured out you have duplicated rownames in the table. I loaded your data in R workspace like this.
to.load = readLines("http://m.uploadedit.com/ba3c/1429271380882.txt")
data = read.csv(text = to.load, sep = "\t", nrows=length(to.load) - 1, row.names=NULL)

Read.Table - Having problems with percentage character

I am trying to read in a tab-separated file (available online, using ftp) using read.table() in R.
The issue seems to be that for the third column, the character string for some of the rows contains characters such as apostrophe character ' as well as percentage character % for example Capital gains 15.0% and Bob's earnings, for example -
810| 13 | Capital gains 15.0% |
170| -20 | Bob’s earnings|
100| 80 | Income|
To handle the apostrophe character, I was using the following syntax
df <- read.table(url('ftp://location'), sep="|", quote="\"", as.is=T)
Yet this syntax shown above does not seem to handle the lines where the problematic third column contains a string with the percentage character, so that the entire remainder of the table is jammed into one field by the function.
I also tried to ignore the third column altogether by using colClasses df <- read.table(url('ftp://location'), sep="|", quote="\"", as.is=T, colClasses=c("numeric", "numeric", "NULL")), but the function still fails over on the lines with the percentage character is present in that third column.
Any ideas on how to address the issue?
Instead of read.table wrapper, use scan function. You can control anything. And is faster, specially if you use "what" option to indicate R what are you scanning.
dat <- scan(file = "data.tsv", # Substitute this for url
what = list(c(rep("numeric",2), "character")), # describe columns
skip = 0, # lines to skip
sep = "|") # separator
Instead of file, use your url.
In this case scan is expecting 2 columns of "numeric" followed by a column of "character"
Type
?scan
to get the complete list of options.

Resources