R read data set which has unequal column - r

I have a .csv data set which is separated by "," and has about 5,000 rows and "5" columns.
However, for some columns, the content contains also ",", for example:
2660,11-01-2016,70.75,05-06-2013,I,,,
4080,26-02-2016,59.36,,D
Thus, when I tried to read it with read_delim(), it will throw me warnings, but the result shall be fine, for example:
Warning: 7 parsing failures.
row # A tibble: 5 x 5 col row col expected actual file expected actual 1 309 5 columns 8 columns 'data/my_data.csv' file 2 523 5 columns 7 columns 'data/my_data.csv' row 3 588 5 columns 8 columns 'data/my_data.csv' col 4 1661 5 columns 9 columns 'data/my_data.csv' expected 5 1877 5 columns 7 columns 'data/my_data.csv'
Is there any way for me to tackle this problem?
I guess I could use read_Lines() and process it one by one and then turn them into a data frame.
Do you have any other ways to deal with such a situation?

1) read.table with fill=TRUE Using fill=TRUE with read.table results in no warnings:
Lines <- "2660,11-01-2016,70.75,05-06-2013,I,,,
4080,26-02-2016,59.36,,D"
# replace text = Lines with your filename
read.table(text = Lines, sep = ",", fill = TRUE)
giving:
V1 V2 V3 V4 V5 V6 V7 V8
1 2660 11-01-2016 70.75 05-06-2013 I NA NA NA
2 4080 26-02-2016 59.36 D NA NA NA
2) replace 1st 4 commas with semicolon Another approach would be:
# replace textConnection(Lines) with your filename
L <- readLines(textConnection(Lines))
for(i in 1:4) L <- sub(",", ";", L)
read.table(text = L, sep = ";")
giving:
V1 V2 V3 V4 V5
1 2660 11-01-2016 70.75 05-06-2013 I,,,
2 4080 26-02-2016 59.36 D
3) remove commas at end of lines Another possibility is to remove commas at the end of lines. (If you are on Windows then sed is in the Rtools distribution.)
read.table(pipe("sed -e s/,*$// readtest.csv"), sep = ",")
giving:
V1 V2 V3 V4 V5
1 2660 11-01-2016 70.75 05-06-2013 I
2 4080 26-02-2016 59.36 D
3a) similar to (3) but without sed
# replace textConnection(Lines) with your filename
L <- readLines(textConnection(Lines))
read.table(text = sub(",*$", "", L), sep = ",")

Related

How to read data with many blank fields in R

I have a tab-delimited file that looks like this:
"ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
I use this code to read in the data:
df <- read.table("path/to/file",header=TRUE,fill=TRUE)
The result is this:
df
id V1 V2 V3 V4 V5
1 1 A 1 NA NA NA
2 2 B 2 NA NA NA
But I expect this:
df
id V1 V2 V3 V4 V5
1 1 A NA NA NA 1
2 2 B NA NA NA 2
I've tried sep="\t" and na.strings=c(""," ",NULL) but those don't help.
I can't get it to work with read.table, so how about parsing the string the manual way
ss <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
library(tidyverse)
entries <- unlist(str_split(ss, "\t"))
ncol <- str_which(entries, "\n")[1]
entries %>%
str_remove("\\n") %>%
matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
as.data.frame() %>%
slice(-1) %>%
mutate_if(is.factor, as.character) %>%
mutate_all(parse_guess)
# ID V1 V2 V3 V4 V5
#1 1 A NA NA NA 1
#2 2 B NA NA NA 2
Explanation: We split the string on "\t"; the first occurrence of "\n" tells us how many columns we have. We then tidy up the entries by removing the line break characters "\n", reshape as matrix and then as data.frame, fix the header, and let readr::parse_guess guess the data type of every column.
For good measure we can roll everything into a function
read.my.data <- function(s) {
entries <- unlist(str_split(s, "\t"))
ncol <- str_which(entries, "\n")[1]
entries %>%
str_remove("\\n") %>%
matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
as.data.frame() %>%
slice(-1) %>%
mutate_if(is.factor, as.character) %>%
mutate_all(parse_guess)
}
and confirm
read.my.data(ss)
# ID V1 V2 V3 V4 V5
#1 1 A NA NA NA 1
#2 2 B NA NA NA 2
data.table's fread() had no problem reading in the string... but your data seems to have a \t too many (after each \n), which causes the creation of an extra column.
It is probably best practive to fix this in your export that creates your files.
If this is not possible, you can adjust fread()'s arguments to get the desired output.
Here we use drop do delete the first column that was created due to the the extra \t.
To get the right column-names back, we read the first line of the file again
string <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
data.table::fread( string,
drop = 1,
fill = TRUE,
col.names = as.matrix( fread(string, nrows = 1, header = FALSE))[1,] )
ID V1 V2 V3 V4 V5
1: 1 A NA NA NA 1
2: 2 B NA NA NA 2
As Quar already mentioned in his/her comment, your file has an extra tab in the beginning of every line, so the number of column labels does not match the number of data fields:
> foo <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
> cat(foo, "\n")
ID V1 V2 V3 V4 V5
1 A 1
2 B 2
That would be ok if the additional first column contained unique row names.
So there are two ways to address the problem: 1. remove the empty column (ideally by fixing the process that produced that file) or 2. fix the row name issue.
Here is my suggestion using the second option:
As the data is tab separated, I'd use read.delim which is just read table with reasonable defaults for this kind of file. Of course that throws an error when used w/o some tweaking ("duplicate 'row.names' are not allowed"). To fix that, we need to tell it to use automatic row numbering. That way you get almost exactly what you want:
> read.delim(text=foo, row.names=NULL)
row.names ID V1 V2 V3 V4 V5
1 1 A NA NA NA 1
2 2 B NA NA NA 2
All that's left to do is get rid of the row.names column. Alternatively, you may want the ID column to be turned into row.names:
> read.delim(text=foo, row.names='ID')
row.names V1 V2 V3 V4 V5
1 A NA NA NA 1
2 B NA NA NA 2
Hope that helps.

"fill" missing columns for fread() [duplicate]

Let's say I have this txt file:
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
With read.csv I can:
> read.csv("linktofile.txt", fill=T, header=F)
V1 V2 V3 V4 V5 V6 V7
1 AA 3 3 3 3 NA NA
2 CC ad 2 2 2 2 2
3 ZZ 2 NA NA NA NA NA
4 AA 3 3 3 3 NA NA
5 CC ad 2 2 2 2 2
However fread gives
> library(data.table)
> fread("linktofile.txt")
V1 V2 V3 V4 V5 V6 V7
1: CC ad 2 2 2 2 2
Can I get the same result with fread?
Major update
It looks like development plans for fread changed and fread has now gained a fill argument.
Using the same sample data from the end of this answer, here's what I get:
library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1: AA 3 3 3 3 NA NA
# 2: CC ad 2 2 2 2 2
# 3: ZZ 2 NA NA NA NA NA
# 4: AA 3 3 3 3 NA NA
# 5: CC ad 2 2 2 2 2
Install the development version of "data.table" with:
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table",
type = "source")
Original answer
This doesn't answer your question about fread: That question has already been addressed by #Matt.
It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.
Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.
You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.
library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
col_types = rep("character", max(count.fields(x, ","))))
Sample data
x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")
## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")
Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

R: read file, split into multiple dataframes

I have a file that is laid out in the following way:
# Query ID 1
# note
# note
tab delimited data across 12 columns
# Query ID 2
# note
# note
tab delimited data across 12 columns
I'd like to import this data into R so that each query is its own dataframe. Ideally as a list of dataframes with the query ID as the name of each item in the list. I've been searching for awhile, but I haven't seen a good way to do this. Is this possible?
Thanks
We have used comma instead of tab to make it easier to see and have put the body of the file in a string but aside from making the obvious changes try this. First we use readLines to read in the file and then determine where the headers are and create a grp vector which has the same number of elements as lines in the file and whose values are the header for that line. Finally split the lines, and apply Read to each group.
but aside from that try this:
# test data
Lines <- "# Query ID 1
# note
# note
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
# Query ID 2
# note
# note
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12"
L <- readLines(textConnection(Lines)) # L <- readLines("myfile")
isHdr <- grepl("Query", L)
grp <- L[isHdr][cumsum(isHdr)]
# Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, comment = "#")
Read <- function(x) read.table(text = x, sep = ",", fill = TRUE, comment = "#")
Map(Read, split(L, grp))
giving:
$`# Query ID 1`
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 1 2 3 4 5 6 7 8 9 10 11 12
2 1 2 3 4 5 6 7 8 9 10 11 12
$`# Query ID 2`
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 1 2 3 4 5 6 7 8 9 10 11 12
2 1 2 3 4 5 6 7 8 9 10 11 12
No packages needed.

How to read csv file in which one colum has space separated numbers

How can I read a two-column csv file in which columns are delimited by commas but the second column has a bunch of numbers split by spaces? Another issue is that the number of values in the second column varies a lot so I don't how to accommodate for this in R. I did this in MATLAB using a cell array but I've no idea what to do next in R. If it helps, I'm only interested on the second column and I'd like to do operations on it( ex : sum of column,standard deviation,etc).
Thanks in advance.
One option is to read the dataset using read.csv and then split the space separated col2 using cSplit from splitstackshape
library(splitstackshape)
cSplit(dat, 'Col2', sep=' ')
# Col1 Col2_1 Col2_2 Col2_3 Col2_4 Col2_5 Col2_6
#1: 25 35 -24 325 2 7 9
#2: 28 24 8 9 10 11 NA
Or you can change the , to ' ' and then use read.table to read the dataset with fill=TRUE
read.table(text=gsub(',', ' ', readLines('spaceSep.csv')),
sep='', skip=1, fill=TRUE)
# V1 V2 V3 V4 V5 V6 V7
#1 25 35 -24 325 2 7 9
#2 28 24 8 9 10 11 NA
If you don't need the first column,
dat1 <- read.table(text=gsub(".*,", "", readLines('spaceSep.csv')),
skip=1, fill=TRUE)
and you can get the sum, sd etc for each column
colSums(dat1, na.rm=TRUE)
#V1 V2 V3 V4 V5 V6
#59 -16 334 12 18 9
apply(dat1,2, sd, na.rm=TRUE)
data
dat <- read.csv('spaceSep.csv')
dat <- structure(list(Col1 = c(25L, 28L), Col2 = structure(c(2L, 1L),
.Label = c(" 24 8 9 10 11", " 35 -24 325 2 7 9"), class = "factor")),
.Names = c("Col1", "Col2"), class = "data.frame", row.names = c(NA, -2L))
Just change the space to comma(such as, in Linux: sed s/ /,/g 1.txt or just use the notepad and substitute all space)! and than read it via read.csv.

Fill option for fread

Let's say I have this txt file:
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
With read.csv I can:
> read.csv("linktofile.txt", fill=T, header=F)
V1 V2 V3 V4 V5 V6 V7
1 AA 3 3 3 3 NA NA
2 CC ad 2 2 2 2 2
3 ZZ 2 NA NA NA NA NA
4 AA 3 3 3 3 NA NA
5 CC ad 2 2 2 2 2
However fread gives
> library(data.table)
> fread("linktofile.txt")
V1 V2 V3 V4 V5 V6 V7
1: CC ad 2 2 2 2 2
Can I get the same result with fread?
Major update
It looks like development plans for fread changed and fread has now gained a fill argument.
Using the same sample data from the end of this answer, here's what I get:
library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1: AA 3 3 3 3 NA NA
# 2: CC ad 2 2 2 2 2
# 3: ZZ 2 NA NA NA NA NA
# 4: AA 3 3 3 3 NA NA
# 5: CC ad 2 2 2 2 2
Install the development version of "data.table" with:
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table",
type = "source")
Original answer
This doesn't answer your question about fread: That question has already been addressed by #Matt.
It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.
Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.
You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.
library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
col_types = rep("character", max(count.fields(x, ","))))
Sample data
x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")
## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")
Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

Resources