How to read data with many blank fields in R - r

I have a tab-delimited file that looks like this:
"ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
I use this code to read in the data:
df <- read.table("path/to/file",header=TRUE,fill=TRUE)
The result is this:
df
id V1 V2 V3 V4 V5
1 1 A 1 NA NA NA
2 2 B 2 NA NA NA
But I expect this:
df
id V1 V2 V3 V4 V5
1 1 A NA NA NA 1
2 2 B NA NA NA 2
I've tried sep="\t" and na.strings=c(""," ",NULL) but those don't help.

I can't get it to work with read.table, so how about parsing the string the manual way
ss <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
library(tidyverse)
entries <- unlist(str_split(ss, "\t"))
ncol <- str_which(entries, "\n")[1]
entries %>%
str_remove("\\n") %>%
matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
as.data.frame() %>%
slice(-1) %>%
mutate_if(is.factor, as.character) %>%
mutate_all(parse_guess)
# ID V1 V2 V3 V4 V5
#1 1 A NA NA NA 1
#2 2 B NA NA NA 2
Explanation: We split the string on "\t"; the first occurrence of "\n" tells us how many columns we have. We then tidy up the entries by removing the line break characters "\n", reshape as matrix and then as data.frame, fix the header, and let readr::parse_guess guess the data type of every column.
For good measure we can roll everything into a function
read.my.data <- function(s) {
entries <- unlist(str_split(s, "\t"))
ncol <- str_which(entries, "\n")[1]
entries %>%
str_remove("\\n") %>%
matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
as.data.frame() %>%
slice(-1) %>%
mutate_if(is.factor, as.character) %>%
mutate_all(parse_guess)
}
and confirm
read.my.data(ss)
# ID V1 V2 V3 V4 V5
#1 1 A NA NA NA 1
#2 2 B NA NA NA 2

data.table's fread() had no problem reading in the string... but your data seems to have a \t too many (after each \n), which causes the creation of an extra column.
It is probably best practive to fix this in your export that creates your files.
If this is not possible, you can adjust fread()'s arguments to get the desired output.
Here we use drop do delete the first column that was created due to the the extra \t.
To get the right column-names back, we read the first line of the file again
string <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
data.table::fread( string,
drop = 1,
fill = TRUE,
col.names = as.matrix( fread(string, nrows = 1, header = FALSE))[1,] )
ID V1 V2 V3 V4 V5
1: 1 A NA NA NA 1
2: 2 B NA NA NA 2

As Quar already mentioned in his/her comment, your file has an extra tab in the beginning of every line, so the number of column labels does not match the number of data fields:
> foo <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
> cat(foo, "\n")
ID V1 V2 V3 V4 V5
1 A 1
2 B 2
That would be ok if the additional first column contained unique row names.
So there are two ways to address the problem: 1. remove the empty column (ideally by fixing the process that produced that file) or 2. fix the row name issue.
Here is my suggestion using the second option:
As the data is tab separated, I'd use read.delim which is just read table with reasonable defaults for this kind of file. Of course that throws an error when used w/o some tweaking ("duplicate 'row.names' are not allowed"). To fix that, we need to tell it to use automatic row numbering. That way you get almost exactly what you want:
> read.delim(text=foo, row.names=NULL)
row.names ID V1 V2 V3 V4 V5
1 1 A NA NA NA 1
2 2 B NA NA NA 2
All that's left to do is get rid of the row.names column. Alternatively, you may want the ID column to be turned into row.names:
> read.delim(text=foo, row.names='ID')
row.names V1 V2 V3 V4 V5
1 A NA NA NA 1
2 B NA NA NA 2
Hope that helps.

Related

Replace NA by "No_"colname"_found"

I want to replace every NA in my dataframe with "No_[colname]_found".
(If there is a value, I want to keep it.) I know I can do it for every column separately but I have > 100 columns.
First, I tried replacing every NA in my dataframe with the colname. I know how to add "No_" and "_found" (by using paste).
This is what I have tried so far without success:
DF <- apply(DF, 2, function(x){ifelse(is.na(x), colnames(DF)[x], x)})
DF <- apply(DF, 2, function(x){ifelse(is.na(x), colnames(x), x)})
DF <- apply(DF, 2, function(x){ifelse(is.na(x), colnames(DF[x]), x)})
With what I tried so far, I don't get error messages. But my NA values don't change into colname, they stay NAs.
We can try using lapply over the names of the input data frame:
df <- data.frame(v1=c(1,NA,3), v2=c(4,5,6), v3=c(NA,8,NA))
output <- data.frame(lapply(names(df), function(x) {
ifelse(is.na(df[[x]]), paste0("No_", x, "_found"), df[[x]])
}))
names(output) <- names(df)
df
v1 v2 v3
1 1 4 NA
2 NA 5 8
3 3 6 NA
output
v1 v2 v3
1 1 4 No_v3_found
2 No_v1_found 5 8
3 3 6 No_v3_found

R DataFrame leading NA's shift

I've searched longer than I'd like to admit for shifting leading NA's to the end.
Got close with a few stack questions "Cut out outer NAs in R","Rotate a Matrix in R","na.locf remove leading NAs, keep others [closed]" as well as looking over na.trim function in zoo package. Essentially I want to turn this:
D <- matrix(c(1:9), 3)
D[2,1]<- NA
D[3,1]<- NA
D[3,2]<- NA
D <- as.data.frame(D)
into this:
D1 <- data.frame(V1 = c(1,5,9),
V2 = c(4,8,NA),
V3 = c(7,NA,NA))
Any help is as always, much appreciated!
Thanks,
You can use sort(..., na.last = T) within row-wise apply:
as.data.frame(t(apply(D, 1, sort, na.last = T)))
# V1 V2 V3
#1 1 4 7
#2 5 8 NA
#3 9 NA NA
Update
To avoid ordering non-NA entries, you can do:
# Revised sample data
D <- matrix(c(1:9), 3)
D[2,1]<- NA
D[3,1]<- NA
D[3,2]<- NA
D <- as.data.frame(D)
D[2,2:3] <- c(8, 5);
D;
# V1 V2 V3
#1 1 4 7
#2 NA 8 5
#3 NA NA 9
as.data.frame(t(apply(D, 1, function(x) c(x[!is.na(x)], x[is.na(x)]))))
#V1 V2 V3
#1 1 4 7
#2 8 5 NA
#3 9 NA NA

Checking if a column name exists in another dataset

So I have two different datasets and I am trying to check if a column name has a duplicate column name in another data set. For example:
V1 V2 V3
1 2 3
as one data set and
V4 V6 V1 V2
NA NA NA NA
And I am trying to make it so the second data set is like this
V4 V6 V1 V2
NA NA 1 NA
where only the minimum value in the original data set copies over, if that makes since. I have tried using this function:
if(ncol((Session1t[grep(temp1, names(Session1t))])) != 0)
But this is not working. It returns the same value regardless of what is input. After entering the if statement I then work to copy only the column that I want over,and I have that figured out, I just cannot get the if statement to work effectively.
We can use ifelse and %in% to match column names and replace NA with 1.
# Create example data frame D1
D1 <- read.table(text = "V1 V2 V3
1 2 3",
header = TRUE)
# Create example data frame D2
D2 <- read.table(text = "V4 V6 V1 V2
NA NA NA NA",
header = TRUE)
# Replace NA to 1 if column names match
D2[1, ] <- ifelse(names(D2) %in% names(D1), 1, NA)
D2
# V4 V6 V1 V2
# 1 NA NA 1 1
Or another option is intersect
nm1 <- intersect(names(df1), names(df2))
df2[nm1] <- df1[nm1]

"fill" missing columns for fread() [duplicate]

Let's say I have this txt file:
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
With read.csv I can:
> read.csv("linktofile.txt", fill=T, header=F)
V1 V2 V3 V4 V5 V6 V7
1 AA 3 3 3 3 NA NA
2 CC ad 2 2 2 2 2
3 ZZ 2 NA NA NA NA NA
4 AA 3 3 3 3 NA NA
5 CC ad 2 2 2 2 2
However fread gives
> library(data.table)
> fread("linktofile.txt")
V1 V2 V3 V4 V5 V6 V7
1: CC ad 2 2 2 2 2
Can I get the same result with fread?
Major update
It looks like development plans for fread changed and fread has now gained a fill argument.
Using the same sample data from the end of this answer, here's what I get:
library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1: AA 3 3 3 3 NA NA
# 2: CC ad 2 2 2 2 2
# 3: ZZ 2 NA NA NA NA NA
# 4: AA 3 3 3 3 NA NA
# 5: CC ad 2 2 2 2 2
Install the development version of "data.table" with:
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table",
type = "source")
Original answer
This doesn't answer your question about fread: That question has already been addressed by #Matt.
It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.
Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.
You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.
library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
col_types = rep("character", max(count.fields(x, ","))))
Sample data
x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")
## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")
Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

Fill option for fread

Let's say I have this txt file:
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
With read.csv I can:
> read.csv("linktofile.txt", fill=T, header=F)
V1 V2 V3 V4 V5 V6 V7
1 AA 3 3 3 3 NA NA
2 CC ad 2 2 2 2 2
3 ZZ 2 NA NA NA NA NA
4 AA 3 3 3 3 NA NA
5 CC ad 2 2 2 2 2
However fread gives
> library(data.table)
> fread("linktofile.txt")
V1 V2 V3 V4 V5 V6 V7
1: CC ad 2 2 2 2 2
Can I get the same result with fread?
Major update
It looks like development plans for fread changed and fread has now gained a fill argument.
Using the same sample data from the end of this answer, here's what I get:
library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1: AA 3 3 3 3 NA NA
# 2: CC ad 2 2 2 2 2
# 3: ZZ 2 NA NA NA NA NA
# 4: AA 3 3 3 3 NA NA
# 5: CC ad 2 2 2 2 2
Install the development version of "data.table" with:
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table",
type = "source")
Original answer
This doesn't answer your question about fread: That question has already been addressed by #Matt.
It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.
Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.
You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.
library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
col_types = rep("character", max(count.fields(x, ","))))
Sample data
x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")
## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")
Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

Resources