Fill option for fread - r

Let's say I have this txt file:
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
With read.csv I can:
> read.csv("linktofile.txt", fill=T, header=F)
V1 V2 V3 V4 V5 V6 V7
1 AA 3 3 3 3 NA NA
2 CC ad 2 2 2 2 2
3 ZZ 2 NA NA NA NA NA
4 AA 3 3 3 3 NA NA
5 CC ad 2 2 2 2 2
However fread gives
> library(data.table)
> fread("linktofile.txt")
V1 V2 V3 V4 V5 V6 V7
1: CC ad 2 2 2 2 2
Can I get the same result with fread?

Major update
It looks like development plans for fread changed and fread has now gained a fill argument.
Using the same sample data from the end of this answer, here's what I get:
library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1: AA 3 3 3 3 NA NA
# 2: CC ad 2 2 2 2 2
# 3: ZZ 2 NA NA NA NA NA
# 4: AA 3 3 3 3 NA NA
# 5: CC ad 2 2 2 2 2
Install the development version of "data.table" with:
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table",
type = "source")
Original answer
This doesn't answer your question about fread: That question has already been addressed by #Matt.
It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.
Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.
You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.
library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
col_types = rep("character", max(count.fields(x, ","))))
Sample data
x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")
## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")

Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

Related

R Transform number/character to be a superscript or subscript

I want to superscript a number (or character, it doesnt matter) to an existing string.
This is what my initial dataframe looks like:
testframe = as.data.frame(c("A34", "B21", "C64", "D83", "E92", "F24"))
testframe$V2 = c(1,3,2,2,3,NA)
colnames(testframe)[1] = "V1"
V1 V2
1 A34 1
2 B21 3
3 C64 2
4 D83 2
5 E92 3
6 F24 NA
What I want to do now is to use the V2 as a kind of "footnote", so any superscript or subscript (doesnt matter which one). When there is no entry in V2, then I just want to keep V1 as it is.
I found a similar question where I saw this answer:
> paste0("H", "\u2081", "O")
[1] "H₁O"
This is what I want, but my problem is that it has to be created automatically since I have way too many rows in my real dataframe.
I tried to add an extra column "V3" to enter the Superscripts and Subscripts Codes:
testframe$V3 = c("u2081", "u2083", "u2082", "u2082", "u2083", NA)
V1 V2 V3
1 A34 1 u2081
2 B21 3 u2083
3 C64 2 u2082
4 D83 2 u2082
5 E92 3 u2083
6 F24 NA <NA>
But when I try paste(testframe$V1, testframe$V3, sep = "\") it gives me an error. How can I use the \ in this case?
If the subscripts will always be digits 0 through 9, you can index into a vector of Unicode subscript digits:
subscripts <- c(
"\u2080",
"\u2081",
"\u2082",
"\u2083",
"\u2084",
"\u2085",
"\u2086",
"\u2087",
"\u2089"
)
testframe$V1 <- paste0(
testframe$V1,
ifelse(
is.na(testframe$V2),
"",
subscripts[testframe$V2 + 1]
)
)
testframe
V1 V2
1 A34₁ 1
2 B21₃ 3
3 C64₂ 2
4 D83₂ 2
5 E92₃ 3
6 F24 NA

How to read data with many blank fields in R

I have a tab-delimited file that looks like this:
"ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
I use this code to read in the data:
df <- read.table("path/to/file",header=TRUE,fill=TRUE)
The result is this:
df
id V1 V2 V3 V4 V5
1 1 A 1 NA NA NA
2 2 B 2 NA NA NA
But I expect this:
df
id V1 V2 V3 V4 V5
1 1 A NA NA NA 1
2 2 B NA NA NA 2
I've tried sep="\t" and na.strings=c(""," ",NULL) but those don't help.
I can't get it to work with read.table, so how about parsing the string the manual way
ss <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
library(tidyverse)
entries <- unlist(str_split(ss, "\t"))
ncol <- str_which(entries, "\n")[1]
entries %>%
str_remove("\\n") %>%
matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
as.data.frame() %>%
slice(-1) %>%
mutate_if(is.factor, as.character) %>%
mutate_all(parse_guess)
# ID V1 V2 V3 V4 V5
#1 1 A NA NA NA 1
#2 2 B NA NA NA 2
Explanation: We split the string on "\t"; the first occurrence of "\n" tells us how many columns we have. We then tidy up the entries by removing the line break characters "\n", reshape as matrix and then as data.frame, fix the header, and let readr::parse_guess guess the data type of every column.
For good measure we can roll everything into a function
read.my.data <- function(s) {
entries <- unlist(str_split(s, "\t"))
ncol <- str_which(entries, "\n")[1]
entries %>%
str_remove("\\n") %>%
matrix(ncol = ncol, byrow = T, dimnames = list(NULL, .[1:ncol])) %>%
as.data.frame() %>%
slice(-1) %>%
mutate_if(is.factor, as.character) %>%
mutate_all(parse_guess)
}
and confirm
read.my.data(ss)
# ID V1 V2 V3 V4 V5
#1 1 A NA NA NA 1
#2 2 B NA NA NA 2
data.table's fread() had no problem reading in the string... but your data seems to have a \t too many (after each \n), which causes the creation of an extra column.
It is probably best practive to fix this in your export that creates your files.
If this is not possible, you can adjust fread()'s arguments to get the desired output.
Here we use drop do delete the first column that was created due to the the extra \t.
To get the right column-names back, we read the first line of the file again
string <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
data.table::fread( string,
drop = 1,
fill = TRUE,
col.names = as.matrix( fread(string, nrows = 1, header = FALSE))[1,] )
ID V1 V2 V3 V4 V5
1: 1 A NA NA NA 1
2: 2 B NA NA NA 2
As Quar already mentioned in his/her comment, your file has an extra tab in the beginning of every line, so the number of column labels does not match the number of data fields:
> foo <- "ID\tV1\tV2\tV3\tV4\tV5\n\t1\tA\t\t\t\t1\n\t2\tB\t\t\t\t2"
> cat(foo, "\n")
ID V1 V2 V3 V4 V5
1 A 1
2 B 2
That would be ok if the additional first column contained unique row names.
So there are two ways to address the problem: 1. remove the empty column (ideally by fixing the process that produced that file) or 2. fix the row name issue.
Here is my suggestion using the second option:
As the data is tab separated, I'd use read.delim which is just read table with reasonable defaults for this kind of file. Of course that throws an error when used w/o some tweaking ("duplicate 'row.names' are not allowed"). To fix that, we need to tell it to use automatic row numbering. That way you get almost exactly what you want:
> read.delim(text=foo, row.names=NULL)
row.names ID V1 V2 V3 V4 V5
1 1 A NA NA NA 1
2 2 B NA NA NA 2
All that's left to do is get rid of the row.names column. Alternatively, you may want the ID column to be turned into row.names:
> read.delim(text=foo, row.names='ID')
row.names V1 V2 V3 V4 V5
1 A NA NA NA 1
2 B NA NA NA 2
Hope that helps.

Filling missing string values in dataframe R

I'm having trouble with something that must be quite easy in R; I want to fill the missing values in a column (of a data.frame) with the corresponding values. So like this:
V1 V2
cat tree
cat NA
NA tree
dog house
NA house
dog NA
horse NA
NA car
horse car
So the corresponding string of cat is tree, so "tree" must be filled in when there is a NA in the "cat group". "house" must be filled in when there is a NA in the "dog group" (so I must choose to take the first word of the list at 1 and 2 as the "leading" word to fill in at every number - EDIT --> it is better when the first is not leading in case of a NA is first).
There are a lot of NA's in V1, and a few in V2, and I want to fill only the NA's of V2.
In SPSS its done with the aggregate function, but I dont think the aggregate function in R is comparable in this case, or is it? Anyone knows how to do this?
Thanks!
The OP has requested that the missing values need to be filled in by group. So, the zoo::na.locf() approach might fail here.
There is a method called update join which can be used to fill in the missing values per group:
library(data.table) # version 1.10.4 used
setDT(DT)
DT[DT[!is.na(V1)][order(V2), .(fillin = first(V2)), by = V1], on = "V1", V2 := fillin][]
# V1 V2
# 1: 1 tree
# 2: 1 tree
# 3: 1 tree
# 4: 2 house
# 5: 2 house
# 6: 2 house
# 7: 3 lawn
# 8: 3 lawn
# 9: 4 NA
#10: 4 NA
#11: NA NA
#12: NA tree
Note that the input data have been supplemented to cover some corner cases.
Explanation
The approach consists of two steps. First, the values to be filled in by group are determined followed by the update join which modifies DT in place.
fill_by_group <- DT[!is.na(V1)][order(V2), .(fillin = first(V2)), by = V1]
fill_by_group
# V1 fillin
#1: 2 house
#2: 3 lawn
#3: 1 tree
#4: 4 NA
DT[fill_by_group, on = "V1", V2 := fillin][]
order(V2) ensures that any NA values are sorted last, so that first(V2) picks the correct value to fill in.
The update join approach has been benchmarked as the fastest method in another case.
Variant using na.omit()
docendo discimus has suggested in his comment to use na.omit(). This can be utilized for the update join as well replacing order()/first():
DT[DT[!is.na(V1), .(fillin = na.omit(V2)), by = V1], on = "V1", V2 := fillin][]
Note that na.omit(V2) works as well as na.omit(V2)[1] or first(na.omit(V2)), here.
Data
Edit: The OP has changed his originally posted data set substantially. As a quick fix, I've updated the sample data below to include cases where V1 is NA.
library(data.table)
DT <- fread(
"1 tree
1 NA
1 tree
2 house
2 house
2 NA
3 NA
3 lawn
4 NA
4 NA
NA NA
NA tree")
Note that the data given by the OP have been supplemented to cover three additional cases:
The first V2 value in each group is NA.
All V2 values in a group are NA.
V1 is `NA.
you can use dplyr and try:
mydata %>%
group_by(V1) %>%
mutate(V2 = unique(V2[!is.na(V2)]))
You can use below:
mydata<-read.table(text="1 tree
1 NA
1 tree
2 house
2 house
2 NA")
mydata[is.na(mydata$V2),]$V2<-mydata[which(is.na(mydata$V2))-1,]$V2

"fill" missing columns for fread() [duplicate]

Let's say I have this txt file:
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
With read.csv I can:
> read.csv("linktofile.txt", fill=T, header=F)
V1 V2 V3 V4 V5 V6 V7
1 AA 3 3 3 3 NA NA
2 CC ad 2 2 2 2 2
3 ZZ 2 NA NA NA NA NA
4 AA 3 3 3 3 NA NA
5 CC ad 2 2 2 2 2
However fread gives
> library(data.table)
> fread("linktofile.txt")
V1 V2 V3 V4 V5 V6 V7
1: CC ad 2 2 2 2 2
Can I get the same result with fread?
Major update
It looks like development plans for fread changed and fread has now gained a fill argument.
Using the same sample data from the end of this answer, here's what I get:
library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1: AA 3 3 3 3 NA NA
# 2: CC ad 2 2 2 2 2
# 3: ZZ 2 NA NA NA NA NA
# 4: AA 3 3 3 3 NA NA
# 5: CC ad 2 2 2 2 2
Install the development version of "data.table" with:
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table",
type = "source")
Original answer
This doesn't answer your question about fread: That question has already been addressed by #Matt.
It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.
Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.
You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.
library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
col_types = rep("character", max(count.fields(x, ","))))
Sample data
x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")
## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")
Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

combining text files in R with different separators

I am trying to read in and combine multiple text files into R. The issue with this is that I have been given some data where field separators between files are different (e.g. tab for one and commas for another). How could I combine these efficiently? An example of layout:
Data1 (tab):
v1 v2 v3 v4 v5
1 2 3 4 urban
4 5 3 2 city
Data2 (comma):
v1,v2,v3,v4,v5
5,6,7,8,rural
6,4,3,1,city
This example is obviously not real, the real code has nearly half a million points! And so cannot reshape the original files. The code I have used so far has been:
filelist <- list.files(path = "~/Documents/", pattern='.dat', full.names=T)
data1 <- ldply(filelist, function(x) read.csv(x, sep="\t"))
data2 <- ldply(filelist, function(x) read.csv(x, sep=","))
This gives me the the data both ways, which I then need to manually clean and then combine. Is there a way of using sep in a way that can remove this? Column names are the same among files. I know that stringr or other concatenating functions may be useful, but I also need to load the data in at the same time, and am unsure how to set this up within the read commands.
I would suggest using fread from the "data.table" package. It's fast, and does a pretty good job of automatically detecting a delimiter in a file.
Here's an example:
## Create some example files
cat('v1\tv2\tv3\tv4\tv5\n1\t2\t3\t4\turban\n4\t5\t3\t2\tcity\n', file = "file1.dat")
cat('v1,v2,v3,v4,v5\n5,6,7,8,rural\n6,4,3,1,city\n', file = "file2.dat")
## Get a character vector of the file names
files <- list.files(pattern = "*.dat") ## Use what you're already doing
library(data.table)
lapply(files, fread)
# [[1]]
# v1 v2 v3 v4 v5
# 1: 1 2 3 4 urban
# 2: 4 5 3 2 city
#
# [[2]]
# v1 v2 v3 v4 v5
# 1: 5 6 7 8 rural
# 2: 6 4 3 1 city
## Fancy work: Bind it all to one data.table...
## with a column indicating where the file came from....
rbindlist(setNames(lapply(files, fread), files), idcol = TRUE)
# .id v1 v2 v3 v4 v5
# 1: file1.dat 1 2 3 4 urban
# 2: file1.dat 4 5 3 2 city
# 3: file2.dat 5 6 7 8 rural
# 4: file2.dat 6 4 3 1 city
You can also add an if clause into your function:
data = ldply(filelist,function(x) if(grepl(",",readLines(x,n=1))){read.csv(x,sep=",")} else{read.csv(x,sep="\t")})

Resources