combining text files in R with different separators - r

I am trying to read in and combine multiple text files into R. The issue with this is that I have been given some data where field separators between files are different (e.g. tab for one and commas for another). How could I combine these efficiently? An example of layout:
Data1 (tab):
v1 v2 v3 v4 v5
1 2 3 4 urban
4 5 3 2 city
Data2 (comma):
v1,v2,v3,v4,v5
5,6,7,8,rural
6,4,3,1,city
This example is obviously not real, the real code has nearly half a million points! And so cannot reshape the original files. The code I have used so far has been:
filelist <- list.files(path = "~/Documents/", pattern='.dat', full.names=T)
data1 <- ldply(filelist, function(x) read.csv(x, sep="\t"))
data2 <- ldply(filelist, function(x) read.csv(x, sep=","))
This gives me the the data both ways, which I then need to manually clean and then combine. Is there a way of using sep in a way that can remove this? Column names are the same among files. I know that stringr or other concatenating functions may be useful, but I also need to load the data in at the same time, and am unsure how to set this up within the read commands.

I would suggest using fread from the "data.table" package. It's fast, and does a pretty good job of automatically detecting a delimiter in a file.
Here's an example:
## Create some example files
cat('v1\tv2\tv3\tv4\tv5\n1\t2\t3\t4\turban\n4\t5\t3\t2\tcity\n', file = "file1.dat")
cat('v1,v2,v3,v4,v5\n5,6,7,8,rural\n6,4,3,1,city\n', file = "file2.dat")
## Get a character vector of the file names
files <- list.files(pattern = "*.dat") ## Use what you're already doing
library(data.table)
lapply(files, fread)
# [[1]]
# v1 v2 v3 v4 v5
# 1: 1 2 3 4 urban
# 2: 4 5 3 2 city
#
# [[2]]
# v1 v2 v3 v4 v5
# 1: 5 6 7 8 rural
# 2: 6 4 3 1 city
## Fancy work: Bind it all to one data.table...
## with a column indicating where the file came from....
rbindlist(setNames(lapply(files, fread), files), idcol = TRUE)
# .id v1 v2 v3 v4 v5
# 1: file1.dat 1 2 3 4 urban
# 2: file1.dat 4 5 3 2 city
# 3: file2.dat 5 6 7 8 rural
# 4: file2.dat 6 4 3 1 city

You can also add an if clause into your function:
data = ldply(filelist,function(x) if(grepl(",",readLines(x,n=1))){read.csv(x,sep=",")} else{read.csv(x,sep="\t")})

Related

"fill" missing columns for fread() [duplicate]

Let's say I have this txt file:
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
With read.csv I can:
> read.csv("linktofile.txt", fill=T, header=F)
V1 V2 V3 V4 V5 V6 V7
1 AA 3 3 3 3 NA NA
2 CC ad 2 2 2 2 2
3 ZZ 2 NA NA NA NA NA
4 AA 3 3 3 3 NA NA
5 CC ad 2 2 2 2 2
However fread gives
> library(data.table)
> fread("linktofile.txt")
V1 V2 V3 V4 V5 V6 V7
1: CC ad 2 2 2 2 2
Can I get the same result with fread?
Major update
It looks like development plans for fread changed and fread has now gained a fill argument.
Using the same sample data from the end of this answer, here's what I get:
library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1: AA 3 3 3 3 NA NA
# 2: CC ad 2 2 2 2 2
# 3: ZZ 2 NA NA NA NA NA
# 4: AA 3 3 3 3 NA NA
# 5: CC ad 2 2 2 2 2
Install the development version of "data.table" with:
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table",
type = "source")
Original answer
This doesn't answer your question about fread: That question has already been addressed by #Matt.
It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.
Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.
You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.
library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
col_types = rep("character", max(count.fields(x, ","))))
Sample data
x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")
## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")
Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

How to apply division operation across values in a single column, based on other columns?

I have a fairly large data frame with one numeric column and a bunch of factors. One of these factors has only two values. I want to generate a new, smaller data frame that divides the numeric variable by another value in the same column.
Example Data:
set.seed(1)
V1 <- rep(c("a","b"), each =8)
V2 <- 1:4
V3 <- rep(c("High","Low"), each=4)
V4 <- rnorm(16)
foo <- data.frame(V1,V2,V3,V4)
Which gives me the following data frame:
V1 V2 V3 V4
1 a 1 High -0.62645381
2 a 2 High 0.18364332
3 a 3 High -0.83562861
4 a 4 High 1.59528080
5 a 1 Low 0.32950777
6 a 2 Low -0.82046838
7 a 3 Low 0.48742905
8 a 4 Low 0.73832471
9 b 1 High 0.57578135
10 b 2 High -0.30538839
11 b 3 High 1.51178117
12 b 4 High 0.38984324
13 b 1 Low -0.62124058
14 b 2 Low -2.21469989
15 b 3 Low 1.12493092
16 b 4 Low -0.04493361
I want to generate a smaller data frame that divides V4(High) by the matching V4(Low)
V1 V2 V4
1 a 1 -1.901181 #foo[1,4]/foo[5,4]
2 a 2 -0.223827 #foo[2,4]/foo[6,4]
...
The problem is my real data is messier than this. I do know that V3 repeats regularly, there is a High for every Low, but V2 and V1 do not repeat regularly like I've demonstrated here. They are not highly irregular, but there are a few dropped values (i.e. b3Low and b3High might have been dropped)
I'm assuming I'm going to have to restructure my data frame somehow, but I have no idea where to even start. Thanks in advance.
Here's an option using dplyr and reshape2:
library(dplyr)
library(reshape2)
foo %>% dcast(V1 + V2 ~ V3, value.var="V4") %>%
mutate(Ratio = High/Low) %>%
select(V1, V2, Ratio)
V1 V2 Ratio
1 a 1 -1.9011807
2 a 2 -0.2238274
3 a 3 -1.7143595
4 a 4 2.1606764
5 b 1 -0.9268251
6 b 2 0.1378915
7 b 3 1.3438880
8 b 4 -8.6759832
Get rid of the select statement if you want to keep the High and Low columns in the final result.
Or with dplyr alone:
foo %>% group_by(V1, V2) %>%
summarise(Ratio = V4[V3=="High"]/V4[V3=="Low"])
Or with data.table:
library(data.table)
setDT(foo)[ , list(Ratio = V4[V3=="High"]/V4[V3=="Low"]), by=list(V1, V2)]
One way to do it would be to first split the dataframe by V3. Then, if they're ordered correctly, it's straightforward. If not, then merge them into a single dataframe and proceed from there. For example:
# Split foo
fooSplit <- split(foo, foo$V3)
#If ordered correctly (as in the example)
fooSplit[[1]]$V4 / fooSplit[[2]]$V4
# [1] -1.9011807 -0.2238274 -1.7143595 2.1606764 -0.9268251 0.1378915 1.3438880 -8.6759832
#If not ordered correctly, merge into new dataframe
#Rename variables in prep for merge
names(fooSplit[[1]])[4] <- "High"
names(fooSplit[[2]])[4] <- "Low"
#Merge into a single dataframe, drop V3
d <- merge(fooSplit[[1]][,-3], fooSplit[[2]][,-3], by = 1:2, all = TRUE)
d$High / d$Low
# [1] -1.9011807 -0.2238274 -1.7143595 2.1606764 -0.9268251 0.1378915 1.3438880 -8.6759832
I think the dpyr package could help you with that.
Following with your code:
You can create a "key" column to use when crossing references between "High" and "Low" values.
foo <- mutate(foo,paste(V1,V2))
names(foo) <- c("V1","V2","V3","V4","key")
Now you have the "key" column you could use filter, to split the data set in two groups ("High" and "Low"), merge to join them using the "key column and select to tide up the data set and just keep the important columns.
foo <- select(merge(filter(foo,V3=="High"), filter(foo,V3=="Low"),
by="key"), V1.x, V2.x, V4.x, V4.y)
At last, when you have the data on the same table, you could create a new calculated column using mutate. We use select again to keep the data set as simple as possible.
foo <- select(mutate(foo,V4.x/V4.y,name="V4"),1,2,5)
So, if you execute:
foo <- mutate(foo,paste(V1,V2))
names(foo) <- c("V1","V2","V3","V4","key")
foo <- select(merge(filter(foo,V3=="High"), filter(foo,V3=="Low"),
by="key"), V1.x, V2.x, V4.x, V4.y)
foo <- select(mutate(foo,V4.x/V4.y,name="V4"),1,2,5)
you will get:
# V1.x V2.x V4.x/V4.y
#1 a 1 -1.9011807
#2 a 2 -0.2238274
#3 a 3 -1.7143595
#4 a 4 2.1606764
#5 b 1 -0.9268251
#6 b 2 0.1378915
#7 b 3 1.3438880
#8 b 4 -8.6759832
Probably it´s not shortest way to do it but I hope it will help you.

R: read file, split into multiple dataframes

I have a file that is laid out in the following way:
# Query ID 1
# note
# note
tab delimited data across 12 columns
# Query ID 2
# note
# note
tab delimited data across 12 columns
I'd like to import this data into R so that each query is its own dataframe. Ideally as a list of dataframes with the query ID as the name of each item in the list. I've been searching for awhile, but I haven't seen a good way to do this. Is this possible?
Thanks
We have used comma instead of tab to make it easier to see and have put the body of the file in a string but aside from making the obvious changes try this. First we use readLines to read in the file and then determine where the headers are and create a grp vector which has the same number of elements as lines in the file and whose values are the header for that line. Finally split the lines, and apply Read to each group.
but aside from that try this:
# test data
Lines <- "# Query ID 1
# note
# note
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12
# Query ID 2
# note
# note
1,2,3,4,5,6,7,8,9,10,11,12
1,2,3,4,5,6,7,8,9,10,11,12"
L <- readLines(textConnection(Lines)) # L <- readLines("myfile")
isHdr <- grepl("Query", L)
grp <- L[isHdr][cumsum(isHdr)]
# Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, comment = "#")
Read <- function(x) read.table(text = x, sep = ",", fill = TRUE, comment = "#")
Map(Read, split(L, grp))
giving:
$`# Query ID 1`
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 1 2 3 4 5 6 7 8 9 10 11 12
2 1 2 3 4 5 6 7 8 9 10 11 12
$`# Query ID 2`
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
1 1 2 3 4 5 6 7 8 9 10 11 12
2 1 2 3 4 5 6 7 8 9 10 11 12
No packages needed.

Fill option for fread

Let's say I have this txt file:
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
With read.csv I can:
> read.csv("linktofile.txt", fill=T, header=F)
V1 V2 V3 V4 V5 V6 V7
1 AA 3 3 3 3 NA NA
2 CC ad 2 2 2 2 2
3 ZZ 2 NA NA NA NA NA
4 AA 3 3 3 3 NA NA
5 CC ad 2 2 2 2 2
However fread gives
> library(data.table)
> fread("linktofile.txt")
V1 V2 V3 V4 V5 V6 V7
1: CC ad 2 2 2 2 2
Can I get the same result with fread?
Major update
It looks like development plans for fread changed and fread has now gained a fill argument.
Using the same sample data from the end of this answer, here's what I get:
library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1: AA 3 3 3 3 NA NA
# 2: CC ad 2 2 2 2 2
# 3: ZZ 2 NA NA NA NA NA
# 4: AA 3 3 3 3 NA NA
# 5: CC ad 2 2 2 2 2
Install the development version of "data.table" with:
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table",
type = "source")
Original answer
This doesn't answer your question about fread: That question has already been addressed by #Matt.
It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.
Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.
You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.
library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
col_types = rep("character", max(count.fields(x, ","))))
Sample data
x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")
## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")
Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.

Numbers as factor after read.delim

I have a data frame that looks like this:
A B C D
1 2 3 4
E F G H
5 6 7 8
I would like to subset only the numeric portion using the following code:
sub_num = DF[sapply(DF, is.numeric)]
The problem is that the numbers are factors after reading the data.frame using read.delim. If I set stringsAsFactors = FALSE the numbers are characters.
This may be a basic problem but I'm not able to solve it.
Try the following instead
sub_num <- DF[!is.na(as.numeric(sapply(DF, as.character)))[1:ncol(DF)], ]
# V1 V2 V3 V4
# 2 1 2 3 4
# 4 5 6 7 8
As for your sapply statement, sapply(DF, is.numeric), in order to work correctly, it would need as.character
sapply(DF, function(X) is.numeric(as.character(X)))
But that would not index your DF as you would expect

Resources