I'm having trouble with something that must be quite easy in R; I want to fill the missing values in a column (of a data.frame) with the corresponding values. So like this:
V1 V2
cat tree
cat NA
NA tree
dog house
NA house
dog NA
horse NA
NA car
horse car
So the corresponding string of cat is tree, so "tree" must be filled in when there is a NA in the "cat group". "house" must be filled in when there is a NA in the "dog group" (so I must choose to take the first word of the list at 1 and 2 as the "leading" word to fill in at every number - EDIT --> it is better when the first is not leading in case of a NA is first).
There are a lot of NA's in V1, and a few in V2, and I want to fill only the NA's of V2.
In SPSS its done with the aggregate function, but I dont think the aggregate function in R is comparable in this case, or is it? Anyone knows how to do this?
Thanks!
The OP has requested that the missing values need to be filled in by group. So, the zoo::na.locf() approach might fail here.
There is a method called update join which can be used to fill in the missing values per group:
library(data.table) # version 1.10.4 used
setDT(DT)
DT[DT[!is.na(V1)][order(V2), .(fillin = first(V2)), by = V1], on = "V1", V2 := fillin][]
# V1 V2
# 1: 1 tree
# 2: 1 tree
# 3: 1 tree
# 4: 2 house
# 5: 2 house
# 6: 2 house
# 7: 3 lawn
# 8: 3 lawn
# 9: 4 NA
#10: 4 NA
#11: NA NA
#12: NA tree
Note that the input data have been supplemented to cover some corner cases.
Explanation
The approach consists of two steps. First, the values to be filled in by group are determined followed by the update join which modifies DT in place.
fill_by_group <- DT[!is.na(V1)][order(V2), .(fillin = first(V2)), by = V1]
fill_by_group
# V1 fillin
#1: 2 house
#2: 3 lawn
#3: 1 tree
#4: 4 NA
DT[fill_by_group, on = "V1", V2 := fillin][]
order(V2) ensures that any NA values are sorted last, so that first(V2) picks the correct value to fill in.
The update join approach has been benchmarked as the fastest method in another case.
Variant using na.omit()
docendo discimus has suggested in his comment to use na.omit(). This can be utilized for the update join as well replacing order()/first():
DT[DT[!is.na(V1), .(fillin = na.omit(V2)), by = V1], on = "V1", V2 := fillin][]
Note that na.omit(V2) works as well as na.omit(V2)[1] or first(na.omit(V2)), here.
Data
Edit: The OP has changed his originally posted data set substantially. As a quick fix, I've updated the sample data below to include cases where V1 is NA.
library(data.table)
DT <- fread(
"1 tree
1 NA
1 tree
2 house
2 house
2 NA
3 NA
3 lawn
4 NA
4 NA
NA NA
NA tree")
Note that the data given by the OP have been supplemented to cover three additional cases:
The first V2 value in each group is NA.
All V2 values in a group are NA.
V1 is `NA.
you can use dplyr and try:
mydata %>%
group_by(V1) %>%
mutate(V2 = unique(V2[!is.na(V2)]))
You can use below:
mydata<-read.table(text="1 tree
1 NA
1 tree
2 house
2 house
2 NA")
mydata[is.na(mydata$V2),]$V2<-mydata[which(is.na(mydata$V2))-1,]$V2
Related
Let's say I have this txt file:
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
With read.csv I can:
> read.csv("linktofile.txt", fill=T, header=F)
V1 V2 V3 V4 V5 V6 V7
1 AA 3 3 3 3 NA NA
2 CC ad 2 2 2 2 2
3 ZZ 2 NA NA NA NA NA
4 AA 3 3 3 3 NA NA
5 CC ad 2 2 2 2 2
However fread gives
> library(data.table)
> fread("linktofile.txt")
V1 V2 V3 V4 V5 V6 V7
1: CC ad 2 2 2 2 2
Can I get the same result with fread?
Major update
It looks like development plans for fread changed and fread has now gained a fill argument.
Using the same sample data from the end of this answer, here's what I get:
library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1: AA 3 3 3 3 NA NA
# 2: CC ad 2 2 2 2 2
# 3: ZZ 2 NA NA NA NA NA
# 4: AA 3 3 3 3 NA NA
# 5: CC ad 2 2 2 2 2
Install the development version of "data.table" with:
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table",
type = "source")
Original answer
This doesn't answer your question about fread: That question has already been addressed by #Matt.
It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.
Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.
You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.
library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
col_types = rep("character", max(count.fields(x, ","))))
Sample data
x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")
## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")
Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.
I have a fairly large data frame with one numeric column and a bunch of factors. One of these factors has only two values. I want to generate a new, smaller data frame that divides the numeric variable by another value in the same column.
Example Data:
set.seed(1)
V1 <- rep(c("a","b"), each =8)
V2 <- 1:4
V3 <- rep(c("High","Low"), each=4)
V4 <- rnorm(16)
foo <- data.frame(V1,V2,V3,V4)
Which gives me the following data frame:
V1 V2 V3 V4
1 a 1 High -0.62645381
2 a 2 High 0.18364332
3 a 3 High -0.83562861
4 a 4 High 1.59528080
5 a 1 Low 0.32950777
6 a 2 Low -0.82046838
7 a 3 Low 0.48742905
8 a 4 Low 0.73832471
9 b 1 High 0.57578135
10 b 2 High -0.30538839
11 b 3 High 1.51178117
12 b 4 High 0.38984324
13 b 1 Low -0.62124058
14 b 2 Low -2.21469989
15 b 3 Low 1.12493092
16 b 4 Low -0.04493361
I want to generate a smaller data frame that divides V4(High) by the matching V4(Low)
V1 V2 V4
1 a 1 -1.901181 #foo[1,4]/foo[5,4]
2 a 2 -0.223827 #foo[2,4]/foo[6,4]
...
The problem is my real data is messier than this. I do know that V3 repeats regularly, there is a High for every Low, but V2 and V1 do not repeat regularly like I've demonstrated here. They are not highly irregular, but there are a few dropped values (i.e. b3Low and b3High might have been dropped)
I'm assuming I'm going to have to restructure my data frame somehow, but I have no idea where to even start. Thanks in advance.
Here's an option using dplyr and reshape2:
library(dplyr)
library(reshape2)
foo %>% dcast(V1 + V2 ~ V3, value.var="V4") %>%
mutate(Ratio = High/Low) %>%
select(V1, V2, Ratio)
V1 V2 Ratio
1 a 1 -1.9011807
2 a 2 -0.2238274
3 a 3 -1.7143595
4 a 4 2.1606764
5 b 1 -0.9268251
6 b 2 0.1378915
7 b 3 1.3438880
8 b 4 -8.6759832
Get rid of the select statement if you want to keep the High and Low columns in the final result.
Or with dplyr alone:
foo %>% group_by(V1, V2) %>%
summarise(Ratio = V4[V3=="High"]/V4[V3=="Low"])
Or with data.table:
library(data.table)
setDT(foo)[ , list(Ratio = V4[V3=="High"]/V4[V3=="Low"]), by=list(V1, V2)]
One way to do it would be to first split the dataframe by V3. Then, if they're ordered correctly, it's straightforward. If not, then merge them into a single dataframe and proceed from there. For example:
# Split foo
fooSplit <- split(foo, foo$V3)
#If ordered correctly (as in the example)
fooSplit[[1]]$V4 / fooSplit[[2]]$V4
# [1] -1.9011807 -0.2238274 -1.7143595 2.1606764 -0.9268251 0.1378915 1.3438880 -8.6759832
#If not ordered correctly, merge into new dataframe
#Rename variables in prep for merge
names(fooSplit[[1]])[4] <- "High"
names(fooSplit[[2]])[4] <- "Low"
#Merge into a single dataframe, drop V3
d <- merge(fooSplit[[1]][,-3], fooSplit[[2]][,-3], by = 1:2, all = TRUE)
d$High / d$Low
# [1] -1.9011807 -0.2238274 -1.7143595 2.1606764 -0.9268251 0.1378915 1.3438880 -8.6759832
I think the dpyr package could help you with that.
Following with your code:
You can create a "key" column to use when crossing references between "High" and "Low" values.
foo <- mutate(foo,paste(V1,V2))
names(foo) <- c("V1","V2","V3","V4","key")
Now you have the "key" column you could use filter, to split the data set in two groups ("High" and "Low"), merge to join them using the "key column and select to tide up the data set and just keep the important columns.
foo <- select(merge(filter(foo,V3=="High"), filter(foo,V3=="Low"),
by="key"), V1.x, V2.x, V4.x, V4.y)
At last, when you have the data on the same table, you could create a new calculated column using mutate. We use select again to keep the data set as simple as possible.
foo <- select(mutate(foo,V4.x/V4.y,name="V4"),1,2,5)
So, if you execute:
foo <- mutate(foo,paste(V1,V2))
names(foo) <- c("V1","V2","V3","V4","key")
foo <- select(merge(filter(foo,V3=="High"), filter(foo,V3=="Low"),
by="key"), V1.x, V2.x, V4.x, V4.y)
foo <- select(mutate(foo,V4.x/V4.y,name="V4"),1,2,5)
you will get:
# V1.x V2.x V4.x/V4.y
#1 a 1 -1.9011807
#2 a 2 -0.2238274
#3 a 3 -1.7143595
#4 a 4 2.1606764
#5 b 1 -0.9268251
#6 b 2 0.1378915
#7 b 3 1.3438880
#8 b 4 -8.6759832
Probably it´s not shortest way to do it but I hope it will help you.
Given
index = c(1,2,3,4,5)
codes = c("c1","c1,c2","","c3,c1","c2")
df=data.frame(index,codes)
df
index codes
1 1 c1
2 2 c1,c2
3 3
4 4 c3,c1
5 5 c2
How can I create a new df that looks like
df1
index codes
1 1 c1
2 2 c1
3 2 c2
4 3
5 4 c3
6 4 c1
7 5 c2
so that I can perform aggregates on the codes? The "index" of the actual data set are a series of timestamps, so I'll want to aggregate by day or hour.
The method of Roland is quite good, provided the variable index has unique keys. You can gain some speed by working with the lists directly. Take into account that :
in your original data frame, codes is a factor. No point in doing that, you want it to be character.
in your original data frame, "" is used instead of NA. As the length of that one is 0, you can get in all kind of trouble later on. I'd use NA there. " " is an actual value, "" is no value at all, but you want a missing value. Hence NA.
So my idea would be:
The data:
index = c(1,2,3,4,5)
codes = c("c1","c1,c2",NA,"c3,c1","c2")
df=data.frame(index,codes,stringsAsFactors=FALSE)
Then :
X <- strsplit(df$codes,",")
data.frame(
index = rep(df$index,sapply(X,length)),
codes = unlist(X)
)
Or, if you insist on using "" instead of NA:
X <- strsplit(df$codes,",")
ll <- sapply(X,length)
X[ll==0] <- NA
data.frame(
index = rep(df$index,pmax(1,ll)),
codes = unlist(X)
)
Neither of both methods assume a unique key in index. They work perfectly well with non-unique timestamps.
You need to split the string (using strsplit) and then combine the resulting list with the data.frame.
The following relies on the assumption that codes are unique in each row. If you have many codes in some rows and only few in others, this might waste a lot of RAM and it might be better to loop.
#to avoid character(0), which would be omitted in rbind
levels(df$codes)[levels(df$codes)==""] <- " "
#rbind fills each row by propagating the values to the "empty" columns for each row
df2 <- cbind(df, do.call(rbind,strsplit(as.character(df$codes),",")))[,-2]
library(reshape2)
df2 <- melt(df2, id="index")[-2]
#here the assumtion is needed
df2 <- df2[!duplicated(df2),]
df2[order(df2[,1], df2[,2]),]
# index value
#1 1 c1
#2 2 c1
#7 2 c2
#3 3
#9 4 c1
#4 4 c3
#5 5 c2
Here's another alternative using "data.table". The sample data includes NA instead of a blank space and includes duplicated index values:
index = c(1,2,3,2,4,5)
codes = c("c1","c1,c2",NA,"c3,c1","c2","c3")
df = data.frame(index,codes,stringsAsFactors=FALSE)
library(data.table)
## We could create the data.table directly, but I'm
## assuming you already have a data.frame ready to work with
DT <- data.table(df)
DT[, list(codes = unlist(strsplit(codes, ","))), by = "index"]
# index codes
# 1: 1 c1
# 2: 2 c1
# 3: 2 c2
# 4: 2 c3
# 5: 2 c1
# 6: 3 NA
# 7: 4 c2
# 8: 5 c3
Let's say I have this txt file:
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
"ZZ",2
"AA",3,3,3,3
"CC","ad",2,2,2,2,2
With read.csv I can:
> read.csv("linktofile.txt", fill=T, header=F)
V1 V2 V3 V4 V5 V6 V7
1 AA 3 3 3 3 NA NA
2 CC ad 2 2 2 2 2
3 ZZ 2 NA NA NA NA NA
4 AA 3 3 3 3 NA NA
5 CC ad 2 2 2 2 2
However fread gives
> library(data.table)
> fread("linktofile.txt")
V1 V2 V3 V4 V5 V6 V7
1: CC ad 2 2 2 2 2
Can I get the same result with fread?
Major update
It looks like development plans for fread changed and fread has now gained a fill argument.
Using the same sample data from the end of this answer, here's what I get:
library(data.table)
packageVersion("data.table")
# [1] ‘1.9.7’
fread(x, fill = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1: AA 3 3 3 3 NA NA
# 2: CC ad 2 2 2 2 2
# 3: ZZ 2 NA NA NA NA NA
# 4: AA 3 3 3 3 NA NA
# 5: CC ad 2 2 2 2 2
Install the development version of "data.table" with:
install.packages("data.table",
repos = "https://Rdatatable.github.io/data.table",
type = "source")
Original answer
This doesn't answer your question about fread: That question has already been addressed by #Matt.
It does, however, give you an alternative to consider that should give you good speed improvements over base R's read.csv.
Unlike fread, you will have to help these functions out a little by providing them with some information about the data you are trying to read.
You can use the input.file function from "iotools". By specifying the column types, you can tell the formatter function how many columns to expect.
library(iotools)
input.file(x, formatter = dstrsplit, sep = ",",
col_types = rep("character", max(count.fields(x, ","))))
Sample data
x <- tempfile()
myvec <- c('"AA",3,3,3,3', '"CC","ad",2,2,2,2,2', '"ZZ",2', '"AA",3,3,3,3', '"CC","ad",2,2,2,2,2')
cat(myvec, file = x, sep = "\n")
## Uncomment for bigger sample data
## cat(rep(myvec, 200000), file = x, sep = "\n")
Not currently; I wasn't aware of read.csv's fill feature. On the plan was to add the ability to read dual-delimited files (sep2 as well as sep as mentioned in ?fread). Then variable length vectors could be read into a list column where each cell was itself a vector. But, not padding with NA.
Could you add it to the list please? That way you'll get notified when its status changes.
Are there many irregular data formats like this out there? I only recall ever seeing regular files, where the incomplete lines would be considered an error.
UPDATE : Very unlikely to be done. fread is optimized for regular delimited files (where each row has the same number of columns). However, irregular files could be read into list columns (each cell itself a vector) when sep2 is implemented; not filled in separate columns as read.csv can do.
Doing a merge between a populated data.table and another one that is empty introduces one NA row in the resulting data.table:
a = data.table(c=c(1,2),key='c')
b = data.table(c=3,key='c')
b=b[c!=3]
b
# Empty data.table (0 rows) of 1 col: c
merge(a,b,all=T)
# c
# 1: NA
# 2: 1
# 3: 2
Why? I expected that it would return only the rows of data.table a, as it does with merge.data.frame:
> merge.data.frame(a,b,all=T,by='c')
# c
#1 1
#2 2
The example in the question is far too simple to show the problem, hence the confusion and discussion. Using two one-column data.tables isn't enough to show what merge does!
Here's a better example :
> a = data.table(P=1:2,Q=3:4,key='P')
> b = data.table(P=2:3,R=5:6,key='P')
> a
P Q
1: 1 3
2: 2 4
> b
P R
1: 2 5
2: 3 6
> merge(a,b) # correct
P Q R
1: 2 4 5
> merge(a,b,all=TRUE) # correct.
P Q R
1: 1 3 NA
2: 2 4 5
3: 3 NA 6
> merge(a,b[0],all=TRUE) # incorrect result when y is empty, agreed
P Q R
1: NA NA NA
2: NA NA NA
3: 1 3 NA
4: 2 4 NA
> merge.data.frame(a,b[0],all=TRUE) # correct
P Q R
1 1 3 NA
2 2 4 NA
Ricardo got to the bottom of this and fixed it in v1.8.9. From NEWS :
merge no longer returns spurious NA row(s) when y is empty and
all.y=TRUE (or all=TRUE), #2633. Thanks
to Vinicius Almendra for reporting. Test added.
all : logical; all = TRUE is shorthand to save setting both all.x = TRUE and all.y = TRUE.
all.x : logical; if TRUE, then extra rows will be added to the output, one for each row in
x that has no matching row in y. These rows will have ’NA’s in those columns
that are usually filled with values from y. The default is FALSE, so that only rows
with data from both x and y are included in the output.
all.y : logical; analogous to all.x above.
This is taken from data.table documentation. For more, look at the description of the arguments for merge function there.
I think this answers your question.
Given you define a and b in your way. The simple usage of rbind(a,b) would return only the rows of a.
However if you want to merge NULL data table b with some other non-empty data table a, there's different approach. I had similar problem when I had to merge different data tables within different loops. I used this workaround.
#some loop that returns data.table named a
#another loop starts
if(all.equal(a,b<-data.table())==TRUE){
b<-a
next
}
merge(a,b,c("Factor1","Factor2"))
That helped me, maybe it will help you too.
That's to be expected, as for merge.data.frame all=T is a full outer join, so you get all keys of both tables see about merge