I'm trying to implement a data.table for my relatively large datasets and I can't figure out how to operate a function over multiple columns in the same row. Specifically, I want to create a new column that contains a specifically-formatted tally of the values (i.e., a histogram) in a subset of columns. It is kind of like table() but that also includes 0 entries and is sorted--so, if you know of a better/faster method I'd appreciate that too!
Simplified test case:
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c"))
DT<-as.data.table(DF)
> DT
A B C D E
1: a b c a a
2: d a a b a
3: a a a c c
my klunky histogram function:
histo<-function(vec){
foo<-c("a"=0,"b"=0,"c"=0,"d"=0)
for(i in vec){foo[i]=foo[i]+1}
return(foo)}
>histo(unname(unlist(DF[1,])))
a b c d
3 1 1 0
>histo(unname(unlist(DF[2,])))
a b c d
3 1 0 1
>histo(unname(unlist(DF[3,])))
a b c d
3 0 2 0
pseduocode of desired function and output
>DT[,his:=some_func_with_histo(A:E)]
>DT
A B C D E his
1: a b c a a (3,1,1,0)
2: d a a b a (3,1,0,1)
3: a a a c c (3,0,2,0)
df <- data.table(DF)
df$hist <- unlist(apply(df, 1, function(x) {
list(
sapply(letters[1:4], function(d) {
b <- sum(!is.na(grep(d,x)))
assign(d, b)
}))
}), recursive=FALSE)
Your df$hist column is a list, with each value named:
> df
A B C D E hist
1: a b c a a 3,1,2,0
2: d a a b a 3,1,1,1
3: a a a c c 3,0,3,0
> df$hist
[[1]]
a b c d
3 1 2 0
[[2]]
a b c d
3 1 1 1
[[3]]
a b c d
3 0 3 0
NOTE: Answer has been updated to to OP's request and mnel's comment
OK, how do you like that solution:
library(data.table)
DT <- data.table(A=c("a","d","a"),
B=c("b","a","a"),
C=c("c","a","a"),
D=c("a","b","c"),
E=c("a","a","c"))
fun <- function(vec, char) {
sum(vec==char)
}
DT[, Vec_Nr:= paste(Vectorize(fun, 'char')(.SD, letters[1:4]), collapse=","),
by=1:nrow(DT),
.SDcols=LETTERS[1:5]]
A B C D E Vec_Nr
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
I basically split up your problem into several steps:
First, I define a function fun that gives me the number of occurrences for one character. To see how
that function works, just call
fun(c("a", "a", "b"), "b")
[1] 1
Next, I vectorize this function because you don't want to know that for only one character "b", but for many. To pass a vector of arguments to a function,
use Vectorize. To see how that works, just type
Vectorize(fun, "char")(c("a", "a", "b"), c("a", "b"))
a b
2 1
Next, I collapse the results into one string and save that as a new column. Note that I deliberatly used the letters and LETTERS here to show you how make this more dynamic.
EDIT (also see below): Provided you first convert column classes to character, e.g., with DT <- DT[,lapply(.SD,as.character)]...
By using factor, you can convert vec and pass the values (a,b,c,d) in one step:
histo2 <- function(x) table(factor(x,levels=letters[1:4]))
Then you can iterate over rows by passing by=1:nrow(DT).
DT[,as.list(histo2(.SD)),by=1:nrow(DT)]
This gives...
nrow a b c d
1: 1 3 1 1 0
2: 2 3 1 0 1
3: 3 3 0 2 0
Also, this iterates over columns. This works because .SD is a special variable holding the subset of data associated with the call to by. In this case, that subset is the data.table consisting of one of the rows. histo2(DT[1]) works the same way.
EDIT (responding to OP's comment): Oh, sorry, I instinctively replaced your first line with
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c")
,stringsAsFactors=FALSE)
since I dislike using factors except when making tables. If you do not want to convert your factor columns to character columns in this way, this will work:
histo3 <- function(x) table(factor(sapply(x,as.character),levels=letters[1:4]))
To put the output into a single column, you use := as you suggested...
DT[,hist:=list(list(histo3(.SD))),by=1:nrow(DT)]
The list(list()) part is key; I always figure this out by trial-and-error. Now DT looks like this:
A B C D E hist
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
You might find that it's a pain to access the information directly from your new column. For example, to access the "a" column of the "histogram", I think the fastest route is...
DT[,hist[[1]][["a"]],by=1:nrow(DT)]
My initial suggestion created an auxiliary data.table with just the counts. I think it's cleaner to do whatever you want to do with the counts in that data.table and then cbind it back. If you choose to store it in a column, you can always create the auxiliary data.table later with
DT[,as.list(hist[[1]]),by=1:nrow(DT)]
You are correct about using .SDcols. For your example, ...
cols = c("A","C")
histname = paste(c("hist",cols),collapse="")
DT[,(histname):=list(list(histo3(.SD))),by=1:nrow(DT),.SDcols=cols]
This gives
A B C D E hist histAC
1: a b c a a 3,1,1,0 1,0,1,0
2: d a a b a 3,1,0,1 1,0,0,1
3: a a a c c 3,0,2,0 2,0,0,0
Related
I have a data frame with 4 columns, each column represent a different treatment. Each column is fill with protein numbers on it and the columns have different number of rows between each other. Theres a way to compare all 4 columns and have as a result a fifth column saying if a value is found in which of the columns? I know I have some values that will happen in two or even maybe 3 of the colums and I was wondering if theres a way to get this as end result in a new column.
I tried Data$A %in% Data$B but this just gives me TRUE or FALSE between two columns. I was looking for some option like match or even contain, but all options seens that can only give me a true or false answer.
What I need is something like this.
A B C
1 DSFG DSFG DSGG
2 DDEG DDED DDEE
3 HUGO HUGI HUGO
So if this is my table, I want the result like this
D(?) E
1 DSFG A,B
2 DSGG C
4 DDEG A
5 DDED B
6 DDEE C
7 HUGO A,C
8 HUGI B
Solution
An idea via base R is to use stack to convert to long, and aggregate to get the required output.
aggregate(ind ~ values, stack(df), toString)
# values ind
#1 DDED B
#2 DDEE C
#3 DDEG A
#4 DSFG A, B
#5 DSGG C
#6 HUGI B
#7 HUGO A, C
NOTE: Your columns need to be as.character for this to work. (df[] <- lapply(df, as.character))
Explanations
Stacking turns data into "long format":
stack(df)
values ind
1 DSFG A
2 DDEG A
3 HUGO A
4 DSFG B
5 DDED B
6 HUGI B
7 DSGG C
8 DDEE C
9 HUGO C
toString() simply joins elements in a vector by comma
toString(c("A", "B", "C"))
[1] "A, B, C"
Aggregating returns a vector of "ind"s for each value, and these are then turned into a string using the function above:
aggregate(ind ~ values, stack(df), FUN=toString)
Doing it the tidy way:
Input
df <- data.frame(A = c("DSFG", "DDEG", "HUGO"), B = c("DSFG", "DDED", "HUGI"), C = c("DSGG", "DDEE", "HUGO"))
Summarizing data
library(tidyverse)
df %>%
gather("Column", "Value", 1:3) %>%
group_by(Value) %>%
summarise(Cols = paste(Column, collapse = ","))
Output
Value Cols
DDED B
DDEE C
DDEG A
DSFG A,B
DSGG C
HUGI B
HUGO A,C
Say we have this toy data.table example:
temp <- data.table(V=c("A", "B", "C", "D","A"), GR=c(1,1,1,2,2))
"V" "GR"
A 1
B 1
C 1
D 2
A 2
I would like to generate all ordered combinations with combn within each subset defined by GR and create with it a new data.table and with a new column with the grouping factor.
For example, for GR=1 we have (A,B),(A,C),(B,C)
for GR=2 we have (D,A)
If I create the result manually it would be
cbind(V=c(1,1,1,2),rbind(t(combn(c("A", "B", "C"),2)),t(combn(c( "D","A"),2))))
1 A B
1 A C
1 B C
2 D A
But I would like to do it with data.table easily instead.
This two option don't work:
temp[,cbind(rep(.GRP,.N),as.data.frame(t(combn(V,2)))),by=GR]
temp[,cbind(rep(.BY,.N),as.data.frame(t(combn(V,2)))),by=GR]
This one work, but I don't understand why. I'm afraid it could copy the whole B vector as is instead of the proper value.
temp[,.(GR,as.list(as.data.frame((combn(V,2))))),by=GR]
And I guess it should be a shorter way to write it.
This works:
> temp[, {v_comb = combn(V,2); .(v_comb[1,], v_comb[2,])}, by=GR]
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
In general, I would avoid when possible all the reshaping operations within the data.table using cbind(), rep(), as.data.frame() or t()... It takes many trials-and-errors to figure out the right way to do it, and produces code that is very hard to maintain.
On the other hand, using code blocks {...} improves the readability of the code.
This uses data.table, though not all within [] using .BY or .GRP.
library(data.table)
temp <- data.table(V=c("A", "B", "C", "D","A"), GR=c(1,1,1,2,2))
tempfunc <- function(x){
dat <- as.data.table(t(combn(temp[GR == x, V], 2)))
dat[, GR := x]
setcolorder(dat, c("GR", "V1", "V2"))
dat[]
}
rbindlist(lapply(unique(temp$GR), tempfunc))
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
Here are two other approaches which also work if there is a group with just one row, e.g., row 6 below:
library(data.table)
temp <- data.table(V=c("A", "B", "C", "D","A","E"), GR=c(1,1,1,2,2,3))
temp
V GR
1: A 1
2: B 1
3: C 1
4: D 2
5: A 2
6: E 3
Using cominat::combn2
temp[, as.data.table(combinat::combn2(V)), by = GR]
GR V1 V2
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 D A
Using a non-equi join
temp[, V := factor(V)][temp, on = .(GR, V < V), .(GR, x.V, i.V),
nomatch = 0L, allow = TRUE]
GR x.V i.V
1: 1 A B
2: 1 A C
3: 1 B C
4: 2 A D
I have one solution but it seems long and complex too.
temp[,do.call(c, apply(t(combn(V,2)), 2, list)),by=GR]
I've also found that combn is 10 times slower than some specialized packages such as iterpc or combinat
temp[,do.call(c, apply(combn2(V), 2, list)),by=GR]
You must also first filter out any group having just one row because otherwise it would cause an error.
And this is my final version, much faster and needs much less memory:
temp[,.(from=rep(V,(.N-1):0),to=V[unlist(sapply(2:.N, seq, .N, simplify = T))]), by=GR]
This is what my data table looks like:
library(data.table)
dt <- fread('
Product Group LastProductOfPriorGroup
A 1 NA
B 1 NA
C 2 B
D 2 B
E 2 B
F 3 E
G 3 E
')
The LastProductOfPriorGroup column is my desired column. I am trying to fetch the product from last row of the prior group. So in the first two rows, there are no prior groups and therefore it is NA. In the third row, the product in the last row of the prior group 1 is B. I am trying to accomplish this by
dt[,LastGroupProduct:= shift(Product,1), by=shift(Group,1)]
to no avail.
You could do
dt[, newcol := shift(dt[, last(Product), by = Group]$V1)[.GRP], by = Group]
This results in the following updated dt, where newcol matches your desired column with the unnecessarily long name. ;)
Product Group LastProductOfPriorGroup newcol
1: A 1 NA NA
2: B 1 NA NA
3: C 2 B B
4: D 2 B B
5: E 2 B B
6: F 3 E E
7: G 3 E E
Let's break the code down from the inside out. I will use ... to denote the accumulated code:
dt[, last(Product), by = Group]$V1 is getting the last values from each group as a character vector.
shift(...) shifts the character vector in the previous call
dt[, newcol := ...[.GRP], by = Group] groups by Group and uses the internal .GRP values for indexing
Update: Frank brings up a good point about my code above calculating the shift for every group over and over again. To avoid that, we can use either
shifted <- shift(dt[, last(Product), Group]$V1)
dt[, newcol := shifted[.GRP], by = Group]
so that we don't calculate the shift for every group. Or, we can take Frank's nice suggestion in the comments and do the following.
dt[dt[, last(Product), by = Group][, v := shift(V1)], on="Group", newcol := i.v]
Another way is to save the last group's value in a variable.
this = NA_character_ # initialize
dt[, LastProductOfPriorGroup:={ last<-this; this<-last(Product); last }, by=Group]
dt
Product Group LastProductOfPriorGroup
1: A 1 NA
2: B 1 NA
3: C 2 B
4: D 2 B
5: E 2 B
6: F 3 E
7: G 3 E
NB: last() is a data.table function which returns the last item of a vector (of the Product column in this case).
This should also be fast since no logic is being invoked to fetch the last group's value; it just relies on the groups running in order (which they do).
I have two categorical columns (A,B) and numerical column (C). I want to obtain the value of A where C is the maximum of groups defined by B. I'm looking for a data.table solution.
library(data.table)
dt <- data.table( A = c("a","b","c"),
B = c("d","d","d"),
C = c(1,2,3))
dt
A B C
1: a d 1
2: b d 2
3: c d 3
# I want to find the value of A for the maximum value
# of C when grouped by B
dt[,max(C), by=c("B")]
B V1
1: d 3
#how can I get the A column, value = "c"
Another option is to sort by C and just extract the unique B groups. This should be faster for a big data set with many groups because it doesn't calculate maxima per group, rather sorts only once
unique(dt[order(-C)], by = "B")
# A B C
# 1: c d 3
You can use which.max to find the index of the maximum of C and select the corresponding element in A:
dt[,.(A=A[which.max(C)]), B]
# B A
#1: d c
Ok, I have a matrix of values with certain identifiers, such as:
A 2
B 3
C 4
D 5
E 6
F 7
G 8
I would like to pull out a subset of these values (using R) based on a list of the identifiers ("B", "D", "E") for example, so I would get the following output:
B 3
D 5
E 6
I'm sure there's an easy way to do this (some sort of apply?) but I can't seem to figure it out. Any ideas? Thanks!
If the letters are the row names, then you can just use this:
m <- matrix(2:8, dimnames = list(LETTERS[1:7], NULL))
m[c("B","D","E"),]
# B D E
# 3 5 6
Note that there is a subtle but very important difference between: m[c("B","D","E"),] and m[rownames(m) %in% c("B","D","E"),]. Both return the same rows, but not necessarily in the same order.
The former uses the character vector c("B","D","E") as in index into m. As a result, the rows will be returned in the order of character vector. For instance:
# result depends on order in c(...)
m[c("B","D","E"),]
# B D E
# 3 5 6
m[c("E","D","B"),]
# E D B
# 6 5 3
The second method, using %in%, creates a logical vector with length = nrow(m). For each element, that element is T if the row name is present in c("B","D","E"), and F otherwise. Indexing with a logical vector returns rows in the original order:
# result does NOT depend on order in c(...)
m[rownames(m) %in% c("B","D","E"),]
# B D E
# 3 5 6
m[rownames(m) %in% c("E","D","B"),]
# B D E
# 3 5 6
This is probably more than you wanted to know...
Your matrix:
> m <- matrix(2:8, dimnames = list(LETTERS[1:7]))
You can use %in% to filter out the desired rows. If the original matrix only has a single column, using drop = FALSE will keep the matrix structure. Otherwise it will be converted to a named vector.
> m[rownames(m) %in% c("B", "D", "E"), , drop = FALSE]
# [,1]
# B 3
# D 5
# E 6