Assume that we have a data.table with ids and subgroups as follows:
DT <- data.table(id=c("A","A","B","B"), subgroup=c("k","m","k","m"), C=c(4,9,6,5))
> DT
id subgroup C
1: A k 4
2: A m 9
3: B k 6
4: B m 5
Now we want to introduce new subgroups to each id, which value C depends on another subgroup. In this example the new subgroup l should be 0.5 of subgroup k, for a given id:
id subgroup C
1: A k 4
2: A l 2
3: A m 9
4: B k 6
5: B l 3
6: B m 5
How could one achieve this efficiently using data.table? The only solution I have come up with is reshaping to wide and then creating new columns; but if one has a large set of ids, this would be fairly clumsy.
Note: in a real-life application there would be many more subgroups and ids.
UPDATED TO TAKE INTO ACCOUNT COMPLEX CASE OF HAVING MORE THAN 2 SUBGROUPS
You can do:
# add a new row in each group based on given condition
DT <- rbind(DT, DT[,.SD,id][,`:=`(subgroup = 'l', C = C/2)])
# order the data by id
DT <- DT[order(id)]
Alternate format as suggested by #Frank:
DT[, rbind(.SD, copy(.SD)[,`:=`(subgroup = 'l', C = C/2)])]
setorder(res, id)
I am trying to "tidy" a large dataset, where multiple different types of data is merged in columns, and some data in column names. This is a common scenario in biological dataset.
My data table has replicate measurements which I want to collapse into a mean. Converting the data into tidy format, these replicate values become additional rows. If I try to aggregate/group by several columns and calculate the mean of the replicates:
collapsed.data <- tidy.dt[, mean(expression, na.rm = T), by=list(Sequence.window,Gene.names,ratio,enrichment.type,condition)]
I get a resultant table that has only the columns used in by statement and followed with the mean(expression) as column V1. Is it possible to get all the other (unchanged) columns as well?
A minimalist example showing what I am trying to achieve is as follows:
library(data.table)
dt <- data.table(a = c("a", "a", "b", "b", "c", "a", "c", "a"), b = rnorm(8),
c = c(1,1,1,1,1,2,1,2), d = rep('x', 8), e = rep('test', 8))
dt[, mean(b), by = list(a, c)]
# a c V1
#1: a 1 -0.7597186
#2: b 1 -0.3001626
#3: c 1 -0.6893773
#4: a 2 -0.1589146
As you can see the columns d and e are dropped.
One possibility is to include d and e in the grouping:
res <- dt[, mean(b), by = list(a, c, d, e)]
res
# a c d e V1
#1: a 1 x test 0.9271986
#2: b 1 x test -0.3161799
#3: c 1 x test 1.3709635
#4: a 2 x test 0.1543337
If you want to keep all columns except the one you want to aggregrate, you can do this in a more programmatic way:
cols_to_group_by <- setdiff(colnames(dt), "b")
res <- dt[, mean(b), by = cols_to_group_by]
The result is the same as above.
By this, you have reduced the number of rows. If you want to keep all rows, you can add an additional column:
dt[, mean_b := mean(b), by = list(a, c)]
dt
# a b c d e mean_b
#1: a 1.1127632 1 x test 0.9271986
#2: a 0.7416341 1 x test 0.9271986
#3: b 0.9040880 1 x test -0.3161799
#4: b -1.5364479 1 x test -0.3161799
#5: c 1.9846982 1 x test 1.3709635
#6: a 0.2615139 2 x test 0.1543337
#7: c 0.7572287 1 x test 1.3709635
#8: a 0.0471535 2 x test 0.1543337
Here, dtis modified by reference, i.e., without copying all of dt, which might save time on large data.
I have a dataframe "df" as below:
V1 V2
1 b a
2 b a
3 a b
4 b a
5 a b
6 a b
7 a b
8 b a
9 a b
10 a b
11 a b
12 b a
Is there a way that I can automate the following 3 steps in R?
Step 1:
R identifies that in the 12 rows of the dataframe "df" & the pattern "a b" is repeating majority of the times.
Step2:
Based on the majority pattern in Step 1, R subsets a dataframe with only those rows which contain the majority pattern in Step 1.
Step3:
R outputs a new sub-setted dataframe of Step2.
Is there a package to do this or a function that I can build? Any guidance will be very valuable. Thank You
Do you need duplicates of the most common combination? If not, there's a really simple way to do it with data.table
library("data.table")
#Create sample data, set seed to have the same output
set.seed(1)
df <- data.table(V1 = sample(c("a", "b", "c"),10 , replace = T),
V2 = sample(c("a", "b", "c"),10 , replace = T),
V3 = sample(c("a", "b", "c"),10 , replace = T))
#Subset
cols <- names(df)
df[, .N, by = cols][order(-N)][1,]
Output(N is number of occurences):
V1 V2 V3 N
1: b c b 2
With your updated Q, you may try this with data.table:
library(data.table)
setDT(df)
cols <- c("V1", "V2")
df[, .N, by = cols][N == max(N)][, N := NULL][df, on = cols, nomatch = 0]
# V1 V2 id
#1: a b 3
#2: a b 5
#3: a b 6
#4: a b 7
#5: a b 9
#6: a b 10
#7: a b 11
Explanation
setDT(df) coerces the data.frame to data.table without copying.
The relevant columns are defined to save typing.
The number of occurences of each combination in the relevant columns are counted.
Only the combination with the highest number of occurences is kept. This completes Step 1 of the Q which asks to find the majority pattern.
Note that in case of a tie, i.e., two or more combinations have the same maximum number of occurences, all combinations are returned. The OP hasn't specified how he wants to handle this case.
The counts are removed as they are no longer needed.
Finally, the original df is joined so that all rows of df which match the majority pattern are selected. nomatch = 0 specifies an inner join. This completes Step 2 and perhaps also Step 3 (but this is unclear to me, see comment.
Note that the row numbers in the id column are kept in the result without any additional effort. This would be the case for any other additional column in dfas well.
Data
df <- fread("id V1 V2
1 b a
2 b a
3 a b
4 b a
5 a b
6 a b
7 a b
8 b a
9 a b
10 a b
11 a b
12 b a")
# EDIT: By request of the OP, the row number (id) should be kept
# df[, id := NULL]
What exactly is "similarity" in your query - just finding the most common row and taking only that? If yes, you just need to group by all variables and sort by occurence
If you're talking about similar, but not matching text in your columns you need to look into edit distances (this package is good for the task https://cran.r-project.org/web/packages/stringdist/stringdist.pdf)
This is what my data table looks like:
library(data.table)
dt <- fread('
Product Group LastProductOfPriorGroup
A 1 NA
B 1 NA
C 2 B
D 2 B
E 2 B
F 3 E
G 3 E
')
The LastProductOfPriorGroup column is my desired column. I am trying to fetch the product from last row of the prior group. So in the first two rows, there are no prior groups and therefore it is NA. In the third row, the product in the last row of the prior group 1 is B. I am trying to accomplish this by
dt[,LastGroupProduct:= shift(Product,1), by=shift(Group,1)]
to no avail.
You could do
dt[, newcol := shift(dt[, last(Product), by = Group]$V1)[.GRP], by = Group]
This results in the following updated dt, where newcol matches your desired column with the unnecessarily long name. ;)
Product Group LastProductOfPriorGroup newcol
1: A 1 NA NA
2: B 1 NA NA
3: C 2 B B
4: D 2 B B
5: E 2 B B
6: F 3 E E
7: G 3 E E
Let's break the code down from the inside out. I will use ... to denote the accumulated code:
dt[, last(Product), by = Group]$V1 is getting the last values from each group as a character vector.
shift(...) shifts the character vector in the previous call
dt[, newcol := ...[.GRP], by = Group] groups by Group and uses the internal .GRP values for indexing
Update: Frank brings up a good point about my code above calculating the shift for every group over and over again. To avoid that, we can use either
shifted <- shift(dt[, last(Product), Group]$V1)
dt[, newcol := shifted[.GRP], by = Group]
so that we don't calculate the shift for every group. Or, we can take Frank's nice suggestion in the comments and do the following.
dt[dt[, last(Product), by = Group][, v := shift(V1)], on="Group", newcol := i.v]
Another way is to save the last group's value in a variable.
this = NA_character_ # initialize
dt[, LastProductOfPriorGroup:={ last<-this; this<-last(Product); last }, by=Group]
dt
Product Group LastProductOfPriorGroup
1: A 1 NA
2: B 1 NA
3: C 2 B
4: D 2 B
5: E 2 B
6: F 3 E
7: G 3 E
NB: last() is a data.table function which returns the last item of a vector (of the Product column in this case).
This should also be fast since no logic is being invoked to fetch the last group's value; it just relies on the groups running in order (which they do).
I'm trying to implement a data.table for my relatively large datasets and I can't figure out how to operate a function over multiple columns in the same row. Specifically, I want to create a new column that contains a specifically-formatted tally of the values (i.e., a histogram) in a subset of columns. It is kind of like table() but that also includes 0 entries and is sorted--so, if you know of a better/faster method I'd appreciate that too!
Simplified test case:
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c"))
DT<-as.data.table(DF)
> DT
A B C D E
1: a b c a a
2: d a a b a
3: a a a c c
my klunky histogram function:
histo<-function(vec){
foo<-c("a"=0,"b"=0,"c"=0,"d"=0)
for(i in vec){foo[i]=foo[i]+1}
return(foo)}
>histo(unname(unlist(DF[1,])))
a b c d
3 1 1 0
>histo(unname(unlist(DF[2,])))
a b c d
3 1 0 1
>histo(unname(unlist(DF[3,])))
a b c d
3 0 2 0
pseduocode of desired function and output
>DT[,his:=some_func_with_histo(A:E)]
>DT
A B C D E his
1: a b c a a (3,1,1,0)
2: d a a b a (3,1,0,1)
3: a a a c c (3,0,2,0)
df <- data.table(DF)
df$hist <- unlist(apply(df, 1, function(x) {
list(
sapply(letters[1:4], function(d) {
b <- sum(!is.na(grep(d,x)))
assign(d, b)
}))
}), recursive=FALSE)
Your df$hist column is a list, with each value named:
> df
A B C D E hist
1: a b c a a 3,1,2,0
2: d a a b a 3,1,1,1
3: a a a c c 3,0,3,0
> df$hist
[[1]]
a b c d
3 1 2 0
[[2]]
a b c d
3 1 1 1
[[3]]
a b c d
3 0 3 0
NOTE: Answer has been updated to to OP's request and mnel's comment
OK, how do you like that solution:
library(data.table)
DT <- data.table(A=c("a","d","a"),
B=c("b","a","a"),
C=c("c","a","a"),
D=c("a","b","c"),
E=c("a","a","c"))
fun <- function(vec, char) {
sum(vec==char)
}
DT[, Vec_Nr:= paste(Vectorize(fun, 'char')(.SD, letters[1:4]), collapse=","),
by=1:nrow(DT),
.SDcols=LETTERS[1:5]]
A B C D E Vec_Nr
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
I basically split up your problem into several steps:
First, I define a function fun that gives me the number of occurrences for one character. To see how
that function works, just call
fun(c("a", "a", "b"), "b")
[1] 1
Next, I vectorize this function because you don't want to know that for only one character "b", but for many. To pass a vector of arguments to a function,
use Vectorize. To see how that works, just type
Vectorize(fun, "char")(c("a", "a", "b"), c("a", "b"))
a b
2 1
Next, I collapse the results into one string and save that as a new column. Note that I deliberatly used the letters and LETTERS here to show you how make this more dynamic.
EDIT (also see below): Provided you first convert column classes to character, e.g., with DT <- DT[,lapply(.SD,as.character)]...
By using factor, you can convert vec and pass the values (a,b,c,d) in one step:
histo2 <- function(x) table(factor(x,levels=letters[1:4]))
Then you can iterate over rows by passing by=1:nrow(DT).
DT[,as.list(histo2(.SD)),by=1:nrow(DT)]
This gives...
nrow a b c d
1: 1 3 1 1 0
2: 2 3 1 0 1
3: 3 3 0 2 0
Also, this iterates over columns. This works because .SD is a special variable holding the subset of data associated with the call to by. In this case, that subset is the data.table consisting of one of the rows. histo2(DT[1]) works the same way.
EDIT (responding to OP's comment): Oh, sorry, I instinctively replaced your first line with
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c")
,stringsAsFactors=FALSE)
since I dislike using factors except when making tables. If you do not want to convert your factor columns to character columns in this way, this will work:
histo3 <- function(x) table(factor(sapply(x,as.character),levels=letters[1:4]))
To put the output into a single column, you use := as you suggested...
DT[,hist:=list(list(histo3(.SD))),by=1:nrow(DT)]
The list(list()) part is key; I always figure this out by trial-and-error. Now DT looks like this:
A B C D E hist
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
You might find that it's a pain to access the information directly from your new column. For example, to access the "a" column of the "histogram", I think the fastest route is...
DT[,hist[[1]][["a"]],by=1:nrow(DT)]
My initial suggestion created an auxiliary data.table with just the counts. I think it's cleaner to do whatever you want to do with the counts in that data.table and then cbind it back. If you choose to store it in a column, you can always create the auxiliary data.table later with
DT[,as.list(hist[[1]]),by=1:nrow(DT)]
You are correct about using .SDcols. For your example, ...
cols = c("A","C")
histname = paste(c("hist",cols),collapse="")
DT[,(histname):=list(list(histo3(.SD))),by=1:nrow(DT),.SDcols=cols]
This gives
A B C D E hist histAC
1: a b c a a 3,1,1,0 1,0,1,0
2: d a a b a 3,1,0,1 1,0,0,1
3: a a a c c 3,0,2,0 2,0,0,0