I need to calculate a formula in a data frame. Each set of values across few columns have to be, lets say simplicity sake, aggregated. However, I do not want calculation across rows. I want to calculate each set with another set based on condition else where.
This is what I mean:
I have a data.table.
data = data.table(A = c("a","c","b","b","a"),
B = c(1:5),
C = c(1:5)
)
setorder(data, by=A)
> data
A B C
1: a 1 1
2: a 5 5
3: b 3 3
4: b 4 4
5: c 2 2
In column D I need to have and aggregate of values in B and C and values B and C when A is "a". As I have more than one "a", multiple aggregations are needed. From every aggregate minimum should be written in.
Here is an example.
For row 1: (1+1)+(1+1)=4, (5+5)+(1+1)=12, so 4 is minimum - D1 =4.
For row 3: (3+3)+(1+1)=8, (3+3)+(5+5)=16, D3 = 8. And so on.
This is what I expect
> data_new
A B C D
1: a 1 1 4
2: a 5 5 12
3: b 3 3 8
4: b 4 4 10
5: c 2 2 6
I tried this and run into issues.
for (i in data)data[i, D:=(min((data[i,B+C]) + (data[a=="a",(B+C)])))]
The expression below for minimum selection works fine on its own when I substitute i for a row number returning list of two numbers for min() returns proper value. Below answer is 8.
min((data[3,B+C]) + (data[A=="a",(B+C)]))
My previous attempts involved grid.expansion() and intersection(). However, with the size of my data set I ran into memory issue and Rstudio quit on me. As a side note, I need to run the calculations as I could not project the smallest outcome by "a" beforehand - it is a set of coordinates and they do not correlate with the magnitude of an answer.
Any suggestion where is my glaring issue
You can store the value of B + C where A = 'a' in a variable (val). For each row you can take minimum of B + C + val value.
library(data.table)
val <- data[A =='a', B + C]
data[, D := min(B + C + val), seq_len(nrow(data))]
data
# A B C D
#1: a 1 1 4
#2: a 5 5 12
#3: b 3 3 8
#4: b 4 4 10
#5: c 2 2 6
You can also use lapply :
data[, D := lapply(B + C, function(x) min(x + val))]
An option is also to replicate the 'a' rows after taking the min of 'B', 'C' and then do a direct + with the 'B', 'C' columns. The advantage is that, we don't have to group or loop
library(data.table)
Reduce(`+`, (data[A == 'a', .(B = min(B), C = min(C))][rep(seq_len(.N), nrow(data))] + data[, .(B, C)]))
#[1] 4 12 8 10 6
Or in a single line
data[, D := B + C + min(B[A== 'a']) + min(C[A== 'a'])]
data$D
#[1] 4 12 8 10 6
Related
A common task of mine is filtering (subsetting) datasets in the data.tables format. I want to subset rows in i in a complex sort of way with multiple column-specific boolean conditions. When I get a new dataset, it will have the same type of columns and I will want to filter them in the same way for all datasets.
To illustrate my task, let me first create an example data.table.
library(data.table)
dt <- data.table(a = seq(1,6), b = letters[seq(1,6)], c = rep(c(4,3,2)))
This yields
a b c
1: 1 a 4
2: 2 b 3
3: 3 c 2
4: 4 d 4
5: 5 e 3
6: 6 f 2
. Suppose I want to apply the following filtering criteria to the columns:
dt[b != 'd'][c < 4][a < 6]
yielding
a b c
1: 2 b 3
2: 3 c 2
3: 5 e 3
. Is there a way to convert that filtering criteria into a variable so that I can just tag it onto the end of the data.table?
I tried
x <- [b != 'd'][c < 4][a < 6]
dt[x]
but this throws the error
Error: unexpected '[' in "x <- ["
. This would be great because I could update the filtering strategy by changing just the variable x and have this filter then apply to all data.tables.
If it is to applied on different dataset, quote the expression and evaluate it on each dataset
i1 <- quote(b != 'd' & c < 4 & a < 6)
dt[dt[, eval(i1)]]
# a b c
#1: 2 b 3
#2: 3 c 2
#3: 5 e 3
I have a really big problem and looping through the data.table to do what I want is too slow, so I am trying to get around looping. Let assume I have a data.table as follows:
a <- data.table(i = c(1,2,3), j = c(2,2,6), k = list(c("a","b"),c("a","c"),c("b")))
> a
i j k
1: 1 2 a,b
2: 2 2 a,c
3: 3 6 b
And I want to group based on the values in k. So something like this:
a[, sum(j), by = k]
right now I am getting the following error:
Error in `[.data.table`(a, , sum(i), by = k) :
The items in the 'by' or 'keyby' list are length (2,2,1). Each must be same length as rows in x or number of rows returned by i (3).
The answer I am looking for is to group first all the rows having "a" in column k and calculate sum(j) and then all rows having "b" and so on. So the desired answer would be:
k V1
a 4
b 8
c 2
Any hint how to do it efficiently? I cant melt the column K by repeating the rows since the size of the data.table would be too big for my case.
I think this might work:
a[, .(k = unlist(k)), by=.(i,j)][,sum(j),by=k]
k V1
1: a 4
2: b 8
3: c 2
If we are using tidyr, a compact option would be
library(tidyr)
unnest(a, k)[, sum(j) ,k]
# k V1
#1: a 4
#2: b 8
#3: c 2
Or using the dplyr/tidyr pipes
unnest(a, k) %>%
group_by(k) %>%
summarise(V1 = sum(j))
# k V1
# <chr> <dbl>
#1 a 4
#2 b 8
#3 c 2
Since by-group operations can be slow, I'd consider...
dat = a[rep(1:.N, lengths(k)), c(.SD, .(k = unlist(a$k))), .SDcols=setdiff(names(a), "k")]
i j k
1: 1 2 a
2: 1 2 b
3: 2 2 a
4: 2 2 c
5: 3 6 b
We're repeating rows of cols i:j to match the unlisted k. The data should be kept in this format instead of using a list column, probably. From there, as in #MikeyMike's answer, we can dat[, sum(j), by=k].
In data.table 1.9.7+, we can similarly do
dat = a[, c(.SD[rep(.I, lengths(k))], .(k = unlist(k))), .SDcols=i:j]
I was wondering what causes the following behavior that surprised me a bit - I defined a data table dt_3, then defined dt_1 to be equal to dt_3. When I then used set() to replace row elements in dt_1, the corresponding elements of dt_3 were changed as well:
refcols=c("A","B")
dt_3 = data.table(A=c(1,1,3,5,6,7), B = c("x","y","z","q","w","e"), C = rep("NO",6))
dt_2 = data.table(A=c(3,5,7), B = c("z","q","x"), D=c(3,5,99))
dt_1 = dt_3
dt_3
A B C
1: 1 x NO
2: 1 y NO
3: 3 z NO
4: 5 q NO
5: 6 w NO
6: 7 e NO
for(j in refcols){
set(dt_1,2,j,dt_2[3,get(j)])
}
Warning messages:
1: In set(dt_1, 2, j, dt_2[3, get(j)]) :
Coerced i from numeric to integer. Please pass integer for efficiency; e.g., 2L rather than 2
2: In set(dt_1, 2, j, dt_2[3, get(j)]) :
Coerced i from numeric to integer. Please pass integer for efficiency; e.g., 2L rather than 2
dt_3
A B C
1: 1 x NO
2: 7 x NO
3: 3 z NO
4: 5 q NO
5: 6 w NO
6: 7 e NO
What is causing this and is there an easier way to subset by explicit row indices for specific columns like this?
We can use copy so that when we replace the elements in one dataset, the other wont' change
dt_1<- copy(dt_3)
Regarding the second part, it is not very clear about the row index. If it is only based on the column index
for(j in refcols){
set(dt_1, i=NULL, j=j, value=dt_2[[j]])
}
dt_1
# A B C
#1: 3 z NO
#2: 5 q NO
#3: 7 x NO
#4: 3 z NO
#5: 5 q NO
#6: 7 x NO
If the 2nd row of the "A" and "B" column in 'dt_1' should be replaced by the 3rd row of 'dt_2' for corresponding columns (based on 'refcols')
for(j in refcols){
set(dt_1, i=2L, j=j, value=dt_2[[j]][3])
}
dt_1
# A B C
#1: 1 x NO
#2: 7 x NO
#3: 3 z NO
#4: 5 q NO
#5: 6 w NO
#6: 7 e NO
I have a data frame defined as follows:
t1 <- data.frame(x=c("A","B","C"),y=c(5,7,9))
> t1
x y
1 A 5
2 B 7
3 C 9
and a vector of picks:
picks <- c("B","C","B")
How do I get these rows, with replacement, in this order selected from the data frame?
I want:
x y
B 7
C 9
B 7
I tried
> t1[t1$x %in% picks,]
x y
2 B 7
3 C 9
and several other combinations of match, grep, which, etc and cannot get out what I want. It seems like it should be easy but I'm not finding the path.
Or you can perform an right join using data.table
library(data.table)
picks <- data.table(x = picks)
setDT(t1)[picks, on = "x"]
# x y
#1: B 7
#2: C 9
#3: B 7
By default the merged data.table is sorted according to x in picks.
We can also use
setNames(t1$y, t1$x)[picks]
#B C B
#7 9 7
I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)