complicated sums in r - different columns in different dfs - r

I am trying to get sums in r. I have 2 dataframes. One consists of 3 columns (tag, doy (=day of year) at beginning, doy at end). The other consists of 2 columns (doy, bbb (=an amount per day)).
Now I want for each row of df1 the sum of bbb from doy.0 to doy.end.
# creating df1
tag<-c(1:5)
doy.0<-c(200:204)
doy.end<-c(207:211)
df1<-data.frame(tag, doy.0, doy.end)
# creating df2
doy<-c(200:211)
bbb<-c(12,10,18,16,20,11,15,19,25,23,21,20)
df2<-data.frame(doy,bbb)
tag doy.0 doy.end
1 1 200 207
2 2 201 208
3 3 202 209
4 4 203 210
5 5 204 211
doy bbb
1 200 12
2 201 10
3 202 18
4 203 16
5 204 20
6 205 11
7 206 15
8 207 19
9 208 25
10 209 23
11 210 21
12 211 20
So I want an additional column in df1 with the sum of bbb. For example for tag 1, I want the bbb from doy 200 to doy 207 (it should be 121 for tag 1, 134 for tag 2, etc).
I have played around a bit with for loops but couldnt figure it out. I would really appreciate your help!
Also if you can think of a better title to this question, feel free to change it. I dont even know what to call this problem, thats how annoying it is...

df1$sum.bbb<-0
for(i in 1: nrow(df1)){
df1$sum.bbb[i]<-sum(df2[which(df2$doy[] == df1$doy.0[i]):which(df2$doy[] == df1$doy.end[i]),2])
}
> df1
tag doy.0 doy.end sum.bbb
1 1 200 207 121
2 2 201 208 134
3 3 202 209 147
4 4 203 210 150
5 5 204 211 154

Does your sum always have the pattern that it should be the sum of 8 consecutive 'bbb' - values? Then this will work:
library(dplyr)
library(zoo)
df1 %>%
mutate(newvar = rollsum(df2$bbb, 8))
tag doy.0 doy.end newvar
1 1 200 207 121
2 2 201 208 134
3 3 202 209 147
4 4 203 210 150
5 5 204 211 154

A solution using tidyverse, the loop is hidden in purrr::map :
replyr::replyr_bind_rows(
purrr::map(
replyr::replyr_split(df1,"tag"),
function(x) data.frame(tag=x$tag,
df2 %>% filter((doy>=x$doy.0)&(doy<=x$doy.end)) %>% summarise(bbb=sum(bbb)))
))
# tag bbb
#1 1 121
#2 2 134
#3 3 147
#4 4 150
#5 5 154

With data.frame:
df1b <- do.call(rbind,
apply(df1,
1,
function(x) data.frame(tag = rep(x["tag"], x["doy.end"] - x["doy.0"] + 1),
doy = x["doy.0"]:x["doy.end"])))
merge(df1, aggregate(bbb ~ tag, merge(df1b, df2), sum))
tag doy.0 doy.end bbb
1 1 200 207 121
2 2 201 208 134
3 3 202 209 147
4 4 203 210 150
5 5 204 211 154
And usign data.table:
library(data.table)
df1 <- as.data.table(df1)
df2 <- as.data.table(df2)
df1[df2,
on = .(doy.0 <= doy, doy.end >= doy),
allow.cartesian = TRUE][,
.(doy.0 = min(doy.0), doy.end = max(doy.end), bbb = sum(bbb)),
by = .(tag)]
tag doy.0 doy.end bbb
1: 1 200 207 121
2: 2 201 208 134
3: 3 202 209 147
4: 4 203 210 150
5: 5 204 211 154

You can use data.table and a non-equi join to create this. If your sum always has the same pattern, the answer of #Len is very good. If your sum has different patterns, data.table is a very fast solution.
library(data.table)
# add sum of bbb to table 1 from table 2
dt1[, bbb := dt2[dt1, on=.(doy >= doy.0, doy <= doy.end), sum(bbb), by=.EACHI]$V1]
dt1
tag doy.0 doy.end bbb
1: 1 200 207 121
2: 2 201 208 134
3: 3 202 209 147
4: 4 203 210 150
5: 5 204 211 154
data:
tag<-c(1:5)
doy.0<-c(200:204)
doy.end<-c(207:211)
dt1<- data.table(tag, doy.0, doy.end) # data.table instead of data.frame
# creating dt2
doy<-c(200:211)
bbb<-c(12,10,18,16,20,11,15,19,25,23,21,20)
dt2<- data.table(doy,bbb) # data.table instead of data.frame

We could do a fuzzy join and aggregate:
library(fuzzyjoin)
library(dplyr)
fuzzy_join(df1, df2, c(doy.0 = "doy", doy.end = "doy"),
list(`<=`,`>=`)) %>%
group_by(tag,doy.0,doy.end) %>%
summarize_at("bbb",sum) %>%
ungroup
# # A tibble: 5 x 4
# tag doy.0 doy.end bbb
# <int> <int> <int> <dbl>
# 1 1 200 207 121
# 2 2 201 208 134
# 3 3 202 209 147
# 4 4 203 210 150
# 5 5 204 211 154
And a base R translation:
x <- expand.grid(tag= df1$tag,doy = df2$doy)
x <- merge(x,df1,all.x=TRUE)
x <- merge(x,df2,all.x=TRUE)
x <- subset(x, doy >= doy.0 & doy <= doy.end)
x <- aggregate(bbb ~ tag, x, sum)
merge(df1,x)
# tag doy.0 doy.end bbb
# 1 1 200 207 121
# 2 2 201 208 134
# 3 3 202 209 147
# 4 4 203 210 150
# 5 5 204 211 154

Related

decrease the time of script running (for loop based)

I wrote a script based on two for loops that I would like to optimize to speed up its running time.
Below are reproducible data that I simplified with the code that I am using on my own data.
nuc is a vector with 101 "position" and
tel is a data frame with different coordinates "aa" and "bb"
The aim is to calculate for each position the number of times each position is comprised between each aa and bb coordinate. For example position 111 is comprise between 3 couple of coordinates : G, I and J
#data
tel=data.frame(aa=c(153,113,163,117,193,162,110,109,186,103),
bb=c(189,176,185,130,200,189,156,123,198,189),
ID=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J"))
> tel
aa bb ID
1 153 189 A
2 113 176 B
3 163 185 C
4 117 130 D
5 193 200 E
6 162 189 F
7 110 156 G
8 109 123 H
9 186 198 I
10 103 189 J
nuc2=100:200
# Loop
count_occ=0
count_occ_int=NULL
count_occ_fin=NULL
for (j in 1:length(nuc2)){
for (i in 1:nrow(tel)) {
if (nuc2[j]< tel$bb[i] & nuc2[j]>tel$aa[i])
{count_occ=count_occ+1}
}
count_occ_int=count_occ
count_occ_fin=c(count_occ_fin,count_occ_int)
count_occ=0
}
nuc_occ=data.frame(nuc=nuc2, occ=count_occ_fin)
> head(nuc_occ,20)
nuc occ
1 100 0
2 101 0
3 102 0
4 103 0
5 104 1
6 105 1
7 106 1
8 107 1
9 108 1
10 109 1
11 110 2
12 111 3
13 112 3
14 113 3
15 114 4
16 115 4
17 116 4
18 117 4
19 118 5
20 119 5
In my data, the length of my nuc vector is 9304567 and the number of couple of coordinates is 53 (I will have some hundred soon) and it took more than 60 hours to run the code !!
Any idea to help me to speed up this code ?
I though to the apply function but I am not sure how to combine the two for loop operations.
You can use data.table non-equi join like this:
library(data.table)
setDT(tel)[SJ(v=nuc2), on=.(aa<=v, bb>=v)][,.(occ = sum(!is.na(ID))), by=.(nuc=aa)]
Explanation:
setDT(tel) sets the tel data.frame to be of class data.table
SJ(v=nuc2) is a convenience function for converting a vector to a data.table; in this case converting nuc2 to a data.table with one column v. I'm doing this becuase I want to join two data.tables, one which is tel (with columns aa,bb and v) and one which has a single column v holding the values in nuc2
the join conditions are in the on=.. param of the setDT(tel)[...] clause; here the join condition is that the v value must be >= the aa value and must be <= the bb value
the final step (i.e. the next chained data.table operation) simply counts the number of rows where ID is not NA, by nuc value (by=.(nuc=aa))
Output:
nuc occ
<int> <int>
1: 100 0
2: 101 0
3: 102 0
4: 103 1
5: 104 1
---
97: 196 2
98: 197 2
99: 198 2
100: 199 1
101: 200 1
Here's a tidyverse solution:
lapply(
100:200,
\(x) tel %>%
filter(aa <= x & x <= bb) %>%
summarise(occ=n(), .groups="drop") %>%
add_column(nuc=x, .before=1)
) %>%
bind_rows() %>%
as_tibble()
# A tibble: 101 × 2
nuc occ
<int> <int>
1 100 0
2 101 0
3 102 0
4 103 1
5 104 1
6 105 1
7 106 1
8 107 1
9 108 1
10 109 2
# … with 91 more rows
Using microbenchmark to assess performance, this gives
Unit: nanoseconds
expr min lq mean median uq max neval
lapply 7 9 8.8 9 9 9 10
original 8 9 23.8 9 9 158 10
In other words, a decrease in speed of about two-thirds. And the tidyverse is not known for speed. A base R solution is likely to be faster still.

Calculate mean of all groups except the current group

I have a data frame with two grouping variables, 'mkt' and 'mdl', and some values 'pr':
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2),
mdl = c('a','a','b','b','b','a','b','a','b'),
pr = c(120,120,110,110,145,130,145,130, 145))
df
mkt mdl pr
1 1 a 120
2 1 a 120
3 1 b 110
4 1 b 110
5 2 b 145
6 2 a 130
7 2 b 145
8 2 a 130
9 2 b 145
Within each 'mkt', the mean 'pr' for each 'mdl' should be calculated as the mean of 'pr' of all other 'mdl' in the same 'mkt', except the current 'mdl'.
For example, for the group defined by mkt == 1 and mdl == a, the 'avgother' is calculated as the average of 'pt' for mkt == 1 (same 'mkt') and mdl == b (all other 'mdl' than the current group a).
Desired result:
# mkt mdl pr avgother
# 1 1 a 120 110
# 2 1 a 120 110
# 3 1 b 110 120
# 4 1 b 110 120
# 5 2 b 145 130
# 6 2 a 130 145
# 7 2 b 145 130
# 8 2 a 130 145
# 9 2 b 145 130
First get the average of each mkt and mdl values and for each mkt exclude the current value and get the average of remaining values.
library(dplyr)
library(purrr)
df %>%
group_by(mkt, mdl) %>%
summarise(avgother = mean(pr)) %>%
mutate(avgother = map_dbl(row_number(), ~mean(avgother[-.x]))) %>%
ungroup %>%
inner_join(df, by = c('mkt', 'mdl'))
# mkt mdl avgother pr
# <dbl> <chr> <dbl> <dbl>
#1 1 a 110 120
#2 1 a 110 120
#3 1 b 120 110
#4 1 b 120 110
#5 2 a 145 130
#6 2 a 145 130
#7 2 b 130 145
#8 2 b 130 145
#9 2 b 130 145
Using data.table, calculate sum and length by 'mkt'. Then, within each mkt-mdl group, calculate mean as (mkt sum - group sum) / (mkt length - group length)
library(data.table)
setDT(df)[ , `:=`(s = sum(pr), n = .N), by = mkt]
df[ , avgother := (s - sum(pr)) / (n - .N), by = .(mkt, mdl)]
df[ , `:=`(s = NULL, n = NULL)]
# mkt mdl pr avgother
# 1: 1 a 120 110
# 2: 1 a 120 110
# 3: 1 b 110 120
# 4: 1 b 110 120
# 5: 2 b 145 130
# 6: 2 a 130 145
# 7: 2 b 145 130
# 8: 2 a 130 145
# 9: 2 b 145 130
Consider base R with multiple ave calls for different level grouping calculation using the decomposed version of mean with sum / count:
df <- within(df, {
avgoth <- (ave(pr, mkt, FUN=sum) - ave(pr, mkt, mdl, FUN=sum)) /
(ave(pr, mkt, FUN=length) - ave(pr, mkt, mdl, FUN=length))
})
df
# mkt mdl pr avgoth
# 1 1 a 120 110
# 2 1 a 120 110
# 3 1 b 110 120
# 4 1 b 110 120
# 5 2 b 145 130
# 6 2 a 130 145
# 7 2 b 145 130
# 8 2 a 130 145
# 9 2 b 145 130
For the sake of completeness, here is another data.table approach which uses grouping by each i, i.e., join and aggregate simultaneously.
For demonstration, an enhanced sample dataset is used which has a third market with 3 products:
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2,3,3,3),
mdl = c('a','a','b','b','b','a','b','a','b', letters[1:3]),
pr = c(120,120,110,110,145,130,145,130, 145, 1:3))
library(data.table)
mdt <- setDT(df)[, .(mdl, s = sum(pr), .N), by = .(mkt)]
df[mdt, on = .(mkt, mdl), avgother := (sum(pr) - s) / (.N - N), by = .EACHI][]
mkt mdl pr avgother
1: 1 a 120 110.0
2: 1 a 120 110.0
3: 1 b 110 120.0
4: 1 b 110 120.0
5: 2 b 145 130.0
6: 2 a 130 145.0
7: 2 b 145 130.0
8: 2 a 130 145.0
9: 2 b 145 130.0
10: 3 a 1 2.5
11: 3 b 2 2.0
12: 3 c 3 1.5
The temporay table mdt contains the sum and count of prices within each mkt but replicated for each product mdl within the market:
mdt
mkt mdl s N
1: 1 a 460 4
2: 1 a 460 4
3: 1 b 460 4
4: 1 b 460 4
5: 2 b 695 5
6: 2 a 695 5
7: 2 b 695 5
8: 2 a 695 5
9: 2 b 695 5
10: 3 a 6 3
11: 3 b 6 3
12: 3 c 6 3
Having mkt and mdl in mdt allows for grouping by each i (by = .EACHI)
Here is an approach which computes avgother directly by subsetting pr values which do not belong to the actual value of mdl before computing the averages.
This is quite different to the other answers posted so far which justifies to post this as a separate answer, IMHO.
# enhanced sample dataset covering more corner cases
df <- data.frame(mkt = c(1,1,1,1,2,2,2,2,2,3,3,3,4),
mdl = c('a','a','b','b','b','a','b','a','b', letters[1:3],'d'),
pr = c(120,120,110,110,145,130,145,130, 145, 1:3, 9))
library(data.table)
setDT(df)[, avgother := sapply(mdl, function(m) mean(pr[m != mdl])), by = mkt][]
mkt mdl pr avgother
1: 1 a 120 110.0
2: 1 a 120 110.0
3: 1 b 110 120.0
4: 1 b 110 120.0
5: 2 b 145 130.0
6: 2 a 130 145.0
7: 2 b 145 130.0
8: 2 a 130 145.0
9: 2 b 145 130.0
10: 3 a 1 2.5
11: 3 b 2 2.0
12: 3 c 3 1.5
13: 4 d 9 NaN
Difference between approaches
The other answers share more or less the same approach (although implemented in different manners)
compute sums and counts of pr for each mkt
compute sums and counts of prfor each mkt and mdl
subtract mkt/mdl sums and counts from mkt sums and counts
compute avgother
This approach
groups by mkt
loops through mdl within each mkt,
subsets pr to drop values which do not belong to the actual value of mdl
before computing mean() directly.
Caveat concerning performance: Although the code essentially is a one-liner it does not imply it is the fastest.

Eliminate rows that have a match in one of multiple columns of the preceding rows

I have a large data.frame. Here a simpler version for more clarity.
ID <- rep(c(1,2,3),each=4)
Bed <- rep(c(1,1,2,2),3)
ERRBeg <- c(90,140,190,200,290,340,390,100,490,540,560,610)
POST1Beg <- c(100,150,200,250,300,350,400,450,500,550,600,650)
POST2Beg <- c(110,160,210,260,310,360,410,460,510,560,610,660)
DATA <- data.frame(ID,Bed,ERRBeg,POST1Beg,POST2Beg)
It looks like that:
I want to delete all rows that have the following match:
The value of ERRBeg is found in POST1Beg or POST2Beg (i have more variables) in one of the previous rows (only if ID and Bed is the same)
ID Bed ERRBeg POST1Beg POST2Beg LAG_ERRBeg LAG_POST1Beg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 90 100 110 NA NA
2 1 1 140 150 160 90 100
3 1 2 190 200 210 NA NA
4 1 2 200 250 260 190 200
5 2 1 290 300 310 NA NA
6 2 1 340 350 360 290 300
7 2 2 390 400 410 NA NA
8 2 2 100 450 460 390 400
9 3 1 490 500 510 NA NA
10 3 1 540 550 560 490 500
11 3 2 560 600 610 NA NA
12 3 2 610 650 660 560 600
I tried this which gives me the exact line where two variables match. However if i turn it around using filter(!ERRBeg == lag(POST1Beg)) it deletes all line where ID and Bed has duplicates.
DATA %>%
group_by(ID, Bed)%>%
filter(ERRBeg == lag(POST1Beg) ) %>%
ungroup()
I also tried this which does not work. I know i might be missing something trivial, but i do not see it.
DATA_xx <- DATA %>%
group_by(ID, Bed)%>%
filter(ERRBeg %in% c(lag(ERRBeg),lag(POST1Beg)) ) %>%
ungroup()
Desired Output:
ID Bed ERRBeg POST1Beg POST2Beg LAG_ERRBeg LAG_POST1Beg
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 90 100 110 NA NA
2 1 1 140 150 160 90 100
3 1 2 190 200 210 NA NA
5 2 1 290 300 310 NA NA
6 2 1 340 350 360 290 300
7 2 2 390 400 410 NA NA
8 2 2 100 450 460 390 400
9 3 1 490 500 510 NA NA
10 3 1 540 550 560 490 500
11 3 2 560 600 610 NA NA
DATA %>%
group_by(ID, Bed)%>%
filter(!ERRBeg %in% POST1Beg ) %>%
ungroup()
I tried this of switching the lag to be an in, and it works I think
Edit: Will not work forward i.e if ERRBeg value appears in a POST1Beg later in the values.
Putting lag back around the post will fix this I believe
DATA %>%
group_by(ID, Bed)%>%
filter(!ERRBeg %in% lag(POST1Beg) ) %>%
ungroup()
Found the problem and the solution. :)
DATA %>%
group_by(ID, Bed)%>%
filter(!ERRBeg %in% c(lag(ERRBeg),lag(POST1Beg),lag(POST2Beg)) | is.na(lag(ERRBeg)) ) %>%
ungroup()
The problem was that i do not only get TRUE, FALSE, but also NA as a result of the equation in the filter.
ID Bed ERRBeg POST1Beg POST2Beg FILTER
<dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
1 1 1 90 100 110 NA
2 1 1 140 150 160 FALSE
3 1 2 190 200 210 NA
4 2 1 290 300 310 NA
5 2 1 340 350 360 FALSE
6 2 2 390 400 410 NA
7 2 2 100 450 460 FALSE
8 3 1 490 500 510 NA
9 3 1 540 550 560 FALSE
10 3 2 560 600 610 NA

Indexing subgroups by sorted positions in R dataframe

I have a dataframe which contains information about several categories, and some associated variables. It is of the form:
ID category sales score
227 A 109 21
131 A 410 24
131 A 509 1
123 B 2 61
545 B 19 5
234 C 439 328
654 C 765 41
What I would like to do is be able to introduce two new columns, salesRank and scoreRank, where I find the item index per category, had they been ordered by sales and score, respectively. I can solve the general case like this:
dF <- dF[order(-dF$sales),]
dF$salesRank<-seq.int(nrow(dF))
but this doesn't account for the categories and so far I've only solved this by breaking up the dataframe. What I want would result in the following:
ID category sales score salesRank scoreRank
227 A 109 21 3 2
131 A 410 24 2 1
131 A 509 1 1 3
123 B 2 61 2 1
545 B 19 5 1 2
234 C 439 328 2 1
654 C 765 41 1 2
Many thanks!
Try:
library(dplyr)
df %>%
group_by(category) %>%
mutate(salesRank = row_number(desc(sales)),
scoreRank = row_number(desc(score)))
Which gives:
#Source: local data frame [7 x 6]
#Groups: category
#
# ID category sales score salesRank scoreRank
#1 227 A 109 21 3 2
#2 131 A 410 24 2 1
#3 131 A 509 1 1 3
#4 123 B 2 61 2 1
#5 545 B 19 5 1 2
#6 234 C 439 328 2 1
#7 654 C 765 41 1 2
From the help:
row_number(): equivalent to rank(ties.method = "first")
min_rank(): equivalent to rank(ties.method = "min")
desc(): transform a vector into a format that will be sorted in descending
order.
As #thelatemail pointed out, for this particular dataset you might want to use min_rank() instead of row_number() which will account for ties in sales/score more appropriately:
> row_number(c(1,2,2,4))
#[1] 1 2 3 4
> min_rank(c(1,2,2,4))
#[1] 1 2 2 4
Use ave in base R with rank (the - is to reverse the rankings from low-to-high to high-to-low):
dF$salesRank <- with(dF, ave(-sales, category, FUN=rank) )
#[1] 3 2 1 2 1 2 1
dF$scoreRank <- with(dF, ave(-score, category, FUN=rank) )
#[1] 2 1 3 1 2 1 2
I have just a base R solution with tapply.
salesRank <- tapply(dat$sales, dat$category, order, decreasing = T)
scoreRank <- tapply(dat$score, dat$category, order, decreasing = T)
cbind(dat, salesRank = unlist(salesRank), scoreRank= unlist(scoreRank))
ID category sales score salesRank scoreRank
A1 227 A 109 21 3 2
A2 131 A 410 24 2 1
A3 131 A 509 1 1 3
B1 123 B 2 61 2 1
B2 545 B 19 5 1 2
C1 234 C 439 328 2 1
C2 654 C 765 41 1 2

in r, how can one trim or winsorize data by a factor

I'm trying to apply the winsor function at each level of a factor (subjects) in order to remove extreme cases. I can apply the winsor function to the entire column, but would like to do it within subject.
Subject RT
1 402
1 422
1 155
1 460
2 283
2 224
2 346
2 447
3 415
3 161
3 1
3 343
Ideally, I'd like the output to be a vector containing the same number of rows as the input but with outliers (e.g. the second last value of Subject 3) to be removed and replaced as per the winsor function.
you are looking for the ?by function
# for example:
by(myDF, myDF$Subject, winsor(myDF$RT))
However, using data.table (instead of data.frame) might be better suited for you
### broken down step by step:
library(data.table)
myDT <- data.table(myDF)
myDT[, winsorResult := winsor(RT), by=Subject]
library(psych)
transform(dat,win = ave(RT,Subject,FUN=winsor))
Subject RT win
1 1 402 402.0
2 1 422 422.0
3 1 155 303.2
4 1 460 437.2
5 2 283 283.0
6 2 224 259.4
7 2 346 346.0
8 2 447 386.4
9 3 415 371.8
10 3 161 161.0
11 3 1 97.0
12 3 343 343.0

Resources