R Multiplying columns of two data frames - r

I have two data frames one with quantities and one with prices
Quantities <- data.frame(region=c("US","US","EU","China","EU"),a = 1:5, b = 5:9, c=8:12)
prices_frame <- data.frame(region=c("US","EU","China"),a = c(10,20,30), b = c(10,20,30), c=c(1000,2000,100))
Quantities
region a b c
1 US 1 5 8
2 US 2 6 9
3 EU 3 7 10
4 China 4 8 11
5 EU 5 9 12
Prices
region a b c
1 US 10 10 1000
2 EU 20 20 2000
3 China 30 30 100
Is there a way that I can quickly multiply the quantities with the prices auf the matching region without having to loop through the entire quantity frame?
Best Alex

We can use match
Assuming your columns a, b, c are in same order in both the dataframes
Quantities[-1] * prices_frame[match(Quantities$region, prices_frame$region), -1]
# a b c
#1 10 50 8000
#2 20 60 9000
#3 60 140 20000
#4 120 240 1100
#5 100 180 24000
To get the dataframe with the same number of columns,
new_df <- cbind(Quantities[1],
Quantities[-1] * prices_frame[match(Quantities$region, prices_frame$region), -1])
# region a b c
#1 US 10 50 8000
#2 US 20 60 9000
#3 EU 60 140 20000
#4 China 120 240 1100
#5 EU 100 180 24000

We can use a join in data.table
library(data.table)
nm1 <- names(Quantities)[-1]
setDT(Quantities)[prices_frame, (nm1) := Map(`*`, mget(nm1),
mget(paste0("i.", nm1))) , on = "region"]
Quantities
# region a b c
#1: US 10 50 8000
#2: US 20 60 9000
#3: EU 60 140 20000
#4: China 120 240 1100
#5: EU 100 180 24000

Related

R: Conditional Sum by Row in DataTable

I've got a very large dataset (millions of rows that I need to loop through thousands of times), and during the loop I have to do a conditional sum that appears to be taking a very long time. Is there a way of making this more efficient?
Datatable format as follows:
DT <- data.table('A' = c(1,1,1,2,2,3,3,3,3,4),
'B' = c(500,510,540,500,540,500,510,519,540,500),
'C' = c(10,20,10,20,10,50,20,50,20,10))
A
B
C
1
500
10
1
510
20
1
540
10
2
500
20
2
540
10
3
500
50
3
510
20
3
519
50
3
540
20
4
500
10
I need the sum of column C (in a new column, D) subject to A == A, and B >= B & B < B + 20 (by row). So the output table would look like the following:
A
B
C
D
1
500
10
30
1
510
20
30
1
540
10
10
2
500
20
20
2
540
10
10
3
500
50
120
3
510
20
120
3
519
50
120
3
540
20
20
4
500
10
10
The code I'm currently using:
DT[,D:= sum(DT$C[A == DT$A & ((B >= DT$B) & (B < DT$B + 20))]), by=c('A', 'B')]
This takes a very long time to actually run, as well as giving me the wrong answer. The output I get looks like this:
A
B
C
D
1
500
10
10
1
510
20
30
1
540
10
10
2
500
20
20
2
540
10
10
3
500
50
50
3
510
20
70
3
519
50
120
3
540
20
20
4
500
10
10
(i.e. D only appears to increase cumulatively).
I'm less concerned with the cumulative thing, more about speed. Ultimately what I'm trying to get to is the largest sum of C, by A, subject to B being within 20 of eachother. I would really appreciate any help on this! Thanks in advance.
If I understand correctly, this can be solved by a non-equi self join:
DT[, Bp20 := B + 20][
DT, on = .(A, B >= B, B < Bp20), mult = "last"][
, .(B, C = i.C, D = sum(i.C)), by = .(A, Bp20)][
, Bp20 := NULL][]
A B C D
1: 1 500 10 30
2: 1 510 20 30
3: 1 540 10 10
4: 2 500 20 20
5: 2 540 10 10
6: 3 500 50 120
7: 3 510 20 120
8: 3 519 50 120
9: 3 540 20 20
10: 4 500 10 10
# logic for B
DT[, g := B >= shift(B) & B < shift(B, 1) + 20, by = A]
# creating index column
DT[, gi := !g]
DT[is.na(gi), gi := T]
DT[, gi := cumsum(gi)]
DT[, D := sum(C), by = gi] # summing by new groups
DT
# A B C g gi D
# 1: 1 500 10 NA 1 30
# 2: 1 510 20 TRUE 1 30
# 3: 1 540 10 FALSE 2 10
# 4: 2 500 20 NA 3 20
# 5: 2 540 10 FALSE 4 10
# 6: 3 500 50 NA 5 120
# 7: 3 510 20 TRUE 5 120
# 8: 3 519 50 TRUE 5 120
# 9: 3 540 20 FALSE 6 20
# 10: 4 500 10 NA 7 10
You might need to adjust logic for B, as all edge cases isn't clear from the question... if for one A value we have c(30, 40, 50, 60), all of those rows are in one group?

Make a spaggetiplot of data

I would like to make a spagettiplot of the data below. Treatment C should be set as the reference,1, compared to treatment A and B. Does anyone have a suggestion how to do that? Thanks in advance! :)
id <- rep(c(300,450), each=6)
trt <- rep(c("A","B","C"),2)
q1 <- c(100,89, 60,85,40,10)
df <- data.frame(id,trt,q1)
df
id trt q1
1 300 A 100
2 300 B 89
3 300 C 60
4 300 A 85
5 300 B 40
6 300 C 10
7 450 A 100
8 450 B 89
9 450 C 60
10 450 A 85
11 450 B 40
12 450 C 10

Select nth observation and sum by group using data.table

I would like to turn the first table into the second by selecting the last observation of a group for a and b, the first observation for c, sum each observation for the group for d and e, and for f, check if a valid date exists and use that date.
Table 1:
ID a b c d e f
1 10 100 1000 10000 100000 ?
1 10 100 1001 10010 100100 5/07/1977
1 11 111 1002 10020 100200 5/07/1977
2 22 222 2000 20000 200000 6/02/1980
3 33 333 3000 30000 300000 20/12/1978
3 33 333 3001 30010 300100 ?
4 40 400 4000 40000 400000 ?
4 40 400 4001 40010 400100 ?
4 40 400 4002 40020 400200 7/06/1944
4 44 444 4003 40030 400300 ?
4 44 444 4004 40040 400400 ?
4 44 444 4005 40050 400500 ?
5 55 555 5000 50000 500000 31/05/1976
5 55 555 5001 50010 500100 31/05/1976
Table 2:
ID a b c d e f
1 11 111 1000 30030 300300 5/07/1977
2 22 222 2000 20000 200000 6/02/1980
3 33 333 3000 60010 600100 20/12/1978
4 44 444 4000 240150 2401500 7/06/1944
5 55 555 5000 100010 1000100 31/05/1976
I have looked up StackOverflow questions and I have only seen elements of this. I can do a through to e in the following steps.
library(data.table)
setwd('D:/Work/BRB/StackOverflow')
DT = data.table(fread('datatable.csv', header=TRUE))
AB = DT[ , .SD[.N], ID ]
AB = AB[ , c('a', 'b') ]
C = DT[ , .SD[1], ID ]
C = C[ , 'c' ]
DE = DT[ , .(d = sum(d), e = sum(e)) , by = ID ]
Final = cbind(AB, C, DE)
Final
My question is, can I do the operations on variables a, b, c, d, e in one transformation without having to split it into 3?
Also, I have no idea how to do f. Any suggestions?
Finally, I am new to R. Anything else I can improve about my code?
There are several things you can improve:
fread will return a data.table, so no need to wrap it in data.table. You can check with class(DT).
Use the na.strings parameter when reading in the data. See below for an example.
Summarise with:
DT[, .(a = a[.N],
b = b[.N],
c = c[1],
d = sum(d),
e = sum(e),
f = unique(na.omit(f)))
, by = ID]
you will then get:
ID a b c d e f
1: 1 11 111 1000 30030 300300 5/07/1977
2: 2 22 222 2000 20000 200000 6/02/1980
3: 3 33 333 3000 60010 600100 20/12/1978
4: 4 44 444 4000 240150 2401500 7/06/1944
5: 5 55 555 5000 100010 1000100 31/05/1976
Some explanations & other notes:
Subsetting with [1] will give you the first value of a group. You could also use the first-function which is optimized in data.table, and thus faster.
Subsetting with [.N] will give you the last value of a group. You could also use the last-function which is optimized in data.table, and thus faster.
Don't use variable names that are also functions in R (in this case, don't use c as a variable name). See also ?c for an explanation of what the c-function does.
For summarising the f-variable, I used unique in combination with na.omit. If there is more than one unique date by ID, you could also use for example na.omit(f)[1].
If speed is an issue, you could optimize the above to (thx to #Frank):
DT[order(f)
, .(a = last(a),
b = last(b),
c = first(c),
d = sum(d),
e = sum(e),
f = first(f))
, by = ID]
Ordering by f will put NA-values last. As a result now the internal GForce-optimization is used for all calculations.
Used data:
DT <- fread("ID a b c d e f
1 10 100 1000 10000 100000 ?
1 10 100 1001 10010 100100 5/07/1977
1 11 111 1002 10020 100200 5/07/1977
2 22 222 2000 20000 200000 6/02/1980
3 33 333 3000 30000 300000 20/12/1978
3 33 333 3001 30010 300100 ?
4 40 400 4000 40000 400000 ?
4 40 400 4001 40010 400100 ?
4 40 400 4002 40020 400200 7/06/1944
4 44 444 4003 40030 400300 ?
4 44 444 4004 40040 400400 ?
4 44 444 4005 40050 400500 ?
5 55 555 5000 50000 500000 31/05/1976
5 55 555 5001 50010 500100 31/05/1976", na.strings='?')
We can use tidyverse. After grouping by 'ID', we summarise the columns based on the first or last observation
library(dplyr)
DT %>%
group_by(ID) %>%
summarise(a = last(a),
b = last(b),
c = first(c),
d = sum(d),
e = sum(e),
f = f[f!="?"][1])
# A tibble: 5 × 7
# ID a b c d e f
# <int> <int> <int> <int> <int> <int> <chr>
#1 1 11 111 1000 30030 300300 5/07/1977
#2 2 22 222 2000 20000 200000 6/02/1980
#3 3 33 333 3000 60010 600100 20/12/1978
#4 4 44 444 4000 240150 2401500 7/06/1944
#5 5 55 555 5000 100010 1000100 31/05/1976

Ordering followed by cumulative sum + by

Here is my data :
class x1 x2
c 6 90
b 5 50
c 3 70
b 9 40
a 5 30
b 1 60
a 7 20
c 4 80
a 2 10
I first want to order it by class (increasing or decreasing doesn't really matter) and then by x1 (decreasing), so I do the following :
df <- df[with(df, order(class, x1, decreasing = TRUE))]
class x1 x2
c 6 90
c 4 80
c 3 70
b 9 40
b 5 50
b 1 60
a 7 20
a 5 30
a 2 10
And then I would like the cumulative sum over x1 for each class :
class x1 x2 cumsum
c 6 90 90
c 4 80 170 # 90+80
c 3 70 240 # 90+80+70
b 9 40 40
b 5 50 90 # 40+50
b 1 60 150 # 40+50+60
a 7 20 20
a 5 30 50 # 20+30
a 2 10 60 # 20+30+10
Following this answer, I did this :
df$cumsum <- unlist(by(df$x2, df$class, cumsum))
# (Also tried this, same result)
df$cumsum <- unlist(by(df[,x2], df[,class], cumsum))
But what I get is a cumulative sum over the whole set + misordered. To be more specific, Here is what I get :
class x1 x2 cumsum
c 6 90 20 # this cumsum
c 4 80 50 # and this cumsum
c 3 70 60 # and this cumsum are the cumsum of the lines of class a,
b 9 40 100 # then it adds the 'x2' values of class b : 60 ('cumsum' from the previous line) + 40
b 5 50 150 # and keeps doing so : 100 + 50
b 1 60 210 # 150 + 60
a 7 20 300 # 210 + 90
a 5 30 380 # 300 + 80
a 2 10 450 # 380 + 70
Any idea on how I could solve this ? Thanks
dplyr can work here too
library(dplyr)
df %>%
group_by(class) %>%
arrange(desc(x1)) %>%
mutate(cumsum=cumsum(x2))
## class x1 x2 cumsum
## (fctr) (int) (int) (int)
## 1 a 7 20 20
## 2 a 5 30 50
## 3 a 2 10 60
## 4 b 9 40 40
## 5 b 5 50 90
## 6 b 1 60 150
## 7 c 6 90 90
## 8 c 4 80 170
## 9 c 3 70 240
As described here (https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html) and elsewhere, the group_by in conjunction with arrange implies that the data will be sorted by the grouping variable first.
We can use data.table
library(data.table)
setDT(df)[, x2:= cumsum(x2) , class]
df
# class x1 x2
#1: c 6 90
#2: c 4 170
#3: c 3 240
#4: b 9 40
#5: b 5 90
#6: b 1 150
#7: a 7 20
#8: a 5 50
#9: a 2 60
NOTE: In the above I used the ordered data
If we need to order also,
setorder(setDT(df), -class, -x1)[, x2:=cumsum(x2), class]
You can use base R transform and ave to cumsum over the class column
transform(df[order(df$class, decreasing = T), ], cumsum = ave(x2, class, FUN=cumsum))
# class x1 x2 cumsum
#1 c 6 90 90
#3 c 3 70 160
#8 c 4 80 240
#2 b 5 50 50
#4 b 9 40 90
#6 b 1 60 150
#5 a 5 30 30
#7 a 7 20 50
#9 a 2 10 60

Replicating an Excel SUMIFS formula

I need to replicate - or at least find an alternative solution - for a SUMIFS function I have in Excel.
I have a transactional database:
SegNbr Index Revenue SUMIF
A 1 10 30
A 1 20 30
A 2 30 100
A 2 40 100
B 1 50 110
B 1 60 110
B 3 70 260
B 3 80 260
and I need to create another column that sums the Revenue, by SegmentNumber, for all indexes that are equal or less the Index in that row. It is a distorted rolling revenue as it will be the same for each SegmentNumber/Index key. This is the formula is this one:
=SUMIFS([Revenue],[SegNbr],[#SegNbr],[Index],"<="&[#Index])
Let's say you have this sample data.frame
dd<-read.table(text="SegNbr Index Revenue
A 1 10
A 1 20
A 2 30
A 2 40
B 1 50
B 1 60
B 3 70
B 3 80", header=T)
Now if we make sure the data is ordered by segment and index, we can do
dd<-dd[order(dd$SegNbr, dd$Index), ] #sort data
dd$OUT<-with(dd,
ave(
ave(Revenue, SegNbr, FUN=cumsum), #get running sum per seg
interaction(SegNbr, Index, drop=T),
FUN=max, na.rm=T) #find largest sum per index per seg
)
dd
This gives
SegNbr Index Revenue OUT
1 A 1 10 30
2 A 1 20 30
3 A 2 30 100
4 A 2 40 100
5 B 1 50 110
6 B 1 60 110
7 B 3 70 260
8 B 3 80 260
as desired.

Resources