let average = materialize(FooTable
| summarize avg(value) by group, class
| summarize arg_min(avg_value, class) by group
This should output something like (i.e. minimum value for the group averages per class):
group
class
avg_value
G1
C1
100
G2
C2
150
..
..
..
Now, I would like to display all the group, class and value row which shows the delta from their group's minimum class average as calculated by the query above.
FooTable
| where value > ( here I want to insert the query to get min by group and class)
Output should be something like:
group
class
min_avg_value
delta
G1
C1
100
0
G1
C2
120
20
G2
C1
200
50
G2
C2
150
0
..
..
..
..
Thanks for the help in advance!
lookup
let FooTable = datatable (group:string, class:string, value:int)
[
'G1' ,'C1', 100
,'G1' ,'C2', 120
,'G2' ,'C1', 200
,'G2' ,'C2', 150
];
let average = materialize(
FooTable
| summarize avg(value) by group, class
| summarize min(avg_value) by group
);
FooTable
| lookup kind=inner average on group
| extend delta = value - min_avg_value
Show expand view
group
class
value
min_avg_value
delta
G1
C1
100
100
0
G1
C2
120
100
20
G2
C1
200
150
50
G2
C2
150
150
0
Fiddle
join
let FooTable = datatable (group:string, class:string, value:int)
[
'G1' ,'C1', 100
,'G1' ,'C2', 120
,'G2' ,'C1', 200
,'G2' ,'C2', 150
];
let average = materialize(
FooTable
| summarize avg(value) by group, class
| summarize min(avg_value) by group
);
average
| join kind=inner FooTable on group
| extend delta = value - min_avg_value
group
min_avg_value
group1
class
value
delta
G1
100
G1
C1
100
0
G1
100
G1
C2
120
20
G2
150
G2
C1
200
50
G2
150
G2
C2
150
0
Fiddle
Related
I have a data frame with daily channel revenue from multiple channels. The data frame looks like the following:
orders_dataframe:
Order |Channel | Revenue |
1 |TV | 120 |
2 |Email | 30 |
3 |Retail | 300 |
4 |Shop1 | 50 |
5 |Shop2 | 90 |
6 |Email | 20 |
7 |Retail | 250 |
What I would like to do is to split those revenues coming from Retail and divide them between Shop1 and Shop2 according to a predefined ratio (e.g., 60%/40% split). For example, I would like that all rows with revenue coming from "Retail" get attributed 60% to Shop1 and 40% to Shop2. This can be reflected by replacing all retail-revenue rows with two new rows, as seen for Order 3 and Order 7 in the final table I want to get below:
orders_dataframe:
Order |Channel | Revenue |
1 |TV | 120 |
2 |Email | 30 |
3 |Shop1 | 180 |
3 |Shop2 | 120 |
4 |Shop1 | 50 |
5 |Shop2 | 90 |
6 |Email | 20 |
7 |Shop1 | 150 |
7 |Shop2 | 100 |
Ideally, since I am performing this with various datasets, I would like to take the percentages from a data frame (split_dataframe) instead of manually assigning the figures 60% and 40%. I would like to use the figures from a dataset like below:
split_dataframe:
Channel |Percent |
Shop1 |60% |
Shop2 |40% |
Here is a reproducible example of the two data frames:
orders_dataframe <- data.frame(Order = c(1,2,3,4,5,6,7),
Channel = c("TV", "Email", "Retail", "Shop1", "Shop2", "Email", "Retail"),
Revenue = c(120,30,300,50,90,20,250))
split_dataframe <- data.frame(Channel = c("Shop1", "Shop2"),
Percent = c(0.6, 0.4))
Thank you very much!
With dplyr,
split_dataframe %>%
mutate(Index="Retail") %>%
merge(.,orders_dataframe,by.x="Index",by.y="Channel") %>%
mutate(Revenue=Revenue*Percent) %>%
select(Order,Channel,Revenue) %>%
bind_rows(orders_dataframe %>% filter(Channel !="Retail"),.)%>%
arrange(.,Order)
gives,
Order Channel Revenue
1 1 TV 120
2 2 Email 30
3 3 Shop1 180
4 3 Shop2 120
5 4 Shop1 50
6 5 Shop2 90
7 6 Email 20
8 7 Shop1 150
9 7 Shop2 100
Here is a data.table approach... see comments in code for explanation
library( data.table )
#make them data.tables
setDT( orders_dataframe ); setDT( split_dataframe )
#split to retail en non-retail orders
orders_retail <- orders_dataframe[ Channel == "Retail", ]
orders_no_retail <- orders_dataframe[ !Channel == "Retail", ]
#divide the retail orders over the two shops (multiple steps)
#create a new colum by shop
shop_cols <- split_dataframe$Channel
orders_retail[, (shop_cols) := Revenue ]
#melt to long format
orders_retail.melt <- melt( orders_retail,
id.vars = "Order",
measure.vars = (shop_cols),
variable.name = "Channel",
value.name = "Revenue")
#and update the molten data with the percentages in the split_dataframe
orders_retail.melt[ split_dataframe,
Revenue := Revenue * i.Percent,
on = .( Channel )]
#merge everything back together and order on Order id
ans <- rbind( orders_no_retail, orders_retail.melt )
setorder( ans, Order )
# Order Channel Revenue
# 1: 1 TV 120
# 2: 2 Email 30
# 3: 3 Shop1 180
# 4: 3 Shop2 120
# 5: 4 Shop1 50
# 6: 5 Shop2 90
# 7: 6 Email 20
# 8: 7 Shop1 150
# 9: 7 Shop2 100
You can do this in base R.
orders_dataframe <- data.frame(Order = c(1,2,3,4,5,6,7),
Channel = c("TV", "Email", "Retail", "Shop1", "Shop2", "Email", "Retail"),
Revenue = c(120,30,300,50,90,20,250))
# Coerce the channel factor to a string.
# Do you really want this as a factor?
orders_dataframe$Channel <- as.character(orders_dataframe$Channel)
# Create a vector of the replacement values.
# The prob = c() argument lets you pick the
# probabilities of each replacement.
replacement <- sample(x = c("Store1","Store2"),
size = length(which(orders_dataframe$Channel == "Retail")),
replace = TRUE, prob = c(0.6, 0.4))
# Replace the Channel columnn with the replacement vector.
orders_dataframe$Channel[which(orders_dataframe$Channel == "Retail")] <- replacement
I have a 10 different dataframe.
dataframe named Group1:
ï.. calories components
1 meal1 177 150 oats + 250 skimmed milk
2 snack1 145 200 yougurt + 100 blackberries
3 meal2 560 200 beans + 100 lamb
4 snack2 66 apple
5 meal3 160 1pc crumpet + 25 spread cheese
I want to get the total calories (I did sum(Group1$calories) and worked fine). Similarly I have 9 groups. Now I have another data frame called participants:
> participants SubjectId Gender Groups ExtraCalories GW
1 1 F G3 -1310.000000 0.000000
2 2 M G6 -920.796555 4.331278
3 3 M G2 -25.395170 4.727376
4 4 M G1 169.256448 3.543941
5 5 M G4 -340.672353 4.591774
I want to add a new column named total calorie with values of those total calories I calculated earlier. But the problem is I want the total calories of dataframe Group one to be put on the Row with G1 and respectively.
If you have dataframes Group1, Group2, Group3...Group10 you can try to get all the dataframes in a list , get sum of calories column in each dataframe and merge with participants dataframe.
merge(transform(stack(sapply(mget(paste0('Group', 1:10)), function(x)
sum(x$calories))), ind = paste0('G', 1:10)),
participants, by.x = "ind", by.y = "Groups")
I have the following two tables:
df <- data.table(id = c("01","02","03"), tariff = c("1A","1B","1A"), summer = c(0,0,1), expenditure = c(150,200,90))
id tariff summer expenditure
1: 01 1A 0 150
2: 02 1B 0 200
3: 03 1A 1 90
catalogue <- data.table(tariff = c("1A","1A","1A","1A","1B","1B","1B","1B"), summer = c(0,0,1,1,0,0,1,1),
lb_quant = c(0,50,0,80,0,80,0,100), ub_quant = c(50,Inf,80,Inf,80,Inf,100,Inf), case = letters[1:8])
tariff summer lb_quant ub_quant case
1: 1A 0 0 50 a
2: 1A 0 50 Inf b
3: 1A 1 0 80 c
4: 1A 1 80 Inf d
5: 1B 0 0 80 e
6: 1B 0 80 Inf f
7: 1B 1 0 100 g
8: 1B 1 100 Inf h
I want to merge df and catalogue by tariff, summer and expenditure. However, expenditure is numeric, so merging will not work directly.
I'm looking for a vectorized way to merge the two tables together if:
tariff and summer match
catalogue$lb_quant < df$expenditure <= catalogue$ub_quant
As an example, I would like to match df[id == "01"] with the second line of catalogue because tariff == "01" and summer == 0 and expenditure falls within [50, inf). So assign case = b to df[id = "01"].
The real df is huge and I want to avoid using loops. Is there a vectorized way to achieve this in R or Python?
You can also use a non-equi update join in this case.
See the following one-liner (added linebreaks for readability)
df[ catalogue,
`:=`( lb_quant = i.lb_quant,
ub_quant= i.ub_quant,
case = i.case ),
on = .( tariff,
summer,
expenditure > lb_quant,
expenditure < ub_quant ) ][]
output
id tariff summer expenditure lb_quant ub_quant case
1: 01 1A 0 150 50 Inf b
2: 02 1B 0 200 80 Inf f
3: 03 1A 1 90 80 Inf d
data.table::foverlaps does merging with intervals in two tables. To do that, you need to
ensure both tables have the intervals defined; in the case where only one column is defined, you need to explicitly copy that to a new field, thereby creating intervals of 0 (it seems odd, but it is minor and temporary);
set table keys to include (in order): joining keys, then the interval columns
df[, exp2 := expenditure]
setkey(df, tariff, summer, expenditure, exp2)
setkey(catalogue, tariff, summer, lb_quant, ub_quant)
foverlaps(df, catalogue)
# tariff summer lb_quant ub_quant case id expenditure exp2
# 1: 1A 0 50 Inf b 01 150 150
# 2: 1A 1 80 Inf d 03 90 90
# 3: 1B 0 80 Inf f 02 200 200
(after the merge, you can remove keys and the extra columns if desired)
I have a data frame that is in the following format
df <- data.frame(name=LETTERS[1:5], location=c(2000,2021,4532,1931,3457),
value=c(1,0,1,1,0))
name location value
A 2000 1
B 2021 0
C 4532 1
D 1931 1
E 3457 0
There are approximately a million rows in the data frame. How would I create a new dataframe that has the distance between every location if the locations are within 1000 of each other also checks to see if the values are both one for both locations?
For the above dataset, the dataframe would only have three rows in it with values of 21 (absolute value of 2000 - 2021), 69 (absolute value of 2000 - 1931), and 90 (abs. value of 2021-1931) because those are the only differences that are less than 1000. It would also have a column of 0 (because A and B values are not 1 and 1), 1 (because A and C values are 1 and 1), and 0 (because B and C are not 1 and 1). So it would look like:
21 0
69 1
90 0
I've tried using loops but since there are so many rows, it's inefficient. Is there some built in function that I should use to do this faster?
Thanks in advance.
library(sqldf)
sqldf("
select a.location
, b.location
, a.location - b.location as locdiff
, a.value*b.value as value
from df a
inner join df b
on a.location - b.location between 1 and 1000
")
This gives
a.location b.location locdiff value
1 2000 1931 69 1
2 2021 2000 21 0
3 2021 1931 90 0
Or with data.table. This is just #MKR's solution but adding a column to avoid a large join result. Not sure if it's possible to achieve this without creating a new column.
setDT(df)
df[, loc2 := location - 1000]
df[df
, .( locdiff = i.location - x.location
, locationA = i.location
, locationB = x.location
, value = x.value*i.value)
, on = .(location >= loc2
, location < location)
, nomatch = 0]
gives
locdiff locationA locationB value
1: 69 2000 1931 1
2: 90 2021 1931 0
3: 21 2021 2000 0
I agree with #Gregor comment where he mentioned sqldf to be better option to in above scenario in the sense that it avoid cartesian join of million records.
But I tried to optimize data.table based solution by first joining on x.location > i.location and then filtering on diff <=1000.
df <- data.frame(name=LETTERS[1:5], location=c(2000,2021,4532,1931,3457),
value=c(1,0,1,1,0))
library(data.table)
setDT(df)
df[df,.(name, diff = x.location - i.location, value = x.value*i.value),
on=.(location > location), nomatch=0][diff<=1000]
# name diff value
# 1: B 21 0
# 2: A 69 1
# 3: B 90 0
I have a dataframe, test, that looks like
c1 c2 c3
1 98 0 2013-08
2 231 0 2011-01
3 231 2.68 2011-03
4 231 1 2011-01
... ... ... ...
That continues on for many more rows. Column c1 has values from 1-297, while c3 has year-month values that consecutively move from 2011-01 to 2015-01. There are multiple rows that have the same c1 and c3 values.
I want to sum up each instance of c1 at each time step (so for all rows where c1 = x and c3 = y, sum those elements and get a result) and output that to a new data frame where each row represents 1 of the types from c1 (1-297), and each column is the corresponding year-month.
I am attempting to use acast (based off a suggestion) to transform it to a data frame with the rows based off of c1 values, with columns from c3, so it looks like
2011-01 2011-02 2011-03 ...
1 0 1.5 2.3 ...
2 0 3.4 0 ...
3 5 2.2 1.1 ...
4 4 2.2 4.4 ...
... ... ... ...
I have been attempting to transform this via acast:
acast(test, test$c3 ~ test$c1, value.var = "c2")
But end up with a matrix/data frame of type int. The rows and columns are correct (1-297, 2011-01 - 2015-01), however the values inside the cells are wrong.
Again just to clarify, in the new data frame each element would represent the sum of elements in the first data frame for all elements that share the same c1 and c3 values.
I believe the issue is that acast sees matching combinations and does something that I don't want it to do. How would I solve this problem? I do not need acast if another solution presents itself.
You should use tidyverse packages dplyr and tidyr:
library(dplyr)
library(tidyr)
df <- test %>%
group_by(c1, c3) %>%
summarise(total = sum(c2)) %>%
spread(c3, total)
Example
I used your simple data frame as an example:
#> c1 c2 c3
#> 1 98 0.00 2013-08
#> 2 231 0.00 2011-01
#> 3 231 2.68 2011-03
#> 4 231 1.00 2011-01
And after running the code, df looks like this:
#> c1 `2011-01` `2011-03` `2013-08`
#> 1 98 NA NA 0
#> 2 231 1 2.68 NA
Explanation
group_by(c1, c3) groups the variables c1 and c3 in your data frame
summarise(total = sum(c2)) sums up c2 (taking into account the c1, c3 groupings)
spread(c3, total) transforms the data frame into a "wide" format with the c3 variables going across the columns