Aggregating by indeces and reweighting in R - r

I have a ton of Price data that is indexed by State, Date, and UPC (Product Code). I want to aggregate away UPC, and combine the Prices by taking a weighted average. I will try to explain it, but you may just want to read the code below.
Each observation in the dataset is: UPC, date, state, price, and weight. I would like to aggregate away the UPC index in this way:
Take all data points with the same date and state, and multiple their prices by their weights and sum them up. This obviously creates a weighted average, which I call the priceIndex. However, for some date&state combos the weights do not add up to 1. Therefore, I want to create two additional columns: One for the total of the weights for each date&state combo. The second is for a reweighted average: I.e., if the original two weights were .5 and .3, change them to .5/(.5+.3)=.625 and .3/(.5+.3)=.375, and then recalculate the weighted average into another price index.
This is what I mean:
upc=c(1153801013,1153801013,1153801013,1153801013,1153801013,1153801013,2105900750,2105900750,2105900750,2105900750,2105900750,2173300001,2173300001,2173300001,2173300001)
date=c(200601,200602,200603,200603,200601,200602,200601,200602,200603,200601,200602,200601,200602,200603,200601)
price=c(26,28,27,27,23,24,85,84,79.5,81,78,24,19,98,47)
state=c(1,1,1,2,2,2,1,1,2,2,2,1,1,1,2)
weight=c(.3,.2,.6,.4,.4,.5,.5,.5,.45,.15,.5,.2,.15,.3,.45)
# This is what I have:
data <- data.frame(upc,date,state,price,weight)
data
# These are a few of the weighted calculations:
# .3*26+85*.5+24*.2 = 55.1
# 28*.2+84*.5+19*.15 = 50.45
# 27*.6+98*.3 = 45.6
# Etc. etc.
# Here is the reweighted calculation for date=200602 & state==1:
# 28*(.2/.85)+84*(.5/.85)+19*(.15/.85) = 50.45
# Or, equivalently:
# (28*.2+84*.5+19*.15)/.85 = 50.45
# This is what I want:
date=c(200601,200602,200603,200601,200602,200603)
state=c(1,1,1,2,2,2)
priceIndex=c(55.1,50.45,45.6,42.5,51,46.575)
totalWeight=c(1,.85,.9,1,1,.85)
reweightedIndex=c(55.1,59.35294,50.66667,42.5,51,54.79412)
index <- data.frame(date,state,priceIndex,totalWeight,reweightedIndex)
index
Also, not that it should matter, but there are about 35 states, 150 UPCs, and 84 dates in the dataset -- so there are a lot of observations.
Thanks a lot in advance.

We can use one of the group by summarise operation. With data.table, we convert the 'data.frame' to 'data.table' (setDT(data), grouped by 'date', 'state', we get the sum of product of 'price' and 'weight', and sum(weight) as temporary variables, and then create the 3 variables in the list based on that.
library(data.table)
setDT(data)[, {tmp1 = sum(price*weight)
tmp2 = sum(weight)
list(priceIndex=tmp1, totalWeight=tmp2,
reweigthedIndex = tmp1/tmp2)}, .(date, state)]
# date state priceIndex totalWeight reweightedIndex
#1: 200601 1 55.100 1.00 55.10000
#2: 200602 1 50.450 0.85 59.35294
#3: 200603 1 45.600 0.90 50.66667
#4: 200603 2 46.575 0.85 54.79412
#5: 200601 2 42.500 1.00 42.50000
#6: 200602 2 51.000 1.00 51.00000
Or using dplyr, we can use summarise to create the 3 columns after doing grouping by 'date' and 'state'.
library(dplyr)
data %>%
group_by(date, state) %>%
summarise(priceIndex = sum(price*weight),
totalWeight = sum(weight),
reweightedIndex = priceIndex/totalWeight)
# date state priceIndex totalWeight reweightedIndex
# (dbl) (dbl) (dbl) (dbl) (dbl)
#1 200601 1 55.100 1.00 55.10000
#2 200601 2 42.500 1.00 42.50000
#3 200602 1 50.450 0.85 59.35294
#4 200602 2 51.000 1.00 51.00000
#5 200603 1 45.600 0.90 50.66667
#6 200603 2 46.575 0.85 54.79412

Related

sum function in R data frame not capturing correct data

I have a dataframe that looks like this:
channel start.time stop.time vp duration X id overlaps
1: 4_speech 14.183 16.554 CH1 2.371 NA 165 1_hands_CH1_1.145;2_body_CH1_1.883;2_body_N_14.272;2_body_N_4.825
2: 4_speech 21.196 22.259 CH1 1.063 NA 166 1_hands_N_1.417;2_body_CH1_1.075;2_body_N_5.485
3: 4_speech 28.001 31.518 CH1 3.517 NA 167 1_hands_CH1_3.557;1_hands_CH2_1.75;2_body_CH1_3.445;2_body_N_3.519
4: 4_speech 34.867 36.549 CH1 1.682 NA 168 2_body_CH1_3.308
5: 4_speech 41.019 42.265 CH1 1.246 NA 169 1_hands_CH1_4.896;1_hands_N_0.663;2_body_CH1_5.288
6: 4_speech 55.262 57.800 CH1 2.538 NA 170 2_body_CH1_2.494;2_body_N_6.571
The first 6 columns show information about a particular observation, the 7th column 'overlaps' shows a list of other observations from a different data-frame that co-occur with the observations in this data frame. Each of the observations in the overlaps column is structured like this: 'channel_vp_duration'. For example, the first observation in 'overlaps' in row 1 shows that '1_hands' is the channel, 'CH1' is vp (a kind of value), and the 1.145 the duration of that observation.
I want a sum of all the durations for a given observation type. I can sort of get this with the following code that was adapted from an answer provided by a stack user on a question I previously asked about how to get the overlaps data in the first place.
library(data.table)
library(stringr)
setDT(speech_rows)
speech_rows[, id := .I]
setkey(speech_rows, id)
#self join on subset by row
speech_rows[speech_rows, durs := {
temp <- df[!id == i.id & start.time < i.stop.time & stop.time > i.start.time & channel == "1_hands", ]
sum(temp$duration)
}, by = .EACHI]
This adds another columns 'durs' which is supposed to show the total duration of all the numeric values attached to a '1_hands' string in the overlaps column. Thus producing the following (first 6 columns removed to save space):
overlaps durs
1: 1_hands_CH1_1.145;2_body_CH1_1.883;2_body_N_14.272;2_body_N_4.825 0.000
2: 1_hands_N_1.417;2_body_CH1_1.075;2_body_N_5.485 1.417
3: 1_hands_CH1_3.557;1_hands_CH2_1.75;2_body_CH1_3.445;2_body_N_3.519 1.750
4: 2_body_CH1_3.308 0.000
5: 1_hands_CH1_4.896;1_hands_N_0.663;2_body_CH1_5.288 5.559
6: 2_body_CH1_2.494;2_body_N_6.571 0.000
But there is a problem, the sum() function does not capture all of the relevant strings. In row 1, there is the string: "1_hands_CH1_1.145", it is the only '1_hands' string in that row, so the value under durs for row 1 should be '1.145'. But the function ignores it for some reason. In row 2, the durs sum is correct. In row 3, it counts only one of the 1_hands values, and ignores the other. In row 5, it correctly finds both of the 1_hands values and adds them together. Rows 4 and 6 are have correct 'durs' values because there are no 1_hands observations in them.
This is very strange, and I don't know it correctly detects the numeric values at some times but not at others. This is problem #1
Problem #2: I cannot specify what I want beyond '1_hands', what I really want to do is get the sum of durations for all 1_hands_CH1 values, NOT all 1_hands values. To do this, I assume that you would just need to change the strings in 'channel == 1_hands'
temp <- df[!id == i.id & start.time < i.stop.time & stop.time > i.start.time & channel == **"1_hands"**, ]
But if I change it to something like "1_hands_CH1" all of the durs values will be zero, it can't anything past '1_hands'.
So in sum, I want to know why the math isn't working like I want it to, and why I can't select more specific strings.
Here is one way you could get durations out of your overlaps column using the tidyverse. You can set text_string equal to what you want durations for. I have provided some examples of how to enter your text string. The example below returns durations for all "1_hands" observations. If you wanted durations just for the "1_hands_CH1", then you would just set text_string <- "1_hands_CH1".
# Load tidyverse
library(tidyverse)
# Set text_string Equal To Specific String You Want Durations For
text_string <- "1_hands_[A-Z0-9]+"
# Examples For text_string
# text_string <- "1_hands_CH1" ## example for getting 1_hands_CH1
# text_string <- ""2_body_N" ## example for getting 2_body_N
# text_string <- "1_hands_[A-Z0-9]+" ## example for getting all 1_hands
# text_string <- "2_body_[A-Z0-9]+" ## example for getting all 2_body
# df With Durations
df_with_durs <- df %>%
as_tibble() %>%
mutate(str_matches = str_match_all(overlaps, str_glue("{text_string}_[0-9.]+")),
durs = map(str_matches,
function(x) {
durs <- str_remove(x, str_glue("{text_string}_"))
num_durs <- as.numeric(durs)
sum_durs <- sum(num_durs)
return(sum_durs)
}
)
) %>%
unnest(cols = durs) %>%
select(-str_matches)
# View Output
df_with_durs
# channel start.time stop.time vp duration X id overlaps durs
# <chr> <dbl> <dbl> <chr> <dbl> <lgl> <int> <chr> <dbl>
# 4_speech 14.2 16.6 CH1 2.37 NA 165 1_hands_CH1_1.145;2_body_CH1_1.883;2_body_N_14.272;2_body_N_4.825 1.14
# 4_speech 21.2 22.3 CH1 1.06 NA 166 1_hands_N_1.417;2_body_CH1_1.075;2_body_N_5.485 1.42
# 4_speech 28.0 31.5 CH1 3.52 NA 167 1_hands_CH1_3.557;1_hands_CH2_1.75;2_body_CH1_3.445;2_body_N_3.519 5.31
# 4_speech 34.9 36.5 CH1 1.68 NA 168 2_body_CH1_3.308 0
# 4_speech 41.0 42.3 CH1 1.25 NA 169 1_hands_CH1_4.896;1_hands_N_0.663;2_body_CH1_5.288 5.56
# 4_speech 55.3 57.8 CH1 2.54 NA 170 2_body_CH1_2.494;2_body_N_6.571 0

Calculate the slope for each individual

I have a longitudinal data
ID<-c(1,1,1,2,2,2,2,3,3,4,4,4)
time<-c(0,12,36,0,7,23,68,0,23,0,32,45)
Age<-rnorm(12,45,9)
Sexe<-c("F","F","F","M","M","M","M","M","M","F","F","F")
biology1<-rnorm(12,12,3)
biology2<-rnorm (12,100,20)
biology3<-rnorm(12,45,9)
biology4<-rnorm(12,20,2)
Death<-c(0,0,1,0,0,0,0,0,0,0,0,1)
data<-data.frame(ID,time,Age,Sexe,biology1,biology2,biology3,biology4,Death)
I would like to calculate the slope (from the begining to the end of the folow-up) for each numerical variable (biology1,biology2,biology3,biology4) and for each individual irres; mainly a function to calculate the slope for each variable without retyping a new line of codes for each variable.I have no idea how to do it.
Here's an approach with dplyr. Here are the tricks:
Use group_by to group the data for each individual.
Use summarise to perform an action for each group.
Use across to do so for multiple columns
Use starts_with to select the appropriate columns
Use list(slope = ...) to name the columns.
Use $coef to extract the coefficients and [2] to get the slope rather than the intercept.
library(dplyr)
data %>%
group_by(ID) %>%
summarise(across(starts_with("biology"),
list(slope = ~lm(. ~ time)$coef[2])))
# A tibble: 4 x 5
ID biology1_slope biology2_slope biology3_slope biology4_slope
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.0459 -1.61 -0.204 -0.00106
2 2 0.131 -0.553 0.0783 -0.0340
3 3 -0.0462 -0.427 -0.402 -0.191
4 4 -0.0524 -1.10 0.379 -0.0736
Here is a data.table option with lm
dt[,
lapply(
.SD,
function(x) coef(lm(x ~ time, data = cbind(x, dt[, "time"])))["time"]
), ID,
.SDcols = patterns("^biology")
]
which gives results like
ID biology1 biology2 biology3 biology4
1: 1 0.07223152 0.07187708 -2.960618e-02 0.022861337
2: 2 -0.05728224 0.13207814 -2.349493e-01 -0.018541982
3: 3 -0.03925044 -0.63219541 -3.166489e-05 -0.009484951
4: 4 -0.01801599 0.04758699 -6.547484e-03 -0.004253647

R - Function or script to populate a dataframe with accumulative partial calculations (not cum sum)

This is quite an specific challenge. Let's say I have this table, think of a typical bank database (working with data.table by the way):
customer_id; month; balance
1;1;100
1;2;110
1;3;140
1;4;70
I need a script or function that returns, for every row, the ratio relative to the max historical balance so far for each customer.
customer_id; month; balance; ratio
1;1;100;1 # 1 because 100 balance is both the current datapoint and the max value so far
1;2;110;1.1 # 1.1 because 110 balance is 1.1 of the prior max value, 100
1;3;140;1.27 # 1.27 because it's 140 divided by the prior max value, 110
1;4;70;0.5 # 0.5 because it's 70 divided by the prior max value, 140
I know some dplyr or data.table methods can work with cumulative calculations, such as cumsum.
However this has a twist I can't quite find online.
Thanks.
You can do this with dplyr using cummax (the cumulative maximum) and lag (to get previous values) fairly easily
library(dplyr)
dd %>%
group_by(customer_id) %>%
mutate(ratio = balance/lag(cummax(balance), default=first(balance)))
# customer_id month balance ratio
# <int> <int> <int> <dbl>
# 1 1 1 100 1
# 2 1 2 110 1.1
# 3 1 3 140 1.27
# 4 1 4 70 0.5
where
dd <- read.table(text="
customer_id; month; balance
1;1;100
1;2;110
1;3;140
1;4;70", sep=";", header=TRUE)

R- Subtracting the mean of a group from each element of that group in a dataframe

I am trying to merge a vector 'means' to a dataframe.
My dataframe looks like this Data = growth
I first calculated all the means for the different groups (1 group = population + temperature + size + replicat) using this command:
means<-aggregate(TL ~ Population + Temperature + Replicat + Size + Measurement, data=growth, list=growth$Name, mean)
Then, I selected the means for Measurement 1 as follows as I am only interested in these means.
meansT0<-means[which(means$Measurement=="1"),]
Now, I would like to merge this vector of means values to my dataframe (=growth) so that the right mean of each group corresponds to the right part of the dataframe.
The goal is to then substrat the mean of each group (at Measurement 1) to each element of the dataframe based on its belonging group (and for all other Measurements except Measurement 1). Maybe there is no need to add the means column to the dataframe? Do you know any command to do that ?
[27.06.18]
I made up this simplified dataframe, I hope this help understanding.
So, what I want is to substrat, for each individual in the dataframe and for each measurement (here only Measurement 1 and Measurement 2, normally I have more), the mean of its belongig group at MEASUREMENT 1.
So, if I get the means by group (1 group= Population + Temperature + Measurement):
means<-aggregate(TL ~ Population + Temperature + Measurement, data=growth, list=growth$Name, mean)
means
I got these values of means (in this example) :
Population Temperature Measurement TL
JUB 15 **1** **12.00000**
JUB 20 **1** **15.66667**
JUB 15 2 17.66667
JUB 20 2 18.66667
JUB 15 3 23.66667
JUB 20 3 24.33333
We are only interested by the means at MEASUREMENT 1. For each individual in the dataframe, I want to substrat the mean of its belonging group at Measurement 1: in this example (see dataframe with R command):
-for the group JUB+15+Measurement 1, mean = 12
-for the group JUB+20+Measurement 1, mean = 15.66
growth<-data.frame(Population=c("JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB","JUB", "JUB", "JUB"), Measurement=c("1","1","1","1","1","1","2","2","2","2","2","2", "3", "3", "3", "3", "3", "3"),Temperature=c("15","15","15","20", "20", "20","15","15","15","20", "20", "20","15","15","15","20", "20", "20"),TL=c(11,12,13,15,18,14, 16,17,20,21,19,16, 25,22,24,26,24,23), New_TL=c("11-12", "12-12", "13-12", "15-15.66", "18-15.66", "14-15.66", "16-12", "17-12", "20-12", "21-15.66", "19-15.66", "16-15.66", "25-12", "22-12", "24-12", "26-15.66", "24-15.66", "23-15.66"))
print(growth)
I hope with this, you can understand better what I am trying to do. I have a lot of data and if I have to do this manually, this will take me a lot of time and increase the risk of me putting mistakes.
Here is an option with tidyverse. After grouping by the group columns, use mutate_at specifying the columns of interest and get the difference of that column (.) with the mean of it.
library(tidyverse)
growth %>%
group_by(Population, Temperature, Replicat, Size, Measurement) %>%
mutate_at(vars(HL, TL), funs(MeanGroupDiff = .
- mean(.[Measurement == 1])))
Using a reproducible example with mtcars dataset
data(mtcars)
mtcars %>%
group_by(cyl, vs) %>%
mutate_at(vars(mpg, disp), funs(MeanGroupDiff = .- mean(.[am==1])))
Have you considered using the data.table package? It is very well suited for doing these kind of grouping, filtering, joining, and aggregation operations you describe, and might save you a great deal of time in the long run.
The code below shows how a workflow similar to the one you described but based on the built in mtcars data set might look using data.table.
To be clear, there are also ways to do what you describe using base R as well as other packages like dplyr, just throwing out a suggestion based on what I have found the most useful for my personal work.
library(data.table)
## Convert mtcars to a data.table
## only include columns `mpg`, `cyl`, `am` and `gear` for brevity
DT <- as.data.table(mtcars)[, .(mpg, cyl,am, gear)]
## Take a subset where `cyl` is equal to 6
DT <- DT[cyl == 6]
## Calculate grouped mean based on `gear` and `am` as grouping variables
DT[,group_mpg_avg := mean(mpg), keyby = .(gear, am)]
## Calculate each row's difference from the group mean
DT[,mpg_diff_from_group := mpg - group_mpg_avg]
print(DT)
# mpg cyl am gear group_mpg_avg mpg_diff_from_group
# 1: 21.4 6 0 3 19.75 1.65
# 2: 18.1 6 0 3 19.75 -1.65
# 3: 19.2 6 0 4 18.50 0.70
# 4: 17.8 6 0 4 18.50 -0.70
# 5: 21.0 6 1 4 21.00 0.00
# 6: 21.0 6 1 4 21.00 0.00
# 7: 19.7 6 1 5 19.70 0.00
Consider by to subset your data frame by factors (but leave out Measurement in order to compare group 1 and all other groups). Then, run an ifelse conditional logic calculation for needed columns. Since by will return a list of data frames, bind all outside with do.call():
df_list <- by(growth, growth[,c("Population", "Temperature")], function(sub) {
# TL CORRECTION
sub$Correct_TL <- ifelse(sub$Measurement != 1,
sub$TL - mean(subset(sub, Measurement == 1)$TL),
sub$TL)
# ADD OTHER CORRECTIONS
return(sub)
})
final_df <- do.call(rbind, df_list)
Output (using posted data)
final_df
# Population Measurement Temperature TL New_TL Correct_TL
# 1 JUB 1 15 11 11-12 11.0000000
# 2 JUB 1 15 12 12-12 12.0000000
# 3 JUB 1 15 13 13-12 13.0000000
# 7 JUB 2 15 16 16-12 4.0000000
# 8 JUB 2 15 17 17-12 5.0000000
# 9 JUB 2 15 20 20-12 8.0000000
# 13 JUB 3 15 25 25-12 13.0000000
# 14 JUB 3 15 22 22-12 10.0000000
# 15 JUB 3 15 24 24-12 12.0000000
# 4 JUB 1 20 15 15-15.66 15.0000000
# 5 JUB 1 20 18 18-15.66 18.0000000
# 6 JUB 1 20 14 14-15.66 14.0000000
# 10 JUB 2 20 21 21-15.66 5.3333333
# 11 JUB 2 20 19 19-15.66 3.3333333
# 12 JUB 2 20 16 16-15.66 0.3333333
# 16 JUB 3 20 26 26-15.66 10.3333333
# 17 JUB 3 20 24 24-15.66 8.3333333
# 18 JUB 3 20 23 23-15.66 7.3333333

Extracting corresponding other values in mutate when group_by is applied

I have a data frame with patient data and measurements of different variables over time.
The data frame looks a bit like this but more lab-values variables:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,2),
time=c(0,3,7,35,0,7,14,28,42),
labvalue1=c(4.04,NA,2.93,NA,NA,3.78,3.66,NA,2.54),
labvalue2=c(NA,63.8,62.8,61.2,78.1,NA,77.6,75.3,NA))
> df2
id time labvalue1 labvalue2
1 1 0 4.04 NA
2 1 3 NA 63.8
3 1 7 2.93 62.8
4 1 35 NA 61.2
5 2 0 NA 78.1
6 2 7 3.78 NA
7 2 14 3.66 77.6
8 2 28 NA 75.3
9 2 42 2.54 NA
I want to calculate for each patient (with unique ID) the decrease or slope per day for the first and last measurement. To compare the slopes between patients. Time is in days. So, eventually I want a new variable, e.g. diff_labvalues - for each value, that gives me for labvalue1:
For patient 1: (2.93-4.04)/ (7-0) and for patient 2: (2.54-3.78)/(42-7) (for now ignoring the measurements in between, just last-first); etc for labvalue2, and so forth.
So far I have used dplyr, created the first1 and last1 functions, because first() and last() did not work with the NA values.
Thereafter, I have grouped_by 'id', used mutate_all (because there are more lab-values in the original df) calculated the difference between the last1() and first1() lab-values for that patient.
But cannot find HOW to extract the values of the corresponding time values (the delta-time value) which I need to calculate the slope of the decline.
Eventually I want something like this (last line):
first1 <- function(x) {
first(na.omit(x))
}
last1 <- function(x) {
last(na.omit(x))
}
df2 = df %>%
group_by(id) %>%
mutate_all(funs(diff=(last1(.)-first1(.)) / #it works until here
(time[position of last1(.)]-time[position of first1(.)]))) #something like this
Not sure if tidyverse even has a solution for this, so any help would be appreciated. :)
We can try
df %>%
group_by(id) %>%
filter(!is.na(labs)) %>%
summarise(diff_labs = (last(labs) - first(labs))/(last(time) - first(time)))
# A tibble: 2 x 2
# id diff_labs
# <dbl> <dbl>
#1 1 -0.15857143
#2 2 -0.03542857
and
> (2.93-4.04)/ (7-0)
#[1] -0.1585714
> (2.54-3.78)/(42-7)
#[1] -0.03542857
Or another option is data.table
library(data.table)
setDT(df)[!is.na(labs), .(diff_labs = (labs[.N] - labs[1])/(time[.N] - time[1])) , id]
# id diff_labs
#1: 1 -0.15857143
#2: 2 -0.03542857

Resources