Combining indices according to another dataframe in R? - r

I have two data.frames, and I'd like to use one as reference for combining observations in the other one.
First, I have data:
> data
Source: local data frame [15 x 7]
upc fips_state_code mymonth price units year sales
(dbl) (int) (dbl) (dbl) (int) (dbl) (dbl)
1 1153801013 2 3 25.84620 235 2008 6073.8563
2 1153801013 1 2 28.61981 108 2009 3090.9396
3 1153801013 2 2 27.99000 7 2009 195.9300
4 1153801013 1 1 27.99000 4 2009 111.9600
5 1153801013 1 3 27.99000 7 2008 195.9300
6 72105922753 1 3 27.10816 163 2008 4418.6306
7 72105922765 2 2 24.79000 3 2010 74.3700
8 72105922765 2 2 25.99000 1 2009 25.9900
9 72105922765 1 2 23.58091 13 2009 306.5518
10 1071917100 2 2 300.07000 1 2009 300.0700
11 1071917100 1 3 307.07000 2 2008 614.1400
12 1071917100 2 3 269.99000 1 2010 269.9900
13 1461503541 2 2 0.65200 8 2008 5.2160
14 1461503541 2 2 13.99000 11 2010 153.8900
15 1461503541 1 1 0.87000 1 2008 0.8700
Then, I have z, which is the reference:
> z
upc code
3 1153801013 52161
1932 72105922753 52161
1934 72105922765 52161
2027 81153801013 52161
2033 81153801041 52161
2 1071917100 50174
1256 8723610700 50174
I want to combine data points in data whose upc is the same in z.
In the sample I gave to you, there are 5 different upcs.
1071917100 is also in z, with the code 50174. However, the only other upc with this code is 8723610700, which is not in data. Therefore, it remains unchanged.
1461503541 is not in z at all, so therefore it also remains unchanged.
1153801013, 72105922753, and 72105922765 all share the same code in z, 52161. Therefore, I want to combine all the observations with these upcs.
I want to do this in a really specific way:
First, I want to choose the upc with the greatest amount of sales across the data. 1153801013 has 9668.616 in sales (simply the sum of all sales with that upc). 72105922753 has 4418.631 in sales. 72105922765 has 406.9118 in sales. Therefore, I choose 1153801013 as the upc for all of them.
Now having chosen this upc, I want to change 72105922753 and 72105922765 to 1153801013 in data.
Now we have a data set that looks like this:
> data1
Source: local data frame [15 x 7]
upc fips_state_code mymonth price units year sales
(dbl) (int) (dbl) (dbl) (int) (dbl) (dbl)
1 1153801013 2 3 25.84620 235 2008 6073.8563
2 1153801013 1 2 28.61981 108 2009 3090.9396
3 1153801013 2 2 27.99000 7 2009 195.9300
4 1153801013 1 1 27.99000 4 2009 111.9600
5 1153801013 1 3 27.99000 7 2008 195.9300
6 1153801013 1 3 27.10816 163 2008 4418.6306
7 1153801013 2 2 24.79000 3 2010 74.3700
8 1153801013 2 2 25.99000 1 2009 25.9900
9 1153801013 1 2 23.58091 13 2009 306.5518
10 1071917100 2 2 300.07000 1 2009 300.0700
11 1071917100 1 3 307.07000 2 2008 614.1400
12 1071917100 2 3 269.99000 1 2010 269.9900
13 1461503541 2 2 0.65200 8 2008 5.2160
14 1461503541 2 2 13.99000 11 2010 153.8900
15 1461503541 1 1 0.87000 1 2008 0.8700
Finally, I want to combine all the data points with the same year, mymonth, and fips_state_code. The way this will happen is by adding up the sales and units numbers of data points with the same upc, fips_state_code, mymonth, and year, and then recalculating the weighted price. (I.e., price = total Sales / total Units.)
And so, the final data set should look like this:
> data2
Source: local data frame [12 x 7]
upc fips_state_code mymonth price units year sales
(dbl) (int) (dbl) (dbl) (dbl) (dbl) (dbl)
1 1153801013 2 3 25.84620 235 2008 6073.856
2 1153801013 1 2 28.07844 121 2009 3397.491
3 1153801013 2 2 27.74000 8 2009 221.920
4 1153801013 1 1 27.99000 4 2009 111.960
5 1153801013 1 3 27.14448 170 2008 4614.561
6 1153801013 2 2 24.79000 3 2010 74.370
7 1071917100 2 2 300.07000 1 2009 300.070
8 1071917100 1 3 307.07000 2 2008 614.140
9 1071917100 2 3 269.99000 1 2010 269.990
10 1461503541 2 2 0.65200 8 2008 5.216
11 1461503541 2 2 13.99000 11 2010 153.890
12 1461503541 1 1 0.87000 1 2008 0.870
I did try to do this myself, but it took me many lines of code, and I couldn't accomplish the last part successfully. Please let me know if anything is unclear, and thank you very much in advance.
Here is the dput code:
data<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 72105922753, 72105922765, 72105922765, 72105922765,
1071917100, 1071917100, 1071917100, 1461503541, 1461503541, 1461503541
), fips_state_code = c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L,
1L, 2L, 2L, 2L, 1L), mymonth = c(3, 2, 2, 1, 3, 3, 2, 2, 2, 2,
3, 3, 2, 2, 1), price = c(25.8461971831, 28.6198113208, 27.99,
27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909, 300.07,
307.07, 269.99, 0.652, 13.99, 0.87), units = c(235L, 108L, 7L,
4L, 7L, 163L, 3L, 1L, 13L, 1L, 2L, 1L, 8L, 11L, 1L), year = c(2008,
2009, 2009, 2009, 2008, 2008, 2010, 2009, 2009, 2009, 2008, 2010,
2008, 2010, 2008), sales = c(6073.8563380285, 3090.9396226464,
195.93, 111.96, 195.93, 4418.6306122439, 74.37, 25.99, 306.5518181817,
300.07, 614.14, 269.99, 5.216, 153.89, 0.87)), .Names = c("upc",
"fips_state_code", "mymonth", "price", "units", "year", "sales"
), row.names = c(NA, -15L), class = c("tbl_df", "data.frame"))
z<-structure(list(upc = c(1153801013, 72105922753, 72105922765,
81153801013, 81153801041, 1071917100, 8723610700), code = c(52161L,
52161L, 52161L, 52161L, 52161L, 50174L, 50174L)), .Names = c("upc",
"code"), row.names = c(3L, 1932L, 1934L, 2027L, 2033L, 2L, 1256L
), class = "data.frame")
data1<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 1153801013, 1153801013, 1153801013, 1153801013, 1071917100,
1071917100, 1071917100, 1461503541, 1461503541, 1461503541),
fips_state_code = c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L,
1L, 2L, 2L, 2L, 1L), mymonth = c(3, 2, 2, 1, 3, 3, 2, 2,
2, 2, 3, 3, 2, 2, 1), price = c(25.8461971831, 28.6198113208,
27.99, 27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909,
300.07, 307.07, 269.99, 0.652, 13.99, 0.87), units = c(235L,
108L, 7L, 4L, 7L, 163L, 3L, 1L, 13L, 1L, 2L, 1L, 8L, 11L,
1L), year = c(2008, 2009, 2009, 2009, 2008, 2008, 2010, 2009,
2009, 2009, 2008, 2010, 2008, 2010, 2008), sales = c(6073.8563380285,
3090.9396226464, 195.93, 111.96, 195.93, 4418.6306122439,
74.37, 25.99, 306.5518181817, 300.07, 614.14, 269.99, 5.216,
153.89, 0.87)), .Names = c("upc", "fips_state_code", "mymonth",
"price", "units", "year", "sales"), row.names = c(NA, -15L), class = c("tbl_df",
"data.frame"))
data2<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 1153801013, 1071917100, 1071917100, 1071917100, 1461503541,
1461503541, 1461503541), fips_state_code = c(2L, 1L, 2L, 1L,
1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L), mymonth = c(3, 2, 2, 1, 3, 2,
2, 3, 3, 2, 2, 1), price = c(25.8461971831, 28.07844, 27.74,
27.99, 27.14448, 24.79, 300.07, 307.07, 269.99, 0.652, 13.99,
0.87), units = c(235, 121, 8, 4, 170, 3, 1, 2, 1, 8, 11, 1),
year = c(2008, 2009, 2009, 2009, 2008, 2010, 2009, 2008,
2010, 2008, 2010, 2008), sales = c(6073.8563380285, 3397.491,
221.92, 111.96, 4614.561, 74.37, 300.07, 614.14, 269.99,
5.216, 153.89, 0.87)), .Names = c("upc", "fips_state_code",
"mymonth", "price", "units", "year", "sales"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -12L))

I think this works. The rows of the final result are in a different order than your data2, but at a glance look they look the same.
# join data
joined = data %>% left_join(z)
# set aside the rows not in z
not_in_z = filter(joined, is.na(code))
modified = joined %>%
filter(!is.na(code)) %>% # for the rows in z
group_by(code) %>% # group by code
arrange(desc(sales)) %>% # sort by sales (so highest sales is first)
mutate(upc = first(upc)) %>% # change all UPC codes to the one with
# highest sales (within group)
bind_rows(not_in_z) # tack back on the rows that weren't in z
The modified data should match your data1 (it has a code column too, but you could drop that).
final = modified %>%
ungroup() %>% # redo the grouping
group_by(upc, fips_state_code, mymonth, year) %>%
summarize( # add your summary columns
sales = sum(sales),
units = sum(units),
price = sales / units
) %>%
select( # get columns in the same order as your "data2"
upc, fips_state_code, mymonth, price, units, year, sales
)
final
# Source: local data frame [12 x 7]
# Groups: upc, fips_state_code, mymonth [10]
#
# upc fips_state_code mymonth price units year sales
# (dbl) (int) (dbl) (dbl) (int) (dbl) (dbl)
# 1 1071917100 1 3 307.07000 2 2008 614.140
# 2 1071917100 2 2 300.07000 1 2009 300.070
# 3 1071917100 2 3 269.99000 1 2010 269.990
# 4 1153801013 1 1 27.99000 4 2009 111.960
# 5 1153801013 1 2 28.07844 121 2009 3397.491
# 6 1153801013 1 3 27.14447 170 2008 4614.561
# 7 1153801013 2 2 27.74000 8 2009 221.920
# 8 1153801013 2 2 24.79000 3 2010 74.370
# 9 1153801013 2 3 25.84620 235 2008 6073.856
# 10 1461503541 1 1 0.87000 1 2008 0.870
# 11 1461503541 2 2 0.65200 8 2008 5.216
# 12 1461503541 2 2 13.99000 11 2010 153.890

Here's a data.table approach.
First initialize data.table:
library(data.table)
setDT(data); setDT(z)
Re-assign upc:
#merge to add `code` to `data`
data[z, code := i.code, on = "upc"]
#add a new column with sales by `upc`
data[ , upc_sales := sum(sales), by = upc]
#re-assign
data[ , upc := upc[which.max(upc_sales)], by = code]
Aggregate:
data2 <- data[ , .(sales = sum(sales), units = sum(units)),
by = .(upc, fips_state_code, mymonth, year)
][ , price := sales / units]
There are minor differences vis-a-vis your data2, but these are all readily fixed with setcolorder and := NULL.
This could also be accomplished in two commands, but it's a tad less legible:
data[z, code := i.code, on = "upc"]
data[, upc :=
upc[which.max(.SD[ , sum(sales), by = upc]$V1)],
by = code][ , {sl <- sum(sales); us <- sum(units)
.(sales = sl, units = us, price = sl/us)},
by = .(upc, fips_state_code, mymonth, year)]

Related

Why is group_by and mutate giving me the unexpected result?

This is an excerpt of my dataset:
check = structure(list(currency = c("AED", "ATS", "AUD", "BEF", "BND",
"CAD"), year = c(2005, 2005, 2005, 2005, 2005, 2005), value = c(0,
0, 14628, 0, 27, 1604), month = c("1", "1", "1", "1", "1", "1"
), quarter = c(1, 1, 1, 1, 1, 1)), row.names = c(NA, 6L), class = "data.frame")
Running this code:
check2 = check %>% group_by(currency) %>% mutate(sum = sum(value))
gives me
currency year value month quarter sum
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 AED 2005 0 1 1 16259
2 ATS 2005 0 1 1 16259
3 AUD 2005 14628 1 1 16259
4 BEF 2005 0 1 1 16259
5 BND 2005 27 1 1 16259
6 CAD 2005 1604 1 1 16259
Shouldn't it give me a different value for each currency? When I tried to group by different combinations of variables, it gives me the same value 16259. Could someone point out where I did it wrong? Thank you.

Problems of joining datasets on R

I have a dataset containing variables and a quantity of goods sold: for some days, however, there are no values.
I created a dataset with all 0 values in sales and all NA in the rest. How can I add those lines to the initial dataset?
At the moment, I have this:
sales
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
4 1 2018 11 0 987
sales.NA
day month year employees holiday sales
1 1 2018 NA NA 0
2 1 2018 NA NA 0
3 1 2018 NA NA 0
4 1 2018 NA NA 0
I would like to create a new dataset, inserting the days where I have no observations, value 0 to sales, and NA on all other variables. Like this
new.data
day month year employees holiday sales
1 1 2018 14 0 1058
2 1 2018 25 1 2174
3 1 2018 NA NA 0
4 1 2018 11 0 987
I tried used something like this
merge(sales.NA,sales, all.y=T, by = c("day","month","year"))
But it does not work
Using dplyr, you could use a "right_join". For example:
sales <- data.frame(day = c(1,2,4),
month = c(1,1,1),
year = c(2018, 2018, 2018),
employees = c(14, 25, 11),
holiday = c(0,1,0),
sales = c(1058, 2174, 987)
)
sales.NA <- data.frame(day = c(1,2,3,4),
month = c(1,1,1,1),
year = c(2018,2018,2018, 2018)
)
right_join(sales, sales.NA)
This leaves you with
day month year employees holiday sales
1 1 1 2018 14 0 1058
2 2 1 2018 25 1 2174
3 3 1 2018 NA NA NA
4 4 1 2018 11 0 987
This leaves NA in sales where you want 0, but that could be fixed by including the sales data in sales.NA, or you could use "tidyr"
right_join(sales, sales.NA) %>% mutate(sales = replace_na(sales, 0))
Here is another data.table solution:
jvars = c("day","month","year")
merge(sales.NA[, ..jvars], sales, by = jvars, all.x = TRUE)[is.na(sales), sales := 0L][]
day month year employees holiday sales
1: 1 1 2018 14 0 1058
2: 2 1 2018 25 1 2174
3: 3 1 2018 NA NA 0
4: 4 1 2018 11 0 987
Or with some neater syntax:
sales[sales.NA[, ..jvars], on = jvars][is.na(sales), sales := 0][]
Reproducible data:
sales <- structure(list(day = c(1L, 2L, 4L), month = c(1L, 1L, 1L), year = c(2018L,
2018L, 2018L), employees = c(14L, 25L, 11L), holiday = c(0L,
1L, 0L), sales = c(1058L, 2174L, 987L)), row.names = c(NA, -3L
), class = c("data.table", "data.frame"))
sales.NA <- structure(list(day = 1:4, month = c(1L, 1L, 1L, 1L), year = c(2018L,
2018L, 2018L, 2018L), employees = c(NA, NA, NA, NA), holiday = c(NA,
NA, NA, NA), sales = c(0L, 0L, 0L, 0L)), row.names = c(NA, -4L
), class = c("data.table", "data.frame"))
That's an answer using the data.table package, since I am more familiar with the syntax, but regular data.frames should work pretty much the same. I also would switch to a proper date format, which will make life easier for you down the line.
Actually, in this way you would not need the Sales.NA table, since it would automatically be solved by all days which have NAs after the first join.
library(data.table)
dt.dates <- data.table(Date = seq.Date(from = as.Date("2018-01-01"), to = as.Date("2018-12-31"),by = "day" ))
dt.sales <- data.table(day = c(1,2,4)
, month = c(1,1,1)
, year = c(2018,2018,2018)
, employees = c(14, 25, 11)
, holiday = c(0,1,0)
, sales = c(1058, 2174, 987)
)
dt.sales[, Date := as.Date(paste(year,month,day, sep = "-")) ]
merge( x = dt.dates
, y = dt.sales
, by.x = "Date"
, by.y = "Date"
, all.x = TRUE
)
> Date day month year employees holiday sales
1: 2018-01-01 1 1 2018 14 0 1058
2: 2018-01-02 2 1 2018 25 1 2174
3: 2018-01-03 NA NA NA NA NA NA
4: 2018-01-04 4 1 2018 11 0 987
...

Months to integer R

This is part of the dataframe I am working on. The first column represents the year, the second the month, and the third one the number of observations for that month of that year.
2005 07 2
2005 10 4
2005 12 2
2006 01 4
2006 02 1
2006 07 2
2006 08 1
2006 10 3
I have observations from 2000 to 2018. I would like to run a Kernel Regression on this data, so I need to create a continuum integer from a date class vector. For instance Jan 2000 would be 1, Jan 2001 would be 13, Jan 2002 would be 25 and so on. With that I will be able to run the Kernel. Later on, I need to translate that back (1 would be Jan 2000, 2 would be Feb 2000 and so on) to plot my model.
Just use a little algebra:
df$cont <- (df$year - 2000L) * 12L + df$month
You could go backward with modulus and integer division.
df$year <- df$cont %/% 12 + 2000L
df$month <- df$cont %% 12 # 12 is set at 0, so fix that with next line.
df$month[df$month == 0L] <- 12L
Here, %% is the modulus operator and %/% is the integer division operator. See ?"%%" for an explanation of these and other arithmetic operators.
What you can do is something like the following. First create a dates data.frame with expand.grid so we have all the years and months from 2000 01 to 2018 12. Next put this in the correct order and last add an order column so that 2000 01 starts with 1 and 2018 12 is 228. If you merge this with your original table you get the below result. You can then remove columns you don't need. And because you have a dates table you can return the year and month columns based on the order column.
dates <- expand.grid(year = seq(2000, 2018), month = seq(1, 12))
dates <- dates[order(dates$year, dates$month), ]
dates$order <- seq_along(dates$year)
merge(df, dates, by.x = c("year", "month"), by.y = c("year", "month"))
year month obs order
1 2005 10 4 70
2 2005 12 2 72
3 2005 7 2 67
4 2006 1 4 73
5 2006 10 3 82
6 2006 2 1 74
7 2006 7 2 79
8 2006 8 1 80
data:
df <- structure(list(year = c(2005L, 2005L, 2005L, 2006L, 2006L, 2006L, 2006L, 2006L),
month = c(7L, 10L, 12L, 1L, 2L, 7L, 8L, 10L),
obs = c(2L, 4L, 2L, 4L, 1L, 2L, 1L, 3L)),
class = "data.frame",
row.names = c(NA, -8L))
An option is to use yearmon type from zoo package and then calculate difference of months from Jan 2001 using difference between yearmon type.
library(zoo)
# +1 has been added to difference so that Jan 2001 is treated as 1
df$slNum = (as.yearmon(paste0(df$year, df$month),"%Y%m")-as.yearmon("200001","%Y%m"))*12+1
# year month obs slNum
# 1 2005 7 2 67
# 2 2005 10 4 70
# 3 2005 12 2 72
# 4 2006 1 4 73
# 5 2006 2 1 74
# 6 2006 7 2 79
# 7 2006 8 1 80
# 8 2006 10 3 82
Data:
df <- read.table(text =
"year month obs
2005 07 2
2005 10 4
2005 12 2
2006 01 4
2006 02 1
2006 07 2
2006 08 1
2006 10 3",
header = TRUE, stringsAsFactors = FALSE)

Using dplyr to summarize by multiple groups

I'm trying to use dplyr to summarize a dataset based on 2 groups: "year" and "area". This is how the dataset looks like:
Year Area Num
1 2000 Area 1 99
2 2001 Area 3 85
3 2000 Area 1 60
4 2003 Area 2 90
5 2002 Area 1 40
6 2002 Area 3 30
7 2004 Area 4 10
...
The end result should look something like this:
Year Area Mean
1 2000 Area 1 100
2 2000 Area 2 80
3 2000 Area 3 89
4 2001 Area 1 80
5 2001 Area 2 85
6 2001 Area 3 59
7 2002 Area 1 90
8 2002 Area 2 88
...
Excuse the values for "mean", they're made up.
The code for the example dataset:
df <- structure(list(
Year = c(2000, 2001, 2000, 2003, 2002, 2002, 2004),
Area = structure(c(1L, 3L, 1L, 2L, 1L, 3L, 4L),
.Label = c("Area 1", "Area 2", "Area 3", "Area 4"),
class = "factor"),
Num = structure(c(7L, 5L, 4L, 6L, 3L, 2L, 1L),
.Label = c("10", "30", "40", "60", "85", "90", "99"),
class = "factor")),
.Names = c("Year", "Area", "Num"),
class = "data.frame", row.names = c(NA, -7L))
df$Num <- as.numeric(df$Num)
Things I've tried:
df.meanYear <- df %>%
group_by(Year) %>%
group_by(Area) %>%
summarize_each(funs(mean(Num)))
But it just replaces every value with the mean, instead of the intended result.
If possible please do provide alternate means (i.e. non-dplyr) methods, because I'm still new with R.
Is this what you are looking for?
library(dplyr)
df <- group_by(df, Year, Area)
df <- summarise(df, avg = mean(Num))
We can use data.table
library(data.table)
setDT(df)[, .(avg = mean(Num)) , by = .(Year, Area)]
I had a similar problem in my code, I fixed it with the .groups attribute:
df %>%
group_by(Year,Area) %>%
summarise(avg = mean(Num), .groups="keep")
Also verified with the added example (as.numeric corrupted Num values, so I used as.numeric(as.character(df$Num)) to fix it):
Year Area avg
<dbl> <fct> <dbl>
1 2000 Area 1 79.5
2 2001 Area 3 85
3 2002 Area 1 40
4 2002 Area 3 30
5 2003 Area 2 90
6 2004 Area 4 10

Combining datapoints using an index dataframe in R

I have two dataframes, and I'd like to use one as reference for combining observations in the other one.
First, I have data:
> data
upc fips_state_code mymonth price units year sales
1 1153801013 2 3 25.84620 235 2008 6073.8563
2 1153801013 1 2 28.61981 108 2009 3090.9396
3 1153801013 2 2 27.99000 7 2009 195.9300
4 1153801013 1 1 27.99000 4 2009 111.9600
5 1153801013 1 3 27.99000 7 2008 195.9300
6 72105922753 1 3 27.10816 163 2008 4418.6306
7 72105922765 2 2 24.79000 3 2010 74.3700
8 72105922765 2 2 25.99000 1 2009 25.9900
9 72105922765 1 2 23.58091 13 2009 306.5518
10 1071917100 2 2 300.07000 1 2009 300.0700
11 1071917100 1 3 307.07000 2 2008 614.1400
12 1071917100 2 3 269.99000 1 2010 269.9900
13 1461503541 2 2 0.65200 8 2008 5.2160
14 1461503541 2 2 13.99000 11 2010 153.8900
15 1461503541 1 1 0.87000 1 2008 0.8700
16 11111111 1 1 3.00000 2 2008 6.0000
17 11111112 1 1 6.00000 5 2008 30.0000
Then, I have z, which is the reference:
> z
upc code
3 1153801013 52161
1932 72105922753 52161
1934 72105922765 52161
2027 81153801013 52161
2033 81153801041 52161
2 1071917100 50174
1256 8723610700 50174
I want to combine datapoints in data whose upc is the same in z.
In the sample I gave to you, there are 7 different UPC's.
1071917100 is also in z, with the code 50174. However, the only other upc with this code is 8723610700, which is not in data. Therefore, it remains unchanged.
1461503541, 11111111, and 11111112 are not in z at all, so therefore they also remains unchanged.
1153801013, 72105922753, and 72105922765 all share the same code in z, 52161. Therefore, I want to combine all the observations with these upc's.
I want to do this in a really specific way:
First, I want to choose the UPC with the greatest amount of sales across the data. 1153801013 has 9668.616 in sales (simply the sum of all sales with that upc). 72105922753 has 4418.631 in sales. 72105922765 has 406.9118 in sales. Therefore, I choose 1153801013 as the upc for all of them.
Now having chosen this upc, I want to change 72105922753 and 72105922765 to 1153801013 in data.
Now we have a dataset that looks like this:
> data1
upc fips_state_code mymonth price units year sales
1 1153801013 2 3 25.84620 235 2008 6073.8563
2 1153801013 1 2 28.61981 108 2009 3090.9396
3 1153801013 2 2 27.99000 7 2009 195.9300
4 1153801013 1 1 27.99000 4 2009 111.9600
5 1153801013 1 3 27.99000 7 2008 195.9300
6 1153801013 1 3 27.10816 163 2008 4418.6306
7 1153801013 2 2 24.79000 3 2010 74.3700
8 1153801013 2 2 25.99000 1 2009 25.9900
9 1153801013 1 2 23.58091 13 2009 306.5518
10 1071917100 2 2 300.07000 1 2009 300.0700
11 1071917100 1 3 307.07000 2 2008 614.1400
12 1071917100 2 3 269.99000 1 2010 269.9900
13 1461503541 2 2 0.65200 8 2008 5.2160
14 1461503541 2 2 13.99000 11 2010 153.8900
15 1461503541 1 1 0.87000 1 2008 0.8700
16 11111111 1 1 3.00000 2 2008 6.0000
17 11111112 1 1 6.00000 5 2008 30.0000
Finally, I want to combine all the datapoints with the same year, mymonth, and fips_state_code. The way this will happen is by adding up the sales and unit numbers of datapoints with the same upc, fips_state_code, mymonth, and year, and then recalculating the weighted price. (I.e., price = total Sales / total Units.)
And so, the final data set should look like this:
> data2
upc fips_state_code mymonth price units year sales
1 1153801013 2 3 25.84620 235 2008 6073.856
2 1153801013 1 2 28.07844 121 2009 3397.491
3 1153801013 2 2 27.74000 8 2009 221.920
4 1153801013 1 1 27.99000 4 2009 111.960
5 1153801013 1 3 27.14448 170 2008 4614.561
6 1153801013 2 2 24.79000 3 2010 74.370
7 1071917100 2 2 300.07000 1 2009 300.070
8 1071917100 1 3 307.07000 2 2008 614.140
9 1071917100 2 3 269.99000 1 2010 269.990
10 1461503541 2 2 0.65200 8 2008 5.216
11 1461503541 2 2 13.99000 11 2010 153.890
12 1461503541 1 1 0.87000 1 2008 0.870
13 11111111 1 1 3.00000 2 2008 6.000
14 11111112 1 1 6.00000 5 2008 30.000
I did try to do this myself, but it seems like it could be done more efficiently than my code using dplyr, and I couldn't accomplish the last part successfully. Please let me know if anything is unclear, and thank you very much in advance.
Here is the dput code:
data<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 72105922753, 72105922765, 72105922765, 72105922765,
1071917100, 1071917100, 1071917100, 1461503541, 1461503541, 1461503541,
11111111, 11111112), fips_state_code = c(2, 1, 2, 1, 1, 1, 2,
2, 1, 2, 1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2, 1, 3, 3,
2, 2, 2, 2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.6198113208,
27.99, 27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909,
300.07, 307.07, 269.99, 0.652, 13.99, 0.87, 3, 6), units = c(235,
108, 7, 4, 7, 163, 3, 1, 13, 1, 2, 1, 8, 11, 1, 2, 5), year = c(2008,
2009, 2009, 2009, 2008, 2008, 2010, 2009, 2009, 2009, 2008, 2010,
2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285, 3090.9396226464,
195.93, 111.96, 195.93, 4418.6306122439, 74.37, 25.99, 306.5518181817,
300.07, 614.14, 269.99, 5.216, 153.89, 0.87, 6, 30)), .Names = c("upc",
"fips_state_code", "mymonth", "price", "units", "year", "sales"
), row.names = c(NA, 17L), class = c("tbl_df", "data.frame"))
z<-structure(list(upc = c(1153801013, 72105922753, 72105922765,
81153801013, 81153801041, 1071917100, 8723610700), code = c(52161L,
52161L, 52161L, 52161L, 52161L, 50174L, 50174L)), .Names = c("upc",
"code"), row.names = c(3L, 1932L, 1934L, 2027L, 2033L, 2L, 1256L
), class = "data.frame")
data1<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 1153801013, 1153801013, 1153801013, 1153801013, 1071917100,
1071917100, 1071917100, 1461503541, 1461503541, 1461503541, 11111111,
11111112), fips_state_code = c(2, 1, 2, 1, 1, 1, 2, 2, 1, 2,
1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2, 1, 3, 3, 2, 2, 2,
2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.6198113208,
27.99, 27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909,
300.07, 307.07, 269.99, 0.652, 13.99, 0.87, 3, 6), units = c(235,
108, 7, 4, 7, 163, 3, 1, 13, 1, 2, 1, 8, 11, 1, 2, 5), year = c(2008,
2009, 2009, 2009, 2008, 2008, 2010, 2009, 2009, 2009, 2008, 2010,
2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285, 3090.9396226464,
195.93, 111.96, 195.93, 4418.6306122439, 74.37, 25.99, 306.5518181817,
300.07, 614.14, 269.99, 5.216, 153.89, 0.87, 6, 30)), .Names = c("upc",
"fips_state_code", "mymonth", "price", "units", "year", "sales"
), row.names = c(NA, 17L), class = c("tbl_df", "data.frame"))
data2<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 1153801013, 1071917100, 1071917100, 1071917100, 1461503541,
1461503541, 1461503541, 11111111, 11111112), fips_state_code = c(2,
1, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2,
1, 3, 2, 2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.07844,
27.74, 27.99, 27.14448, 24.79, 300.07, 307.07, 269.99, 0.652,
13.99, 0.87, 3, 6), units = c(235, 121, 8, 4, 170, 3, 1, 2, 1,
8, 11, 1, 2, 5), year = c(2008, 2009, 2009, 2009, 2008, 2010,
2009, 2008, 2010, 2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285,
3397.491, 221.92, 111.96, 4614.561, 74.37, 300.07, 614.14, 269.99,
5.216, 153.89, 0.87, 6, 30)), .Names = c("upc", "fips_state_code",
"mymonth", "price", "units", "year", "sales"), row.names = c(NA,
14L), class = c("tbl_df", "data.frame"))
This is what I have attempted so far:
w <- z[match(unique(z$code), z$code),]
w <- plyr::rename(w,c("upc"="upc1"))
data <- merge(x=data,y=z,by="upc",all.x=T,all.y=F)
data <- merge(x=data,y=w,by="code",all.x=T,all.y=F)
data <- within(data, upc2 <- ifelse(!is.na(upc1),upc1,upc))
data$upc <- data$upc2
data$upc1 <- data$upc2 <- data$code <- NULL
data <- data[complete.cases(data),]
attach(data)
data <- aggregate(data,by=list(upc,fips_state_code,year,mymonth),FUN=sum)
data$price <- data$sales / data$units
detach(data)
data$Group.1 <- data$Group.2 <- data$Group.3 <- data$Group.4 <- NULL
I can't figure out how to make the upc chosen be the one with the most sales. It would also be great if there were a way to do this in fewer lines of code and more elegantly.

Resources