Combining datapoints using an index dataframe in R

Combining datapoints using an index dataframe in R - r

I have two dataframes, and I'd like to use one as reference for combining observations in the other one.
First, I have data:
> data
upc fips_state_code mymonth price units year sales
1 1153801013 2 3 25.84620 235 2008 6073.8563
2 1153801013 1 2 28.61981 108 2009 3090.9396
3 1153801013 2 2 27.99000 7 2009 195.9300
4 1153801013 1 1 27.99000 4 2009 111.9600
5 1153801013 1 3 27.99000 7 2008 195.9300
6 72105922753 1 3 27.10816 163 2008 4418.6306
7 72105922765 2 2 24.79000 3 2010 74.3700
8 72105922765 2 2 25.99000 1 2009 25.9900
9 72105922765 1 2 23.58091 13 2009 306.5518
10 1071917100 2 2 300.07000 1 2009 300.0700
11 1071917100 1 3 307.07000 2 2008 614.1400
12 1071917100 2 3 269.99000 1 2010 269.9900
13 1461503541 2 2 0.65200 8 2008 5.2160
14 1461503541 2 2 13.99000 11 2010 153.8900
15 1461503541 1 1 0.87000 1 2008 0.8700
16 11111111 1 1 3.00000 2 2008 6.0000
17 11111112 1 1 6.00000 5 2008 30.0000
Then, I have z, which is the reference:
> z
upc code
3 1153801013 52161
1932 72105922753 52161
1934 72105922765 52161
2027 81153801013 52161
2033 81153801041 52161
2 1071917100 50174
1256 8723610700 50174
I want to combine datapoints in data whose upc is the same in z.
In the sample I gave to you, there are 7 different UPC's.
1071917100 is also in z, with the code 50174. However, the only other upc with this code is 8723610700, which is not in data. Therefore, it remains unchanged.
1461503541, 11111111, and 11111112 are not in z at all, so therefore they also remains unchanged.
1153801013, 72105922753, and 72105922765 all share the same code in z, 52161. Therefore, I want to combine all the observations with these upc's.
I want to do this in a really specific way:
First, I want to choose the UPC with the greatest amount of sales across the data. 1153801013 has 9668.616 in sales (simply the sum of all sales with that upc). 72105922753 has 4418.631 in sales. 72105922765 has 406.9118 in sales. Therefore, I choose 1153801013 as the upc for all of them.
Now having chosen this upc, I want to change 72105922753 and 72105922765 to 1153801013 in data.
Now we have a dataset that looks like this:
> data1
upc fips_state_code mymonth price units year sales
1 1153801013 2 3 25.84620 235 2008 6073.8563
2 1153801013 1 2 28.61981 108 2009 3090.9396
3 1153801013 2 2 27.99000 7 2009 195.9300
4 1153801013 1 1 27.99000 4 2009 111.9600
5 1153801013 1 3 27.99000 7 2008 195.9300
6 1153801013 1 3 27.10816 163 2008 4418.6306
7 1153801013 2 2 24.79000 3 2010 74.3700
8 1153801013 2 2 25.99000 1 2009 25.9900
9 1153801013 1 2 23.58091 13 2009 306.5518
10 1071917100 2 2 300.07000 1 2009 300.0700
11 1071917100 1 3 307.07000 2 2008 614.1400
12 1071917100 2 3 269.99000 1 2010 269.9900
13 1461503541 2 2 0.65200 8 2008 5.2160
14 1461503541 2 2 13.99000 11 2010 153.8900
15 1461503541 1 1 0.87000 1 2008 0.8700
16 11111111 1 1 3.00000 2 2008 6.0000
17 11111112 1 1 6.00000 5 2008 30.0000
Finally, I want to combine all the datapoints with the same year, mymonth, and fips_state_code. The way this will happen is by adding up the sales and unit numbers of datapoints with the same upc, fips_state_code, mymonth, and year, and then recalculating the weighted price. (I.e., price = total Sales / total Units.)
And so, the final data set should look like this:
> data2
upc fips_state_code mymonth price units year sales
1 1153801013 2 3 25.84620 235 2008 6073.856
2 1153801013 1 2 28.07844 121 2009 3397.491
3 1153801013 2 2 27.74000 8 2009 221.920
4 1153801013 1 1 27.99000 4 2009 111.960
5 1153801013 1 3 27.14448 170 2008 4614.561
6 1153801013 2 2 24.79000 3 2010 74.370
7 1071917100 2 2 300.07000 1 2009 300.070
8 1071917100 1 3 307.07000 2 2008 614.140
9 1071917100 2 3 269.99000 1 2010 269.990
10 1461503541 2 2 0.65200 8 2008 5.216
11 1461503541 2 2 13.99000 11 2010 153.890
12 1461503541 1 1 0.87000 1 2008 0.870
13 11111111 1 1 3.00000 2 2008 6.000
14 11111112 1 1 6.00000 5 2008 30.000
I did try to do this myself, but it seems like it could be done more efficiently than my code using dplyr, and I couldn't accomplish the last part successfully. Please let me know if anything is unclear, and thank you very much in advance.
Here is the dput code:
data<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 72105922753, 72105922765, 72105922765, 72105922765,
1071917100, 1071917100, 1071917100, 1461503541, 1461503541, 1461503541,
11111111, 11111112), fips_state_code = c(2, 1, 2, 1, 1, 1, 2,
2, 1, 2, 1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2, 1, 3, 3,
2, 2, 2, 2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.6198113208,
27.99, 27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909,
300.07, 307.07, 269.99, 0.652, 13.99, 0.87, 3, 6), units = c(235,
108, 7, 4, 7, 163, 3, 1, 13, 1, 2, 1, 8, 11, 1, 2, 5), year = c(2008,
2009, 2009, 2009, 2008, 2008, 2010, 2009, 2009, 2009, 2008, 2010,
2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285, 3090.9396226464,
195.93, 111.96, 195.93, 4418.6306122439, 74.37, 25.99, 306.5518181817,
300.07, 614.14, 269.99, 5.216, 153.89, 0.87, 6, 30)), .Names = c("upc",
"fips_state_code", "mymonth", "price", "units", "year", "sales"
), row.names = c(NA, 17L), class = c("tbl_df", "data.frame"))
z<-structure(list(upc = c(1153801013, 72105922753, 72105922765,
81153801013, 81153801041, 1071917100, 8723610700), code = c(52161L,
52161L, 52161L, 52161L, 52161L, 50174L, 50174L)), .Names = c("upc",
"code"), row.names = c(3L, 1932L, 1934L, 2027L, 2033L, 2L, 1256L
), class = "data.frame")
data1<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 1153801013, 1153801013, 1153801013, 1153801013, 1071917100,
1071917100, 1071917100, 1461503541, 1461503541, 1461503541, 11111111,
11111112), fips_state_code = c(2, 1, 2, 1, 1, 1, 2, 2, 1, 2,
1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2, 1, 3, 3, 2, 2, 2,
2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.6198113208,
27.99, 27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909,
300.07, 307.07, 269.99, 0.652, 13.99, 0.87, 3, 6), units = c(235,
108, 7, 4, 7, 163, 3, 1, 13, 1, 2, 1, 8, 11, 1, 2, 5), year = c(2008,
2009, 2009, 2009, 2008, 2008, 2010, 2009, 2009, 2009, 2008, 2010,
2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285, 3090.9396226464,
195.93, 111.96, 195.93, 4418.6306122439, 74.37, 25.99, 306.5518181817,
300.07, 614.14, 269.99, 5.216, 153.89, 0.87, 6, 30)), .Names = c("upc",
"fips_state_code", "mymonth", "price", "units", "year", "sales"
), row.names = c(NA, 17L), class = c("tbl_df", "data.frame"))
data2<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 1153801013, 1071917100, 1071917100, 1071917100, 1461503541,
1461503541, 1461503541, 11111111, 11111112), fips_state_code = c(2,
1, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2,
1, 3, 2, 2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.07844,
27.74, 27.99, 27.14448, 24.79, 300.07, 307.07, 269.99, 0.652,
13.99, 0.87, 3, 6), units = c(235, 121, 8, 4, 170, 3, 1, 2, 1,
8, 11, 1, 2, 5), year = c(2008, 2009, 2009, 2009, 2008, 2010,
2009, 2008, 2010, 2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285,
3397.491, 221.92, 111.96, 4614.561, 74.37, 300.07, 614.14, 269.99,
5.216, 153.89, 0.87, 6, 30)), .Names = c("upc", "fips_state_code",
"mymonth", "price", "units", "year", "sales"), row.names = c(NA,
14L), class = c("tbl_df", "data.frame"))
This is what I have attempted so far:
w <- z[match(unique(z$code), z$code),]
w <- plyr::rename(w,c("upc"="upc1"))
data <- merge(x=data,y=z,by="upc",all.x=T,all.y=F)
data <- merge(x=data,y=w,by="code",all.x=T,all.y=F)
data <- within(data, upc2 <- ifelse(!is.na(upc1),upc1,upc))
data$upc <- data$upc2
data$upc1 <- data$upc2 <- data$code <- NULL
data <- data[complete.cases(data),]
attach(data)
data <- aggregate(data,by=list(upc,fips_state_code,year,mymonth),FUN=sum)
data$price <- data$sales / data$units
detach(data)
data$Group.1 <- data$Group.2 <- data$Group.3 <- data$Group.4 <- NULL
I can't figure out how to make the upc chosen be the one with the most sales. It would also be great if there were a way to do this in fewer lines of code and more elegantly.

Related

Merging data frames by id while interweaving years and carry values forward between years

I have two data frames that I want to merge. Both of them contain information about people, per id & year.
One of them is the "main", and the other adds information. However, I can't merge them in the regular way (i.e., merge() or dplyr::left_join()) because the year values in them don't necessarily match per id. So I want to chronologically carry over from what is known from the second table into the per year rows in the main table.
In the following example, I have two tables about army officers. The "main" one has 3 columns for id, year, and another col_1:
df_main_info <-
tribble(~id, ~year, ~col_1,
1, 2008, "foo",
1, 2005, "bar",
1, 2010, "blah",
1, 2020, "bar",
2, 1999, "foo",
2, 2020, "foo",
3, 2002, "bar",
3, 2010, "bar",
4, 2003, "foo",
4, 2010, "bar"
)
I have an additional table with id and year columns, for when each officer got their rank, and what rank it was:
df_ranks_history <-
tribble(~id, ~year, ~army_rank,
1, 2005, "second_lieutenant",
1, 2010, "first_lieutenant",
1, 2018, "major",
1, 2021, "colonel",
2, 2002, "major",
2, 2018, "colonel",
3, 1995, "second_lieutenant",
3, 2000, "captain",
3, 2012, "colonel"
)
The years don't match strictly. But if, for example, officer id = 3 became "captain" in 2000, then we know that in 2002 it was still the case, so we can enter "captain" into df_main_info in row 7.
The desired output should therefore be:
desired_output <-
tribble(~id, ~year, ~col_1, ~army_rank,
1, 2008, "foo", "second_lieutenant",
1, 2005, "bar", "second_lieutenant",
1, 2010, "blah", "first_lieutenant",
1, 2020, "bar", "major",
2, 1999, "foo", NA,
2, 2020, "foo", "colonel",
3, 2002, "bar", "captain",
3, 2010, "bar", "captain",
4, 2003, "foo", NA,
4, 2010, "bar", NA
)
In case this is relevant, the ranks go in a certain order:
us_army_officer_ranks <- c("second_lieutenant",
"first_lieutenant",
"captain",
"major",
"lieutenant_colonel",
"colonel")
# colonel > lieutenant_colonel > major > captain > first_lieutenant > second_lieutenant

library(dplyr)
library(tidyr)
df_main_info %>%
full_join(df_ranks_history, by = c("id", "year")) %>%
group_by(id) %>%
arrange(id, year) %>%
fill(army_rank, .direction = "down") %>%
filter(!is.na(col_1))
# # A tibble: 10 × 4
# # Groups: id [4]
# id year col_1 army_rank
# <dbl> <dbl> <chr> <chr>
# 1 1 2005 bar second_lieutenant
# 2 1 2008 foo second_lieutenant
# 3 1 2010 blah first_lieutenant
# 4 1 2020 bar major
# 5 2 1999 foo NA
# 6 2 2020 foo colonel
# 7 3 2002 bar captain
# 8 3 2010 bar captain
# 9 4 2003 foo NA
# 10 4 2010 bar NA

library(data.table)
setDT(df_main_info)
setDT(df_ranks_history)
df_ranks_history[df_main_info, on = list(id, year), roll = +Inf]
id year army_rank col_1
1: 1 2008 second_lieutenant foo
2: 1 2005 second_lieutenant bar
3: 1 2010 first_lieutenant blah
4: 1 2020 major bar
5: 2 1999 <NA> foo
6: 2 2020 colonel foo
7: 3 2002 captain bar
8: 3 2010 captain bar
9: 4 2003 <NA> foo
10: 4 2010 <NA> bar

Probit model with panel data

I am a novice user in R. I'm working with Version 1.3.1093 and Windows .
I'm working on a Panel Data Set for all activist interventions by hedge funds in Europe for the period 2005 - 2019(time variable). So I have data on a firm-year level. I created the ID Variable, giving a number to each company. I already calculated some financial ratios for each firm-year: Roa, Ebitda margin, Sales growth, LEverage, etc.
I also have data on Book to market ratio and Ln of Market value, I would like to run the binary probability model explaining the occurrence of hedge funds targeting ( targeted = 1, not targeted = 0) with the several variables mentioned above (lagged by one year).
This is a part of the dataset:[enter image description here][1]
ï..Company.code Company Targeted T.of.intervation TRBC Year Book.to.market Capex.to.sales EBITDA.MARGIN Leverage Ln.of.Mv
<int> <chr> <int> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 BALDA ~ 0 2006 5110 2005 0.387 0.0816 0.185 0.219 5.65
2 1 BALDA ~ 1 2006 5110 2006 0.554 0.0935 -0.0548 0.426 5.46
3 1 BALDA ~ 1 2006 5110 2007 0.292 0.137 -0.0993 0.337 5.69
4 1 BALDA ~ 1 2006 5110 2008 3.55 0.144 -0.00861 0.263 4.44
5 2 SUEZ SA 0 2006 5910 2005 0.733 0.0925 0.180 0.445 6.65
6 2 SUEZ SA 1 2006 5910 2006 1.11 0.0877 0.175 0.417 6.51
7 2 SUEZ SA 1 2006 5910 2007 0.949 0.0941 0.168 0.526 6.58
8 2 SUEZ SA 1 2006 5910 2008 0.600 0.0925 0.150 0.551 6.77
9 3 ASM IN~ 0 2007 5710 2006 0.321 0.0449 0.193 0.340 5.93
10 3 ASM IN~ 1 2007 5710 2007 0.354 0.0494 0.185 0.260 5.95
# ... with 3,357 more rows, and 7 more variables: Nwc.to.sales <dbl>, ROA <dbl>, Sales.Growth <dbl>, Industrial <int>,
# NR <int>, Tmt <int>, Consumer <int>````
[1]: https://i.stack.imgur.com/a3nJj.png

Mechanics of building logistic regression model (LR) is quite easy. R stats modul supports following LR mechanics:
library(tidyverse) #this helps import data from example
df <- tribble(
~id, ~Company.code, ~Company, ~Targeted, ~T.of.intervation, ~TRBC, ~Year, ~Book.to.market, ~Capex.to.sales, ~EBITDA.MARGIN, ~Leverage, ~Ln.of.Mv,
1, 1, "BALDA", 0, 2006, 5110, 2005, 0.387, 0.0816, 0.185, 0.219, 5.65,
2, 1, "BALDA", 1, 2006, 5110, 2006, 0.554, 0.0935, -0.0548, 0.426, 5.46,
3, 1, "BALDA", 1, 2006, 5110, 2007, 0.292, 0.137, -0.0993, 0.337, 5.69,
4, 1, "BALDA", 1, 2006, 5110, 2008, 3.55, 0.144, -0.00861, 0.263, 4.44,
5, 2, "SUEZ", 0, 2006, 5910, 2005, 0.733, 0.0925, 0.180, 0.445, 6.65,
6, 2, "SUEZ", 1, 2006, 5910, 2006, 1.11, 0.0877, 0.175, 0.417, 6.51,
7, 2, "SUEZ", 1, 2006, 5910, 2007, 0.949, 0.0941, 0.168, 0.526, 6.58,
8, 2, "SUEZ", 1, 2006, 5910, 2008, 0.600, 0.0925, 0.150, 0.551, 6.77,
9, 3, "ASM", 0, 2007, 5710, 2006, 0.321, 0.0449, 0.193, 0.340, 5.93,
10, 3, "ASM", 1, 2007, 5710, 2007, 0.354, 0.0494, 0.185, 0.260, 5.95
)
lr <- glm(Targeted~Book.to.market+Capex.to.sales+EBITDA.MARGIN+Leverage+Ln.of.Mv, family = "binomial", data = df) #this is the model training (fails for this dataset)
prediction <- predict(lr, df, type = "response") #this applies model to the data
#note: this example doesn't make sense - I didn't have enough data to make both training and validation
# datasets - you should keep part of your data and use it instead `df` in this step
More details could be found in many trainings - e.g. here.

R: Is there a way to insert rows that complete a year sequence?

I am working on R with a longitudinal database (register data so pretty large) about individuals, with several rows per ID (named "vn" in the database) and their attributes in column. My variable "observation" indicates each year of observation. Sometimes (but not in all cases) one or more years are skipped because nothing changes for the individual. I would like to add those "missing rows" to my database so that each individual has an entry for every year between their first and last observation (which aren't necessarily the same for everyone). Since individual's attributes can change over time, the row added must include the same attribute values as the previous one (for the example below, if a row is added for 2010, the individual will have a value of 3 in maritalstatus and 5584 in municipality).
Here is an overview of an individual in my database:
structure(list(vn = c("555", "555", "555", "555", "555", "555", "555", "555", "555", "555", "555"), municipality = c(5586, 5586, 5586, 5586, 5586, 5586, 5611, 5611, 5584, 5584, 5584), yearofbirth = c(1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957), sex = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), maritalstatus = c(2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), observation = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2012)), row.names = 470:480, class = "data.frame")
My current code (below) adds rows to my dataset when an observation year is missing, keeping all the information from the previous row except that now, some information is doubled, some observation years appear twice.
test<-test %>% expand(vn, municipality, yearofbirth, sex, maritalstatus, full_seq(observation,1))
I was also thinking of using rep() but can't find a way to do what I want.
Does anyone have an idea for a code that could help me?

If we have at least one observation per year then this could be achieved via tidyr::complete and tidyr::fill like so:
Edit 1: If not all years are present in the dataset the approach still works by first converting observation to factor with the levels set to the range of years:
Edit 2: To take account of differing year ranges one has to filter after the fill. To this end I added a variable last_obs containing the last year observed for an individual. This variable can be used to filter after the fill.
<!-- language-all: lang-r -->
d <- structure(list(vn = c("555", "555", "555", "555", "555", "555", "555", "555", "555", "555", "555"), municipality = c(5586, 5586, 5586, 5586, 5586, 5586, 5611, 5611, 5584, 5584, 5584), yearofbirth = c(1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957, 1957), sex = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), maritalstatus = c(2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), observation = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2012)), row.names = 470:480, class = "data.frame")
library(dplyr)
library(tidyr)
# Add some data
d1 <- d %>%
mutate(vn = "556") %>%
filter(observation <= 2010, observation %% 2 == 0)
# Bind data
d2 <- bind_rows(d, d1)
d2 %>%
# Add year of last obs by vn
group_by(vn) %>%
mutate(last_obs = last(observation)) %>%
ungroup() %>%
# Convert to fct
mutate(observation = factor(observation, levels = 2000:2016)) %>%
# Complete and fill
tidyr::complete(vn, observation) %>%
tidyr::fill(everything()) %>%
# Convert back to numeric
mutate(observation = as.integer(as.character(observation))) %>%
# Drop obs after year of last obs
filter(as.numeric(observation) <= last_obs) %>%
# Drop helper
select(-last_obs)
#> # A tibble: 22 x 6
#> vn observation municipality yearofbirth sex maritalstatus
#> <chr> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 555 2000 5586 1957 2 2
#> 2 555 2001 5586 1957 2 2
#> 3 555 2002 5586 1957 2 3
#> 4 555 2003 5586 1957 2 3
#> 5 555 2004 5586 1957 2 3
#> 6 555 2005 5586 1957 2 3
#> 7 555 2006 5611 1957 2 3
#> 8 555 2007 5611 1957 2 3
#> 9 555 2008 5584 1957 2 3
#> 10 555 2009 5584 1957 2 3
#> # ... with 12 more rows

Combining indices according to another dataframe in R?

I have two data.frames, and I'd like to use one as reference for combining observations in the other one.
First, I have data:
> data
Source: local data frame [15 x 7]
upc fips_state_code mymonth price units year sales
(dbl) (int) (dbl) (dbl) (int) (dbl) (dbl)
1 1153801013 2 3 25.84620 235 2008 6073.8563
2 1153801013 1 2 28.61981 108 2009 3090.9396
3 1153801013 2 2 27.99000 7 2009 195.9300
4 1153801013 1 1 27.99000 4 2009 111.9600
5 1153801013 1 3 27.99000 7 2008 195.9300
6 72105922753 1 3 27.10816 163 2008 4418.6306
7 72105922765 2 2 24.79000 3 2010 74.3700
8 72105922765 2 2 25.99000 1 2009 25.9900
9 72105922765 1 2 23.58091 13 2009 306.5518
10 1071917100 2 2 300.07000 1 2009 300.0700
11 1071917100 1 3 307.07000 2 2008 614.1400
12 1071917100 2 3 269.99000 1 2010 269.9900
13 1461503541 2 2 0.65200 8 2008 5.2160
14 1461503541 2 2 13.99000 11 2010 153.8900
15 1461503541 1 1 0.87000 1 2008 0.8700
Then, I have z, which is the reference:
> z
upc code
3 1153801013 52161
1932 72105922753 52161
1934 72105922765 52161
2027 81153801013 52161
2033 81153801041 52161
2 1071917100 50174
1256 8723610700 50174
I want to combine data points in data whose upc is the same in z.
In the sample I gave to you, there are 5 different upcs.
1071917100 is also in z, with the code 50174. However, the only other upc with this code is 8723610700, which is not in data. Therefore, it remains unchanged.
1461503541 is not in z at all, so therefore it also remains unchanged.
1153801013, 72105922753, and 72105922765 all share the same code in z, 52161. Therefore, I want to combine all the observations with these upcs.
I want to do this in a really specific way:
First, I want to choose the upc with the greatest amount of sales across the data. 1153801013 has 9668.616 in sales (simply the sum of all sales with that upc). 72105922753 has 4418.631 in sales. 72105922765 has 406.9118 in sales. Therefore, I choose 1153801013 as the upc for all of them.
Now having chosen this upc, I want to change 72105922753 and 72105922765 to 1153801013 in data.
Now we have a data set that looks like this:
> data1
Source: local data frame [15 x 7]
upc fips_state_code mymonth price units year sales
(dbl) (int) (dbl) (dbl) (int) (dbl) (dbl)
1 1153801013 2 3 25.84620 235 2008 6073.8563
2 1153801013 1 2 28.61981 108 2009 3090.9396
3 1153801013 2 2 27.99000 7 2009 195.9300
4 1153801013 1 1 27.99000 4 2009 111.9600
5 1153801013 1 3 27.99000 7 2008 195.9300
6 1153801013 1 3 27.10816 163 2008 4418.6306
7 1153801013 2 2 24.79000 3 2010 74.3700
8 1153801013 2 2 25.99000 1 2009 25.9900
9 1153801013 1 2 23.58091 13 2009 306.5518
10 1071917100 2 2 300.07000 1 2009 300.0700
11 1071917100 1 3 307.07000 2 2008 614.1400
12 1071917100 2 3 269.99000 1 2010 269.9900
13 1461503541 2 2 0.65200 8 2008 5.2160
14 1461503541 2 2 13.99000 11 2010 153.8900
15 1461503541 1 1 0.87000 1 2008 0.8700
Finally, I want to combine all the data points with the same year, mymonth, and fips_state_code. The way this will happen is by adding up the sales and units numbers of data points with the same upc, fips_state_code, mymonth, and year, and then recalculating the weighted price. (I.e., price = total Sales / total Units.)
And so, the final data set should look like this:
> data2
Source: local data frame [12 x 7]
upc fips_state_code mymonth price units year sales
(dbl) (int) (dbl) (dbl) (dbl) (dbl) (dbl)
1 1153801013 2 3 25.84620 235 2008 6073.856
2 1153801013 1 2 28.07844 121 2009 3397.491
3 1153801013 2 2 27.74000 8 2009 221.920
4 1153801013 1 1 27.99000 4 2009 111.960
5 1153801013 1 3 27.14448 170 2008 4614.561
6 1153801013 2 2 24.79000 3 2010 74.370
7 1071917100 2 2 300.07000 1 2009 300.070
8 1071917100 1 3 307.07000 2 2008 614.140
9 1071917100 2 3 269.99000 1 2010 269.990
10 1461503541 2 2 0.65200 8 2008 5.216
11 1461503541 2 2 13.99000 11 2010 153.890
12 1461503541 1 1 0.87000 1 2008 0.870
I did try to do this myself, but it took me many lines of code, and I couldn't accomplish the last part successfully. Please let me know if anything is unclear, and thank you very much in advance.
Here is the dput code:
data<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 72105922753, 72105922765, 72105922765, 72105922765,
1071917100, 1071917100, 1071917100, 1461503541, 1461503541, 1461503541
), fips_state_code = c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L,
1L, 2L, 2L, 2L, 1L), mymonth = c(3, 2, 2, 1, 3, 3, 2, 2, 2, 2,
3, 3, 2, 2, 1), price = c(25.8461971831, 28.6198113208, 27.99,
27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909, 300.07,
307.07, 269.99, 0.652, 13.99, 0.87), units = c(235L, 108L, 7L,
4L, 7L, 163L, 3L, 1L, 13L, 1L, 2L, 1L, 8L, 11L, 1L), year = c(2008,
2009, 2009, 2009, 2008, 2008, 2010, 2009, 2009, 2009, 2008, 2010,
2008, 2010, 2008), sales = c(6073.8563380285, 3090.9396226464,
195.93, 111.96, 195.93, 4418.6306122439, 74.37, 25.99, 306.5518181817,
300.07, 614.14, 269.99, 5.216, 153.89, 0.87)), .Names = c("upc",
"fips_state_code", "mymonth", "price", "units", "year", "sales"
), row.names = c(NA, -15L), class = c("tbl_df", "data.frame"))
z<-structure(list(upc = c(1153801013, 72105922753, 72105922765,
81153801013, 81153801041, 1071917100, 8723610700), code = c(52161L,
52161L, 52161L, 52161L, 52161L, 50174L, 50174L)), .Names = c("upc",
"code"), row.names = c(3L, 1932L, 1934L, 2027L, 2033L, 2L, 1256L
), class = "data.frame")
data1<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 1153801013, 1153801013, 1153801013, 1153801013, 1071917100,
1071917100, 1071917100, 1461503541, 1461503541, 1461503541),
fips_state_code = c(2L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 2L,
1L, 2L, 2L, 2L, 1L), mymonth = c(3, 2, 2, 1, 3, 3, 2, 2,
2, 2, 3, 3, 2, 2, 1), price = c(25.8461971831, 28.6198113208,
27.99, 27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909,
300.07, 307.07, 269.99, 0.652, 13.99, 0.87), units = c(235L,
108L, 7L, 4L, 7L, 163L, 3L, 1L, 13L, 1L, 2L, 1L, 8L, 11L,
1L), year = c(2008, 2009, 2009, 2009, 2008, 2008, 2010, 2009,
2009, 2009, 2008, 2010, 2008, 2010, 2008), sales = c(6073.8563380285,
3090.9396226464, 195.93, 111.96, 195.93, 4418.6306122439,
74.37, 25.99, 306.5518181817, 300.07, 614.14, 269.99, 5.216,
153.89, 0.87)), .Names = c("upc", "fips_state_code", "mymonth",
"price", "units", "year", "sales"), row.names = c(NA, -15L), class = c("tbl_df",
"data.frame"))
data2<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 1153801013, 1071917100, 1071917100, 1071917100, 1461503541,
1461503541, 1461503541), fips_state_code = c(2L, 1L, 2L, 1L,
1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L), mymonth = c(3, 2, 2, 1, 3, 2,
2, 3, 3, 2, 2, 1), price = c(25.8461971831, 28.07844, 27.74,
27.99, 27.14448, 24.79, 300.07, 307.07, 269.99, 0.652, 13.99,
0.87), units = c(235, 121, 8, 4, 170, 3, 1, 2, 1, 8, 11, 1),
year = c(2008, 2009, 2009, 2009, 2008, 2010, 2009, 2008,
2010, 2008, 2010, 2008), sales = c(6073.8563380285, 3397.491,
221.92, 111.96, 4614.561, 74.37, 300.07, 614.14, 269.99,
5.216, 153.89, 0.87)), .Names = c("upc", "fips_state_code",
"mymonth", "price", "units", "year", "sales"), class = c("tbl_df",
"data.frame"), row.names = c(NA, -12L))

I think this works. The rows of the final result are in a different order than your data2, but at a glance look they look the same.
# join data
joined = data %>% left_join(z)
# set aside the rows not in z
not_in_z = filter(joined, is.na(code))
modified = joined %>%
filter(!is.na(code)) %>% # for the rows in z
group_by(code) %>% # group by code
arrange(desc(sales)) %>% # sort by sales (so highest sales is first)
mutate(upc = first(upc)) %>% # change all UPC codes to the one with
# highest sales (within group)
bind_rows(not_in_z) # tack back on the rows that weren't in z
The modified data should match your data1 (it has a code column too, but you could drop that).
final = modified %>%
ungroup() %>% # redo the grouping
group_by(upc, fips_state_code, mymonth, year) %>%
summarize( # add your summary columns
sales = sum(sales),
units = sum(units),
price = sales / units
) %>%
select( # get columns in the same order as your "data2"
upc, fips_state_code, mymonth, price, units, year, sales
)
final
# Source: local data frame [12 x 7]
# Groups: upc, fips_state_code, mymonth [10]
#
# upc fips_state_code mymonth price units year sales
# (dbl) (int) (dbl) (dbl) (int) (dbl) (dbl)
# 1 1071917100 1 3 307.07000 2 2008 614.140
# 2 1071917100 2 2 300.07000 1 2009 300.070
# 3 1071917100 2 3 269.99000 1 2010 269.990
# 4 1153801013 1 1 27.99000 4 2009 111.960
# 5 1153801013 1 2 28.07844 121 2009 3397.491
# 6 1153801013 1 3 27.14447 170 2008 4614.561
# 7 1153801013 2 2 27.74000 8 2009 221.920
# 8 1153801013 2 2 24.79000 3 2010 74.370
# 9 1153801013 2 3 25.84620 235 2008 6073.856
# 10 1461503541 1 1 0.87000 1 2008 0.870
# 11 1461503541 2 2 0.65200 8 2008 5.216
# 12 1461503541 2 2 13.99000 11 2010 153.890

Here's a data.table approach.
First initialize data.table:
library(data.table)
setDT(data); setDT(z)
Re-assign upc:
#merge to add `code` to `data`
data[z, code := i.code, on = "upc"]
#add a new column with sales by `upc`
data[ , upc_sales := sum(sales), by = upc]
#re-assign
data[ , upc := upc[which.max(upc_sales)], by = code]
Aggregate:
data2 <- data[ , .(sales = sum(sales), units = sum(units)),
by = .(upc, fips_state_code, mymonth, year)
][ , price := sales / units]
There are minor differences vis-a-vis your data2, but these are all readily fixed with setcolorder and := NULL.
This could also be accomplished in two commands, but it's a tad less legible:
data[z, code := i.code, on = "upc"]
data[, upc :=
upc[which.max(.SD[ , sum(sales), by = upc]$V1)],
by = code][ , {sl <- sum(sales); us <- sum(units)
.(sales = sl, units = us, price = sl/us)},
by = .(upc, fips_state_code, mymonth, year)]

R timeseries - identify missing observations (timestamps) and insert NAs to create time series of given length

I have a set of 24 grouped (hierarchical) time series supposedly running over 3 years, and I want to look at monthly sales, but it turns out that a number of them have missing observations, e.g.
getCounts(Shop1, ...)
2011-01 2011-02 2011-03 2011-04 2011-05 2011-06 2011-07 2011-08 2011-09 2011-10 2011-11 2011-12 2012-02 2012-03 2012-04 2012-05 2012-06 2012-07 2012-08 2012-09 2012-10 2012-11
10 22 10 12 36 31 25 19 7 7 7 5 1 9 9 11 10 16 25 3 2 5
is missing an observation for January 2012 and ends in November 2012 although it's supposed to run to December 2013.
getCounts uses the command
with(myDF, tapply(varName, substr(dateName, 1, 7), sum))
to get the monthly counts.
I want to replace the missing observations, both in the middle of the time series and at the end, with NAs, so that all my time series have the same number of observations and, if there are any "holes" they will be visible in a plot.
Can anybody help me do this?
Thanks!
Edit: My preferred output would be something like this:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2011 1 NA 2 3 4 5 6 NA 7 8 9 10
2012 2 3 4 5 6 NA NA NA NA NA NA NA
where each NA is replacing a missing observation.
Edit 2: getCounts() look like this:
getCounts <- function(dataObject, dateName, varName){
dataNameString <- deparse(substitute(dataObject))
countsStr <- paste0("with(", dataNameString,", tapply(", varName, ", substr(", dateName, ", 1, 7), sum))")
counts <- eval(parse(text = countsStr))
return(counts)
}
And here's the dput:
structure(c(10, 22, 10, 12, 36, 31, 25, 19, 7, 7, 7, 5, 1, 9,
9, 11, 10, 16, 25, 3, 2, 5), .Dim = 22L, .Dimnames = list(c("2011-01",
"2011-02", "2011-03", "2011-04", "2011-05", "2011-06", "2011-07",
"2011-08", "2011-09", "2011-10", "2011-11", "2011-12", "2012-02",
"2012-03", "2012-04", "2012-05", "2012-06", "2012-07", "2012-08",
"2012-09", "2012-10", "2012-11")))

Try this
df <- data.frame(Year = substr(names(x), 1, 4),
Month = factor(month.abb[as.numeric(substr(names(x), 6, 7))],
levels = month.abb),
Value = x)
library(tidyr)
spread(df, Month, Value)
# Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 1 2011 10 22 10 12 36 31 25 19 7 7 7 5
# 2 2012 NA 1 9 9 11 10 16 25 3 2 5 NA
Data
x <- structure(c(10, 22, 10, 12, 36, 31, 25, 19, 7, 7, 7, 5, 1, 9,
9, 11, 10, 16, 25, 3, 2, 5), .Dim = 22L, .Dimnames = list(c("2011-01",
"2011-02", "2011-03", "2011-04", "2011-05", "2011-06", "2011-07",
"2011-08", "2011-09", "2011-10", "2011-11", "2011-12", "2012-02",
"2012-03", "2012-04", "2012-05", "2012-06", "2012-07", "2012-08",
"2012-09", "2012-10", "2012-11")))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Combining datapoints using an index dataframe in R - r

Related

Merging data frames by id while interweaving years and carry values forward between years

Probit model with panel data

R: Is there a way to insert rows that complete a year sequence?

Combining indices according to another dataframe in R?

R timeseries - identify missing observations (timestamps) and insert NAs to create time series of given length

Categories

Resources