Probit model with panel data - r

I am a novice user in R. I'm working with Version 1.3.1093 and Windows .
I'm working on a Panel Data Set for all activist interventions by hedge funds in Europe for the period 2005 - 2019(time variable). So I have data on a firm-year level. I created the ID Variable, giving a number to each company. I already calculated some financial ratios for each firm-year: Roa, Ebitda margin, Sales growth, LEverage, etc.
I also have data on Book to market ratio and Ln of Market value, I would like to run the binary probability model explaining the occurrence of hedge funds targeting ( targeted = 1, not targeted = 0) with the several variables mentioned above (lagged by one year).
This is a part of the dataset:[enter image description here][1]
ï..Company.code Company Targeted T.of.intervation TRBC Year Book.to.market Capex.to.sales EBITDA.MARGIN Leverage Ln.of.Mv
<int> <chr> <int> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 BALDA ~ 0 2006 5110 2005 0.387 0.0816 0.185 0.219 5.65
2 1 BALDA ~ 1 2006 5110 2006 0.554 0.0935 -0.0548 0.426 5.46
3 1 BALDA ~ 1 2006 5110 2007 0.292 0.137 -0.0993 0.337 5.69
4 1 BALDA ~ 1 2006 5110 2008 3.55 0.144 -0.00861 0.263 4.44
5 2 SUEZ SA 0 2006 5910 2005 0.733 0.0925 0.180 0.445 6.65
6 2 SUEZ SA 1 2006 5910 2006 1.11 0.0877 0.175 0.417 6.51
7 2 SUEZ SA 1 2006 5910 2007 0.949 0.0941 0.168 0.526 6.58
8 2 SUEZ SA 1 2006 5910 2008 0.600 0.0925 0.150 0.551 6.77
9 3 ASM IN~ 0 2007 5710 2006 0.321 0.0449 0.193 0.340 5.93
10 3 ASM IN~ 1 2007 5710 2007 0.354 0.0494 0.185 0.260 5.95
# ... with 3,357 more rows, and 7 more variables: Nwc.to.sales <dbl>, ROA <dbl>, Sales.Growth <dbl>, Industrial <int>,
# NR <int>, Tmt <int>, Consumer <int>````
[1]: https://i.stack.imgur.com/a3nJj.png

Mechanics of building logistic regression model (LR) is quite easy. R stats modul supports following LR mechanics:
library(tidyverse) #this helps import data from example
df <- tribble(
~id, ~Company.code, ~Company, ~Targeted, ~T.of.intervation, ~TRBC, ~Year, ~Book.to.market, ~Capex.to.sales, ~EBITDA.MARGIN, ~Leverage, ~Ln.of.Mv,
1, 1, "BALDA", 0, 2006, 5110, 2005, 0.387, 0.0816, 0.185, 0.219, 5.65,
2, 1, "BALDA", 1, 2006, 5110, 2006, 0.554, 0.0935, -0.0548, 0.426, 5.46,
3, 1, "BALDA", 1, 2006, 5110, 2007, 0.292, 0.137, -0.0993, 0.337, 5.69,
4, 1, "BALDA", 1, 2006, 5110, 2008, 3.55, 0.144, -0.00861, 0.263, 4.44,
5, 2, "SUEZ", 0, 2006, 5910, 2005, 0.733, 0.0925, 0.180, 0.445, 6.65,
6, 2, "SUEZ", 1, 2006, 5910, 2006, 1.11, 0.0877, 0.175, 0.417, 6.51,
7, 2, "SUEZ", 1, 2006, 5910, 2007, 0.949, 0.0941, 0.168, 0.526, 6.58,
8, 2, "SUEZ", 1, 2006, 5910, 2008, 0.600, 0.0925, 0.150, 0.551, 6.77,
9, 3, "ASM", 0, 2007, 5710, 2006, 0.321, 0.0449, 0.193, 0.340, 5.93,
10, 3, "ASM", 1, 2007, 5710, 2007, 0.354, 0.0494, 0.185, 0.260, 5.95
)
lr <- glm(Targeted~Book.to.market+Capex.to.sales+EBITDA.MARGIN+Leverage+Ln.of.Mv, family = "binomial", data = df) #this is the model training (fails for this dataset)
prediction <- predict(lr, df, type = "response") #this applies model to the data
#note: this example doesn't make sense - I didn't have enough data to make both training and validation
# datasets - you should keep part of your data and use it instead `df` in this step
More details could be found in many trainings - e.g. here.

Related

remove duplicate coordinates from X and Y column

Based on the data below how can I remove the rows with duplicate X and Y coordinates? In the example below, you will notice that one of X coordinate is -1.52 which is repeated twice but it's not a duplicate since it's corresponding Y coordiantes are different.
I don't know if it matters but please note that the orginal dataset has more than 2 decimal places for the X and Y values.
Sample data:
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), X = c(-1.01,
-1.11, -1.11, -2.13, -2.13, -1.52, -1.52, -1.98, -3.69, -4.79),
Y = c(2.11, 3.33, 3.33, 6.66, 6.66, 7.77, 8.88, 9.99, 1.11,
6.68)), class = "data.frame", row.names = c(NA, -10L))
Desired data:
id X Y
1 -1.01 2.11
2 -1.11 3.33
4 -2.13 6.66
6 -1.52 7.77
7 -1.52 8.88
8 -1.98 9.99
9 -3.69 1.11
19 -4.79 6.68
Use duplicated
subset(df1, !duplicated(df1[-1]))
-output
id X Y
1 1 -1.01 2.11
2 2 -1.11 3.33
4 4 -2.13 6.66
6 6 -1.52 7.77
7 7 -1.52 8.88
8 8 -1.98 9.99
9 9 -3.69 1.11
10 10 -4.79 6.68
Or with distinct
library(dplyr)
df1 %>%
distinct(X, Y, .keep_all = TRUE)

Calculation cumulated values using grouping

I am trying to calculate cumulated acetone and acetaldehyde emission from different soil incubations across three time points. Emission of the compounds was measured from six soils (of different soil_types) on three days. I wish to calculate the cumulated emission for each soil for each time point.
The end goal is to calculate the average emission from all soils and present a graph similar to this one (except there should be error bars on my graph):
Can anyone spot where I'm going wrong?
Here's the code:
library(tidyverse)
library(plotrix)
df%>%
group_by(soil, compound, days)%>%
mutate(cum_emission=cumsum(emission))%>%
summarise(mean=mean(cum_emission, na.rm = TRUE),
sd = sd(cum_emission, na.rm = TRUE),
se = std.error(cum_emission, na.rm = TRUE))
Here's the data:
df <- structure(list(days = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4,
4, 4, 4, 4, 4, 4, 4, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 4, 4, 4, 4), soil = c(12, 12, 2, 2, 1, 1, 9, 9, 13, 13,
3, 3, 12, 12, 2, 2, 1, 1, 9, 9, 12, 12, 2, 2, 1, 1, 9, 9, 13,
13, 3, 3, 13, 13, 3, 3), soil_type = c("organic", "organic",
"mineral", "mineral", "mineral", "mineral", "organic", "organic",
"organic", "organic", "mineral", "mineral", "organic", "organic",
"mineral", "mineral", "mineral", "mineral", "organic", "organic",
"organic", "organic", "mineral", "mineral", "mineral", "mineral",
"organic", "organic", "organic", "organic", "mineral", "mineral",
"organic", "organic", "mineral", "mineral"), compound = c("Acetone",
"Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde",
"Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone",
"Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde",
"Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone",
"Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde",
"Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone",
"Acetaldehyde", "Acetone", "Acetaldehyde", "Acetone", "Acetaldehyde"
), emission = c(0.01, 0, 0.03, 0.03, 0.07, 0.06, 0.33, 0.1, 0.02,
0.01, 0.01, 0, 0.02, 0.01, 0.07, 0.08, 0.09, 0.07, 0.32, 0.22,
0.01, 0, 0.06, 0.06, 0.08, 0.06, 0.23, 0.14, 0.4, 0.04, 0.14,
0, 0.05, 0.05, 0.14, 0)), row.names = c(NA, -36L), class = c("tbl_df",
"tbl", "data.frame"))
This only addresses the setup of the data, not the plotting. (sorry for the partial answer!)
You wrote that you wanted to group by soil, compound, days, did you mean soil_type, compound, days? As #maarvd pointed out, with soil, every row is unique.
When I modified the content to
df %>%
group_by(soil_type, compound, days)%>%
mutate(cum_emission=cumsum(emission))%>%
summarise(mean=mean(cum_emission, na.rm = TRUE),
sd = sd(cum_emission, na.rm = TRUE),
se = std.error(cum_emission, na.rm = TRUE))
I was able to render the following results
# A tibble: 12 x 6
# Groups: soil_type, compound [4]
soil_type compound days mean sd se
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 mineral Acetaldehyde 0 0.0700 0.0346 0.02
2 mineral Acetaldehyde 4 0.127 0.0404 0.0233
3 mineral Acetaldehyde 10 0.10 0.0346 0.02
4 mineral Acetone 0 0.08 0.0436 0.0252
5 mineral Acetone 4 0.177 0.116 0.0669
6 mineral Acetone 10 0.16 0.111 0.0643
7 organic Acetaldehyde 0 0.07 0.0608 0.0351
8 organic Acetaldehyde 4 0.173 0.144 0.0829
9 organic Acetaldehyde 10 0.107 0.0945 0.0546
10 organic Acetone 0 0.237 0.197 0.113
11 organic Acetone 4 0.25 0.201 0.116
12 organic Acetone 10 0.297 0.319 0.184
** changes based on #Tiptop's comment
If you're looking for the cumulative, moving averages, how about this?
I'm sure some of this I didn't originally write, but wherever it originated, I've repurposed it many times.
You won't need plotrix, but you will need the library tidyquant.
library(tidyverse)
library(tidyquant)
UDF_roll <- function(x, na.rm = TRUE) {
m <- mean(x, na.rm = na.rm) # calculate the average (for the rolling average)
s <- sd(x, na.rm = na.rm) # calculate the sd to find the confidence interval
hi <- m + 2*s # CI HI
lo <- m - 2*s # CI Low
vals <- c(Mean = m, SD = s, HI.95 = hi, LO.95 = lo)
return(vals)
}
# loop for each type of compound (I'm assuming that the data you provided is a sample and you have more.)
trends <- vector("list") # empty list to store the results
cp = unique(df$compound) # create a list of unique compound names
for(i in 1:length(unique(df$compound))){ # loop through each compound
trends[[i]] <- df %>% as.data.frame() %>% # add results to the list
filter(compound == cp[i]) %>% # for one compound
arrange(days) %>%
# the rolling functions requires time series with a date; so random dates added as controller
mutate(time = seq(as.Date("2010/1/1"),
by = "month",
length.out = nrow(.)),
cum_emission = cumsum(emission)) %>%
arrange(compound,-days) %>% # most recent on top for TS
tq_mutate(select = cum_emission, # collect mean, sd, error
mutate_fun = rollapply,
width = 2, # 2: current & previous reading
align = "right",
by.column = FALSE,
FUN = UDF_roll, # calls the function UDF
na.rm = TRUE) %>%
ggplot(aes(x = seq_along(time))) +
geom_point(aes(y = cum_emission),
color = "black", alpha = 0.2) + # cumulative
geom_ribbon(aes(ymin = LO.95, ymax = HI.95),
fill = "azure3", alpha = 0.4) + # confidence interval
geom_jitter(aes(y = Mean, color= Mean),
size = 1, alpha = 0.9) + # rolling average
labs(title = paste0(cp[[i]], ": Trends and Volatility\nIncremental Moving Average with 95% CI Bands (+/-2 SD)"),
x = "", y = "Soil Emissions") +
scale_color_viridis_c(end = .8) + theme_bw() +
theme(legend.position="none")
}
trends[[1]]
trends[[2]]
trends[[1]]$data # you can NULL the time column if you use the data another way
This makes the data time series. The plots:
The data is shown below. If you wanted to group it differently, you'll have to add the argument .groups = "drop" to the summarise() call, or you won't be able to get it through tq_mutate.
# A tibble: 18 x 11
days soil soil_type compound emission time cum_emission Mean SD HI.95 LO.95
<dbl> <dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 12 organic Acetone 0.01 2010-01-01 0.01 NA NA NA NA
2 0 2 mineral Acetone 0.03 2010-02-01 0.04 0.025 0.0212 0.0674 -0.0174
3 0 1 mineral Acetone 0.07 2010-03-01 0.11 0.075 0.0495 0.174 -0.0240
4 0 9 organic Acetone 0.33 2010-04-01 0.44 0.275 0.233 0.742 -0.192
5 0 13 organic Acetone 0.02 2010-05-01 0.46 0.45 0.0141 0.478 0.422
6 0 3 mineral Acetone 0.01 2010-06-01 0.47 0.465 0.00707 0.479 0.451
7 4 12 organic Acetone 0.02 2010-07-01 0.49 0.48 0.0141 0.508 0.452
8 4 2 mineral Acetone 0.07 2010-08-01 0.56 0.525 0.0495 0.624 0.426
9 4 1 mineral Acetone 0.09 2010-09-01 0.65 0.605 0.0636 0.732 0.478
10 4 9 organic Acetone 0.32 2010-10-01 0.97 0.81 0.226 1.26 0.357
11 4 13 organic Acetone 0.05 2010-11-01 1.02 0.995 0.0354 1.07 0.924
12 4 3 mineral Acetone 0.14 2010-12-01 1.16 1.09 0.0990 1.29 0.892
13 10 12 organic Acetone 0.01 2011-01-01 1.17 1.16 0.00707 1.18 1.15
14 10 2 mineral Acetone 0.06 2011-02-01 1.23 1.2 0.0424 1.28 1.12
15 10 1 mineral Acetone 0.08 2011-03-01 1.31 1.27 0.0566 1.38 1.16
16 10 9 organic Acetone 0.23 2011-04-01 1.54 1.42 0.163 1.75 1.10
17 10 13 organic Acetone 0.4 2011-05-01 1.94 1.74 0.283 2.31 1.17
18 10 3 mineral Acetone 0.14 2011-06-01 2.08 2.01 0.0990 2.21 1.81

How to group sales data using R?

I am trying to find a way to match new products with the products those I have historical data. Then I will use historical data from the preview years' products to make some prediction for the new products.
Please consider the following subset of the data:
# A tibble: 13 x 11
prdct_id prdct_grp_1 prdct_grp_2 prdct_grp_3 prdct_grp_4 Start_season January February March April sales_total
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.00 WUW SW BH B21 2017 2.00 10.0 5.00 4.00 21.0
2 2.00 WUW SW BK R21 2017 7.00 9.00 4.00 5.00 25.0
3 3.00 MUW NW UW P1 2018 6.00 8.00 10.0 6.00 32.0
4 4.00 LNG KW LW L1 2016 8.00 9.00 12.0 7.00 36.0
5 5.00 QKQ MZ KA AQ 2013 10.0 8.67 16.7 8.00 43.3
6 6.00 MUW NW UW P1 2019 0 0 0 0 0
7 7.00 WUW SW BK R21 2019 0 0 0 0 0
8 8.00 LNG NW UW P2 2014 15.1 8.67 28.7 11.0 63.4
9 9.00 QKQ KW LW L2 2016 16.8 8.67 32.7 12.0 70.1
10 10.0 WUW MZ KA AQ 2017 18.5 8.67 36.7 13.0 76.8
11 11.0 QKQ MZ KA AQ 2019 0 0 0 0 0
12 12.0 WUW MZ KA AQ 2019 0 0 0 0 0
13 13.0 MUW NW UW P1 2019 0 0 0 0 0
prdct_grp stands for a product group (for example prdct_grp_1=WUW means the product is in "women underwear" and prdct_grp_2=SW will specify that it is in the "swimwear" group and so on). If a product in the same prdct_grp from(1-4) then I will assume that they will have very similar sales figures.
I would like to have the following outcome
# A tibble: 3 x 11
new_prdct_id prdct_grp_1 prdct_grp_2 prdct_grp_3 prdct_grp_4 Start_s January February March April sales_total
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 6~3 MUW NW UW P1 2019 6.00 8.00 10.0 6.00 32.0
2 7~2 WUW SW BK R21 2019 7.00 9.00 4.00 5.00 25.0
3 11~5 QKQ MZ KA AQ 2019 10.0 9.00 17.0 8.00 43.0
I used tidyverse to have the outcome I wanted to have but the result was not very good.
If a product matches more than one product or match to another product which has start season 2019 is another problem. how could I handle this?
Thank you for your help.
Best
A
Below is a possible dplyr solution along with detailed comments. Please always make sure that your problem is reproducible by providing dput() output or at least a code snippet for creating your dataset.
# import required package
library(dplyr)
# reproduce your data frame (or at least something similar to it)
# please give more details next time
prdct_df <- data_frame(
prdct_id = 1:13,
prdct_grp_1 = c("WUW", "WUW", "MUW", "LNG", "QKQ", "MUW", "WUW", "LNG", "QKQ", "WUW", "QKQ", "WUW", "MUW"),
prdct_grp_2 = c("SW", "SW", "NW", "KW", "MZ", "NW", "SW", "NW", "KW", "MZ", "MZ", "MZ", "NW"),
prdct_grp_3 = c("BH", "BK", "UW", "LW", "KA", "UW", "BK", "UW", "LW", "KA", "KA", "KA", "UW"),
prdct_grp_4 = c("B21", "R21", "P1", "L1", "AQ", "P1", "R21", "P2", "L2", "AQ", "AQ", "AQ", "P1"),
Start_season = c(2017, 2017, 2018, 2016, 2013, 2019, 2019, 2014, 2016, 2017, 2019, 2019, 2019),
January = c(2, 7, 6 , 8, 10, 0, 0, 15.1, 16.8, 18.5, 0, 0, 0),
February = c(10, 9, 8, 9, 8.67, 0, 0, 8.86, 8.67, 8.67, 0, 0, 0),
March = c(4, 5, 10, 12, 16.7, 0, 0, 28.7, 32.7, 36.7, 0, 0, 0),
April = c(4, 5, 6, 7, 8, 0, 0, 11, 12, 13, 0, 0, 0),
sales_total = c(21, 25, 32, 36, 43.3, 0, 0, 63.4, 70.1, 76.8, 0, 0, 0)
)
# define new season in case you have additional seasons in the furture
new_prdct_seasons <- 2019 # with new seasons: c(2019, 2020, 2012) and so on
# keep the historical and new data separate (optional but clean)
# filter your data to separate new products
new_prdct_df <- prdct_df %>%
filter(Start_season %in% new_prdct_seasons)
# filter your data to separate old products
old_prdct_df <- prdct_df %>%
filter(!(Start_season %in% new_prdct_seasons))
# match the new and old products to get the data frame you want
final_df <- old_prdct_df %>%
inner_join(
# only the first 6 columns are needed from new product data frame
new_prdct_df[1:6],
# inner join by product group features
by = c("prdct_grp_1", "prdct_grp_2", "prdct_grp_3", "prdct_grp_4")
) %>%
# reorder the columns and change their names when necessary
select(
new_prdct_id = 12,
old_prdct_id = 1,
2:5,
Start_season = 13,
7:11
)
# we obtained the data frame you asked for
# note that we avoided matches among new products by keeping new and old products in distinct data frames
final_df
# # A tibble: 5 x 12
# new_prdct_id old_prdct_id prdct_grp_1 prdct_grp_2 prdct_grp_3 prdct_grp_4 Start_season January
# <int> <int> <chr> <chr> <chr> <chr> <dbl> <dbl>
# 1 7 2 WUW SW BK R21 2019 7
# 2 6 3 MUW NW UW P1 2019 6
# 3 13 3 MUW NW UW P1 2019 6
# 4 11 5 QKQ MZ KA AQ 2019 10
# 5 12 10 WUW MZ KA AQ 2019 18.5
# # ... with 4 more variables: February <dbl>, March <dbl>, April <dbl>, sales_total <dbl>
# you can also exclude matches with more than one old product if needed
final_df[-3, ] # this removes the match 13-3 as there is already 6-3

Combining datapoints using an index dataframe in R

I have two dataframes, and I'd like to use one as reference for combining observations in the other one.
First, I have data:
> data
upc fips_state_code mymonth price units year sales
1 1153801013 2 3 25.84620 235 2008 6073.8563
2 1153801013 1 2 28.61981 108 2009 3090.9396
3 1153801013 2 2 27.99000 7 2009 195.9300
4 1153801013 1 1 27.99000 4 2009 111.9600
5 1153801013 1 3 27.99000 7 2008 195.9300
6 72105922753 1 3 27.10816 163 2008 4418.6306
7 72105922765 2 2 24.79000 3 2010 74.3700
8 72105922765 2 2 25.99000 1 2009 25.9900
9 72105922765 1 2 23.58091 13 2009 306.5518
10 1071917100 2 2 300.07000 1 2009 300.0700
11 1071917100 1 3 307.07000 2 2008 614.1400
12 1071917100 2 3 269.99000 1 2010 269.9900
13 1461503541 2 2 0.65200 8 2008 5.2160
14 1461503541 2 2 13.99000 11 2010 153.8900
15 1461503541 1 1 0.87000 1 2008 0.8700
16 11111111 1 1 3.00000 2 2008 6.0000
17 11111112 1 1 6.00000 5 2008 30.0000
Then, I have z, which is the reference:
> z
upc code
3 1153801013 52161
1932 72105922753 52161
1934 72105922765 52161
2027 81153801013 52161
2033 81153801041 52161
2 1071917100 50174
1256 8723610700 50174
I want to combine datapoints in data whose upc is the same in z.
In the sample I gave to you, there are 7 different UPC's.
1071917100 is also in z, with the code 50174. However, the only other upc with this code is 8723610700, which is not in data. Therefore, it remains unchanged.
1461503541, 11111111, and 11111112 are not in z at all, so therefore they also remains unchanged.
1153801013, 72105922753, and 72105922765 all share the same code in z, 52161. Therefore, I want to combine all the observations with these upc's.
I want to do this in a really specific way:
First, I want to choose the UPC with the greatest amount of sales across the data. 1153801013 has 9668.616 in sales (simply the sum of all sales with that upc). 72105922753 has 4418.631 in sales. 72105922765 has 406.9118 in sales. Therefore, I choose 1153801013 as the upc for all of them.
Now having chosen this upc, I want to change 72105922753 and 72105922765 to 1153801013 in data.
Now we have a dataset that looks like this:
> data1
upc fips_state_code mymonth price units year sales
1 1153801013 2 3 25.84620 235 2008 6073.8563
2 1153801013 1 2 28.61981 108 2009 3090.9396
3 1153801013 2 2 27.99000 7 2009 195.9300
4 1153801013 1 1 27.99000 4 2009 111.9600
5 1153801013 1 3 27.99000 7 2008 195.9300
6 1153801013 1 3 27.10816 163 2008 4418.6306
7 1153801013 2 2 24.79000 3 2010 74.3700
8 1153801013 2 2 25.99000 1 2009 25.9900
9 1153801013 1 2 23.58091 13 2009 306.5518
10 1071917100 2 2 300.07000 1 2009 300.0700
11 1071917100 1 3 307.07000 2 2008 614.1400
12 1071917100 2 3 269.99000 1 2010 269.9900
13 1461503541 2 2 0.65200 8 2008 5.2160
14 1461503541 2 2 13.99000 11 2010 153.8900
15 1461503541 1 1 0.87000 1 2008 0.8700
16 11111111 1 1 3.00000 2 2008 6.0000
17 11111112 1 1 6.00000 5 2008 30.0000
Finally, I want to combine all the datapoints with the same year, mymonth, and fips_state_code. The way this will happen is by adding up the sales and unit numbers of datapoints with the same upc, fips_state_code, mymonth, and year, and then recalculating the weighted price. (I.e., price = total Sales / total Units.)
And so, the final data set should look like this:
> data2
upc fips_state_code mymonth price units year sales
1 1153801013 2 3 25.84620 235 2008 6073.856
2 1153801013 1 2 28.07844 121 2009 3397.491
3 1153801013 2 2 27.74000 8 2009 221.920
4 1153801013 1 1 27.99000 4 2009 111.960
5 1153801013 1 3 27.14448 170 2008 4614.561
6 1153801013 2 2 24.79000 3 2010 74.370
7 1071917100 2 2 300.07000 1 2009 300.070
8 1071917100 1 3 307.07000 2 2008 614.140
9 1071917100 2 3 269.99000 1 2010 269.990
10 1461503541 2 2 0.65200 8 2008 5.216
11 1461503541 2 2 13.99000 11 2010 153.890
12 1461503541 1 1 0.87000 1 2008 0.870
13 11111111 1 1 3.00000 2 2008 6.000
14 11111112 1 1 6.00000 5 2008 30.000
I did try to do this myself, but it seems like it could be done more efficiently than my code using dplyr, and I couldn't accomplish the last part successfully. Please let me know if anything is unclear, and thank you very much in advance.
Here is the dput code:
data<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 72105922753, 72105922765, 72105922765, 72105922765,
1071917100, 1071917100, 1071917100, 1461503541, 1461503541, 1461503541,
11111111, 11111112), fips_state_code = c(2, 1, 2, 1, 1, 1, 2,
2, 1, 2, 1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2, 1, 3, 3,
2, 2, 2, 2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.6198113208,
27.99, 27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909,
300.07, 307.07, 269.99, 0.652, 13.99, 0.87, 3, 6), units = c(235,
108, 7, 4, 7, 163, 3, 1, 13, 1, 2, 1, 8, 11, 1, 2, 5), year = c(2008,
2009, 2009, 2009, 2008, 2008, 2010, 2009, 2009, 2009, 2008, 2010,
2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285, 3090.9396226464,
195.93, 111.96, 195.93, 4418.6306122439, 74.37, 25.99, 306.5518181817,
300.07, 614.14, 269.99, 5.216, 153.89, 0.87, 6, 30)), .Names = c("upc",
"fips_state_code", "mymonth", "price", "units", "year", "sales"
), row.names = c(NA, 17L), class = c("tbl_df", "data.frame"))
z<-structure(list(upc = c(1153801013, 72105922753, 72105922765,
81153801013, 81153801041, 1071917100, 8723610700), code = c(52161L,
52161L, 52161L, 52161L, 52161L, 50174L, 50174L)), .Names = c("upc",
"code"), row.names = c(3L, 1932L, 1934L, 2027L, 2033L, 2L, 1256L
), class = "data.frame")
data1<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 1153801013, 1153801013, 1153801013, 1153801013, 1071917100,
1071917100, 1071917100, 1461503541, 1461503541, 1461503541, 11111111,
11111112), fips_state_code = c(2, 1, 2, 1, 1, 1, 2, 2, 1, 2,
1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2, 1, 3, 3, 2, 2, 2,
2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.6198113208,
27.99, 27.99, 27.99, 27.1081632653, 24.79, 25.99, 23.5809090909,
300.07, 307.07, 269.99, 0.652, 13.99, 0.87, 3, 6), units = c(235,
108, 7, 4, 7, 163, 3, 1, 13, 1, 2, 1, 8, 11, 1, 2, 5), year = c(2008,
2009, 2009, 2009, 2008, 2008, 2010, 2009, 2009, 2009, 2008, 2010,
2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285, 3090.9396226464,
195.93, 111.96, 195.93, 4418.6306122439, 74.37, 25.99, 306.5518181817,
300.07, 614.14, 269.99, 5.216, 153.89, 0.87, 6, 30)), .Names = c("upc",
"fips_state_code", "mymonth", "price", "units", "year", "sales"
), row.names = c(NA, 17L), class = c("tbl_df", "data.frame"))
data2<-structure(list(upc = c(1153801013, 1153801013, 1153801013, 1153801013,
1153801013, 1153801013, 1071917100, 1071917100, 1071917100, 1461503541,
1461503541, 1461503541, 11111111, 11111112), fips_state_code = c(2,
1, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1), mymonth = c(3, 2, 2,
1, 3, 2, 2, 3, 3, 2, 2, 1, 1, 1), price = c(25.8461971831, 28.07844,
27.74, 27.99, 27.14448, 24.79, 300.07, 307.07, 269.99, 0.652,
13.99, 0.87, 3, 6), units = c(235, 121, 8, 4, 170, 3, 1, 2, 1,
8, 11, 1, 2, 5), year = c(2008, 2009, 2009, 2009, 2008, 2010,
2009, 2008, 2010, 2008, 2010, 2008, 2008, 2008), sales = c(6073.8563380285,
3397.491, 221.92, 111.96, 4614.561, 74.37, 300.07, 614.14, 269.99,
5.216, 153.89, 0.87, 6, 30)), .Names = c("upc", "fips_state_code",
"mymonth", "price", "units", "year", "sales"), row.names = c(NA,
14L), class = c("tbl_df", "data.frame"))
This is what I have attempted so far:
w <- z[match(unique(z$code), z$code),]
w <- plyr::rename(w,c("upc"="upc1"))
data <- merge(x=data,y=z,by="upc",all.x=T,all.y=F)
data <- merge(x=data,y=w,by="code",all.x=T,all.y=F)
data <- within(data, upc2 <- ifelse(!is.na(upc1),upc1,upc))
data$upc <- data$upc2
data$upc1 <- data$upc2 <- data$code <- NULL
data <- data[complete.cases(data),]
attach(data)
data <- aggregate(data,by=list(upc,fips_state_code,year,mymonth),FUN=sum)
data$price <- data$sales / data$units
detach(data)
data$Group.1 <- data$Group.2 <- data$Group.3 <- data$Group.4 <- NULL
I can't figure out how to make the upc chosen be the one with the most sales. It would also be great if there were a way to do this in fewer lines of code and more elegantly.

Strip leading zero from numeric vector without changing class

I have the following data, which is a few Major League Baseball statistics.
Year AVG SLG TB OBP IsoPow RC
1 1986 0.223 0.300 172 0.330 0.194 64.1
2 1987 0.261 0.356 271 0.329 0.230 92.8
3 1988 0.283 0.357 264 0.368 0.208 100.0
4 1989 0.248 0.328 247 0.351 0.178 91.9
5 1990 0.301 0.374 293 0.406 0.264 128.0
6 1991 0.292 0.367 262 0.410 0.222 118.2
Usually, percentage-type MLB statistics are displayed as a decimal, but with the leading zero removed. I'd like to do the same, but also preserve the class of the variable, which in this case is numeric.
For example, with bonds$AVG I'd like the result to be a numeric vector that looks exactly like
[1] .223 .261 .283 .248 .301 .292
Using sub, the vector goes from numeric to character, then back to its original numeric state after wrapping it with as.numeric.
> sub(0, "", bonds$AVG)
# [1] ".223" ".261" ".283" ".248" ".301" ".292"
> as.numeric(sub(0, "", bonds$AVG))
# [1] 0.223 0.261 0.283 0.248 0.301 0.292
Is this possible in R?
bonds <-
structure(list(Year = c(1986, 1987, 1988, 1989, 1990, 1991),
AVG = c(0.223, 0.261, 0.283, 0.248, 0.301, 0.292), SLG = c(0.3,
0.356, 0.357, 0.328, 0.374, 0.367), TB = c(172, 271, 264,
247, 293, 262), OBP = c(0.33, 0.329, 0.368, 0.351, 0.406,
0.41), IsoPow = c(0.194, 0.23, 0.208, 0.178, 0.264, 0.222
), RC = c(64.1, 92.8, 100, 91.9, 128, 118.2)), .Names = c("Year",
"AVG", "SLG", "TB", "OBP", "IsoPow", "RC"), row.names = c(NA,
6L), class = "data.frame")
Perhaps you could generalize the following by modifying print.data.frame?
f1 <- function(x) noquote(sub(0, "", x))
f1(bonds$AVG)
.223 .261 .283 .248 .301 .292

Resources