Tips on differencing values in R data frame by group [duplicate] - r

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Beginner tips on using plyr to calculate year-over-year change across groups
What is a good way to calcualte a year-on-year difference (new variable) for an existing data frame variable (i.e. sales) across multiple variable groups (i.e. Region and Food)?
Below is a example of the data frame structure:
Date Region Type Sales
1/1/2001 East Food 120
1/1/2001 West Housing 130
1/1/2001 North Food 130
1/2/2001 East Food 133
1/3/2001 West Housing 140
1/4/2001 North Food 150
….
….
1/29/2013 East Food 125
1/29/2013 West Housing 137
1/29/2013 North Food 1350
Also, in addition to differening the data, I would like to calcuate a a trailing (say 7 day) moving average.
Any guidance would be greatly appreciated.

Here is something to get you started. data.table is a great package for this sort of things as it provides a concise and easy-to-use syntax (once you are past the learning curve) for these kinds of things.
library(data.table)
Create a reproducible example
set.seed(128)
regions = c("East", "West", "North", "South")
types = c("Food", "Housing")
dates <- seq(as.Date('2009-01-01'), as.Date('2011-12-31'), by = 1)
n <- length(dates)
dt <- data.table(Date = dates,
Region = sample(regions, n, replace = TRUE),
Type = sample(types, n, replace = TRUE),
Sales = round(rnorm(n, mean = 100, sd = 10)))
Add Year column
dt[, Year := year(Date)]
> dt
Date Region Type Sales Year
1: 2009-01-01 West Food 119 2009
2: 2009-01-02 North Housing 102 2009
3: 2009-01-03 North Housing 102 2009
4: 2009-01-04 North Food 101 2009
5: 2009-01-05 West Food 101 2009
---
1091: 2011-12-27 East Housing 122 2011
1092: 2011-12-28 East Housing 88 2011
1093: 2011-12-29 North Food 115 2011
1094: 2011-12-30 West Housing 96 2011
1095: 2011-12-31 East Food 101 2011
Calculate summary by year
summary <- dt[, list(Sales = sum(Sales)), by = 'Year,Region,Type']
setkey(summary, 'Year')
> head(summary)
Year Region Type Sales
1: 2009 West Food 4791
2: 2009 North Housing 3517
3: 2009 North Food 6774
4: 2009 South Housing 4380
5: 2009 East Food 4144
6: 2009 West Housing 4275
Function to create year-on-year diffs for each region/product combo.
YoYdiff <- function(dt) {
# Calculate year-on-year difference for Sales column
data.table(Sales.Diff = diff(dt$Sales), Year = dt$Year[-1])
}
Calculate year-on-year difference by column. This works for my example as setkey(dt, Year) sorts the data table by Year, but if your example misses some years for some products/regions you have to be more careful.
> summary[, YoYdiff(.SD), by = 'Region,Type']
Region Type Sales.Diff Year
1: West Food -412 2010
2: West Food 121 2011
3: North Housing 1907 2010
4: North Housing -1457 2011
5: North Food -3087 2010
6: North Food 369 2011
7: South Housing -539 2010
8: South Housing 575 2011
9: East Food 1264 2010
10: East Food -1732 2011
11: West Housing 298 2010
12: West Housing -410 2011
13: South Food -889 2010
14: South Food 1045 2011
15: East Housing 1146 2010
16: East Housing 1169 2011

Related

Find the nth largest value based on criteria [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
This is the basically same problem I had in Excel a few days ago (Excel - find nth largest value based on criteria), but this time in R (the data set contains half a million entries and that is more than Excel seems to be able to handle).
I have a table that looks like this that I have imported from Excel:
Country Region Code Product name Year Value
Sweden Stockholm 123 Apple 1991 244
Sweden Kirruna 123 Apple 1987 100
Japan Kyoto 543 Pie 1987 544
Denmark Copenhagen 123 Apple 1998 787
Denmark Copenhagen 123 Apple 1987 100
Denmark Copenhagen 543 Pie 1991 320
Denmark Copenhagen 126 Candy 1999 200
Sweden Gothenburg 126 Candy 2013 300
Sweden Gothenburg 157 Tomato 1987 150
Sweden Stockholm 125 Juice 1987 250
Sweden Kirruna 187 Banana 1998 310
Japan Kyoto 198 Ham 1987 157
Japan Kyoto 125 Juice 1987 550
Japan Tokyo 125 Juice 1991 100
What I want to do is to make a code that can give me the sum of the nth largest value of products that have been sold in a specific country. For instance, the most sold product in Sweden is Apple so I want to code to find that apple is the most sold product (in total, which is what I am interested in) and then summaries all the values of the sold apples in the country Sweden, 344.
I also want to be able to find the nth largest value based on both country and year. That is, if I am looking for the most sold product in Sweden in the year 2013, it should return the product Candy and the value 300.
Solution for your first question (find most sold product per country, summarise value for this product) using dplyr:
library(tidyverse)
df %>%
group_by(Country, Product_name) %>%
summarise(sum_value = sum(Value, na.rm = TRUE)) %>%
ungroup() %>%
group_by(Country) %>%
filter(sum_value == max(sum_value))
# A tibble: 3 x 3
# Groups: Country [3]
Country Product_name sum_value
<fctr> <fctr> <int>
1 Denmark Apple 887
2 Japan Juice 650
3 Sweden Apple 344
Solution for second question (show nth most sold products per country and year, sum value):
df %>%
group_by(Country, Product_name, Year) %>%
summarise(sum_value = sum(Value, na.rm = TRUE)) %>%
ungroup() %>%
group_by(Country, Year) %>%
arrange(desc(sum_value), .by_group = TRUE) %>%
slice(., 1:2)
Had to change the data a bit to get a decent output, so here's the output with all years set to 1987 (change the 2 in the 1:2 within the last row for a different n):
# A tibble: 6 x 4
# Groups: Country, Year [3]
Country Product_name Year sum_value
<fctr> <fctr> <int> <int>
1 Denmark Apple 1987 887
2 Denmark Pie 1987 320
3 Japan Juice 1987 650
4 Japan Pie 1987 544
5 Sweden Apple 1987 344
6 Sweden Banana 1987 310

Merging data by 2 variables in R

I am attempting to merge two data sets. In the past I have used merge() with by equal to the variable I want to merge by. However, now I would like to do so with two variables. My first data set looks something like this:
Year Winning_Tm Losing_Tm
2011 Texas Washington
2012 Alabama South Carolina
2013 Tennessee Texas
Then I have another data set with a rank of each team (this is very simplified) for each year. Like this:
Year Team Rank
2011 Texas 32
2011 Washington 34
2012 South Carolina 45
2012 Alabama 12
2013 Texas 6
2013 Tennessee 51
I would like to merge them so I have a data set that looks like this:
Year Winning_Tm Winning_TM_rank Losing_Tm Losing_Tm_rank
2011 Texas 32 Washington 34
2012 Alabama 12 South Carolina 45
2013 Tennessee 51 Texas 6
My hope is that there is a simple way to do this but it may be more complicated. Thanks!
I reproduced your data (try to include a dput of it next time):
A <- data.frame(
Year = c(2011, 2012, 2013),
Winning_Tm = c("Texas","Alabama","Tennessee"),
Losing_Tm = c("Washington","South Carolina", "Texas"),
stringsAsFactors = FALSE
)
B <- data.frame(
Year = c("2011","2011","2012","2012","2013","2013"),
Team = c("Texas","Washington","South Carolina","Alabama","Texas","Tennessee"),
Rank = c(32,34,45,12,6,51),
stringsAsFactors = FALSE
)
You can melt the first dataframe using the reshape2 package:
library(reshape2)
A <- melt(A, id.vars = "Year")
names(A)[3] <- "Team"
Now it looks like this:
> A
Year variable Team
1 2011 Winning_Tm Texas
2 2012 Winning_Tm Alabama
3 2013 Winning_Tm Tennessee
4 2011 Losing_Tm Washington
5 2012 Losing_Tm South Carolina
6 2013 Losing_Tm Texas
You can then merge the datasets together by the two columns of interest:
AB <- merge(A, B, by=c("Year","Team"))
Which looks like this:
> AB
Year Team variable Rank
1 2011 Texas Winning_Tm 32
2 2011 Washington Losing_Tm 34
3 2012 Alabama Winning_Tm 12
4 2012 South Carolina Losing_Tm 45
5 2013 Tennessee Winning_Tm 51
6 2013 Texas Losing_Tm 6
Then using the reshape command from base R you can change AB to a wide format:
reshape(AB, idvar = "Year", timevar = "variable", direction = "wide")
The result:
Year Team.Winning_Tm Rank.Winning_Tm Team.Losing_Tm Rank.Losing_Tm
1 2011 Texas 32 Washington 34
3 2012 Alabama 12 South Carolina 45
5 2013 Tennessee 51 Texas 6
Two separate merges. You would need to wrap your list of by variables in c(), and since the variables have different names, you need by.x and by.y. Afterward you could rename the rank variables.
I'll call your data winlose and teamrank, respectively. Then you'd need:
first_merge <- merge(winlose, teamrank, by.x = c('Year', 'Winning_Tm'), by.y = c('Year', 'Team'))
second_merge <- merge(first_merge, teamrank, by.x = c('Year', 'Losing_Tm'), by.y = c('Year', 'Team'))
Renaming the variables:
names(second_merge)[names(second_merge) == 'Rank.x'] <- 'Winning_Tm_rank'
names(second_merge)[names(second_merge) == 'Rank.y'] <- 'Losing_Tm_rank'
If you are familiar with SQL a rather complicated, but fast way to do this all in one step would be:
res <- sqldf("SELECT l.*,
max(case when l.Winning_Tm = r.Team then r.Rank else 0 end) as Winning_Tm_rank,
max(case when l.Losing_Tm = r.Team then r.Rank else 0 end) as Losing_Tm_rank
FROM df1 as l
inner join df2 as r
on (l.Winning_Tm = r.Team
OR l.Losing_Tm = r.Team)
AND l.Year = r.Year
group by l.Year, l.Winning_Tm, l.Losing_Tm")
res
Year Winning_Tm Losing_Tm Winning_Tm_rank Losing_Tm_rank
1 2011 Texas Washington 32 34
2 2012 Alabama South_Carolina 12 45
3 2013 Tennessee Texas 51 6
Data:
df1 <- read.table(header=T,text="Year Winning_Tm Losing_Tm
2011 Texas Washington
2012 Alabama South_Carolina
2013 Tennessee Texas")
df2<- read.table(header=T,text="Year Team Rank
2011 Texas 32
2011 Washington 34
2012 South_Carolina 45
2012 Alabama 12
2013 Texas 6
2013 Tennessee 51")

How to do Group By Rollup in R? (Like SQL)

I have a dataset and I want to perform something like Group By Rollup like we have in SQL for aggregate values.
Below is a reproducible example. I know aggregate works really well as explained here but not a satisfactory fit for my case.
year<- c('2016','2016','2016','2016','2017','2017','2017','2017')
month<- c('1','1','1','1','2','2','2','2')
region<- c('east','west','east','west','east','west','east','west')
sales<- c(100,200,300,400,200,400,600,800)
df<- data.frame(year,month,region,sales)
df
year month region sales
1 2016 1 east 100
2 2016 1 west 200
3 2016 1 east 300
4 2016 1 west 400
5 2017 2 east 200
6 2017 2 west 400
7 2017 2 east 600
8 2017 2 west 800
now what I want to do is aggregation (sum- by year-month-region) and add the new aggregate row in the existing dataframe
e.g. there should be two additional rows like below with a new name for region as 'USA' for the aggreagted rows
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 USA 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 USA 2000
I have figured out a way (below) but I am very sure that there exists an optimum solution for this OR a better workaround than mine
df1<- setNames(aggregate(df$sales, by=list(df$year,df$month, df$region), FUN=sum),
c('year','month','region', 'sales'))
df2<- setNames(aggregate(df$sales, by=list(df$year,df$month), FUN=sum),
c('year','month', 'sales'))
df2$region<- 'USA' ## added a new column- region- for total USA
df2<- df2[, c('year','month','region', 'sales')] ## reordering the columns of df2
df3<- rbind(df1,df2)
df3<- df3[order(df3$year,df3$month,df3$region),] ## order by
rownames(df3)<- NULL ## renumbered the rows after order by
df3
Thanks for the support!
melt/dcast in the reshape2 package can do subtotalling. After running dcast we replace "(all)" in the month column with the month using na.locf from the zoo package:
library(reshape2)
library(zoo)
m <- melt(df, measure.vars = "sales")
dout <- dcast(m, year + month + region ~ variable, fun.aggregate = sum, margins = "month")
dout$month <- na.locf(replace(dout$month, dout$month == "(all)", NA))
giving:
> dout
year month region sales
1 2016 1 east 400
2 2016 1 west 600
3 2016 1 (all) 1000
4 2017 2 east 800
5 2017 2 west 1200
6 2017 2 (all) 2000
In recent devel data.table 1.10.5 you can use new feature called "grouping sets" to produce sub totals:
library(data.table)
setDT(df)
res = groupingsets(df, .(sales=sum(sales)), sets=list(c("year","month"), c("year","month","region")), by=c("year","month","region"))
setorder(res, na.last=TRUE)
res
# year month region sales
#1: 2016 1 east 400
#2: 2016 1 west 600
#3: 2016 1 NA 1000
#4: 2017 2 east 800
#5: 2017 2 west 1200
#6: 2017 2 NA 2000
You can substitute NA to USA using res[is.na(region), region := "USA"].
plyr::ddply(df, c("year", "month", "region"), plyr::summarise, sales = sum(sales))

create random subsets in R without duplicates

my task is to divide a dataset of 32 rows into 8 groups without having duplicated entries.
i am trying to do this with a loop and by creating a new dataset after each cycle.
the data:
year pos country elo fifa cont hcountry hcont
1 2010 FRA 1851 1044 Europe RSA Africa
2 2010 MEX 1872 895 South America RSA Africa
3 2010 URU 1819 899 South America RSA Africa
4 2010 RSA 1569 392 Africa RSA Africa
5 2010 GRE 1726 964 Europe RSA Africa
6 2010 KOR 1766 632 Asia RSA Africa
8 2010 ARG 1899 1076 South America RSA Africa
9 2010 USA 1749 957 North America RSA Africa
10 2010 SVN 1648 860 Europe RSA Africa
11 2010 ALG 1531 821 Africa RSA Africa
...
my solution so far:
for (i in 1:8){
assign(paste("group", i, sep = ""), droplevels(subset(wc2010[sample(nrow(wc2010), 4),])))
wc2010 <- subset(wc2010, !(country %in% group[i]$country))
}
problem is of course: i don't know how to use the loop-variable.... :-(
help would be deeply appreciated!
thanks
Bob
Here is one way to create a random partition:
random.groups <- function(n.items = 32L, n.groups = 8L)
1L + (sample.int(n.items) %% n.groups)
So then you just have to do:
wc2010$group <- random.groups(nrow(wc2010), n.groups = 8L)
Then you might also be interested in doing
groups <- split(wc2010, wc2010$group)
Edit: this was not asked by the OP, but I realize that soccer draws for big tournaments usually involves hats: before the draw, teams are grouped by regions and/or rankings. Then groups are formed by randomly picking one team from each hat, so that two teams from a same hat cannot end up in the same group.
Here is a modification to my function so it can also take hats as an input:
random.groups <- function(n.items = 32L, n.groups = 8L,
hats = rep(1L, n.items)) {
splitted.items <- split(seq.int(n.items), hats)
shuffled <- lapply(splitted.items, sample)
1L + (order(unlist(shuffled)) %% n.groups)
}
Here is an example, where say, the first 8 teams are in hat #1, the next 8 teams are in hat #2, etc.:
# set.seed(123)
random.groups(32, 8, c(rep(1, 8), rep(2, 8), rep(3, 8), rep(4, 8)))
# [1] 7 8 2 6 5 3 1 4 8 7 5 3 2 4 1 6 3 2 7 6 5 8 1 4 7 6 5 4 3 2 1 8

Calculate Concentration Index by Region and Year (panel data)

This is my first post and very stuck on trying to build my first function that calculates Herfindahl measures on Firm gross output, using panel data (year=1998:2007) with firms = obs. by year (1998-2007) and region ("West","Central","East","NE") and am having problems with passing arguments through the function. I think I need to use two loops (one for time and one for region). Any help would be useful.. I really dont want to have to subset my data 400+ times to get herfindahl measures one at a time. Thanks in advance!
Below I provide: 1) My starter code (only returns one value); 2) desired output (2-bins that contain the hefindahl measures by 1) year and by 2) year-region); and 3) original data
1) My starter Code
myherf<- function (x, time, region){
time = year # variable is defined in my data and includes c(1998:2007)
region = region # Variable is defined in my data, c("West", "Central","East","NE")
for (i in 1:length(time)) {
for (j in 1:length(region)) {
herf[i,j] <- x/sum(x)
herf[i,j] <- herf[i,j]^2
herf[i,j] <- sum(herf[i,j])^1/2
}
}
return(herf[i,j])
}
myherf(extractiveoutput$x, i, j)
Error in herf[i, j] <- x/sum(x) : object 'herf' not found
2) My desired outcome is the following two vectors:
A. (1x10 vector)
Year herfindahl(yr)
1998 x
1999 x
...
2007 x
B. (1x40 vector)
Year Region hefindahl(yr-region)
1998 West x
1998 Central x
1998 East x
1998 NE x
...
2007 West x
2007 Central x
2007 East x
2007 northeast x
3) Original Data
Obs. industry year region grossoutput
1 06 1998 Central 0.048804830
2 07 1998 Central 0.011222478
3 08 1998 Central 0.002851575
4 09 1998 Central 0.009515881
5 10 1998 Central 0.0067931
...
12 06 1999 Central 0.050861447
13 07 1999 Central 0.008421093
14 08 1999 Central 0.002034649
15 09 1999 Central 0.010651283
16 10 1999 Central 0.007766118
...
111 06 1998 East 0.036787413
112 07 1998 East 0.054958377
113 08 1998 East 0.007390260
114 09 1998 East 0.010766598
115 10 1998 East 0.015843418
...
436 31 2007 West 0.166044176
437 32 2007 West 0.400031011
438 33 2007 West 0.133472059
439 34 2007 West 0.043669662
440 45 2007 West 0.017904620
You can use the conc function from the ineq library. The solution gets really simple and fast using data.table.
library(ineq)
library(data.table)
# convert your data.frame into a data.table
setDT(df)
# calculate inequality of grossoutput by region and year
df[, .(inequality = conc(grossoutput, type = "Herfindahl")), by=.(region, year) ]

Resources