Calculate Concentration Index by Region and Year (panel data) - r

This is my first post and very stuck on trying to build my first function that calculates Herfindahl measures on Firm gross output, using panel data (year=1998:2007) with firms = obs. by year (1998-2007) and region ("West","Central","East","NE") and am having problems with passing arguments through the function. I think I need to use two loops (one for time and one for region). Any help would be useful.. I really dont want to have to subset my data 400+ times to get herfindahl measures one at a time. Thanks in advance!
Below I provide: 1) My starter code (only returns one value); 2) desired output (2-bins that contain the hefindahl measures by 1) year and by 2) year-region); and 3) original data
1) My starter Code
myherf<- function (x, time, region){
time = year # variable is defined in my data and includes c(1998:2007)
region = region # Variable is defined in my data, c("West", "Central","East","NE")
for (i in 1:length(time)) {
for (j in 1:length(region)) {
herf[i,j] <- x/sum(x)
herf[i,j] <- herf[i,j]^2
herf[i,j] <- sum(herf[i,j])^1/2
}
}
return(herf[i,j])
}
myherf(extractiveoutput$x, i, j)
Error in herf[i, j] <- x/sum(x) : object 'herf' not found
2) My desired outcome is the following two vectors:
A. (1x10 vector)
Year herfindahl(yr)
1998 x
1999 x
...
2007 x
B. (1x40 vector)
Year Region hefindahl(yr-region)
1998 West x
1998 Central x
1998 East x
1998 NE x
...
2007 West x
2007 Central x
2007 East x
2007 northeast x
3) Original Data
Obs. industry year region grossoutput
1 06 1998 Central 0.048804830
2 07 1998 Central 0.011222478
3 08 1998 Central 0.002851575
4 09 1998 Central 0.009515881
5 10 1998 Central 0.0067931
...
12 06 1999 Central 0.050861447
13 07 1999 Central 0.008421093
14 08 1999 Central 0.002034649
15 09 1999 Central 0.010651283
16 10 1999 Central 0.007766118
...
111 06 1998 East 0.036787413
112 07 1998 East 0.054958377
113 08 1998 East 0.007390260
114 09 1998 East 0.010766598
115 10 1998 East 0.015843418
...
436 31 2007 West 0.166044176
437 32 2007 West 0.400031011
438 33 2007 West 0.133472059
439 34 2007 West 0.043669662
440 45 2007 West 0.017904620

You can use the conc function from the ineq library. The solution gets really simple and fast using data.table.
library(ineq)
library(data.table)
# convert your data.frame into a data.table
setDT(df)
# calculate inequality of grossoutput by region and year
df[, .(inequality = conc(grossoutput, type = "Herfindahl")), by=.(region, year) ]

Related

Create time series in R with weekly measurements for 30 years period

I have a set of weekly data for 30 years (1991 - 2020). The data was collected weekly between 5th may - 10 October every year. This gives me 23 weeks of data every year for 30 years.
I want to create a time series in R with this data. How do I do that please? It should be just 690 entriesin the output, but it is generating 1531 entries in the output See my codes and data below:
I saw a similar question HERE, but mine repeats for 30 years.
myts <- ts(df$Kc_Kamble, start = c(1991, 1), end = c(2020, 23), frequency = 52)
Output in R:
Time Series:
Start = c(1991, 1)
End = c(2020, 23)
Frequency = 52
Sample data:
Year Week Kc_Kamble
1991 1 0.357445197
1991 2 0.36902168
1991 3 0.383675947
1991 4 0.400703221
1991 5 0.418901921
1991 6 0.437049406
1991 7 0.453742803
1991 8 0.467291036
1991 9 0.475942834
1991 10 0.476898402
1991 11 0.464632341
1991 12 0.436298927
1991 13 0.396338825
1991 14 0.352731819
1991 15 0.313539638
1991 16 0.283932169
1991 17 0.2627343
1991 18 0.247373874
1991 19 0.235647483
1991 20 0.225655859
1991 21 0.216663659
1991 22 0.208550065
1991 23 0.203605036
1992 1 0.336754943
1992 2 0.334735193
1992 3 0.342654691
1992 4 0.363520428
1992 5 0.397733301
1992 6 0.4399758
1992 7 0.483592219
1992 8 0.521920773
1992 9 0.548597061
1992 10 0.560150059
1992 11 0.557210705
1992 12 0.542114151
1992 13 0.5173071
1992 14 0.485236257
1992 15 0.448348321
1992 16 0.409089999
1992 17 0.369907993
1992 18 0.333162073
1992 19 0.300014261
1992 20 0.270225988
1992 21 0.243406301
1992 22 0.219247646
1992 23 0.204966601
Let me suggest the following steps to set up and start analyzing your time series.
Initialize your time series by creating a 'dates' sequence and 'data' (set to NA). Use the library xts to create the time series.
library(xts)
dates <- seq(as.Date("1991-01-01"), as.Date("2020-01-01"), by = "weeks")
data <- rep(NA, length(dates))
myxts <- xts(x = data, order.by = dates)
str(myxts); head(myxts); tail(myxts)
Collect your data.
Data is collected weekly between 5th may - 10 October every year.
Let's read the data and work with Weekly Total Precipitation for year 2014.
ts_data <- read.table("https://www.dropbox.com/s/k2cxpja3cpsyoyc/ts_data.txt?dl=1", header =TRUE, sep="\t")
year.2014 <- ts_data[which(ts_data$Year == 2014),]
year.2014 # 23 rows of data for 2014.
start <- as.Date("2014-5-5"); end <- as.Date("2014-10-10")
collect <- which ( index(myxts) >= start & index(myxts) <= end )
myxts[collect] <- year.2014$PRPtot
# year.2014 and collect must have the same number of rows
Verify the collected data. You should see data inside each time window, and NA outside the time windows.
myxts2 <- window(myxts, start=start-50, end=end+50)
str(myxts2); myxts2
Visualize the collected data. You could view the complete time series (i.e. myxts). Note that autoplot drops all NAs.
library(ggplot2)
autoplot(myxts2, geom = "point")

R: move everything after a word to a new column and then only keep the last four digits in the new column

My data frame has a column called "State" and contains the state name, HB/HF number, and the date the law went into effect. I want the state column to only contain the state name and the second column to contain just the year. How would I do this?
Mintz = read.csv('https://github.com/bandcar/mintz/raw/main/State%20Legislation%20on%20Biosimilars2.csv')
mintz = Mintz
# delete rows if col 2 has a blank value.
mintz = mintz[mintz$Substitution.Requirements != "", ]
# removes entire row if column 1 has the word State
mintz=mintz[mintz$State != "State", ]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
# delete PR
mintz = mintz[-34,]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
I'm almost certain I'll need to use strsplit(gsub()) but I'm not sure how to this since there's no specific pattern
EDIT
I still need help keeping only the state name in column 1.
As for moving the year to a new column, I found the below. It works, but I don't know why it works. From my understanding \d means that \d is the actual character it's searching for. the "." means to search for one character, and I have no idea what the \1 means. Another strange thing is that Minnesota (row 20) did not have a year, so it instead used characters. Isn't \d only supposed to be for digits? Someone care to explain?
mintz2 = mintz
mintz2$Year = sub('.*(\\d{4}).*', '\\1', mintz2$State)
One way could be:
For demonstration purposes select the State column.
Then we use str_extract to extract all numbers with 4 digits with that are at the end of the string \\d{4}-> this gives us the Year column.
Finally we make use of the inbuilt state.name function make a pattern of it an use it again with str_extract and remove NA rows.
library(dplyr)
library(stringr)
mintz %>%
select(State) %>%
mutate(Year = str_extract(State, '\\d{4}$'), .after=State,
State = str_extract(State, paste(state.name, collapse='|'))
) %>%
na.omit()
State Year
2 Arizona 2016
3 California 2016
7 Connecticut 2018
12 Florida 2013
13 Georgia 2015
16 Hawaii 2016
21 Illinois 2016
24 Indiana 2014
28 Iowa 2017
32 Kansas 2017
33 Kentucky 2016
34 Louisiana 2015
39 Maryland 2017
42 Michigan 2018
46 Missouri 2016
47 Montana 2017
50 Nebraska 2018
51 Nevada 2018
54 New Hampshire 2018
55 New Jersey 2016
59 New York 2017
62 North Carolina 2015
63 North Dakota 2013
66 Ohio 2017
67 Oregon 2016
70 Pennsylvania 2016
74 Rhode Island 2016
75 South Carolina 2017
78 South Dakota 2019
79 Tennessee 2015
82 Texas 2015
85 Utah 2015
88 Vermont 2018
89 Virginia 2013
92 Washington 2015
93 West Virginia 2018
96 Wisconsin 2019
97 Wyoming 2018

r ddply error undefined columns selected

I have a time series data set like below:
age time income
16 to 24 2004 q1 400
16 to 24 2004 q2 500
… …
65 and over 2014 q3 800
it has different 60 quarters of income data for each age group.as income data is seasonal. i am trying to apply decomposition function to filter out trends.what i have did so far is below. but R consistently throw errors (error message:undefined columns selected) at me. any idea how to go about it?
fun =function(x){
ts = ts(x,frequency=4,start=c(2004,1))
ts.d =decompose(ts,type='additive')
as.vector(ts.d$trend)
}
trend.dt = ddply(my.dat,.(age),transform,trend=fun(income))
expected result is (NA is because, after decomposition, the first and last ob will not have value,but the rest should have)
age time income trend
16 to 24 2004 q1 400 NA
16 to 24 2004 q2 500 489
… …
65 and over 2014 q3 800 760
65 and over 2014 q3 810 NA

create random subsets in R without duplicates

my task is to divide a dataset of 32 rows into 8 groups without having duplicated entries.
i am trying to do this with a loop and by creating a new dataset after each cycle.
the data:
year pos country elo fifa cont hcountry hcont
1 2010 FRA 1851 1044 Europe RSA Africa
2 2010 MEX 1872 895 South America RSA Africa
3 2010 URU 1819 899 South America RSA Africa
4 2010 RSA 1569 392 Africa RSA Africa
5 2010 GRE 1726 964 Europe RSA Africa
6 2010 KOR 1766 632 Asia RSA Africa
8 2010 ARG 1899 1076 South America RSA Africa
9 2010 USA 1749 957 North America RSA Africa
10 2010 SVN 1648 860 Europe RSA Africa
11 2010 ALG 1531 821 Africa RSA Africa
...
my solution so far:
for (i in 1:8){
assign(paste("group", i, sep = ""), droplevels(subset(wc2010[sample(nrow(wc2010), 4),])))
wc2010 <- subset(wc2010, !(country %in% group[i]$country))
}
problem is of course: i don't know how to use the loop-variable.... :-(
help would be deeply appreciated!
thanks
Bob
Here is one way to create a random partition:
random.groups <- function(n.items = 32L, n.groups = 8L)
1L + (sample.int(n.items) %% n.groups)
So then you just have to do:
wc2010$group <- random.groups(nrow(wc2010), n.groups = 8L)
Then you might also be interested in doing
groups <- split(wc2010, wc2010$group)
Edit: this was not asked by the OP, but I realize that soccer draws for big tournaments usually involves hats: before the draw, teams are grouped by regions and/or rankings. Then groups are formed by randomly picking one team from each hat, so that two teams from a same hat cannot end up in the same group.
Here is a modification to my function so it can also take hats as an input:
random.groups <- function(n.items = 32L, n.groups = 8L,
hats = rep(1L, n.items)) {
splitted.items <- split(seq.int(n.items), hats)
shuffled <- lapply(splitted.items, sample)
1L + (order(unlist(shuffled)) %% n.groups)
}
Here is an example, where say, the first 8 teams are in hat #1, the next 8 teams are in hat #2, etc.:
# set.seed(123)
random.groups(32, 8, c(rep(1, 8), rep(2, 8), rep(3, 8), rep(4, 8)))
# [1] 7 8 2 6 5 3 1 4 8 7 5 3 2 4 1 6 3 2 7 6 5 8 1 4 7 6 5 4 3 2 1 8

Tips on differencing values in R data frame by group [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Beginner tips on using plyr to calculate year-over-year change across groups
What is a good way to calcualte a year-on-year difference (new variable) for an existing data frame variable (i.e. sales) across multiple variable groups (i.e. Region and Food)?
Below is a example of the data frame structure:
Date Region Type Sales
1/1/2001 East Food 120
1/1/2001 West Housing 130
1/1/2001 North Food 130
1/2/2001 East Food 133
1/3/2001 West Housing 140
1/4/2001 North Food 150
….
….
1/29/2013 East Food 125
1/29/2013 West Housing 137
1/29/2013 North Food 1350
Also, in addition to differening the data, I would like to calcuate a a trailing (say 7 day) moving average.
Any guidance would be greatly appreciated.
Here is something to get you started. data.table is a great package for this sort of things as it provides a concise and easy-to-use syntax (once you are past the learning curve) for these kinds of things.
library(data.table)
Create a reproducible example
set.seed(128)
regions = c("East", "West", "North", "South")
types = c("Food", "Housing")
dates <- seq(as.Date('2009-01-01'), as.Date('2011-12-31'), by = 1)
n <- length(dates)
dt <- data.table(Date = dates,
Region = sample(regions, n, replace = TRUE),
Type = sample(types, n, replace = TRUE),
Sales = round(rnorm(n, mean = 100, sd = 10)))
Add Year column
dt[, Year := year(Date)]
> dt
Date Region Type Sales Year
1: 2009-01-01 West Food 119 2009
2: 2009-01-02 North Housing 102 2009
3: 2009-01-03 North Housing 102 2009
4: 2009-01-04 North Food 101 2009
5: 2009-01-05 West Food 101 2009
---
1091: 2011-12-27 East Housing 122 2011
1092: 2011-12-28 East Housing 88 2011
1093: 2011-12-29 North Food 115 2011
1094: 2011-12-30 West Housing 96 2011
1095: 2011-12-31 East Food 101 2011
Calculate summary by year
summary <- dt[, list(Sales = sum(Sales)), by = 'Year,Region,Type']
setkey(summary, 'Year')
> head(summary)
Year Region Type Sales
1: 2009 West Food 4791
2: 2009 North Housing 3517
3: 2009 North Food 6774
4: 2009 South Housing 4380
5: 2009 East Food 4144
6: 2009 West Housing 4275
Function to create year-on-year diffs for each region/product combo.
YoYdiff <- function(dt) {
# Calculate year-on-year difference for Sales column
data.table(Sales.Diff = diff(dt$Sales), Year = dt$Year[-1])
}
Calculate year-on-year difference by column. This works for my example as setkey(dt, Year) sorts the data table by Year, but if your example misses some years for some products/regions you have to be more careful.
> summary[, YoYdiff(.SD), by = 'Region,Type']
Region Type Sales.Diff Year
1: West Food -412 2010
2: West Food 121 2011
3: North Housing 1907 2010
4: North Housing -1457 2011
5: North Food -3087 2010
6: North Food 369 2011
7: South Housing -539 2010
8: South Housing 575 2011
9: East Food 1264 2010
10: East Food -1732 2011
11: West Housing 298 2010
12: West Housing -410 2011
13: South Food -889 2010
14: South Food 1045 2011
15: East Housing 1146 2010
16: East Housing 1169 2011

Resources