R: move everything after a word to a new column and then only keep the last four digits in the new column - r

My data frame has a column called "State" and contains the state name, HB/HF number, and the date the law went into effect. I want the state column to only contain the state name and the second column to contain just the year. How would I do this?
Mintz = read.csv('https://github.com/bandcar/mintz/raw/main/State%20Legislation%20on%20Biosimilars2.csv')
mintz = Mintz
# delete rows if col 2 has a blank value.
mintz = mintz[mintz$Substitution.Requirements != "", ]
# removes entire row if column 1 has the word State
mintz=mintz[mintz$State != "State", ]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
# delete PR
mintz = mintz[-34,]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
I'm almost certain I'll need to use strsplit(gsub()) but I'm not sure how to this since there's no specific pattern
EDIT
I still need help keeping only the state name in column 1.
As for moving the year to a new column, I found the below. It works, but I don't know why it works. From my understanding \d means that \d is the actual character it's searching for. the "." means to search for one character, and I have no idea what the \1 means. Another strange thing is that Minnesota (row 20) did not have a year, so it instead used characters. Isn't \d only supposed to be for digits? Someone care to explain?
mintz2 = mintz
mintz2$Year = sub('.*(\\d{4}).*', '\\1', mintz2$State)

One way could be:
For demonstration purposes select the State column.
Then we use str_extract to extract all numbers with 4 digits with that are at the end of the string \\d{4}-> this gives us the Year column.
Finally we make use of the inbuilt state.name function make a pattern of it an use it again with str_extract and remove NA rows.
library(dplyr)
library(stringr)
mintz %>%
select(State) %>%
mutate(Year = str_extract(State, '\\d{4}$'), .after=State,
State = str_extract(State, paste(state.name, collapse='|'))
) %>%
na.omit()
State Year
2 Arizona 2016
3 California 2016
7 Connecticut 2018
12 Florida 2013
13 Georgia 2015
16 Hawaii 2016
21 Illinois 2016
24 Indiana 2014
28 Iowa 2017
32 Kansas 2017
33 Kentucky 2016
34 Louisiana 2015
39 Maryland 2017
42 Michigan 2018
46 Missouri 2016
47 Montana 2017
50 Nebraska 2018
51 Nevada 2018
54 New Hampshire 2018
55 New Jersey 2016
59 New York 2017
62 North Carolina 2015
63 North Dakota 2013
66 Ohio 2017
67 Oregon 2016
70 Pennsylvania 2016
74 Rhode Island 2016
75 South Carolina 2017
78 South Dakota 2019
79 Tennessee 2015
82 Texas 2015
85 Utah 2015
88 Vermont 2018
89 Virginia 2013
92 Washington 2015
93 West Virginia 2018
96 Wisconsin 2019
97 Wyoming 2018

Related

How to filter a dataframe so that it finds the maximum value for 10 unique occurrences of another variable

I have this dataframe here which I filter down to only include counties in the state of Washington and only include columns that are relevant for the answer I am looking for. What I want to do is filter down the dataframe so that I have 10 rows only, which have the highest Black Prison Population out of all of the counties in Washington State regardless of year. The part that I am struggling with is that there can't be repeated counties, so each row should include the highest Black Prison Populations for the top 10 unique county names in the state of Washington. Some of the counties have Null data for the populations for the black prison populations as well. for You should be able to reproduce this to get the updated dataframe.
library(dplyr)
incarceration <- read.csv("https://raw.githubusercontent.com/vera-institute/incarceration-trends/master/incarceration_trends.csv")
blackPrisPop <- incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA")
Sample of what the updated dataframe looks like (should include 1911 rows):
fips county_name state year black_pop_15to64 black_prison_pop
130 53005 Benton County WA 2001 1008 25
131 53005 Benton County WA 2002 1143 20
132 53005 Benton County WA 2003 1208 21
133 53005 Benton County WA 2004 1236 27
134 53005 Benton County WA 2005 1310 32
135 53005 Benton County WA 2006 1333 35
You can group_by the county county_name, and then use slice_max taking the row with maximum value for black_prison_pop. If you set n = 1 option you will get one row for each county. If you set with_ties to FALSE, you also will get one row even in case of ties.
You can arrange in descending order the black_prison_pop value to get the overall top 10 values across all counties.
library(dplyr)
incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA") %>%
group_by(county_name) %>%
slice_max(black_prison_pop, n = 1, with_ties = FALSE) %>%
arrange(desc(black_prison_pop)) %>%
head(10)
Output
black_prison_pop black_pop_15to64 year fips county_name state
<dbl> <dbl> <int> <int> <chr> <chr>
1 1845 73480 2002 53033 King County WA
2 975 47309 2013 53053 Pierce County WA
3 224 5890 2005 53063 Spokane County WA
4 172 19630 2015 53061 Snohomish County WA
5 137 8129 2016 53011 Clark County WA
6 129 5146 2003 53035 Kitsap County WA
7 102 5663 2009 53067 Thurston County WA
8 58 706 1991 53021 Franklin County WA
9 50 1091 1991 53077 Yakima County WA
10 46 1748 2008 53073 Whatcom County WA

R moving average between data frame variables

I am trying to find a solution but haven't, yet.
I have a dataframe structured as follows:
country City 2014 2015 2016 2017 2018 2019
France Paris 23 34 54 12 23 21
US NYC 1 2 2 12 95 54
I want to find the moving average for every 3 years (i.e. 2014-16, 2015-17, etc) to be placed in ad-hoc columns.
country City 2014 2015 2016 2017 2018 2019 2014-2016 2015-2017 2016-2018 2017-2019
France Paris 23 34 54 12 23 21 37 33.3 29.7 18.7
US NYC 1 2 2 12 95 54 etc etc etc etc
Any hint?
1) Using the data shown reproducibly in the Note at the end we apply rollmean to each column in the transpose of the data and then transpose back. We rollapply the appropriate paste command to create the names.
library(zoo)
DF2 <- DF[-(1:2)]
cbind(DF, setNames(as.data.frame(t(rollmean(t(DF2), 3))),
rollapply(names(DF2), 3, function(x) paste(range(x), collapse = "-"))))
giving:
country City 2014 2015 2016 2017 2018 2019 2014-2016 2015-2017 2016-2018 2017-2019
1 France Paris 23 34 54 12 23 21 37.000000 33.333333 29.66667 18.66667
2 US NYC 1 2 2 12 95 54 1.666667 5.333333 36.33333 53.66667
2) This could also be expressed using dplyr/tidyr/zoo like this:
library(dplyr)
library(tidyr)
library(zoo)
DF %>%
pivot_longer(-c(country, City)) %>%
group_by(country, City) %>%
mutate(value = rollmean(value, 3, fill = NA),
name = rollapply(name, 3, function(x) paste(range(x), collapse="-"), fill=NA)) %>%
ungroup %>%
drop_na %>%
pivot_wider %>%
left_join(DF, ., by = c("country", "City"))
Note
Lines <- "country City 2014 2015 2016 2017 2018 2019
France Paris 23 34 54 12 23 21
US NYC 1 2 2 12 95 54 "
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE, check.names = FALSE)

A new table in R based on fields and data from existing one

I need to raise the question again as it was closed as duplicated, but the issue hasn't been resolved.
So, I'm working on international trade data and have the following table at the moment with 5 different values for commodity_code (commod_codes = c('85','84','87','73','29')):
year trade_flow reporter partner commodity_code commodity trade_value_usd
1 2012 Import Belarus China 29 Organic chemicals 150863100
2 2013 Import Belarus China 29 Organic chemicals 151614000
3 2014 Import Belarus China 29 Organic chemicals 73110200
4 2015 Import Belarus China 29 Organic chemicals 140396300
5 2016 Import Belarus China 29 Organic chemicals 135311600
6 2012 Import Belarus China 73 Articles of iron or steel 100484600
I need to create a new table that looks simple (commodity codes in top row, years in first column and corresponding trade values in cells):
year commodity_code
29 73 84 85 87
1998 value1 ... value 5
1999
…
2016
* I used reshape() but didn't succeed.
Would appreciate your support.
In case there are duplicate permutations, I would suggest to use this code (though not in base R - uses dplyr and tidyr packages)
as.data.frame(trade_data[,c("year","commodity_code","trade_value_usd")] %>% group_by (year,commodity_code)%>% summarise( sum(trade_value_usd))%>%spread(commodity_code,3))
Provided I understood you correctly, here is a one-liner in base R.
xtabs(trade_value_usd ~ year + commodity_code, data = df);
#year 29 73
# 2012 150863100 100484600
# 2013 151614000 0
# 2014 73110200 0
# 2015 140396300 0
# 2016 135311600 0
Explanation: Use xtabs to cross-tabulate trade_value_usd as a function of year (rows) and commodity_code (columns).
Sample data
df <- read.table(text =
"year trade_flow reporter partner commodity_code commodity trade_value_usd
1 2012 Import Belarus China 29 'Organic chemicals' 150863100
2 2013 Import Belarus China 29 'Organic chemicals' 151614000
3 2014 Import Belarus China 29 'Organic chemicals' 73110200
4 2015 Import Belarus China 29 'Organic chemicals' 140396300
5 2016 Import Belarus China 29 'Organic chemicals' 135311600
6 2012 Import Belarus China 73 'Articles of iron or steel' 100484600
", header = T, row.names = 1)

Merging data by 2 variables in R

I am attempting to merge two data sets. In the past I have used merge() with by equal to the variable I want to merge by. However, now I would like to do so with two variables. My first data set looks something like this:
Year Winning_Tm Losing_Tm
2011 Texas Washington
2012 Alabama South Carolina
2013 Tennessee Texas
Then I have another data set with a rank of each team (this is very simplified) for each year. Like this:
Year Team Rank
2011 Texas 32
2011 Washington 34
2012 South Carolina 45
2012 Alabama 12
2013 Texas 6
2013 Tennessee 51
I would like to merge them so I have a data set that looks like this:
Year Winning_Tm Winning_TM_rank Losing_Tm Losing_Tm_rank
2011 Texas 32 Washington 34
2012 Alabama 12 South Carolina 45
2013 Tennessee 51 Texas 6
My hope is that there is a simple way to do this but it may be more complicated. Thanks!
I reproduced your data (try to include a dput of it next time):
A <- data.frame(
Year = c(2011, 2012, 2013),
Winning_Tm = c("Texas","Alabama","Tennessee"),
Losing_Tm = c("Washington","South Carolina", "Texas"),
stringsAsFactors = FALSE
)
B <- data.frame(
Year = c("2011","2011","2012","2012","2013","2013"),
Team = c("Texas","Washington","South Carolina","Alabama","Texas","Tennessee"),
Rank = c(32,34,45,12,6,51),
stringsAsFactors = FALSE
)
You can melt the first dataframe using the reshape2 package:
library(reshape2)
A <- melt(A, id.vars = "Year")
names(A)[3] <- "Team"
Now it looks like this:
> A
Year variable Team
1 2011 Winning_Tm Texas
2 2012 Winning_Tm Alabama
3 2013 Winning_Tm Tennessee
4 2011 Losing_Tm Washington
5 2012 Losing_Tm South Carolina
6 2013 Losing_Tm Texas
You can then merge the datasets together by the two columns of interest:
AB <- merge(A, B, by=c("Year","Team"))
Which looks like this:
> AB
Year Team variable Rank
1 2011 Texas Winning_Tm 32
2 2011 Washington Losing_Tm 34
3 2012 Alabama Winning_Tm 12
4 2012 South Carolina Losing_Tm 45
5 2013 Tennessee Winning_Tm 51
6 2013 Texas Losing_Tm 6
Then using the reshape command from base R you can change AB to a wide format:
reshape(AB, idvar = "Year", timevar = "variable", direction = "wide")
The result:
Year Team.Winning_Tm Rank.Winning_Tm Team.Losing_Tm Rank.Losing_Tm
1 2011 Texas 32 Washington 34
3 2012 Alabama 12 South Carolina 45
5 2013 Tennessee 51 Texas 6
Two separate merges. You would need to wrap your list of by variables in c(), and since the variables have different names, you need by.x and by.y. Afterward you could rename the rank variables.
I'll call your data winlose and teamrank, respectively. Then you'd need:
first_merge <- merge(winlose, teamrank, by.x = c('Year', 'Winning_Tm'), by.y = c('Year', 'Team'))
second_merge <- merge(first_merge, teamrank, by.x = c('Year', 'Losing_Tm'), by.y = c('Year', 'Team'))
Renaming the variables:
names(second_merge)[names(second_merge) == 'Rank.x'] <- 'Winning_Tm_rank'
names(second_merge)[names(second_merge) == 'Rank.y'] <- 'Losing_Tm_rank'
If you are familiar with SQL a rather complicated, but fast way to do this all in one step would be:
res <- sqldf("SELECT l.*,
max(case when l.Winning_Tm = r.Team then r.Rank else 0 end) as Winning_Tm_rank,
max(case when l.Losing_Tm = r.Team then r.Rank else 0 end) as Losing_Tm_rank
FROM df1 as l
inner join df2 as r
on (l.Winning_Tm = r.Team
OR l.Losing_Tm = r.Team)
AND l.Year = r.Year
group by l.Year, l.Winning_Tm, l.Losing_Tm")
res
Year Winning_Tm Losing_Tm Winning_Tm_rank Losing_Tm_rank
1 2011 Texas Washington 32 34
2 2012 Alabama South_Carolina 12 45
3 2013 Tennessee Texas 51 6
Data:
df1 <- read.table(header=T,text="Year Winning_Tm Losing_Tm
2011 Texas Washington
2012 Alabama South_Carolina
2013 Tennessee Texas")
df2<- read.table(header=T,text="Year Team Rank
2011 Texas 32
2011 Washington 34
2012 South_Carolina 45
2012 Alabama 12
2013 Texas 6
2013 Tennessee 51")

Calculate Concentration Index by Region and Year (panel data)

This is my first post and very stuck on trying to build my first function that calculates Herfindahl measures on Firm gross output, using panel data (year=1998:2007) with firms = obs. by year (1998-2007) and region ("West","Central","East","NE") and am having problems with passing arguments through the function. I think I need to use two loops (one for time and one for region). Any help would be useful.. I really dont want to have to subset my data 400+ times to get herfindahl measures one at a time. Thanks in advance!
Below I provide: 1) My starter code (only returns one value); 2) desired output (2-bins that contain the hefindahl measures by 1) year and by 2) year-region); and 3) original data
1) My starter Code
myherf<- function (x, time, region){
time = year # variable is defined in my data and includes c(1998:2007)
region = region # Variable is defined in my data, c("West", "Central","East","NE")
for (i in 1:length(time)) {
for (j in 1:length(region)) {
herf[i,j] <- x/sum(x)
herf[i,j] <- herf[i,j]^2
herf[i,j] <- sum(herf[i,j])^1/2
}
}
return(herf[i,j])
}
myherf(extractiveoutput$x, i, j)
Error in herf[i, j] <- x/sum(x) : object 'herf' not found
2) My desired outcome is the following two vectors:
A. (1x10 vector)
Year herfindahl(yr)
1998 x
1999 x
...
2007 x
B. (1x40 vector)
Year Region hefindahl(yr-region)
1998 West x
1998 Central x
1998 East x
1998 NE x
...
2007 West x
2007 Central x
2007 East x
2007 northeast x
3) Original Data
Obs. industry year region grossoutput
1 06 1998 Central 0.048804830
2 07 1998 Central 0.011222478
3 08 1998 Central 0.002851575
4 09 1998 Central 0.009515881
5 10 1998 Central 0.0067931
...
12 06 1999 Central 0.050861447
13 07 1999 Central 0.008421093
14 08 1999 Central 0.002034649
15 09 1999 Central 0.010651283
16 10 1999 Central 0.007766118
...
111 06 1998 East 0.036787413
112 07 1998 East 0.054958377
113 08 1998 East 0.007390260
114 09 1998 East 0.010766598
115 10 1998 East 0.015843418
...
436 31 2007 West 0.166044176
437 32 2007 West 0.400031011
438 33 2007 West 0.133472059
439 34 2007 West 0.043669662
440 45 2007 West 0.017904620
You can use the conc function from the ineq library. The solution gets really simple and fast using data.table.
library(ineq)
library(data.table)
# convert your data.frame into a data.table
setDT(df)
# calculate inequality of grossoutput by region and year
df[, .(inequality = conc(grossoutput, type = "Herfindahl")), by=.(region, year) ]

Resources