How to assign new variables after group_split by automatically? - r

I try do split a dataframe by two variables, year and sectors. I did split them with group_split but everytime I need them, I have to call them with $ operator. I want to give them a name automatically so I do not need to use $ for every usage. I know I can assign them to new names by hand but I have more than 70 values so it's a bit time consuming
dummy <- data.frame(year = rep(2014:2020, 12),
sector = rep(c("auto","retail","sales","medical"),3),
emp = sample(1:2000, size = 84))
dummy%>%
group_by(year)%>%
group_split(year)%>%
set_names(nm = unique(dummy$year)) -> dummy_year
head(dummy_year$2014)
year sector emp
<int> <chr> <int>
2014 auto 171
2014 medical 1156
2014 sales 1838
2014 retail 1386
2014 auto 1360
2014 medical 1403
I want to call them like
some_kind_of_function(dummy_year, assign new variable by date)
head(year_2014)
year sector emp
<int> <chr> <int>
2014 auto 171
2014 medical 1156
2014 sales 1838
2014 retail 1386
2014 auto 1360
2014 medical 1403
maybe a for loop?

Maybe you want something like this:
library(dplyr)
dummy %>%
split(f = paste0("year_", as.factor(.$year)))

group_split wouldn't create named list. We can use split from base R
lst1 <- split(dummy, dummy$year)
names(lst1) <- paste0('year_', names(lst1))
If we want to create objects (not recommended), use list2env
list2env(lst1, .GlobalEnv)
-output
> year_2014
year sector emp
1 2014 auto 740
8 2014 medical 123
15 2014 sales 700
22 2014 retail 166
29 2014 auto 323
36 2014 medical 653
43 2014 sales 986
50 2014 retail 1814
57 2014 auto 1381
64 2014 medical 661
71 2014 sales 1362
78 2014 retail 641

Related

Making a table for streamgraph

Hi guys I am trying to plot a streamgraph using data at the following link: https://www.kaggle.com/START-UMD/gtd.
My aim is to streamgraph the frequency of terrorist attacks for each terrorist group of the variable gnamebut my problem is that I don't know how to filter the data frame in order to have all the parameters necessary to plot a streamgraph which are data, key, value, date.
I tried to get to that subset of the original dataframe by using the following code
str <- terror %>%
filter(gname != "Unknown") %>%
group_by(gname) %>%
summarise(total=n()) %>%
arrange(desc(total)) %>%
head(20)
But all I managed to get is the frequency of attacks for each terrorist group, without getting the number of attacks for each year.
Could you suggest any way to do it? That would be amazing!
Thanks for reading guys and for the help.
Dario and Kent are correct. You need to add the iyear variable in the group_by function:
terror %>%
filter(gname != "Unknown") %>%
group_by(gname, iyear) %>%
summarise(total=n()) %>%
arrange(desc(total)) %>%
head(20) -> str
str
# A tibble: 20 x 3
# Groups: gname [7]
gname iyear total
<chr> <int> <int>
1 Islamic State of Iraq and the Levant (ISIL) 2016 1454
2 Islamic State of Iraq and the Levant (ISIL) 2017 1315
3 Islamic State of Iraq and the Levant (ISIL) 2014 1249
4 Taliban 2015 1249
5 Islamic State of Iraq and the Levant (ISIL) 2015 1221
6 Taliban 2016 1065
7 Taliban 2014 1035
8 Taliban 2017 894
9 Al-Shabaab 2014 871
10 Taliban 2012 800
11 Taliban 2013 775
12 Al-Shabaab 2017 570
13 Al-Shabaab 2016 564
14 Boko Haram 2015 540
15 Shining Path (SL) 1989 509
16 Communist Party of India - Maoist (CPI-Maoist) 2010 505
17 Shining Path (SL) 1984 502
18 Boko Haram 2014 495
19 Shining Path (SL) 1983 493
20 Farabundo Marti National Liberation Front (FML~ 1991 492
Then send that to the streamgraph:
str %>% streamgraph("gname", "total", "iyear")
I've always had difficulty annotating these graphs, as far as I know, it had to be done manually:
str %>% streamgraph("gname", "total", "iyear") %>%
sg_annotate(label="ISIL", x=as.Date("2016-01-01"), y=1454, size=14)

Testing whether n% of data values exist in a variable grouped by posix date

I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237

Aggregate/Group_by second minimum value in R

I have used either group_by() in dplyr or the aggregate() function to aggregate across columns in R. For my current problem I want to group by an individual but finding the second lowest of one column (Number) and the lowest of another (Year). So, if my data looks like this:
Number Individual Year Value
123 M. Smith 2010 234
435 M. Smith 2011 346
435 M. Smith 2012 356
524 M. Smith 2015 432
119 J. Jones 2010 345
119 J. Jones 2012 432
254 J. Jones 2013 453
876 J. Jones 2014 654
I want it to become:
Number Individual Year Value
435 M. Smith 2011 346
254 J. Jones 2013 453
Thank you.
We can use the dplyr package. dt2 is the final output. The idea is to filter out the minimum in the Number column, then arrange the data frame by Individual, Number, and Year. Finally, select the first row of each group.
# Load package
library(dplyr)
# Create example data frame
dt <- read.table(text = "Number Individual Year Value
123 'M. Smith' 2010 234
435 'M. Smith' 2011 346
435 'M. Smith' 2012 356
524 'M. Smith' 2015 432
119 'J. Jones' 2010 345
119 'J. Jones' 2012 432
254 'J. Jones' 2013 453
876 'J. Jones' 2014 654",
header = TRUE, stringsAsFactors = FALSE)
# Process the data
dt2 <- dt %>%
group_by(Individual) %>%
filter(Number != min(Number)) %>%
arrange(Individual, Number, Year) %>%
slice(1)
We can use dplyr
library(dplyr)
df1 %>%
group_by(Individual) %>%
arrange(Individual, Number) %>%
filter(Number != max(Number)) %>%
slice(which.max(Number))
# A tibble: 2 x 4
# Groups: Individual [2]
# Number Individual Year Value
# <int> <chr> <int> <int>
#1 254 J. Jones 2013 453
#2 435 M. Smith 2011 346

Looking up values without loop in R

I need to look up a value in a data frame based on multiple criteria in another data frame. Example
A=
Country Year Number
USA 1994 455
Canada 1997 342
Canada 1998 987
must have added a column by the name of "rate" coming from
B=
Year USA Canada
1993 21 654
1994 41 321
1995 56 789
1996 85 123
1997 65 456
1998 1 999
So that the final data frame is
C=
Country Year Number Rate
USA 1994 455 41
Canada 1997 342 456
Canada 1998 987 999
In other words: Look up year and country from A in B and result is C. I would like to do this without a loop. I would like a general approach, such that I would be able to look up based on more than two criteria.
Here's another way using data.table that doesn't require converting the 2nd data table to long form:
require(data.table) # 1.9.6+
A[B, Rate := get(Country), by=.EACHI, on="Year"]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
where A and B are data.tables, and Country is of character type.
We can melt the second dataset from 'wide' to 'long' format, merge with the first dataset to get the expected output.
library(reshape2)
res <- merge(A, melt(B, id.var='Year'),
by.x=c('Country', 'Year'), by.y=c('variable', 'Year'))
names(res)[4] <- 'Rate'
res
# Country Year Number Rate
#1 Canada 1997 342 456
#2 Canada 1998 987 999
#3 USA 1994 455 41
Or we can use gather from tidyr and right_join to get this done.
library(dplyr)
library(tidyr)
gather(B, Country,Rate, -Year) %>%
right_join(., A)
# Year Country Rate Number
#1 1994 USA 41 455
#2 1997 Canada 456 342
#3 1998 Canada 999 987
Or as #DavidArenburg mentioned in the comments, this can be also done with data.table. We convert the 'data.frame' to 'data.table' (setDT(A)), melt the second dataset and join on 'Year', and 'Country'.
library(data.table)#v1.9.6+
setDT(A)[melt(setDT(B), 1L, variable = "Country", value = "Rate"),
on = c("Country", "Year"),
nomatch = 0L]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
Or a shorter version (if we are not too picky no variable names)
setDT(A)[melt(B, 1L), on = c(Country = "variable", Year = "Year"), nomatch = 0L]

Calculate Concentration Index by Region and Year (panel data)

This is my first post and very stuck on trying to build my first function that calculates Herfindahl measures on Firm gross output, using panel data (year=1998:2007) with firms = obs. by year (1998-2007) and region ("West","Central","East","NE") and am having problems with passing arguments through the function. I think I need to use two loops (one for time and one for region). Any help would be useful.. I really dont want to have to subset my data 400+ times to get herfindahl measures one at a time. Thanks in advance!
Below I provide: 1) My starter code (only returns one value); 2) desired output (2-bins that contain the hefindahl measures by 1) year and by 2) year-region); and 3) original data
1) My starter Code
myherf<- function (x, time, region){
time = year # variable is defined in my data and includes c(1998:2007)
region = region # Variable is defined in my data, c("West", "Central","East","NE")
for (i in 1:length(time)) {
for (j in 1:length(region)) {
herf[i,j] <- x/sum(x)
herf[i,j] <- herf[i,j]^2
herf[i,j] <- sum(herf[i,j])^1/2
}
}
return(herf[i,j])
}
myherf(extractiveoutput$x, i, j)
Error in herf[i, j] <- x/sum(x) : object 'herf' not found
2) My desired outcome is the following two vectors:
A. (1x10 vector)
Year herfindahl(yr)
1998 x
1999 x
...
2007 x
B. (1x40 vector)
Year Region hefindahl(yr-region)
1998 West x
1998 Central x
1998 East x
1998 NE x
...
2007 West x
2007 Central x
2007 East x
2007 northeast x
3) Original Data
Obs. industry year region grossoutput
1 06 1998 Central 0.048804830
2 07 1998 Central 0.011222478
3 08 1998 Central 0.002851575
4 09 1998 Central 0.009515881
5 10 1998 Central 0.0067931
...
12 06 1999 Central 0.050861447
13 07 1999 Central 0.008421093
14 08 1999 Central 0.002034649
15 09 1999 Central 0.010651283
16 10 1999 Central 0.007766118
...
111 06 1998 East 0.036787413
112 07 1998 East 0.054958377
113 08 1998 East 0.007390260
114 09 1998 East 0.010766598
115 10 1998 East 0.015843418
...
436 31 2007 West 0.166044176
437 32 2007 West 0.400031011
438 33 2007 West 0.133472059
439 34 2007 West 0.043669662
440 45 2007 West 0.017904620
You can use the conc function from the ineq library. The solution gets really simple and fast using data.table.
library(ineq)
library(data.table)
# convert your data.frame into a data.table
setDT(df)
# calculate inequality of grossoutput by region and year
df[, .(inequality = conc(grossoutput, type = "Herfindahl")), by=.(region, year) ]

Resources