Making a table for streamgraph - r

Hi guys I am trying to plot a streamgraph using data at the following link: https://www.kaggle.com/START-UMD/gtd.
My aim is to streamgraph the frequency of terrorist attacks for each terrorist group of the variable gnamebut my problem is that I don't know how to filter the data frame in order to have all the parameters necessary to plot a streamgraph which are data, key, value, date.
I tried to get to that subset of the original dataframe by using the following code
str <- terror %>%
filter(gname != "Unknown") %>%
group_by(gname) %>%
summarise(total=n()) %>%
arrange(desc(total)) %>%
head(20)
But all I managed to get is the frequency of attacks for each terrorist group, without getting the number of attacks for each year.
Could you suggest any way to do it? That would be amazing!
Thanks for reading guys and for the help.

Dario and Kent are correct. You need to add the iyear variable in the group_by function:
terror %>%
filter(gname != "Unknown") %>%
group_by(gname, iyear) %>%
summarise(total=n()) %>%
arrange(desc(total)) %>%
head(20) -> str
str
# A tibble: 20 x 3
# Groups: gname [7]
gname iyear total
<chr> <int> <int>
1 Islamic State of Iraq and the Levant (ISIL) 2016 1454
2 Islamic State of Iraq and the Levant (ISIL) 2017 1315
3 Islamic State of Iraq and the Levant (ISIL) 2014 1249
4 Taliban 2015 1249
5 Islamic State of Iraq and the Levant (ISIL) 2015 1221
6 Taliban 2016 1065
7 Taliban 2014 1035
8 Taliban 2017 894
9 Al-Shabaab 2014 871
10 Taliban 2012 800
11 Taliban 2013 775
12 Al-Shabaab 2017 570
13 Al-Shabaab 2016 564
14 Boko Haram 2015 540
15 Shining Path (SL) 1989 509
16 Communist Party of India - Maoist (CPI-Maoist) 2010 505
17 Shining Path (SL) 1984 502
18 Boko Haram 2014 495
19 Shining Path (SL) 1983 493
20 Farabundo Marti National Liberation Front (FML~ 1991 492
Then send that to the streamgraph:
str %>% streamgraph("gname", "total", "iyear")
I've always had difficulty annotating these graphs, as far as I know, it had to be done manually:
str %>% streamgraph("gname", "total", "iyear") %>%
sg_annotate(label="ISIL", x=as.Date("2016-01-01"), y=1454, size=14)

Related

How to assign new variables after group_split by automatically?

I try do split a dataframe by two variables, year and sectors. I did split them with group_split but everytime I need them, I have to call them with $ operator. I want to give them a name automatically so I do not need to use $ for every usage. I know I can assign them to new names by hand but I have more than 70 values so it's a bit time consuming
dummy <- data.frame(year = rep(2014:2020, 12),
sector = rep(c("auto","retail","sales","medical"),3),
emp = sample(1:2000, size = 84))
dummy%>%
group_by(year)%>%
group_split(year)%>%
set_names(nm = unique(dummy$year)) -> dummy_year
head(dummy_year$2014)
year sector emp
<int> <chr> <int>
2014 auto 171
2014 medical 1156
2014 sales 1838
2014 retail 1386
2014 auto 1360
2014 medical 1403
I want to call them like
some_kind_of_function(dummy_year, assign new variable by date)
head(year_2014)
year sector emp
<int> <chr> <int>
2014 auto 171
2014 medical 1156
2014 sales 1838
2014 retail 1386
2014 auto 1360
2014 medical 1403
maybe a for loop?
Maybe you want something like this:
library(dplyr)
dummy %>%
split(f = paste0("year_", as.factor(.$year)))
group_split wouldn't create named list. We can use split from base R
lst1 <- split(dummy, dummy$year)
names(lst1) <- paste0('year_', names(lst1))
If we want to create objects (not recommended), use list2env
list2env(lst1, .GlobalEnv)
-output
> year_2014
year sector emp
1 2014 auto 740
8 2014 medical 123
15 2014 sales 700
22 2014 retail 166
29 2014 auto 323
36 2014 medical 653
43 2014 sales 986
50 2014 retail 1814
57 2014 auto 1381
64 2014 medical 661
71 2014 sales 1362
78 2014 retail 641

How to find number of storms per year since 2010?

The question says: Find the number of storms per year since 2010.
So far, I have this as my code in R.
The data set is "storms" which is a dataset that is loaded into R, and is a subset of the NOAA Atlantic hurricane database.
storms %>%
select(status, year) %>%
filter(year == 2010) %>%
tally()
What I don't know is if the "since" keyword means before 2010 or should I just count the number of storms found in 2010?
Storms since 2010 per year means including 2010 and afterwards the number of storms each year. Maybe this is what the question is asking:
storms2 = storms %>% filter(year>= 2010)
storms2 %>% count(year)
# A tibble: 11 × 2
year n
<dbl> <int>
1 2010 402
2 2011 323
3 2012 454
4 2013 202
5 2014 139
6 2015 220
7 2016 396
8 2017 306
9 2018 266
10 2019 330
11 2020 570

How to filter a dataframe so that it finds the maximum value for 10 unique occurrences of another variable

I have this dataframe here which I filter down to only include counties in the state of Washington and only include columns that are relevant for the answer I am looking for. What I want to do is filter down the dataframe so that I have 10 rows only, which have the highest Black Prison Population out of all of the counties in Washington State regardless of year. The part that I am struggling with is that there can't be repeated counties, so each row should include the highest Black Prison Populations for the top 10 unique county names in the state of Washington. Some of the counties have Null data for the populations for the black prison populations as well. for You should be able to reproduce this to get the updated dataframe.
library(dplyr)
incarceration <- read.csv("https://raw.githubusercontent.com/vera-institute/incarceration-trends/master/incarceration_trends.csv")
blackPrisPop <- incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA")
Sample of what the updated dataframe looks like (should include 1911 rows):
fips county_name state year black_pop_15to64 black_prison_pop
130 53005 Benton County WA 2001 1008 25
131 53005 Benton County WA 2002 1143 20
132 53005 Benton County WA 2003 1208 21
133 53005 Benton County WA 2004 1236 27
134 53005 Benton County WA 2005 1310 32
135 53005 Benton County WA 2006 1333 35
You can group_by the county county_name, and then use slice_max taking the row with maximum value for black_prison_pop. If you set n = 1 option you will get one row for each county. If you set with_ties to FALSE, you also will get one row even in case of ties.
You can arrange in descending order the black_prison_pop value to get the overall top 10 values across all counties.
library(dplyr)
incarceration %>%
select(black_prison_pop, black_pop_15to64, year, fips, county_name, state) %>%
filter(state == "WA") %>%
group_by(county_name) %>%
slice_max(black_prison_pop, n = 1, with_ties = FALSE) %>%
arrange(desc(black_prison_pop)) %>%
head(10)
Output
black_prison_pop black_pop_15to64 year fips county_name state
<dbl> <dbl> <int> <int> <chr> <chr>
1 1845 73480 2002 53033 King County WA
2 975 47309 2013 53053 Pierce County WA
3 224 5890 2005 53063 Spokane County WA
4 172 19630 2015 53061 Snohomish County WA
5 137 8129 2016 53011 Clark County WA
6 129 5146 2003 53035 Kitsap County WA
7 102 5663 2009 53067 Thurston County WA
8 58 706 1991 53021 Franklin County WA
9 50 1091 1991 53077 Yakima County WA
10 46 1748 2008 53073 Whatcom County WA

R: How to spread, group_by, summarise and mutate at the same time

I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).

R t-test of mean vs observations for multiple factor levels

I have a dataset of some 39k rows of data, an excerpt is below:
'Country', 'Group', 'Item', 'Year' are categorical
'Production' and 'Waste' are numerical
'LF' is also numerical, but is the result of 'Waste'/'Production
Region Country Group Item Year Production Waste LF
Europe Bulgaria Cereals Wheat 1961 2040 274 0.134313725
Europe Bulgaria Cereals Wheat 1962 2090 262 0.125358852
Europe Bulgaria Cereals Wheat 1963 1894 277 0.14625132
Europe Bulgaria Cereals Wheat 1964 2121 286 0.134842056
Europe Bulgaria Cereals Wheat 1965 2923 341 0.116660965
Europe Bulgaria Cereals Wheat 1966 3193 385 0.120576261
Europe Bulgaria Cereals Barley 1961 612 15 0.024509804
Europe Bulgaria Cereals Barley 1962 599 16 0.026711185
Europe Bulgaria Cereals Barley 1963 618 16 0.025889968
Europe Bulgaria Cereals Barley 1964 764 21 0.027486911
Europe Bulgaria Cereals Barley 1965 876 22 0.025114155
Europe Bulgaria Cereals Barley 1966 1064 24 0.022556391
I have used the following code to generate 991 different means by Item and Group
df2 <- aggregate(LF ~ Country + Item, data=df1, FUN='mean')
The results of this function look ok.
I would like to test whether the respective means of LF in df2 are different to the underlying annual observations in df1 for each Country-Item combination (ie. if FALSE, then LF is really just a static ratio, if TRUE then 'Waste' is independent from 'Production').
How might this best be done? There seem to be 991 tests to conduct for this dataset alone and I don't know how to mix the apply and t.test functions in this manner.
Thanks!
t.test requires two groups to compare on a numeric/scale dependent output variable. Here, it seems to me that for each combination of country and item you want to compare all different year averages/means. In other words, you are trying to investigate if year is influencing the LF averages, for each combination of country and item.
The easiest way to do this is to create a linear model (LF ~ Year) for each combination of country and item and interpret the coefficient and p value of the variable year.
library(dplyr)
library(broom)
set.seed(115)
# example dataset
dt = data.frame(Country = rep("country1",12),
Item = c(rep("item1",6), rep("item2",6)),
Year = rep(1961:1966,2),
LF = runif(12,0,1))
# general means by country and item
dt %>% group_by(Country,Item) %>% summarise(Mean_LF = mean(LF))
# each years means by country and item
dt %>% group_by(Country,Item,Year) %>% summarise(Mean_LF = mean(LF))
# does year influence the means for each country and item?
dt %>% group_by(Country,Item) %>% do(tidy(lm(LF~Year, data=.)))
Hope this helps. Let me know if I'm missing something and I'll update my code.

Resources