The question says: Find the number of storms per year since 2010.
So far, I have this as my code in R.
The data set is "storms" which is a dataset that is loaded into R, and is a subset of the NOAA Atlantic hurricane database.
storms %>%
select(status, year) %>%
filter(year == 2010) %>%
tally()
What I don't know is if the "since" keyword means before 2010 or should I just count the number of storms found in 2010?
Storms since 2010 per year means including 2010 and afterwards the number of storms each year. Maybe this is what the question is asking:
storms2 = storms %>% filter(year>= 2010)
storms2 %>% count(year)
# A tibble: 11 × 2
year n
<dbl> <int>
1 2010 402
2 2011 323
3 2012 454
4 2013 202
5 2014 139
6 2015 220
7 2016 396
8 2017 306
9 2018 266
10 2019 330
11 2020 570
Related
Hi guys I am trying to plot a streamgraph using data at the following link: https://www.kaggle.com/START-UMD/gtd.
My aim is to streamgraph the frequency of terrorist attacks for each terrorist group of the variable gnamebut my problem is that I don't know how to filter the data frame in order to have all the parameters necessary to plot a streamgraph which are data, key, value, date.
I tried to get to that subset of the original dataframe by using the following code
str <- terror %>%
filter(gname != "Unknown") %>%
group_by(gname) %>%
summarise(total=n()) %>%
arrange(desc(total)) %>%
head(20)
But all I managed to get is the frequency of attacks for each terrorist group, without getting the number of attacks for each year.
Could you suggest any way to do it? That would be amazing!
Thanks for reading guys and for the help.
Dario and Kent are correct. You need to add the iyear variable in the group_by function:
terror %>%
filter(gname != "Unknown") %>%
group_by(gname, iyear) %>%
summarise(total=n()) %>%
arrange(desc(total)) %>%
head(20) -> str
str
# A tibble: 20 x 3
# Groups: gname [7]
gname iyear total
<chr> <int> <int>
1 Islamic State of Iraq and the Levant (ISIL) 2016 1454
2 Islamic State of Iraq and the Levant (ISIL) 2017 1315
3 Islamic State of Iraq and the Levant (ISIL) 2014 1249
4 Taliban 2015 1249
5 Islamic State of Iraq and the Levant (ISIL) 2015 1221
6 Taliban 2016 1065
7 Taliban 2014 1035
8 Taliban 2017 894
9 Al-Shabaab 2014 871
10 Taliban 2012 800
11 Taliban 2013 775
12 Al-Shabaab 2017 570
13 Al-Shabaab 2016 564
14 Boko Haram 2015 540
15 Shining Path (SL) 1989 509
16 Communist Party of India - Maoist (CPI-Maoist) 2010 505
17 Shining Path (SL) 1984 502
18 Boko Haram 2014 495
19 Shining Path (SL) 1983 493
20 Farabundo Marti National Liberation Front (FML~ 1991 492
Then send that to the streamgraph:
str %>% streamgraph("gname", "total", "iyear")
I've always had difficulty annotating these graphs, as far as I know, it had to be done manually:
str %>% streamgraph("gname", "total", "iyear") %>%
sg_annotate(label="ISIL", x=as.Date("2016-01-01"), y=1454, size=14)
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
I have the dataframe below.
year<-c(2016,2016,2017,2017,2016,2016,2017,2017)
city<-c("NY","NY","NY","NY","WS","WS","WS","WS")
spec<-c("df","df","df","df","vb","vb","vb","vb")
num<-c(45,67,89,90,45,67,89,90)
df<-data.frame(year,city,spec,num)
I would like to know if it is possible to sum the num based on year,city and spec columns in order to bring it from this form:
year city spec num
1 2016 NY df 45
2 2016 NY df 67
3 2017 NY df 89
4 2017 NY df 90
5 2016 WS vb 45
6 2016 WS vb 67
7 2017 WS vb 89
8 2017 WS vb 90
to this:
year city spec num
1 2016 NY df 112
2 2017 NY df 179
3 2016 WS vb 112
4 2017 WS vb 179
Possible duplicate, but here is an answer:
library(tidyverse)
df %>%
group_by(year,city,spec) %>%
summarise(sum = sum(num))
...results in ...
# A tibble: 4 x 4
# Groups: year, city [4]
year city spec sum
<dbl> <fct> <fct> <dbl>
1 2016 NY df 112
2 2016 WS vb 112
3 2017 NY df 179
4 2017 WS vb 179
One way is to use sqldf package:
sqldf("Select year, city, spec, sum(num) from df
group by year, city, spec order by city")
year city spec sum(num)
1 2016 NY df 112
2 2017 NY df 179
3 2016 WS vb 112
4 2017 WS vb 179
Using dplyr
df %>%
group_by(year, city, spec) %>%
summarise(SumNum = sum(num)) %>%
arrange(city)
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a df that resembles this:
Year Country Sales($M)
2013 Australia 120
2013 Australia 450
2013 Armenia 80
2013 Armenia 175
2013 Armenia 0
2014 Australia 500
2014 Australia 170
2014 Armenia 0
2014 Armenia 100
I'd like to combine the rows that match Year and Country, adding the Sales column. The result should be:
Year Country Sales($M)
2013 Australia 570
2013 Armenia 255
2014 Australia 670
2014 Armenia 100
I'm sure I could write a long loop to check whether Year and Country are the same and then add the Sales from those rows, but this is R so there must be a simple function that I'm totally missing.
Many thanks in advance.
library(tidyverse)
df %>%
group_by(Year,Country) %>%
summarise(Sales = sum(Sales))
I have a data frame that has hourly observational climate data over multiple years, I have included a dummy data frame below that will hopefully illustrate my QU.
dateTime <- seq(as.POSIXct("2012-01-01"),
as.POSIXct("2012-12-31"),
by=(60*60))
WS <- sample(0:20,8761,rep=TRUE)
WD <- sample(0:390,8761,rep=TRUE)
Temp <- sample(0:40,8761,rep=TRUE)
df <- data.frame(dateTime,WS,WD,Temp)
df$WS[WS>15] <- NA
I need to group by year (or in this example, by month) to find if df$WS has 75% or more of valid data for that month. My filtering criteria is NA as 0 is still a valid observation. I have real NAs as it is observational climate data.
I have tried dplyr piping using %>% function to filer by a new column "Month" as well as reviewing several questions on here
Calculate the percentages of a column in a data frame - "grouped" by column,
Making a data frame of count of NA by variable for multiple data frames in a list,
R group by date, and summarize the values
None of these have really answered my question.
My hope is to put something in a longer script that works in a looping function that will go through all my stations and all the years in each station to produce a wind rose if this criteria is met for that year / station. Please let me know if I need to clarify more.
Cheers
There are many way of doing this. This one appears quite instructive.
First create a new variable which will denote month (and account for year if you have more than one year). Split on this variable and count the number of NAs. Divide this by the number of values and multiply by 100 to get percentage points.
df$monthyear <- format(df$dateTime, format = "%m %Y")
out <- split(df, f = df$monthyear)
sapply(out, function(x) (sum(is.na(x$WS))/nrow(x)) * 100)
01 2012 02 2012 03 2012 04 2012 05 2012 06 2012 07 2012
23.92473 21.40805 24.09152 25.00000 20.56452 24.58333 27.15054
08 2012 09 2012 10 2012 11 2012 12 2012
22.31183 25.69444 23.22148 21.80556 24.96533
You could also use data.table.
library(data.table)
setDT(df)
df[, (sum(is.na(WS))/.N) * 100, by = monthyear]
monthyear V1
1: 01 2012 23.92473
2: 02 2012 21.40805
3: 03 2012 24.09152
4: 04 2012 25.00000
5: 05 2012 20.56452
6: 06 2012 24.58333
7: 07 2012 27.15054
8: 08 2012 22.31183
9: 09 2012 25.69444
10: 10 2012 23.22148
11: 11 2012 21.80556
12: 12 2012 24.96533
Here is a method using dplyr. It will work even if you have missing data.
library(lubridate) #for the days_in_month function
library(dplyr)
df2 <- df %>% mutate(Month=format(dateTime,"%Y-%m")) %>%
group_by(Month) %>%
summarise(No.Obs=sum(!is.na(WS)),
Max.Obs=24*days_in_month(as.Date(paste0(first(Month),"-01")))) %>%
mutate(Obs.Rate=No.Obs/Max.Obs)
df2
Month No.Obs Max.Obs Obs.Rate
<chr> <int> <dbl> <dbl>
1 2012-01 575 744 0.7728495
2 2012-02 545 696 0.7830460
3 2012-03 560 744 0.7526882
4 2012-04 537 720 0.7458333
5 2012-05 567 744 0.7620968
6 2012-06 557 720 0.7736111
7 2012-07 553 744 0.7432796
8 2012-08 568 744 0.7634409
9 2012-09 546 720 0.7583333
10 2012-10 544 744 0.7311828
11 2012-11 546 720 0.7583333
12 2012-12 554 744 0.7446237
I want to spread this data below (first 12 rows shown here only) by the column 'Year', returning the sum of 'Orders' grouped by 'CountryName'. Then calculate the % change in 'Orders' for each 'CountryName' from 2014 to 2015.
CountryName Days pCountry Revenue Orders Year
United Kingdom 0-1 days India 2604.799 13 2014
Norway 8-14 days Australia 5631.123 9 2015
US 31-45 days UAE 970.8324 2 2014
United Kingdom 4-7 days Austria 94.3814 1 2015
Norway 8-14 days Slovenia 939.8392 3 2014
South Korea 46-60 days Germany 1959.4199 15 2014
UK 8-14 days Poland 1394.9096 6. 2015
UK 61-90 days Lithuania -170.8035 -1 2015
US 8-14 days Belize 1687.68 5 2014
Australia 46-60 days Chile 888.72 2. 0 2014
US 15-30 days Turkey 2320.7355 8 2014
Australia 0-1 days Hong Kong 672.1099 2 2015
I can make this work with a smaller test dataframe, but can only seem to return endless errors like 'sum not meaningful for factors' or 'duplicate identifiers for rows' with the full data. After hours of reading the dplyr docs and trying things I've given up. Can anyone help with this code...
data %>%
spread(Year, Orders) %>%
group_by(CountryName) %>%
summarise_all(.funs=c(Sum='sum'), na.rm=TRUE) %>%
mutate(percent_inc=100*((`2014_Sum`-`2015_Sum`)/`2014_Sum`))
The expected output would be a table similar to below. (Note: these numbers are for illustrative purposes, they are not hand calculated.)
CountryName percent_inc
UK 34.2
US 28.2
Norway 36.1
... ...
Edit
I had to make a few edits to the variable names, please note.
Sum first, while your data are still in long format, then spread. Here's an example with fake data:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2014:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
spread(Year, sum_orders) %>%
mutate(Pct = (`2014` - `2015`)/`2014` * 100)
Country `2014` `2015` Pct
1 A 575 599 -4.173913
2 B 457 486 -6.345733
3 C 481 319 33.679834
4 D 423 481 -13.711584
5 E 528 551 -4.356061
If you have multiple years, it's probably easier to just keep it in long format until you're ready to make a nice output table:
set.seed(2)
dat = data.frame(Country=sample(LETTERS[1:5], 500, replace=TRUE),
Year = sample(2010:2015, 500, replace=TRUE),
Orders = sample(-1:20, 500, replace=TRUE))
dat %>% group_by(Country, Year) %>%
summarise(sum_orders = sum(Orders, na.rm=TRUE)) %>%
group_by(Country) %>%
arrange(Country, Year) %>%
mutate(Pct = c(NA, -diff(sum_orders))/lag(sum_orders) * 100)
Country Year sum_orders Pct
<fctr> <int> <int> <dbl>
1 A 2010 205 NA
2 A 2011 144 29.756098
3 A 2012 226 -56.944444
4 A 2013 119 47.345133
5 A 2014 177 -48.739496
6 A 2015 303 -71.186441
7 B 2010 146 NA
8 B 2011 159 -8.904110
9 B 2012 152 4.402516
10 B 2013 180 -18.421053
# ... with 20 more rows
This is not an answer because you haven't really asked a reproducible question, but just to help out.
Error 1 You're getting this error duplicate identifiers for rows likely because of spread. spread wants to make N columns of your N unique values but it needs to know which unique row to place those values. If you have duplicate value-combinations, for instance:
CountryName Days pCountry Revenue
United Kingdom 0-1 days India 2604.799
United Kingdom 0-1 days India 2604.799
shows up twice, then spread gets confused which row it should place the data in. The quick fix is to data %>% mutate(row=row_number()) %>% spread... before spread.
Error 2 You're getting this error sum not meaningful for factors likely because of summarise_all. summarise_all will operate on all columns but some columns contain strings (or factors). What does United Kingdom + United Kingdom equal? Try instead summarise(2014_Sum = sum(2014), 2015_Sum = sum(2015)).