R - dplyr- Reducing data package 'storms'

R - dplyr- Reducing data package 'storms' - r

I am working with dplyr and the data package 'storms'.
I need a table in which I have each measured storm in a column. Then I want to give each row an ID.
So far I have
storm_ID <- storms %>%
select(year,month,name) %>%
group_by(year,month,name) %>%
summarise(ID = n())
storm_ID
View(storm_ID)
The only thing is that it doesn't do anything for me.
I don't quite understand how I can see every single storm in the table. I had previously sorted them by name. Then I get 214 storms. However, storms with the same name occur in several years.
At the end I want something like:
name | year | month | day | ID
| | | | |
Zeta 2005 12 31 Zeta1
Zeta 2006 1 1 Zeta1
| | | | |
Zeta 2020 10 24 Zeta2
To do this, I need to know if a storm occurred in 2 years (i.e. from 2005-12-31 to 2006-01-01) But this should then only be counted as one storm.
After that I should then be able to evaluate the duration, wind speed difference and pressure difference per storm. What I had already evaluated with the wrong sorting.
Help would be nice.
Thanks in advance.

If you count storms as one if they continue to the next day but there are no gaps, days without a storm of the same name, then the following code might be what you want.
The variable Thresh is set to the maximum number of consecutive days for a storm to be counted as the same storm.
suppressPackageStartupMessages(library(dplyr))
data("storms", package = "dplyr")
Thresh <- 5
storms %>%
count(name, year, month, day) %>%
group_by(name) %>%
mutate(Date = as.Date(ISOdate(year, month, day)),
DDiff = c(0, diff(Date)) > Thresh,
DDiff = cumsum(DDiff)) %>%
group_by(name, DDiff) %>%
mutate(name = ifelse(DDiff > 0, paste(name, cur_group_id(), sep = "."), name)) %>%
ungroup() %>%
group_by(name) %>%
summarise(name = first(name),
year = first(year),
n = sum(n))
#> # A tibble: 512 x 3
#> name year n
#> <chr> <dbl> <int>
#> 1 AL011993 1993 8
#> 2 AL012000 2000 4
#> 3 AL021992 1992 5
#> 4 AL021994 1994 6
#> 5 AL021999 1999 4
#> 6 AL022000 2000 12
#> 7 AL022001 2001 5
#> 8 AL022003 2003 4
#> 9 AL022006 2006 5
#> 10 AL031987 1987 32
#> # ... with 502 more rows
Created on 2022-04-15 by the reprex package (v2.0.1)
Edit
After seeing the OP's answer, I have revised mine and they are now nearly identical.
The main difference is that, even if increasing the gap between days with records Thresh to 5, storm Dorian has 5 consecutive days without records, between July the 27th 2013 and August 2nd 2013. Still the same, it should be considered as one storm only. To have this result, increase Thresh to an appropriate value, for instance, 30 (days) and the outputs now match.
I have left it like this to show this point and to show what the variable Thresh is meant for.
In the code that follows, I assign the result of my code above to data.frame rui and the result of the OP's is cbind'ed with id and piped to a count instruction. Then saved in storm_count. The two outputs are compared for differences with anti_join after removing the id from my name column.
suppressPackageStartupMessages(library(dplyr))
data("storms", package = "dplyr")
Thresh <- 5
storms %>%
count(name, year, month, day) %>%
group_by(name) %>%
mutate(Date = as.Date(ISOdate(year, month, day)),
DDiff = c(0, diff(Date)) > Thresh,
DDiff = cumsum(DDiff)) %>%
group_by(name, DDiff) %>%
mutate(name = ifelse(DDiff > 0, paste(name, cur_group_id(), sep = "."), name)) %>%
ungroup() %>%
group_by(name) %>%
summarise(name = first(name),
year = first(year),
n = sum(n)) -> rui
id <- c()
j <- 1
k <- 1
for(i in storms$name) {
if(k-1 == 0){
id <- append(id, j)
k <- k+1
next
}
if(i != storms$name[k-1])
{
j <- j+1
}
id <- append(id, j)
k <- k+1
}
cbind(storms, id) %>%
count(name, id) -> storm_count
# two rows
anti_join(
rui %>% mutate(name = sub("\\.\\d+$", "", name)),
storm_count,
by = c("name", "n")
)
#> # A tibble: 2 x 3
#> name year n
#> <chr> <dbl> <int>
#> 1 Dorian 2013 16
#> 2 Dorian 2013 4
# just one row
anti_join(
storm_count,
rui %>% mutate(name = sub("\\.\\d+$", "", name)),
by = c("name", "n")
)
#> name id n
#> 1 Dorian 397 20
# see here the dates of 2013-07-27 and 2013-08-02
storms %>%
filter(name == "Dorian", year == 2013) %>%
count(name, year, month, day)
#> # A tibble: 7 x 5
#> name year month day n
#> <chr> <dbl> <dbl> <int> <int>
#> 1 Dorian 2013 7 23 1
#> 2 Dorian 2013 7 24 4
#> 3 Dorian 2013 7 25 4
#> 4 Dorian 2013 7 26 4
#> 5 Dorian 2013 7 27 3
#> 6 Dorian 2013 8 2 1
#> 7 Dorian 2013 8 3 3
Created on 2022-04-15 by the reprex package (v2.0.1)

For you first problem:
storm_ID <- storms %>%
select(year,month,name) %>%
group_by(year,month,name) %>%
mutate(ID = stringr::str_c(name, cur_group_id()))
This create a unique Storm-Name-ID, e.g. Amy1, Amy2 etc.
This is how you can check if a storm has happened in consecutive years
storms %>%
group_by(name) %>%
mutate(consec_helper = cumsum(c(1, diff(year) != 1))) %>%
group_by(name, consec_helper) %>%
filter(n() > 1)
I find this to be true only for Zeta
name year
<chr> <dbl>
1 Zeta 2005
2 Zeta 2006

Thanks for your approaches, unfortunately, all are not the appropriate solution.
I asked my professor for help, he said I could start a query with a loop. (I didn't expect an answer) So I checked the names afterwards and looked if they change. The data set is sorted by date, for this reason Zeta does not occur consecutively if it is not the same storm.
My current solution is :
install.packages(dplyr)
library(dplyr)
id <- c()
j <- 1
k <- 1
for(i in storms$name) {
if(k-1 == 0){
id <- append(id, j)
k <- k+1
next
}
if(i != storms$name[k-1])
{
j <- j+1
}
id <- append(id, j)
k <- k+1
}
storms <- cbind(storms, id)
View(storms)
I have now manually checked the dataset and think it is the appropriate solution to my problem.
This brings me to 511 different storms. (As of 22-04-15)
Nevertheless, thank you for all the solutions, I appreciate it very much.

Related

How do I remove observations within 7 days of each other within a specific ID group?

ID
Date
101
10-17-2021
101
10-19-2021
101
10-20-2021
101
10-31-2021
101
11-01-2021
For each ID I want to remove observations that are within 7 days of each other. I want to keep the earliest date of the dates that are within 7 days of each other. So in this case I would want to keep "10-17-2021" and "10-31-2021". This process would continue until I have unique dates for each ID that are at least 7 days apart and do not contain other dates in between.

You can do it using group_by() and slice() functions. But first the Date column must be formatted using as.Date() function. Here is the code to remove observations within 7-day interval and keep only the earliest ID:
library(tidyverse)
df$Date <- as.Date(df$Date, format = "%m-%d-%Y")
df %>%
group_by(ID) %>%
slice(c(1, which(c(0, diff(Date)) >= 7)))
output
ID Date
101 2021-10-17
101 2021-10-31

In your example, you can't evaluate every observation independently because some of them may be removed when compared to the first value. Perhaps I'm not thinking about it the right way, but I think you need a loop to do this. Here's what I came up with (note: I made the sequence of dates longer to make sure it works):
library(dplyr)
d <- tibble(
ID = 101,
Date = seq(lubridate::mdy("01-01-2023"),
lubridate::mdy("02-07-2023"), by="days")
)
i <- 1
while(i < nrow(d)){
d <- d %>% mutate(diff = Date - d$Date[i])
d <- d %>% filter(diff <= 0 | diff > 7)
if(i < nrow(d)){
i <- i+1
}
}
d <- d %>% select(-diff)
d
#> # A tibble: 5 × 2
#> ID Date
#> <dbl> <date>
#> 1 101 2023-01-01
#> 2 101 2023-01-09
#> 3 101 2023-01-17
#> 4 101 2023-01-25
#> 5 101 2023-02-02
Created on 2023-02-08 by the reprex package (v2.0.1)
Essentially, what happens is that the loop initializes with the first observation and removes every observation within seven days. If more observations remain, it increments the counter and moves to the next day and evaluates all subsequent dates from there, keeping everything that came before.
These loops are difficult to do in the tidyverse, but you could split the data by group, run the loop on each group and then put the groups back together. Here's an example:
library(dplyr)
d <- tibble(
ID = 101,
Date = seq(lubridate::mdy("01-01-2023"),
lubridate::mdy("02-07-2023"), by="days")
)
d2 <- d %>% mutate(ID = 102)
alldat <- bind_rows(d, d2)
split_dat <- alldat %>%
group_by(ID) %>%
group_split()
result <- purrr::map(split_dat, function(d){
i <- 1
while(i < nrow(d)){
d <- d %>% mutate(diff = Date - d$Date[i])
d <- d %>% filter(diff <= 0 | diff > 7)
if(i < nrow(d)){
i <- i+1
}
}
d <- d %>% select(-diff)
d
})
result <- bind_rows(result)
result
#> # A tibble: 10 × 2
#> ID Date
#> <dbl> <date>
#> 1 101 2023-01-01
#> 2 101 2023-01-09
#> 3 101 2023-01-17
#> 4 101 2023-01-25
#> 5 101 2023-02-02
#> 6 102 2023-01-01
#> 7 102 2023-01-09
#> 8 102 2023-01-17
#> 9 102 2023-01-25
#> 10 102 2023-02-02
Created on 2023-02-08 by the reprex package (v2.0.1)

You can try using a recursive function as in this answer.
f <- function(d, ind = 1) {
ind.next <- dplyr::first(which(difftime(d, d[ind], units="days") > 7))
if (is.na(ind.next))
return(ind)
else
return(c(ind, f(d, ind.next)))
}
After the first date, the function will get the next index ind.next where the date is more than 7 days away. Recursively, add that index and get the next date after that. In the end, just return all the row indexes.
The code to use this function can group_by(ID) and slice to retain those rows based on indexes returned.
library(dplyr)
df %>%
group_by(ID) %>%
slice(f(Date))

How to create a new column that specifies which range of years a date belongs to (like academic year)?

In some cases, a "year" doesn't necessarily cycle from January 1st. For example, academic year starts at the end of August in the US. Another example is the NBA season.
My question: given data containing a date column, I want to create another column that refers to which period it falls in. For example, consider that we are given the following tib:
library(lubridate, warn.conflicts = FALSE)
library(tibble)
tib <- tibble(my_dates = as_date(c("1999-01-01", "2010-08-09", "2010-09-02", "1995-03-02")))
tib
#> # A tibble: 4 x 1
#> my_dates
#> <date>
#> 1 1999-01-01
#> 2 2010-08-09
#> 3 2010-09-02
#> 4 1995-03-02
and we want to mutate a column that refers to the academic year each date belongs to, provided that the academic year starts on August 31st:
desired_output <-
tib %>%
add_column(belongs_to_school_year = c("1998-1999", "2009-2010", "2010-2011", "1994-1995"))
desired_output
#> # A tibble: 4 x 2
#> my_dates belongs_to_school_year
#> <date> <chr>
#> 1 1999-01-01 1998-1999
#> 2 2010-08-09 2009-2010
#> 3 2010-09-02 2010-2011
#> 4 1995-03-02 1994-1995
How can I create the column belongs_to_school_year using mutate(), based on my_dates?

You can use dplyr and lubridate for this:
desired_output <- tib %>%
mutate(school_year = case_when(month(my_dates) <= 8 ~ paste(year(my_dates)-1, year(my_dates), sep = "-"),
month(my_dates) > 8 ~ paste(year(my_dates), year(my_dates)+1, sep = "-")))
or:
desired_output <- tib %>%
mutate(school_year = if_else(month(my_dates) <= 8,
paste(year(my_dates)-1, year(my_dates), sep = "-"),
paste(year(my_dates), year(my_dates)+1, sep = "-")))

Weighted mean of a group, where weight is from another group

Suppose you have a long data.frame of the following form:
ID Group Year Field VALUE
1 1 2016 AA 10
2 1 2016 AA 16
1 1 2016 TOTAL 100
2 1 2016 TOTAL 120
etc..
and you want to create an grouped output of weighted.mean(Value,??) for each group_by(Group, Year, Field) using Field == TOTAL as the weight for years >2013.
So far i am using dplyr:
dat %>%
filter(Year>2013) %>%
group_by(Group, Year, Field) %>%
summarize(m = weighted.mean(VALUE,VALUE[Field == 'TOTAL'])) %>%
ungroup()
Now the problem (to my understanding) is that by using group_by I cannot define the "Field" value afterwards, as I tell it to look at the group of "Field == AA".
Transforming data from long to wide is not a solution, as i have >1000 different field values which potentially increase over time, and this code will be run daily at some point.

First of all, this is a hacky solution, and I am sure there is a better approach to this issue. The goal is to make a new column containing the weights, and this approach does so using the filling nature of left_join(), but I am sure you could do this with fill() or across().
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.0.3
# Example data from OP
dat <- data.frame(ID = c(1,2,1,2), Group = rep(1,4), Year = rep(2016,4),Field = c("AA","AA","TOTAL","TOTAL"), VALUE = c(10,16,100,120))
# Make a new dataframe containing the TOTAL values
weights <- dat %>% filter(Field == "TOTAL") %>% mutate(w = VALUE) %>% select(-Field,-VALUE)
weights
#> ID Group Year w
#> 1 1 1 2016 100
#> 2 2 1 2016 120
# Make a new frame containing the original values and the weights
new_dat <- left_join(dat,weights, by = c("Group","Year","ID"))
# Add a column for weight
new_dat %>%
filter(Year>2013) %>%
group_by(Group, Year, Field) %>%
summarize(m = weighted.mean(VALUE,w)) %>%
ungroup()
#> `summarise()` regrouping output by 'Group', 'Year' (override with `.groups` argument)
#> # A tibble: 2 x 4
#> Group Year Field m
#> <dbl> <dbl> <chr> <dbl>
#> 1 1 2016 AA 13.3
#> 2 1 2016 TOTAL 111.
Created on 2020-11-03 by the reprex package (v0.3.0)

Deleting duplicated rows based on condition (position)

I have a dataset that looks something like this
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
I have mulple observations for each id and time identifier - e.g. I have 3 different alpha 1970 values. I would like to retain only one observation per id/year most notably the last one that appears in for each id/year.
the final dataset should look something like this:
final <- data.frame("id" = c("Alpha","Alpha","Beta","Beta","Beta"),
"Year" = c(1970,1971,1980,1981,1982),
"Val" = c(-2,5,5,3,5))
Does anyone know how I can approach the problem?
Thanks a lot in advance for your help

If you are open to a data.table solution, this can be done quite concisely:
library(data.table)
setDT(df)[, .SD[.N], by = c("id", "Year")]
#> id Year Val
#> 1: Alpha 1970 -2
#> 2: Alpha 1971 5
#> 3: Beta 1980 5
#> 4: Beta 1981 3
#> 5: Beta 1982 5
by = c("id", "Year") groups the data.table by id and Year, and .SD[.N] then returns the last row within each such group.

How about this?
library(tidyverse)
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
final <-
df %>%
group_by(id, Year) %>%
slice(n()) %>%
ungroup()
final
#> # A tibble: 5 x 3
#> id Year Val
#> <fct> <dbl> <dbl>
#> 1 Alpha 1970 -2
#> 2 Alpha 1971 5
#> 3 Beta 1980 5
#> 4 Beta 1981 3
#> 5 Beta 1982 5
Created on 2019-09-29 by the reprex package (v0.3.0)
Translates to "within each id-Year group, take only the row where the row number is equal to the size of the group, i.e. it's the last row under the current ordering."
You could also use either filter(), e.g. filter(row_number() == n()), or distinct() (and then you wouldn't even have to group), e.g. distinct(id, Year, .keep_all = TRUE) - but distinct functions take the first distinct row, so you'd need to reverse the row ordering here first.

An option with base R
aggregate(Val ~ ., df, tail, 1)
# id Year Val
#1 Alpha 1970 -2
#2 Alpha 1971 5
#3 Beta 1980 5
#4 Beta 1981 3
#5 Beta 1982 5
If we need to select the first row
aggregate(Val ~ ., df, head, 1)

max([column])where name = (each unique name in the name column) for each year in R

I am using the baby names data in R for practice.
total_n <-babynames %>%
mutate(name_gender = paste(name,sex))%>%
group_by(year) %>%
summarise(total_n = sum(n, na.rm=TRUE)) %>%
arrange(total_n)
bn <- inner_join(babynames,total_n,by = "year")
df <- bn%>%
mutate(pct_of_names = n/total_n)%>%
group_by(name, year)%>%
summarise(pct =sum(pct_of_names))
The dataframe output looked like this:
For each name, there's all the years, and the related pct for that year. I am stuck with getting the year with the highest pct for each name. How do I do this?

Pretty simple, once you know where the babynames data comes from. You had everything needed:
library(dplyr)
library(babynames)
total_n <-babynames %>%
mutate(name_gender = paste(name,sex))%>%
group_by(year) %>%
summarise(total_n = sum(n, na.rm=TRUE)) %>%
arrange(total_n)
bn <- inner_join(babynames,total_n,by = "year")
df <- bn%>%
mutate(pct_of_names = n/total_n)%>%
group_by(name, year)%>%
summarise(pct =sum(pct_of_names))
You were missing this final step:
df %>%
group_by(name) %>%
filter(pct == max(pct))
# A tibble: 95,025 x 3
# Groups: name [95,025]
name year pct
<chr> <dbl> <dbl>
1 Aaban 2014 4.338256e-06
2 Aabha 2014 2.440269e-06
3 Aabid 2003 1.316094e-06
4 Aabriella 2015 1.363073e-06
5 Aada 2015 1.363073e-06
6 Aadam 2015 5.997520e-06
7 Aadan 2009 6.031433e-06
8 Aadarsh 2014 4.880538e-06
9 Aaden 2009 3.335645e-04
10 Aadesh 2011 1.370356e-06
# ... with 95,015 more row
group_by and filter are your friends.