R cumulative sum using dplyr with reset - r

I am trying to make a table that counts the number of consecutive years grouped by columns "state" and "p" that looks like this:
data_right <- data.table(state = c("NY", "NY", "NY", "NY", "NY","NY", "PA",
"PA", "PA", "PA", "PA", "PA"), p = c("n", "n","n","n", "p", "p", "n", "n", "n",
"p", "p", "p"),Year = c("1973", "1974", "1977", "1978", "1988", "1989" ,"1991",
"1992", "1993", "1920", "1929", "1931"), Consecutive_Yrs =
c(1,2,1,2,1,2,1,2,3,1,1,1))
The code I am using right now is not working properly. I'm trying mutate, and group_by statements in dplyr but am having no luck. I also cannot use the data.table package because my R version is not up to date.
Any help to get this output is greatly appreciated!

library(dplyr)
data_right %>%
group_by(state, p) %>%
mutate(grp = cumsum(c(TRUE, diff(as.integer(Year)) > 1))) %>%
group_by(state, p, grp) %>%
mutate(cy = row_number()) %>%
ungroup() %>%
select(-grp)
# # A tibble: 12 x 5
# state p Year Consecutive_Yrs cy
# <chr> <chr> <chr> <dbl> <int>
# 1 NY n 1973 1 1
# 2 NY n 1974 2 2
# 3 NY n 1977 1 1
# 4 NY n 1978 2 2
# 5 NY p 1988 1 1
# 6 NY p 1989 2 2
# 7 PA n 1991 1 1
# 8 PA n 1992 2 2
# 9 PA n 1993 3 3
# 10 PA p 1920 1 1
# 11 PA p 1929 1 1
# 12 PA p 1931 1 1
Assumes the data is already ordered by Year.
Data:
data_right <- data.table(state = c("NY", "NY", "NY", "NY", "NY","NY", "PA", "PA", "PA", "PA", "PA", "PA"), p = c("n", "n","n","n", "p", "p", "n", "n", "n", "p", "p", "p"),Year = c("1973", "1974", "1977", "1978", "1988", "1989" ,"1991", "1992", "1993", "1920", "1929", "1931"), Consecutive_Yrs = c(1,2,1,2,1,2,1,2,3,1,1,1))

Related

How to Create a New Variable Based on a List on Vector

In R,
With
a) list containing regions (Northeast, South, North Central, West) that each state belongs to
regions <- list(
west = c("WA", "OR", "CA", "NV", "AZ", "ID", "MT", "WY",
"CO", "NM", "UT"),
south = c("TX", "OK", "AR", "LA", "MS", "AL", "TN", "KY",
"GA", "FL", "SC", "NC", "VA", "WV"),
midwest = c("KS", "NE", "SD", "ND", "MN", "MO", "IA", "IL",
"IN", "MI", "WI", "OH"),
northeast = c("ME", "NH", "NY", "MA", "RI", "VT", "PA",
"NJ", "CT", "DE", "MD", "DC")
)
And
b) a data.frame with States and Deaths
#A tibble:
state Deaths
<chr> <int>
1 AL 29549
2 AK 741
3 AR 50127
4 NJ 15142
5 CA 175213
6 IA 1647
...
I want to create a new variable, matching each state to it's region and summarizing Deaths. What's the best approach to do this?
We may stack the list to a two column data.frame and do a join
library(dplyr)
stack(regions) %>%
left_join(df1, ., by = c("state" = "values")) %>%
rename(region = 'ind')
-output
state Deaths region
1 AL 29549 south
2 AK 741 <NA>
3 AR 50127 south
4 NJ 15142 northeast
5 CA 175213 west
6 IA 1647 midwest
If the df1 have duplicate rows, we may do a group by summarise
stack(regions) %>%
left_join(df1, ., by = c("state" = "values")) %>%
group_by(state, region = 'ind') %>%
summarise(Deaths = sum(Deaths, na.rm = TRUE), .groups = 'drop')
data
df1 <- structure(list(state = c("AL", "AK", "AR", "NJ", "CA", "IA"),
Deaths = c(29549L, 741L, 50127L, 15142L, 175213L, 1647L)),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
What I did here was just create the list into a data frame with one column denoting the region and the other a list of states
Here I used a dplyr function "right_join" which will "line up" the different rows and columns based on specific values. So here we want to line up the corresponding region based on states.

Keeping pairs of data that have the same date and time in R

I have 2 sets of data that look like this (this is a very small subset of it).
data1 <- data.frame("Metal" = c("Al", "Al", "Al", "Al", "Al", "Al", "Al"), "Type" =
c("F", "F", "F", "F", "F", "F", "F"), "Date" = c("2000-01-01", "2000-01-01", "2000-
01-02", "2000-01-03",
"2000-01-03", "2000-01-07", "2000-01-07"), "Time" = c("11:00:00", "12:00:00",
"15:00:00", "13:00:00", "17:00:00", "20:00:00", "20:00:00"), "Value" = c(100, 200,
300, 100, 400, 500, 500))
data2 <- data.frame("Metal" = c("Al", "Al", "Al", "Al", "Al", "Al", "Al"), "Type" =
c("P", "P",
"P", "P", "P",
"P", "P"), "Date" = c("2000-01-01", "2000-01-01", "2000-01-01", "2000-01-03", "2000-
01-03",
"2000-01-04", "2000-01-07"), "Time" = c("11:00:00", "11:00:00", "14:00:00",
"17:00:00", "13:00:00", "16:00:00", "20:00:00"), "Value" = c(100, 100, 200, 900, 100,
400, 999))
I want to keep data from both tables that have the same date and time and create a new table (data3). Sometimes within data1 and data2, there will be duplicates, I don't want data3 to contain those duplicates, just 1 of them and with its pair from the other table. I would also like the output table to be ordered to show the pairs from each table under each other (so my "Type" column would be alternating F, P, F, P, etc.).
Here is my desired output
data3 <- data.frame("Metal" = c("Al", "Al", "Al", "Al", "Al",
"Al", "Al", "Al"), "Type" = c("F", "P", "F",
"P", "F", "P", "F", "P"), "Date" = c("2000-01-01", "2000-01-01",
"2000-01-03", "2000-01-03", "2000-01-03", "2000-01-03", "2001-01-
07", "2001-01-07"), "Time" =
c("11:00:00", "11:00:00", "13:00:00",
"13:00:00", "17:00:00", "17:00:00", "20:00:00", "20:00:00"),
"Value" = c(100, 100, 100, 100, 400, 900, 500, 999))
I have tried using various types of joins from dplyr, but they aren't joining the way I'd like it to.
Thank you for your help!!
We may need bind the data, and then filter out the duplicates after grouping
library(dplyr)
library(data.table)
bind_rows(data1, data2, .id = 'grp')%>%
group_by(Metal, Date, Time) %>%
filter(n() > 1) %>%
arrange(Date, Time, rowid(grp)) %>%
slice(match(c("F", "P"), Type)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 8 × 5
Metal Type Date Time Value
<chr> <chr> <chr> <chr> <dbl>
1 Al F 2000-01-01 11:00:00 100
2 Al P 2000-01-01 11:00:00 100
3 Al F 2000-01-03 13:00:00 100
4 Al P 2000-01-03 13:00:00 100
5 Al F 2000-01-03 17:00:00 400
6 Al P 2000-01-03 17:00:00 900
7 Al F 2000-01-07 20:00:00 500
8 Al P 2000-01-07 20:00:00 999
-OP's data
> data3
Metal Type Date Time Value
1 Al F 2000-01-01 11:00:00 100
2 Al P 2000-01-01 11:00:00 100
3 Al F 2000-01-03 13:00:00 100
4 Al P 2000-01-03 13:00:00 100
5 Al F 2000-01-03 17:00:00 400
6 Al P 2000-01-03 17:00:00 900
7 Al F 2001-01-07 20:00:00 500
8 Al P 2001-01-07 20:00:00 999
This was not easy :-)
library(dplyr)
bind_rows(data1, data2) %>%
group_by(Date, Time) %>%
filter(n()>1) %>%
ungroup() %>%
group_by(Type) %>%
arrange(Time) %>%
ungroup() %>%
mutate(Flag = ifelse(Type == "P" & lag(Type, default = last(Type)) == "F", 1, NA)) %>%
mutate(Flag1 = lead(Flag)) %>%
filter(if_any(.cols = starts_with("Flag"), .fns = ~ . == 1)) %>%
select(-starts_with("Flag"))
Metal Type Date Time Value
<chr> <chr> <chr> <chr> <dbl>
1 Al F 2000-01-01 11:00:00 100
2 Al P 2000-01-01 11:00:00 100
3 Al F 2000-01-03 13:00:00 100
4 Al P 2000-01-03 13:00:00 100
5 Al F 2000-01-03 17:00:00 400
6 Al P 2000-01-03 17:00:00 900
7 Al F 2000-01-07 20:00:00 500
8 Al P 2000-01-07 20:00:00 999
An approach with inner_join
The difficulty here is getting the right format, the mere data filter itself is done after the inner_join.
library(dplyr)
library(tidyr)
joined <- inner_join(data1 %>% distinct(), data2 %>% distinct(),
c("Metal", "Date", "Time"))
joined
Metal Type.x Date Time Value.x Type.y Value.y
1 Al F 2000-01-01 11:00:00 100 P 100
2 Al F 2000-01-03 13:00:00 100 P 100
3 Al F 2000-01-03 17:00:00 400 P 900
4 Al F 2000-01-07 20:00:00 500 P 999
Arranging data
joined %>%
pivot_longer(starts_with("Type"), values_to="Type") %>%
rowwise() %>%
mutate(Value = c_across(starts_with("Value"))[c(F=1, P=2)[Type]]) %>%
select(-contains("."), -name) %>%
ungroup()
# A tibble: 8 × 5
Metal Date Time Type Value
<chr> <chr> <chr> <chr> <dbl>
1 Al 2000-01-01 11:00:00 F 100
2 Al 2000-01-01 11:00:00 P 100
3 Al 2000-01-03 13:00:00 F 100
4 Al 2000-01-03 13:00:00 P 100
5 Al 2000-01-03 17:00:00 F 400
6 Al 2000-01-03 17:00:00 P 900
7 Al 2000-01-07 20:00:00 F 500
8 Al 2000-01-07 20:00:00 P 999

Calculate median grouping in multiple year increments R

I'm trying to use dplyr to calculate medians by grouping 3 different columns and in 3 year increments.
My data looks like this:
data <- data.frame("Year" = c("1990","1990", "1992", "1993", "1994", "1990", "1991", "1990",
"1991", "1992", "1994", "1995"),"Type" = c("Al", "Al", "Al", "Al", "Al", "Al", "Al", "Cu",
"Cu", "Cu", "Cu", "Cu"), "Frac" = c("F", "F", "F", "F", "F", "UF", "UF", "F", "F", "UF",
"UF", "UF"), "Value" = c(0.1, 0.2, 0.3, 0.6, 0.7, 1.3, 1.5, 0.4, 0.2, 0.9, 2.3, 2.9))
I would like to calculate the median of "Value" in 3 year groupings and also grouping by "Type" and "Frac".
The problem is that sometimes there is a missing year, so I want it to group in 3 year increments based on the data that I have. Showing what I mean with my example data it would be grouped like this: (1990, 1992, 1993) for Al and F. Then just (1994) for Al and F since there's no more data for Al and F. Then (1990, 1991) for Al and UF since there's only 2 years worth of data. So basically I want it to be grouped by 3 years if possible, but if not, then do whatever is left over.
This is the end table I would like to have:
stats_wanted <- data.frame("Year" = c("1990, 1992, 1993", "1994", "1990, 1991",
"1990, 1991", "1992, 1994, 1995"), "Type" = c("Al", "Al", "Al", "Cu", "Cu"), "Frac" =
c("F", "F", "UF", "F", "UF"), "Median" = c(0.25, 0.7, 1.4, 0.3, 2.3))
Hopefully this makes sense... let me know if you have any questions :)!
I do not know dplyr, but here is a data.table solution.
library(data.table)
setDT(data)
data = data[order(Type,Frac,Year)]
# data = data[order(Year)] also works fine
data[
!duplicated(.SD,by=c('Year','Type','Frac')),
yeargroup:=0:(.N-1) %/% 3,
.(Type,Frac)]
# !duplicated... selects only the first unique row by year,type,frac
# 0:(.N-1) gives 0 to N-1 for each Type,Frac group
# %/% 3 gives the remainder when divided by 3
> data
Year Type Frac Value yeargroup
1: 1990 Al F 0.1 0
2: 1990 Al F 0.2 NA <- NA because dupe Year,Type,Frac
3: 1992 Al F 0.3 0
4: 1993 Al F 0.6 0
5: 1994 Al F 0.7 1
6: 1990 Al UF 1.3 0
7: 1991 Al UF 1.5 0
8: 1990 Cu F 0.4 0
9: 1991 Cu F 0.2 0
10: 1992 Cu UF 0.9 0
11: 1994 Cu UF 2.3 0
12: 1995 Cu UF 2.9 0
# handle dupe Year,Type,Frac rows:
data[,yeargroup:=max(yeargroup,na.rm=T),.(Year,Type,Frac)]
> data
Year Type Frac Value yeargroup
1: 1990 Al F 0.1 0
2: 1990 Al F 0.2 0 <- fixed NA
3: 1992 Al F 0.3 0
4: 1993 Al F 0.6 0
5: 1994 Al F 0.7 1
6: 1990 Al UF 1.3 0
7: 1991 Al UF 1.5 0
8: 1990 Cu F 0.4 0
9: 1991 Cu F 0.2 0
10: 1992 Cu UF 0.9 0
11: 1994 Cu UF 2.3 0
12: 1995 Cu UF 2.9 0
stats_wanted = data[,
.(Year=paste0(unique(Year),collapse=', '),Median=median(Value)),
.(Type,Frac,yeargroup)]
> stats_wanted
Type Frac yeargroup Year Median
1: Al F 0 1990, 1992, 1993 0.25
2: Al F 1 1994 0.70
3: Al UF 0 1990, 1991 1.40
4: Cu F 0 1990, 1991 0.30
5: Cu UF 0 1992, 1994, 1995 2.30
PS: #ronak-shah posted a concise dplyr solution, which inspired me to post another data.table solution which is even conciser:
> data[
order(Year),
.(Year,Value,group=(rleid(Year)-1)%/%3),
.(Type,Frac)
][,
.(Year=paste0(unique(Year),collapse=', '),Median=median(Value)),
.(Type,Frac,group)
]
Here's a dplyr solution -
For each Type and Frac, we create a group column which assigns the same number to every 3 values. For each group, we concatenate the Year value and calculate the median.
library(dplyr)
data %>%
group_by(Type, Frac) %>%
mutate(group = match(Year, unique(Year)),
group = ceiling(group/3)) %>%
group_by(group, .add = TRUE) %>%
summarise(Year = toString(unique(Year)),
Median = median(Value), .groups = 'drop') %>%
select(Year, Type, Frac, Median)
# Year Type Frac Median
# <chr> <chr> <chr> <dbl>
#1 1990, 1992, 1993 Al F 0.25
#2 1994 Al F 0.7
#3 1990, 1991 Al UF 1.4
#4 1990, 1991 Cu F 0.3
#5 1992, 1994, 1995 Cu UF 2.3

Merge two datasets in R based on column values [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I have two datasets I would like to merge in R: one is a long catch dataset and the other is a small effort dataset. I would like to join these so that I can multiply values for the same years AND industry together. Eg, the small effort columns will be repeated many times over, as they are industry-wide characteristics. I think this is a very simple merge but am having trouble making it work!
Catch <- data.frame(
Species = c("a", "a", "c", "c", "a", "b"),
Industry= c( "ag", "fi", "ag", "fi", "ag", "fi" ),
Year = c("1990", "1990", "1991", "1992", "1990", "1990"),
Catch = c(0,1,4,7,5,6))
Effort<-data.frame(
Industry= c( "ag", "ag", "ag" , "fi", "fi", "fi"),
Year = c("1990", "1991", "1992", "1990", "1991", "1992"),
Effort = c(0,1,4,7,5,6))
What I have tried so far:
effort_catch<-merge(Effort, Catch , by.x= Year, by.y=Year )
I am not sure which one is what you need
transform(
merge(Catch, Effort, by = c("Industry", "Year"), all.x = TRUE),
prod = Catch * Effort
)
Industry Year Species Catch Effort prod
1 ag 1990 a 0 0 0
2 ag 1990 a 5 0 0
3 ag 1991 c 4 1 4
4 fi 1990 a 1 7 7
5 fi 1990 b 6 7 42
6 fi 1992 c 7 6 42
or
transform(
merge(Catch, Effort, by = c("Industry", "Year"), all = TRUE),
prod = Catch * Effort
)
Industry Year Species Catch Effort prod
1 ag 1990 a 0 0 0
2 ag 1990 a 5 0 0
3 ag 1991 c 4 1 4
4 ag 1992 <NA> NA 4 NA
5 fi 1990 a 1 7 7
6 fi 1990 b 6 7 42
7 fi 1991 <NA> NA 5 NA
8 fi 1992 c 7 6 42
Here's a solution using dplyr
library(dplyr)
full_join(Catch, Effort) %>%
mutate(Multiplied = Catch * Effort)
#> Joining, by = c("Industry", "Year")
#> Species Industry Year Catch Effort Multiplied
#> 1 a ag 1990 0 0 0
#> 2 a fi 1990 1 7 7
#> 3 c ag 1991 4 1 4
#> 4 c fi 1992 7 6 42
#> 5 a ag 1990 5 0 0
#> 6 b fi 1990 6 7 42
#> 7 <NA> ag 1992 NA 4 NA
#> 8 <NA> fi 1991 NA 5 NA
Based on your provided data...
Catch <- data.frame(
Species = c("a", "a", "c", "c", "a", "b"),
Industry= c( "ag", "fi", "ag", "fi", "ag", "fi" ),
Year = c("1990", "1990", "1991", "1992", "1990", "1990"),
Catch = c(0,1,4,7,5,6))
Effort<-data.frame(
Industry= c( "ag", "ag", "ag" , "fi", "fi", "fi"),
Year = c("1990", "1991", "1992", "1990", "1991", "1992"),
Effort = c(0,1,4,7,5,6))

Resetting the cumulative sum when a condition is met in R [duplicate]

This question already has answers here:
R Cumulative Sum with a condition and a reset
(3 answers)
Closed 3 years ago.
So I have a table that looks like this currently:
data_wrong <- data.table(State = c("NY", "NY", "NY", "NY", "PA", "PA", "PA",
"NJ", "NJ", "NJ"), Year = c("1973", "1974", "1975", "2005", "1992", "1993",
"2001", "1930", "1931", "1932"), Consecutive_Yrs = c(1,2,3,1,1,6,1,1,9,10))
And I'd like it to look like this:
data <- data.table(State = c("NY", "NY", "NY", "NY", "PA", "PA", "PA", "NJ",
"NJ", "NJ"), Year = c("1973", "1974", "1975", "2005", "1992", "1993",
"2001", "1930", "1931", "1932"), Consecutive_Yrs = c(1,2,3,1,1,2,1,1,2,3))
This is the code I'm using right now to get my table:
data$diff <- NA
data <- data %>%
group_by(State) %>%
arrange(State) %>%
mutate(diff = Year - lag(Year, default = first(Year)))
data$Consecutive_Yrs <- 1
data$Consecutive_Yrs <- ifelse(data$diff == 1, cumsum(data$Consecutive_Yrs),
1)
Any help would be greatly appreciated :)
As. it is a data.table, an option is to use data.table methods
library(data.table)
data_wrong[, grp := cumsum(c(TRUE, diff(as.numeric(Year)) > 1)),
.(State)][, Consecutive_Yrs := as.numeric(seq_len(.N)), .(State, grp)]
data_wrong
# State Year Consecutive_Yrs grp
# 1: NY 1973 1 1
# 2: NY 1974 2 1
# 3: NY 1975 3 1
# 4: NY 2005 1 2
# 5: PA 1992 1 1
# 6: PA 1993 2 1
# 7: PA 2001 1 2
# 8: NJ 1930 1 1
# 9: NJ 1931 2 1
#10: NJ 1932 3 1
Or use rowid
data_wrong[, Consecutive_Yrs2 := rowid(rleid(as.numeric(Year) -
shift(as.numeric(Year), fill = as.numeric(Year[1])) >1)), .(State)]

Resources