How to select a row based on date conditions of another row? - r

I have df1:
State date fips score score1
1 Alabama 2020-03-24 1 242 0
2 Alabama 2020-03-26 1 538 3
3 Alabama 2020-03-28 1 720 4
4 Alabama 2020-03-21 1 131 0
5 Alabama 2020-03-15 1 23 0
6 Alabama 2020-03-18 1 51 0
7 Texas 2020-03-14 2 80 0
7 Texas 2020-03-16 2 102 0
7 Texas 2020-03-20 2 702 1
8 Texas 2020-03-23 2 1005 1
I would like to see which date a State surpasses a score of 100. I would then like to select the row 7 days after that date? For example, Alabama passes 100 on March 21st, so I would like to keep the March 28th data.
State date fips score score1
3 Alabama 2020-03-28 1 720 4
8 Texas 2020-03-23 2 1005 1

Here is a solution tidyverse and lubridate.
library(tidyverse)
library(lubridate)
df %>%
#Convert date column to date format
mutate_at(vars(date), ymd) %>%
#Group by State
group_by(State) %>%
#Ignore scores under 100
filter(score > 100) %>%
#Stay only with the date of the first date with score over 100 + 7 days
filter(date == min(date) + days(7))

Using a by approach (assuming date + 7 is available).
res <- do.call(rbind, by(dat, dat$state, function(x) {
st <- x[x$cases > 100, ]
st[as.Date(st$date) == as.Date(st$date[1]) + 7, ]
}))
head(res)
# date state fips cases deaths
# Alabama 2020-03-27 Alabama 1 639 4
# Alaska 2020-04-04 Alaska 2 169 3
# Arizona 2020-03-28 Arizona 4 773 15
# Arkansas 2020-03-28 Arkansas 5 409 5
# California 2020-03-15 California 6 478 6
# Colorado 2020-03-21 Colorado 8 475 6

Related

Fill in Column Based on Other Rows (R)

I am looking for a way to fill in a column in R based on values in a different column. Below is what my data looks like.
year
action
player
end
2001
1
Mike
2003
2002
0
Mike
NA
2003
0
Mike
NA
2004
0
Mike
NA
2001
0
Alan
NA
2002
0
Alan
NA
2003
1
Alan
2004
2004
0
Alan
NA
I would like to either change the "action" column or create a new column such that it reflects the duration between the "year" and "end" variables. Below is what it would look like:
year
action
player
end
2001
1
Mike
2003
2002
1
Mike
NA
2003
1
Mike
NA
2004
0
Mike
NA
2001
0
Alan
NA
2002
0
Alan
NA
2003
1
Alan
2004
2004
1
Alan
NA
I have tried to do this with the following loop:
i <- 0
z <- 0
for (i in 1:nrow(df)){
i <- z + i + 1
if (df[i, 2] == 0) {}
else {df[i, 5] = (df[i, 4] - df[i, 1])}
z <- df[i,5]
for (z in i:nrow(df)){df[i, 2] = 1}
}
Here, my i value is skyrocketing, breaking the loop. I am not sure why that is occuring. I'd be interested to either know how to fix my approach or how to do this in a smarter fashion.
There's no need for explicit loops here.
First group your data frame by player. Then find the rows where the cumulative sum (cumsum) of action is greater than 0 and the year is less than or equal to the end year of the group. If the row meets these conditions, set action to 1, otherwise to 0.
Using the dplyr package you could achieve this in a couple of lines:
library(dplyr)
df %>%
group_by(player) %>%
mutate(action = as.numeric(cumsum(action) > 0 & year <= na.omit(end)[1]))
#> # A tibble: 8 x 4
#> # Groups: player [2]
#> year action player end
#> <int> <dbl> <chr> <int>
#> 1 2001 1 Mike 2003
#> 2 2002 1 Mike NA
#> 3 2003 1 Mike NA
#> 4 2004 0 Mike NA
#> 5 2001 0 Alan NA
#> 6 2002 0 Alan NA
#> 7 2003 1 Alan 2004
#> 8 2004 1 Alan NA

summing a column based on values in two other columns

I have a data frame that lists individual mass shootings for each state between 1991-2020. I would like to 1) sum the total victims each year for each state, and 2) sum the total number of mass shootings each state had each year.
So far, I've only managed to get a total sum of victims between 1991-2020 for each state. And I'm not even sure how I could get a column with the total incidents per year, per state. Are there any adjustments I can make to the aggregate function, or is there some other function to get the information I want?
What I have:
combined = read.csv('https://raw.githubusercontent.com/bandcar/massShootings/main/combo1991_2020_states.csv')
> head(combined)
state date year fatalities injured total_victims
3342 Alabama 04/07/2009 2009 4 0 4
3351 Alabama 03/10/2009 2009 10 6 16
3285 Alabama 01/29/2012 2012 5 0 5
135 Alabama 12/28/2013 2013 3 5 8
267 Alabama 07/06/2013 2013 0 4 4
557 Alabama 06/08/2014 2014 1 4 5
q = aggregate(total_victims ~ state,data=combined,FUN=sum)
> head(q)
state total_victims
1 Alabama 364
2 Alaska 19
3 Arizona 223
4 Arkansas 205
5 California 1816
6 Colorado 315
What I want for each state for each year:
year state total_victims total_shootings
1 2009 Alabama 20 2
2 2012 Alabama 5 1
3 2013 Alabama 12 2
4 2014 Alabama 5 1
You can use group_by in combination with summarise() from the tidyverse packages.
library(tidyverse)
combined |>
group_by(state, year) |>
summarise(total_victims = sum(total_victims),
total_shootings = n())
This is the result you get:
# A tibble: 457 x 4
# Groups: state [52]
state year total_victims total_shootings
<chr> <int> <int> <int>
1 Alabama 2009 20 2
2 Alabama 2012 5 1
3 Alabama 2013 12 2
4 Alabama 2014 10 2
5 Alabama 2015 17 4

How to find the annual evolution rate for each firm in my data table?

So I have a data table of 5000 firms, each firm is assigned a numerical value ("id") which is 1 for the first firm, 2 for the second ...
Here is my table with only the profit variable :
|id | year | profit
|:----| :----| :----|
|1 |2001 |-0.4
|1 |2002 |-0.89
|2 |2001 |1.89
|2 |2002 |2.79
Each firm is expressed twice, one line specifies the data in 2001 and the second in 2002 (the "id" value being the same on both lines because it is the same firm one year apart).
How to calculate the annual rate of change of each firm ("id") between 2001 and 2002 ?
I'm really new to R and I don't see where to start? Separate the 2001 and 2002 data?
I did this :
years <- sort(unique(group$year))years
And I also found this on the internet but with no success :
library(dplyr)
res <-
group %>%
arrange(id,year) %>%
group_by(id) %>%
mutate(evol_rate = ("group$year$2002" / lag("group$year$2001") - 1) * 100) %>%
ungroup()
Thank you very much
From what you've written, I take it that you want to calculate the formula for ROC for the profit values of 2001 and 2002:
ROC=(current_value​/previous_value − 1) ∗ 100
To accomplish this, I suggest tidyr::pivot_wider() which reshapes your dataframe from long to wide format (see: https://r4ds.had.co.nz/tidy-data.html#pivoting).
Code:
require(tidyr)
require(dplyr)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
head(df, 10)
#> id year value
#> 1 1 2001 856
#> 2 1 2002 1850
#> 3 2 2001 1687
#> 4 2 2002 1902
#> 5 3 2001 1728
#> 6 3 2002 1773
#> 7 4 2001 691
#> 8 4 2002 1691
#> 9 5 2001 1368
#> 10 5 2002 893
df_wide <- df %>%
pivot_wider(names_from = year,
names_prefix = "profit_",
values_from = value,
values_fn = mean)
res <- df_wide %>%
mutate(evol_rate = (profit_2002/profit_2001-1)*100) %>%
round(2)
head(res, 10)
#> # A tibble: 10 x 4
#> id profit_2001 profit_2002 evol_rate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 856 1850 116.
#> 2 2 1687 1902 12.7
#> 3 3 1728 1773 2.6
#> 4 4 691 1691 145.
#> 5 5 1368 893 -34.7
#> 6 6 883 516 -41.6
#> 7 7 1280 1649 28.8
#> 8 8 1579 1383 -12.4
#> 9 9 1907 1626 -14.7
#> 10 10 1227 1134 -7.58
If you want to do it without reshaping your data into a wide format you can use
library(tidyverse)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
df %>% head(n = 10)
#> id year value
#> 1 1 2001 1173
#> 2 1 2002 1648
#> 3 2 2001 1560
#> 4 2 2002 1091
#> 5 3 2001 1736
#> 6 3 2002 667
#> 7 4 2001 1840
#> 8 4 2002 1202
#> 9 5 2001 1597
#> 10 5 2002 1797
new_df <- df %>%
group_by(id) %>%
mutate(ROC = ((value / lag(value) - 1) * 100))
new_df %>% head(n = 10)
#> # A tibble: 10 × 4
#> # Groups: id [5]
#> id year value ROC
#> <dbl> <dbl> <int> <dbl>
#> 1 1 2001 1173 NA
#> 2 1 2002 1648 40.5
#> 3 2 2001 1560 NA
#> 4 2 2002 1091 -30.1
#> 5 3 2001 1736 NA
#> 6 3 2002 667 -61.6
#> 7 4 2001 1840 NA
#> 8 4 2002 1202 -34.7
#> 9 5 2001 1597 NA
#> 10 5 2002 1797 12.5
This groups the data by id and then uses lag to compare the current year to the year prior

R - calculate annual population conditional on survival in every year

I have a data frame with three columns: birth_year, death_year, gender.
I have to calculate total alive male and female population for every year in a given range (1950:1980).
The data frame looks like this:
birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female
The person is alive in year x if death_year > x & birth year <= x
The output I am looking for is something like this:
year male female
1950 3 4
1951 2 3
1952 4 3
1953 4 5
.
.
1980 6 3
Thanks!
Does this work:
library(tidyr)
library(purrr)
library(dplyr)
df %>% mutate(year = map2(1950,1980, seq)) %>% unnest(year) %>%
mutate(isalive = case_when(year >= birth_year & year < death_year ~ 1, TRUE ~ 0)) %>%
group_by(year, gender) %>% summarise(alive = sum(isalive)) %>%
pivot_wider(names_from = gender, values_from = alive) %>% print( n = 50)
`summarise()` regrouping output by 'year' (override with `.groups` argument)
# A tibble: 31 x 3
# Groups: year [31]
year female male
<int> <dbl> <dbl>
1 1950 4 3
2 1951 4 3
3 1952 4 3
4 1953 4 3
5 1954 4 3
6 1955 4 3
7 1956 4 2
8 1957 4 2
9 1958 4 2
10 1959 4 2
11 1960 4 2
12 1961 4 2
13 1962 4 2
14 1963 4 2
15 1964 4 2
16 1965 4 2
17 1966 4 1
18 1967 4 1
19 1968 4 1
20 1969 4 1
21 1970 4 1
22 1971 4 1
23 1972 4 1
24 1973 4 1
25 1974 4 1
26 1975 4 1
27 1976 3 1
28 1977 3 1
29 1978 3 1
30 1979 3 1
31 1980 3 1
Data used:
df
# A tibble: 9 x 3
birth_year death_year gender
<dbl> <dbl> <chr>
1 1934 1988 male
2 1922 1993 female
3 1890 1966 male
4 1901 1956 male
5 1946 2009 female
6 1909 1976 female
7 1899 1945 male
8 1887 1949 male
9 1902 1984 female
Here's a simple base R solution. Summing a logical vector will get you your count of alive or dead because TRUE is 1 and FALSE is 0.
number_alive <- function(range, df){
sapply(range, function(x) sum((df$death_year > x) & (df$birth_year <= x)))
}
output <- data.frame('year' = 1950:1980,
'female' = number_alive(1950:1980, df[df$gender == 'female']),
'male' = number_alive(1950:1980, df[df$gender == 'male']))
# year female male
# 1 1950 4 3
# 2 1951 4 3
# 3 1952 4 3
# 4 1953 4 3
# 5 1954 4 3
# 6 1955 4 3
# 7 1956 4 2
# 8 1957 4 2
# 9 1958 4 2
# 10 1959 4 2
# 11 1960 4 2
# 12 1961 4 2
# 13 1962 4 2
# 14 1963 4 2
# 15 1964 4 2
# 16 1965 4 2
# 17 1966 4 1
# 18 1967 4 1
# 19 1968 4 1
# 20 1969 4 1
# 21 1970 4 1
# 22 1971 4 1
# 23 1972 4 1
# 24 1973 4 1
# 25 1974 4 1
# 26 1975 4 1
# 27 1976 3 1
# 28 1977 3 1
# 29 1978 3 1
# 30 1979 3 1
# 31 1980 3 1
This approach uses an ifelse to determine if alive (1) or dead (0).
Data:
df <- "birth_year death_year gender
1934 1988 male
1922 1993 female
1890 1966 male
1901 1956 male
1946 2009 female
1909 1976 female
1899 1945 male
1887 1949 male
1902 1984 female"
df <- read.table(text = df, header = TRUE)
Code:
library(dplyr)
library(tidyr)
library(tibble)
library(purrr)
df %>%
mutate(year = map2(1950,1980, seq)) %>%
unnest(year) %>%
select(year, birth_year, death_year, gender) %>%
mutate(
alive = ifelse(year >= birth_year & year <= death_year, 1, 0)
) %>%
group_by(year, gender) %>%
summarise(
is_alive = sum(alive)
) %>%
pivot_wider(
names_from = gender,
values_from = is_alive
) %>%
select(year, male, female)
Output:
#> # A tibble: 31 x 3
#> # Groups: year [31]
#> year male female
#> <int> <dbl> <dbl>
#> 1 1950 3 4
#> 2 1951 3 4
#> 3 1952 3 4
#> 4 1953 3 4
#> 5 1954 3 4
#> 6 1955 3 4
#> 7 1956 3 4
#> 8 1957 2 4
#> 9 1958 2 4
#> 10 1959 2 4
#> # … with 21 more rows
Created on 2020-11-11 by the reprex package (v0.3.0)

grouping data in R and summing by decade

I have the following dataset:
ireland england france year
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931
I need to summarise the data by 1920's and 1930's. So I need total points for ireland, england and france in the 1920-1922 and then another total point for ireland,england and france in 1930,1931.
Any ideas? I have tried but failed.
Dataset:
x <- read.table(text = "ireland england france
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931", header = T)
How about dividing the years by 10 and then summarizing?
library(dplyr)
x %>% mutate(decade = floor(year/10)*10) %>%
group_by(decade) %>%
summarize_all(sum) %>%
select(-year)
# A tibble: 2 x 5
# decade ireland england france
# <dbl> <int> <int> <int>
# 1 1920 15 8 7
# 2 1930 5 6 7
An R base solution
As A5C1D2H2I1M1N2O1R2T1 mentioned, you can use findIntervals() to set corresponding decade for each year and then, an aggregate() to group py decade
txt <-
"ireland england france year
5 3 2 1920
4 3 4 1921
6 2 1 1922
3 1 5 1930
2 5 2 1931"
df <- read.table(text=txt, header=T)
decades <- c(1920, 1930, 1940)
df$decade<- decades[findInterval(df$year, decades)]
aggregate(cbind(ireland,england,france) ~ decade , data = df, sum)
Output:
decade ireland england france
1 1920 15 8 7
2 1930 5 6 7

Resources