summing a column based on values in two other columns - r

I have a data frame that lists individual mass shootings for each state between 1991-2020. I would like to 1) sum the total victims each year for each state, and 2) sum the total number of mass shootings each state had each year.
So far, I've only managed to get a total sum of victims between 1991-2020 for each state. And I'm not even sure how I could get a column with the total incidents per year, per state. Are there any adjustments I can make to the aggregate function, or is there some other function to get the information I want?
What I have:
combined = read.csv('https://raw.githubusercontent.com/bandcar/massShootings/main/combo1991_2020_states.csv')
> head(combined)
state date year fatalities injured total_victims
3342 Alabama 04/07/2009 2009 4 0 4
3351 Alabama 03/10/2009 2009 10 6 16
3285 Alabama 01/29/2012 2012 5 0 5
135 Alabama 12/28/2013 2013 3 5 8
267 Alabama 07/06/2013 2013 0 4 4
557 Alabama 06/08/2014 2014 1 4 5
q = aggregate(total_victims ~ state,data=combined,FUN=sum)
> head(q)
state total_victims
1 Alabama 364
2 Alaska 19
3 Arizona 223
4 Arkansas 205
5 California 1816
6 Colorado 315
What I want for each state for each year:
year state total_victims total_shootings
1 2009 Alabama 20 2
2 2012 Alabama 5 1
3 2013 Alabama 12 2
4 2014 Alabama 5 1

You can use group_by in combination with summarise() from the tidyverse packages.
library(tidyverse)
combined |>
group_by(state, year) |>
summarise(total_victims = sum(total_victims),
total_shootings = n())
This is the result you get:
# A tibble: 457 x 4
# Groups: state [52]
state year total_victims total_shootings
<chr> <int> <int> <int>
1 Alabama 2009 20 2
2 Alabama 2012 5 1
3 Alabama 2013 12 2
4 Alabama 2014 10 2
5 Alabama 2015 17 4

Related

Is it possible to make groups based on an ID of a person in R?

I have this data:
data <- data.frame(id_pers=c(4102,13102,27101,27102,28101,28102, 42101,42102,56102,73102,74103,103104,117103,117104,117105),
birthyear=c(1992,1994,1993,1992,1995,1999,2000,2001,2000, 1994, 1999, 1978, 1986, 1998, 1999))
I want to group the different persons by familys in a new column, so that persons 27101,27102 (siblings) are group/family 1 and 42101,42102 are in group 2, 117103,117104,117105 are in group 3 so on.
Person "4102" has no siblings and should be a NA in the new column.
It is always the case that 2 or more persons are siblings if the ID's are not further apart than a maximum of 6 numbers.
I have a far larger dataset with over 3000 rows. How could I do it the most efficient way?
You can use round with digits = -1 (or -2) if you have id_pers that goes above 10 observations per family. If you want the id to be integers from 1; you can use cur_group_id:
library(dplyr)
data %>%
group_by(fam_id = round(id_pers - 5, digits = -1)) %>%
mutate(fam_gp = cur_group_id())
output
# A tibble: 15 × 3
# Groups: fam_id [10]
id_pers birthyear fam_id fam_gp
<dbl> <dbl> <dbl> <int>
1 4102 1992 4100 1
2 13102 1994 13100 2
3 27101 1993 27100 3
4 27102 1992 27100 3
5 28101 1995 28100 4
6 28106 1999 28100 4
7 42101 2000 42100 5
8 42102 2001 42100 5
9 56102 2000 56100 6
10 73102 1994 73100 7
11 74103 1999 74100 8
12 103104 1978 103100 9
13 117103 1986 117100 10
14 117104 1998 117100 10
15 117105 1999 117100 10
It looks like we can the 1000s digit (and above) to delineate groups.
library(dplyr)
data %>%
mutate(
famgroup = trunc(id_pers/1000),
famgroup = match(famgroup, unique(famgroup))
)
# id_pers birthyear famgroup
# 1 4102 1992 1
# 2 13102 1994 2
# 3 27101 1993 3
# 4 27102 1992 3
# 5 28101 1995 4
# 6 28102 1999 4
# 7 42101 2000 5
# 8 42102 2001 5
# 9 56102 2000 6
# 10 73102 1994 7
# 11 74103 1999 8
# 12 103104 1978 9
# 13 117103 1986 10
# 14 117104 1998 10
# 15 117105 1999 10

r data.table adjust min and max years only if each set has at least one incrementing obs

I have a data set that holds an id, location, start year, end year, age1 and age2. For each group defined as id, location, age1 and age2, I would like to create new start and end year. For instance, I may have three entries for china encompassing age 0 - age 4. One will be 2000 - 2000, the other is 2001 - 2001, and the final is 2005-2005. Since the years are incrementing by 1 in the first two entries, I'd want their corresponding newstart and newend to be 2000-2001. The third entry would have newstart==2005 and newend==2005 as this is not apart of a continuous set of years.
The data table I have resembles the following, except it has thousands of entries many combinations :
id location start end age1 age2
1 brazil 2000 2000 0 4
1 brazil 2001 2001 0 4
1 brazil 2002 2002 0 4
2 argentina 1990 1991 1 1
2 argentina 1991 1991 2 2
2 argentina 1992 1992 2 2
2 argentina 1993 1993 2 2
3 belize 2001 2001 0.5 1
3 belize 2005 2005 1 2
I want to alter the data table so that it will look like the following
id location start end age1 age2 newstart newend
1 brazil 2000 2000 0 4 2000 2002
1 brazil 2001 2001 0 4 2000 2002
1 brazil 2002 2002 0 4 2000 2002
2 argentina 1990 1991 1 1 1991 1991
2 argentina 1991 1991 2 2 1991 1993
2 argentina 1992 1992 2 2 1991 1993
2 argentina 1993 1993 2 2 1991 1993
3 belize 2001 2001 0.5 1 2001 2001
3 belize 2005 2005 1 2 2005 2005
I have tried creating a variable that tracks the difference of the previous year and the current year using lag and then calculating the difference between these two years. I then created the newstart and newend by placing the min start and max end. I have found that this only works if there is a set of 2 in continuous years. If I have a larger set, this doesn't work as it has no way of tracking the number of obs in which the years increase by 1 for each grouping. I believe I need some type of loop.
Is there a more efficient way to accomplish this?
data.table
You tagged with data.table, so my first suggestion is this:
library(data.table)
dat[, contiguous := rleid(c(TRUE, diff(start) == 1)), by = .(id)]
dat[, c("newstart", "newend") := .(min(start), max(end)), by = .(id, contiguous)]
dat[, contiguous := NULL]
dat
# id location start end age1 age2 newstart newend
# 1: 1 brazil 2000 2000 0.0 4 2000 2002
# 2: 1 brazil 2001 2001 0.0 4 2000 2002
# 3: 1 brazil 2002 2002 0.0 4 2000 2002
# 4: 2 argentina 1990 1991 1.0 1 1990 1993
# 5: 2 argentina 1991 1991 2.0 2 1990 1993
# 6: 2 argentina 1992 1992 2.0 2 1990 1993
# 7: 2 argentina 1993 1993 2.0 2 1990 1993
# 8: 3 belize 2001 2001 0.5 1 2001 2001
# 9: 3 belize 2005 2005 1.0 2 2005 2005
base R
If instead you really just mean data.frame, then
dat <- transform(dat, contiguous = ave(start, id, FUN = function(a) cumsum(c(TRUE, diff(a) != 1))))
dat <- transform(dat,
newstart = ave(start, id, contiguous, FUN = min),
newend = ave(end , id, contiguous, FUN = max)
)
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to min; returning Inf
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to min; returning Inf
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to max; returning -Inf
# Warning in FUN(X[[i]], ...) :
# no non-missing arguments to max; returning -Inf
dat
# id location start end age1 age2 newstart newend contiguous
# 1 1 brazil 2000 2000 0.0 4 2000 2002 1
# 2 1 brazil 2001 2001 0.0 4 2000 2002 1
# 3 1 brazil 2002 2002 0.0 4 2000 2002 1
# 4 2 argentina 1990 1991 1.0 1 1990 1993 1
# 5 2 argentina 1991 1991 2.0 2 1990 1993 1
# 6 2 argentina 1992 1992 2.0 2 1990 1993 1
# 7 2 argentina 1993 1993 2.0 2 1990 1993 1
# 8 3 belize 2001 2001 0.5 1 2001 2001 1
# 9 3 belize 2005 2005 1.0 2 2005 2005 2
dat$contiguous <- NULL
Interesting point I just learned about ave: it uses interaction(...) (all grouping variables), which is going to give all possible combinations, not just the combinations observed in the data. Because of that, the FUNction may be called with zero data. In this case, it did, giving the warnings. One could suppress this with function(a) suppressWarnings(min(a)) instead of just min.
We could use dplyr. After grouping by 'id', take the difference of the 'start' and the lagof the 'start', apply rleid to get the run-length-id' and create the 'newstart', 'newend' as the min and max of the 'start'
library(dplyr)
library(data.table)
df1 %>%
group_by(id) %>%
group_by(grp = rleid(replace_na(start - lag(start), 1)),
.add = TRUE) %>%
mutate(newstart = min(start), newend = max(end))
-output
# A tibble: 9 x 9
# Groups: id, grp [4]
# id location start end age1 age2 grp newstart newend
# <int> <chr> <int> <int> <dbl> <int> <int> <int> <int>
#1 1 brazil 2000 2000 0 4 1 2000 2002
#2 1 brazil 2001 2001 0 4 1 2000 2002
#3 1 brazil 2002 2002 0 4 1 2000 2002
#4 2 argentina 1990 1991 1 1 1 1990 1993
#5 2 argentina 1991 1991 2 2 1 1990 1993
#6 2 argentina 1992 1992 2 2 1 1990 1993
#7 2 argentina 1993 1993 2 2 1 1990 1993
#8 3 belize 2001 2001 0.5 1 1 2001 2001
#9 3 belize 2005 2005 1 2 2 2005 2005
Or with data.table
library(data.table)
setDT(df1)[, grp := rleid(replace_na(start - shift(start), 1))
][, c('newstart', 'newend') := .(min(start), max(end)), .(id, grp)][, grp := NULL]

Calculating percentage change of panel data for other entities

I have a very large data frame that takes the form of panel data. The data has economic information on production for each industry within countries for a range of years. I would like to find a code that calculates year-to-year percentage changes for this output within the same industry but aggregates this for different countries as the one of the same row.
It sounds difficult (difficult to explain) so I give an example. Using this code:
panel <- cbind.data.frame(industry = rep(c("Logging" , "Automobile") , each = 9) ,
country = rep(c("Austria" , "Belgium" , "Croatia") , each = 3 , times = 2) ,
year = rep(c(2000:2002) , times = 6) ,
output = c(2,3,4,1,5,8,1,2,4,2,3,4,6,7,8,9,10,11))
That gives this matrix:
industry country year output
1 Logging Austria 2000 2
2 Logging Austria 2001 3
3 Logging Austria 2002 4
4 Logging Belgium 2000 1
5 Logging Belgium 2001 5
6 Logging Belgium 2002 8
7 Logging Croatia 2000 1
8 Logging Croatia 2001 2
9 Logging Croatia 2002 4
10 Automobile Austria 2000 2
11 Automobile Austria 2001 3
12 Automobile Austria 2002 4
13 Automobile Belgium 2000 6
14 Automobile Belgium 2001 7
15 Automobile Belgium 2002 8
16 Automobile Croatia 2000 9
17 Automobile Croatia 2001 10
18 Automobile Croatia 2002 11
I compute percentage changes per industry using tidyverse:
library(tidyverse)
panel <- panel %>%
group_by(country , industry) %>%
mutate(per_change = (output - lag(output)) / lag(output))
giving:
# A tibble: 18 x 5
# Groups: country, industry [6]
industry country year output per_change
<fct> <fct> <int> <dbl> <dbl>
1 Logging Austria 2000 2 NA
2 Logging Austria 2001 3 0.5
3 Logging Austria 2002 4 0.333
4 Logging Belgium 2000 1 NA
5 Logging Belgium 2001 5 4
6 Logging Belgium 2002 8 0.6
7 Logging Croatia 2000 1 NA
8 Logging Croatia 2001 2 1
9 Logging Croatia 2002 4 1
10 Automobile Austria 2000 2 NA
11 Automobile Austria 2001 3 0.5
12 Automobile Austria 2002 4 0.333
13 Automobile Belgium 2000 6 NA
14 Automobile Belgium 2001 7 0.167
15 Automobile Belgium 2002 8 0.143
16 Automobile Croatia 2000 9 NA
17 Automobile Croatia 2001 10 0.111
18 Automobile Croatia 2002 11 0.1
So I would like a code that gives for row 1 NA, row 2 the sum of percentage change for all logging industry in 2001 except Austria (4+1) = 5, row 3 sum of all percentage change in logging industry in 2002 except Austria (0.6 +1) = 1.6, row 4 again NA, row 5 sum of percentage change for logging in 2001 except Belgium (1.5) , ....
I wouldn't know how to do this other by hand.
Please also a code that is flexible and would be able to identify N countries and Y industries.
You can
first group the "panel" table according to industry and year to sum "per_change"
second join this grouped table with your main table
lastly subtract "per_change" from "grouped sum"
After your code:
d1<-as.data.frame(panel)
attach(panel)
d2<-aggregate(per_change~industry+year, FUN=sum)
detach(panel)
library(dplyr)
panel<-left_join(d1,d2, by=c("industry"="industry", "year"="year"))
panel$exc_per_change<-panel$per_change.y-panel$per_change.x
output is
> head(panel)
industry country year output per_change.x per_change.y exc_per_change
1 Logging Austria 2000 2 NA NA NA
2 Logging Austria 2001 3 0.5000000 5.500000 5.000000
3 Logging Austria 2002 4 0.3333333 1.933333 1.600000
4 Logging Belgium 2000 1 NA NA NA
5 Logging Belgium 2001 5 4.0000000 5.500000 1.500000
6 Logging Belgium 2002 8 0.6000000 1.933333 1.333333

How to select a row based on date conditions of another row?

I have df1:
State date fips score score1
1 Alabama 2020-03-24 1 242 0
2 Alabama 2020-03-26 1 538 3
3 Alabama 2020-03-28 1 720 4
4 Alabama 2020-03-21 1 131 0
5 Alabama 2020-03-15 1 23 0
6 Alabama 2020-03-18 1 51 0
7 Texas 2020-03-14 2 80 0
7 Texas 2020-03-16 2 102 0
7 Texas 2020-03-20 2 702 1
8 Texas 2020-03-23 2 1005 1
I would like to see which date a State surpasses a score of 100. I would then like to select the row 7 days after that date? For example, Alabama passes 100 on March 21st, so I would like to keep the March 28th data.
State date fips score score1
3 Alabama 2020-03-28 1 720 4
8 Texas 2020-03-23 2 1005 1
Here is a solution tidyverse and lubridate.
library(tidyverse)
library(lubridate)
df %>%
#Convert date column to date format
mutate_at(vars(date), ymd) %>%
#Group by State
group_by(State) %>%
#Ignore scores under 100
filter(score > 100) %>%
#Stay only with the date of the first date with score over 100 + 7 days
filter(date == min(date) + days(7))
Using a by approach (assuming date + 7 is available).
res <- do.call(rbind, by(dat, dat$state, function(x) {
st <- x[x$cases > 100, ]
st[as.Date(st$date) == as.Date(st$date[1]) + 7, ]
}))
head(res)
# date state fips cases deaths
# Alabama 2020-03-27 Alabama 1 639 4
# Alaska 2020-04-04 Alaska 2 169 3
# Arizona 2020-03-28 Arizona 4 773 15
# Arkansas 2020-03-28 Arkansas 5 409 5
# California 2020-03-15 California 6 478 6
# Colorado 2020-03-21 Colorado 8 475 6

Add lines with NA values

I have a data frame like this:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2005 hiv 3
4 1 Italy 2000 cancer 4
5 1 Italy 2001 cancer 5
6 1 Italy 2002 cancer 6
7 1 Italy 2003 cancer 7
8 1 Italy 2004 cancer 8
9 1 Italy 2005 cancer 9
10 4 France 2000 hiv 10
11 4 France 2004 hiv 11
12 4 France 2005 hiv 12
13 4 France 2001 cancer 13
14 4 France 2002 cancer 14
15 4 France 2003 cancer 15
16 4 France 2004 cancer 16
17 2 Spain 2000 hiv 17
18 2 Spain 2001 hiv 18
19 2 Spain 2002 hiv 19
20 2 Spain 2003 hiv 20
21 2 Spain 2004 hiv 21
22 2 Spain 2005 hiv 22
23 2 Spain ... ... ...
indx is a value linked to the country (same country = same indx).
In this example I used only 3 countries (country) and 2 disease (death), in the original data frame are many more.
I would like to have one row for each country for each disease from 2000 to 2005.
What I would like to get is:
indx country year death value
1 1 Italy 2000 hiv 1
2 1 Italy 2001 hiv 2
3 1 Italy 2002 hiv NA
4 1 Italy 2003 hiv NA
5 1 Italy 2004 hiv NA
6 1 Italy 2005 hiv 3
7 1 Italy 2000 cancer 4
8 1 Italy 2001 cancer 5
9 1 Italy 2002 cancer 6
10 1 Italy 2003 cancer 7
11 1 Italy 2004 cancer 8
12 1 Italy 2005 cancer 9
13 4 France 2000 hiv 10
14 4 France 2001 hiv NA
15 4 France 2002 hiv NA
16 4 France 2003 hiv NA
17 4 France 2004 hiv 11
18 4 France 2005 hiv 12
19 4 France 2000 cancer NA
20 4 France 2001 cancer 13
21 4 France 2002 cancer 14
22 4 France 2003 cancer 15
23 4 France 2004 cancer 16
24 4 France 2005 cancer NA
25 2 Spain 2000 hiv 17
26 2 Spain 2001 hiv 18
27 2 Spain 2002 hiv 19
28 2 Spain 2003 hiv 20
29 2 Spain 2004 hiv 21
30 2 Spain 2005 hiv 22
31 2 Spain ... ... ...
I.e. I would like to add lines with value = NA at the missing years for each country for each disease.
For example, it lacks data of HIV in Italy between 2002 and 2004 and then I add this lines with value = NA.
How can I do that?
For a reproducible example:
indx <- c(rep(1, times=9), rep(4, times=7), rep(2, times=6))
country <- c(rep("Italy", times=9), rep("France", times=7), rep("Spain", times=6))
year <- c(2000, 2001, 2005, 2000:2005, 2000, 2004, 2005, 2001:2004, 2000:2005)
death <- c(rep("hiv", times=3), rep("cancer", times=6), rep("hiv", times=3), rep("cancer", times=4), rep("hiv", times=6))
value <- c(1:22)
dfl <- data.frame(indx, country, year, death, value)
Using base R, you could do:
# setDF(dfl) # run this first if you have a data.table
merge(expand.grid(lapply(dfl[c("country", "death", "year")], unique)), dfl, all.x = TRUE)
This first creates all combinations of the unique values in country, death, and year and then merges it to the original data, to add the values and where combinations were not in the original data, it adds NAs.
In the package tidyr, there's a special function that does this for you with a a single command:
library(tidyr)
complete(dfl, country, year, death)
Here is a longer base R method. You create two new data.frames, one that contains all combinations of the country, year, and death, and a second that contains an index key.
# get data.frame with every combination of country, year, and death
dfNew <- with(df, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death)))
# get index key
indexKey <- unique(df[, c("indx", "country")])
# merge these together
dfNew <- merge(indexKey, dfNew, by="country")
# merge onto original data set
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
This returns
dfNew
indx country year death value
1 1 Italy 2000 cancer 4
2 1 Italy 2000 hiv 1
3 1 Italy 2001 cancer 5
4 1 Italy 2001 hiv 2
5 1 Italy 2002 cancer 6
6 1 Italy 2002 hiv NA
7 1 Italy 2003 cancer 7
8 1 Italy 2003 hiv NA
9 1 Italy 2004 cancer 8
10 1 Italy 2004 hiv NA
11 1 Italy 2005 cancer 9
12 1 Italy 2005 hiv 3
13 2 Spain 2000 cancer NA
14 2 Spain 2000 hiv 17
15 2 Spain 2001 cancer NA
...
If df is a data.table, here are the corresponding lines of code:
# CJ is a cross-join
setkey(df, country, year, death)
dfNew <- df[CJ(country, year, death, unique=TRUE),
.(country, year, death, value)]
indexKey <- unique(df[, .(indx, country)])
dfNew <- merge(indexKey, dfNew, by="country")
dfNew <- merge(df, dfNew, by=c("indx", "country", "year", "death"), all=TRUE)
Note that it rather than using CJ, it is also possible to use expand.grid as in the data.frame version:
dfNew <- df[, expand.grid("country"=unique(country), "year"=unique(year),
"death"=unique(death))]
tidyr::complete helps create all combinations of the variables you pass it, but if you have two columns that are identical, it will over-expand or leave NAs where you don't want. As a workaround you can use dplyr grouping (df %>% group_by(indx, country) %>% complete(death, year)) or just merge the two columns into one temporarily:
library(tidyr)
# merge indx and country into a single column so they won't over-expand
df %>% unite(indx_country, indx, country) %>%
# fill in missing combinations of new column, death, and year
complete(indx_country, death, year) %>%
# separate indx and country back to how they were
separate(indx_country, c('indx', 'country'))
# Source: local data frame [36 x 5]
#
# indx country death year value
# (chr) (chr) (fctr) (int) (int)
# 1 1 Italy cancer 2000 4
# 2 1 Italy cancer 2001 5
# 3 1 Italy cancer 2002 6
# 4 1 Italy cancer 2003 7
# 5 1 Italy cancer 2004 8
# 6 1 Italy cancer 2005 9
# 7 1 Italy hiv 2000 1
# 8 1 Italy hiv 2001 2
# 9 1 Italy hiv 2002 NA
# 10 1 Italy hiv 2003 NA
# .. ... ... ... ... ...

Resources