Count of crashes and injuries? - r

I have a dataset from dot.gov website that I have to analyze as part of our school project. It contains a lot of information, but I am just focusing on crashes and injuries. How do I count the number of crashes or injuries from the year 2007-2014 for example?
Do I have to subset my data per year or is there a more efficient way to do it? Thank you!
Below is a sample of my dataset:

Without a reproducible example of your dataset on which we can test our code, it is difficult to be sure that it will be working, but using dplyr and lubridate package, you can try (assuming that your dataset is called df):
library(dplyr)
library(lubridate)
df %>% mutate(YEARTXT = ymd(YEARTXT)) %>%
mutate(Year = year(YEARTXT)) %>%
filter(Year %in% 2007:2014) %>%
summarise(INJURED = sum(INJURED, na.rm = FALSE),
CRASH = sum(CRASH == "Y"))
To get the count of Crash and injured by per year, you can add group_by to the following sequence such as:
df %>% mutate(YEARTXT = ymd(YEARTXT)) %>%
mutate(Year = year(YEARTXT)) %>%
group_by(Year) %>%
filter(Year %in% 2007:2014) %>%
summarise(INJURED = sum(INJURED, na.rm = FALSE),
CRASH = sum(CRASH == "Y"))
If this is not working, please provide a reproducible example of your dataset: How to make a great R reproducible example

Related

Having trouble summarising data in R studio

I have a data set of species and how many were observed, however when I try to take the average for each species, I get the mean for the individual observation, so nothing is changing.
Along the top row of my data are the species names, and going down the column underneath them is the count. I am trying to summarize this data so I may plot it, but rather than the mean of the entire column being taken, it just takes the mean of the individual observation.
grassData <- read.csv("Dykebooke.csv", header = TRUE, sep = ",")
View(grassData)
summary.grass <- grassData %>% group_by(Cordgrass) %>%
summarise(mean = mean(Cordgrass), variance = var(Cordgrass))
summary.lavender <- grassData %>% group_by(Lavender) %>%
summarise(mean(Lavender), var(Lavender))
summary.goldenrod <- grassData %>% group_by(Goldenrod) %>%
summarise(mean(Goldenrod), var(Goldenrod))
summary.crab <- grassData %>% group_by(Crab) %>%
summarise(mean(Crab), var(Crab))
summary.iva <- grassData %>% group_by(Iva) %>%
summarise(mean(Iva), var(Iva))
summary.grasshopper <- grassData %>% group_by(Grasshopper) %>%
summarise(mean(Grasshopper), var(Grasshopper))
This is what I have done so far, but this is what it provides.
I have not used R in a few years so I am very rusty, any help is appreciated.

R - Difference between two rows conditional on variables

Lately for everything I manage on my own, I feel like I have a point I need to clarify on this board. SO thanks in advance for your help.
The issue is as follows:
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,4,3,4,5),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","96V","25A","25A","25A","25A"),
Contract_Size = c(100,20,120,300,120,30,25,60,80,90,30,70,90,220,35,70,150,250,10,50,25)
)
Data<-Data[, totalcont := sum(Contract_Size), by=c("Code_ID_Buy","Month")]
View(Data)
I would like to calculate the difference in total contract size from one period to the next, does anybody know whether it is possible ? I have been working on that for weeks not finding the solution...
Kind regards,
This should solve it
library(tidyverse)
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,4,3,4,5),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","96V","25A","25A","25A","25A"),
Contract_Size = c(100,20,120,300,120,30,25,60,80,90)
)
Data %>%
group_by(Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(dif_contract = total_contract -lag(total_contract))
If you need to respect another variable do
Data %>%
group_by(Code_ID_Buy,Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(lag = total_contract -lag(total_contract))
If you need to maintain the current structure left_join may be right
result <- Data %>%
group_by(Code_ID_Buy,Month) %>%
summarise(total_contract = sum(Contract_Size)) %>%
mutate(lag = total_contract -lag(total_contract))
Data %>%
left_join(result)

dplyr not grouping correctly or else using data from previous groups

I am working with JHU data on coronavirus infections, and I'm trying to compute new cases (and deaths) by group. Here's the code:
base <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-"
world.confirmed <- read.csv(paste0(base,"Confirmed.csv"), sep=',', head=T)
world.confirmed <- gather( world.confirmed, Date, Cases, X1.22.20:X3.21.20)
world.deaths <- read.csv(paste0(base,"Deaths.csv"), sep=',', head=T)
world.deaths <- gather(world.deaths, Date, Deaths, X1.22.20:X3.21.20)
world.data <- merge(world.confirmed, world.deaths,
by=c("Province.State","Country.Region","Lat", "Long", "Date"))
world.data$Date <- as.Date(world.data$Date, "X%m.%d.%y")
world.data <- world.data %>%
group_by(Province.State,Country.Region,Date) %>%
arrange(Province.State, Country.Region, as.Date(Date))
Following solutions to this question in SO I have tried to compute differences by group using something like this:
world.data <- world.data %>%
group_by(Lat,Long) %>%
mutate(New.Cases = Cases - lag(Cases))
That does not work, however; any other grouping does not either. Here're results on boundary between two first countries:
I have tried also inserting an arrange phase, and even trying to zero the first element of the group. Same problem. Any idea?
Update I'm using R 3.4.4 and dplyr_0.8.5
Probably, this might help :
library(dplyr)
world.data %>%
mutate(Date = as.Date(Date, "X%m.%d.%y")) %>%
arrange(Country.Region, Lat, Long, Date) %>%
group_by(Country.Region, Lat, Long) %>%
mutate(New_Cases = Cases - lag(Cases),
New_deaths = Deaths - lag(Deaths))
We arrange the data according to Date, and find New_Cases by subtracting today's case with yesterday's case for each Country and the same for deaths.

transform() to add rows with dplyr()

I've got a data frame (df) with two variables, site and purchase.
I'd like to use dplyr() to group my data by site and purchase, and get the counts and percentages for the grouped data. I'd however also like the tibble to feature rows called ALLSITES, representing the data of all the sites grouped by purchase, so that I end up with a tibble looking similar to dfgoal.
The problem's that my current code doesn't get me the ALLSITES rows. I've tried adding a base R function into dplyr(), which doesn't work.
Any help would be much appreciated.
Starting point (df):
df <- data.frame(site=c("LON","MAD","PAR","MAD","PAR","MAD","PAR","MAD","PAR","LON","MAD","LON","MAD","MAD","MAD"),purchase=c("a1","a2","a1","a1","a1","a1","a1","a1","a1","a2","a1","a2","a1","a2","a1"))
Desired outcome:
dfgoal <- data.frame(site=c("LON","LON","MAD","MAD","PAR","ALLSITES","ALLSITES"),purchase=c("a1","a2","a1","a2","a1","a1","a2"),bin=c(1,2,6,2,4,11,4),pin_per=c(33.33333,66.66667,75.00000,25.00000,100.00000,73.33333,26.66666))
Current code:
library(dplyr)
df %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
df %>%
rbind(df, transform(df, site = "ALLSITES") %>%
group_by(site, purchase) %>%
summarize(bin = sum(purchase==purchase)) %>%
group_by(site) %>%
mutate(bin_per = (bin/sum(bin)*100))
We can start from the first output code block, after grouping by 'site' with a created string of 'ALLSITES' and 'purchase' get the sum of 'bin' and later 'bin_per', then with bind_rows row bind the two datasets
df1 %>%
ungroup() %>%
group_by(site = 'ALLSITES', purchase) %>%
summarise(bin = sum(bin)) %>%
ungroup %>%
mutate(bin_per = 100*(bin/sum(bin))) %>%
bind_rows(df1, .)

efficient dplyr summarise in one data frame based on intervals in another one

I frequently need to calculate means of many parameters in time series datasets based on intervals defined as "events" in a second dataset.
The example code below illustrates my current approach, which does work nicely.
As my datasets will be increasing, though, I am wondering if there is a more efficient way (example runs in ~30 s on my PC).
It is important to stay within dplyr/tidyverse (data.table ways are appreciated, but won't really help).
library(tidyverse)
#generate time series data
data <- bind_cols(
data_frame(td=seq(from = as.POSIXct("2010-01-01 00:00"),
to = as.POSIXct("2010-12-31 23:59"),
by = 60)),
as_data_frame(replicate(20,runif(525600))))
#generate events
events <- data_frame(
event = as.character(1:669),
start_cet = seq(from = as.POSIXct("2010-01-01 00:00"),
to = as.POSIXct("2010-12-01 00:00"),
by = 43200),
stop_cet = seq(from = as.POSIXct("2010-01-01 02:00"),
to = as.POSIXct("2010-12-01 02:00"),
by = 43200)
)
#calculate means of data columns within event intervals
system.time(
means <- events %>%
rowwise() %>%
mutate(s = list(data %>% select(td) %>% filter(td >= start_cet & td < stop_cet))) %>%
unnest() %>%
select(event,td) %>%
left_join(.,data) %>%
group_by(event) %>%
summarise_at(vars(V1:V20),funs(mean=mean)) %>%
ungroup()
)
Here's an efficient way of doing it using the latest devel (1.9.7+) version of data.table that takes about 10 milliseconds to run for OP sample:
library(data.table)
setDT(data); setDT(events)
data[events, on = .(td >= start_cet, td <= stop_cet), lapply(.SD, mean), by = .EACHI]
Answer to myself after ~ 3 yrs...
The mutate step in the above dplyr solution was unnecessarily complicated, as also indicated in the comment by JDLong. I now use
means2 <- events %>%
rowwise() %>%
mutate(td = list(seq(start_cet, stop_cet - 60, "min"))) %>%
unnest() %>%
select(event,td) %>%
left_join(.,data) %>%
group_by(event) %>%
summarise_at(vars(V1:V20),funs(mean=mean)) %>%
ungroup()
which is ~ 25 times faster than the old dplyr solution above.
The dt solution is still ~ 5 times faster than this dplyr chain. However, the output is a bit messed up. Instead of a column with the events, I get two columns td, which are the start and stop times of the events. Some dt experts know how to fix this?

Resources