Plot new cases per day using ggplot in R - r

I acquire the data set of Coronavirus in the US from The New York Times which includes date and accumulative cases up to that date. In what way I can extract and plot new cases per day using ggpplot in R?
The data set: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv

Assuming you have only two columns, one for dates and one for cumulative cases, you can get the number of cases by substracting the cumulative of one day by the value of the day before.
In dplyr, you can use lag function for that:
Here a fake and reproducible dataset (I intentionally keep orogonal cases values that I provided to show the correct calculation)
df <- data.frame(date = seq(ymd("2020-01-01"),ymd("2020-01-10"),by = "day"),
cases = sample(10:100,10))
df$cumCase <- cumsum(df$cases)
library(dplyr)
df %>% mutate(Orig_cases = ifelse(row_number()==1, cumCase, cumCase - lag(cumCase)))
date cases cumCase Orig_cases
1 2020-01-01 88 88 88
2 2020-01-02 49 137 49
3 2020-01-03 14 151 14
4 2020-01-04 35 186 35
5 2020-01-05 67 253 67
6 2020-01-06 23 276 23
7 2020-01-07 95 371 95
8 2020-01-08 63 434 63
9 2020-01-09 17 451 17
10 2020-01-10 90 541 90
Now, you have the correct calculation, you can pass it to ggplot by doing:
library(dplyr)
library(ggplot2)
df %>% mutate(Orig_cases = ifelse(row_number()==1, cumCase, cumCase - lag(cumCase)))# %>%
ggplot(aes(x = date, y = Orig_cases))+
geom_col()+
geom_line(aes(y = cumCase, group = 1))

Related

how to calculate mean based on conditions in for loop in r

I have what I think is a simple question but I can't figure it out! I have a data frame with multiple columns. Here's a general example:
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
test.df
I would like for R to calculate average activity based on the age of the colony in the data frame. Specifically, I want it to only calculate the average activity of the colonies that are the same age or older than the colony in that row, not including the activity of the colony in that row. For example, colony 29683 is 21 years old. I want the average activity of colonies older than 21 for this row of my data. That would include colony 25077 and colony 4865; and the mean would be (45+33)/2 = 39. I want R to do this for each row of the data by identifying the age of the colony in the current row, then identifying the colonies that are older than that colony, and then averaging the activity of those colonies.
I've tried doing this in a for loop in R. Here's the code I used:
test.avg = vector("numeric",nrow(test.df))`
for (i in 1:10){
test.avg[i] <- mean(subset(test.df$activity,test.df$age >= age[i])[-i])
}
R returns a list of values where half of them are correct and the the other half are not (I'm not even sure how it calculated those incorrect numbers..). The numbers that are correct are also out of order compared to how they're listed in the dataframe. It's clearly able to do the right thing for some iterations of the loop but not all. If anyone could help me out with my code, I would greatly appreciate it!
colony = c('29683','25077','28695','4865','19858','2235','1948','1849','2370','23196')
age = c(21,23,4,25,7,4,12,14,9,7)
activity = c(19,45,78,33,2,49,22,21,112,61)
test.df = data.frame(colony,age,activity)
library(tidyverse)
test.df %>%
mutate(result = map_dbl(age, ~mean(activity[age > .x])))
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
# base
test.df$result <- with(test.df, sapply(age, FUN = function(x) mean(activity[age > x])))
test.df
#> colony age activity result
#> 1 29683 21 19 39.00000
#> 2 25077 23 45 33.00000
#> 3 28695 4 78 39.37500
#> 4 4865 25 33 NaN
#> 5 19858 7 2 42.00000
#> 6 2235 4 49 39.37500
#> 7 1948 12 22 29.50000
#> 8 1849 14 21 32.33333
#> 9 2370 9 112 28.00000
#> 10 23196 7 61 42.00000
Created on 2021-03-22 by the reprex package (v1.0.0)
The issue in your solution is that the index would apply to the original data.frame, yet you subset that and so it does not match anymore.
Try something like this: First find minimum age, then exclude current index and calculate average activity of cases with age >= pre-calculated minimum age.
for (i in 1:10){
test.avg[i] <- {amin=age[i]; mean(subset(test.df[-i,], age >= amin)$activity)}
}
You can use map_df :
library(tidyverse)
test.df %>%
mutate(map_df(1:nrow(test.df), ~
test.df %>%
filter(age >= test.df$age[.x]) %>%
summarise(av_acti= mean(activity))))

Performing operation among levels of grouped variable in R/dplyr

I want to perform a calculation among levels a grouping variable and fit this into a dplyr/tidyverse style workflow. I know this is confusing wording, but I hope the example below helps to clarify.
Below, I want to find the difference between levels "A" and "B" for each year that that I have data. One solution was to cast the data from long to wide format, and use mutate() in order to find the difference between A and B and create a new column with the results.
Ultimately, I'm working with a much larger dataset in which for each of N species, and for every year of sampling, I want to find the response ratio of some measured variable. Being able to keep the calculation in a long-format workflow would greatly help with later uses of the data.
library(tidyverse)
library(reshape)
set.seed(34)
test = data.frame(Year = rep(seq(2011,2020),2),
Letter = rep(c('A','B'),each = 10),
Response = sample(100,20))
test.results = test %>%
cast(Year ~ Letter, value = 'Response') %>%
mutate(diff = A - B)
#test.results
Year A B diff
2011 93 48 45
2012 33 44 -11
2013 9 80 -71
2014 10 61 -51
2015 50 67 -17
2016 8 43 -35
2017 86 20 66
2018 54 99 -45
2019 29 100 -71
2020 11 46 -35
Is there some solution where I could group by Year, and then use a function like summarize() to calculate between the levels of variable "Letters"?
group_by(Year)%>%
summarise( "something here to perform a calculation between levels A and B of the variable "Letters")
You can subset the Response values for "A" and "B" and then take the difference.
library(dplyr)
test %>%
group_by(Year) %>%
summarise(diff = Response[Letter == 'A'] - Response[Letter == 'B'])
# Year diff
# <int> <int>
# 1 2011 45
# 2 2012 -11
# 3 2013 -71
# 4 2014 -51
# 5 2015 -17
# 6 2016 -35
# 7 2017 66
# 8 2018 -45
# 9 2019 -71
#10 2020 -35
In this example, we can also take advantage of the fact that if we arrange the data "A" would come before "B" so we can use diff :
test %>%
arrange(Year, desc(Letter)) %>%
group_by(Year) %>%
summarise(diff = diff(Response))

R aggregate variable but duplicate new values in original dataframe

I am new to R, and I've run into what I imagine is a very simple problem:
I am currently trying to aggregate an hourly variable to daily averages. The trick is I want to keep these new daily averages in my original data frame. While I have been able use aggregate() or summaryBy() for a new daily aggregated data frame, I would like to simply repeat averaged values within my original data frame. Shown below is a head from my frame:
- x y
50 4.650097 2017-3-12-16
51 6.499223 2017-3-12-17
52 8.741650 2017-3-12-18
53 8.358922 2017-3-12-19
54 8.650971 2017-3-12-20
55 6.928252 2017-3-12-21
What I want to do is aggregate x, which is an hourly measurement, into a single daily average, but include those repeated averages as new columns.
For example, lets say the average of x was '6.12' for the first 24 rows. I want '6.12' to repeat as a new column for 24 rows, instead of creating a new single value vector.
Thank you in advance for any advice!
Here is a dplyr solution:
library(dplyr);
df %>%
mutate(date = as.Date(as.POSIXct(strptime(y, "%Y-%m-%d-%H")))) %>%
group_by(date) %>%
mutate(mean.x = mean(x))
## A tibble: 9 x 5
## Groups: date [2]
# X. x y date mean.x
# <int> <dbl> <fct> <date> <dbl>
#1 50 4.65 2017-3-12-16 2017-03-12 7.30
#2 51 6.50 2017-3-12-17 2017-03-12 7.30
#3 52 8.74 2017-3-12-18 2017-03-12 7.30
#4 53 8.36 2017-3-12-19 2017-03-12 7.30
#5 54 8.65 2017-3-12-20 2017-03-12 7.30
#6 55 6.93 2017-3-12-21 2017-03-12 7.30
#7 100 5.00 2017-4-23-16 2017-04-23 5.00
#8 101 6.00 2017-4-23-17 2017-04-23 5.00
#9 102 4.00 2017-4-23-18 2017-04-23 5.00
Explanation: Convert y to POSIXct format, extract date component, group_by date, and create new column with daily mean.
Sample data
df <- read.table(text =
"- x y
50 4.650097 2017-3-12-16
51 6.499223 2017-3-12-17
52 8.741650 2017-3-12-18
53 8.358922 2017-3-12-19
54 8.650971 2017-3-12-20
55 6.928252 2017-3-12-21
100 5.0 2017-4-23-16
101 6.0 2017-4-23-17
102 4.0 2017-4-23-18", header = T)
This is untested as you haven't provided a reproducible form of your data (check out dput), but this should at least point you in the right direction. Just replace mydf with whatever your dataframe is called.
library(tidyr)
library(dplyr)
aggregated_df <- mydf %>%
separate(y, c("date", "hour"), sep = -3) %>%
group_by(date) %>%
mutate(daily_average = mean(x))

How to group in R with partial match and assign a column with the aggregated value?

Below is the data frame I have:
Quarter Revenue
1 2014-Q1 10
2 2014-Q2 20
3 2014-Q3 30
4 2014-Q4 40
5 2015-Q1 50
6 2015-Q2 60
7 2015-Q3 70
8 2015-Q4 80
I want to find the mean of the quarters containing Q1,Q2,Q3,Q4 separately (for e.g. for text containing Q1, I have two values for revenue i.e. 10 and 50, the mean of which is 30) and insert a column depicting the mean. The o/p should look like the one described below:
Quarter Revenue Aggregate
1 2014-Q1 10 30
2 2014-Q2 20 40
3 2014-Q3 30 50
4 2014-Q4 40 60
5 2015-Q1 50 30
6 2015-Q2 60 40
7 2015-Q3 70 50
8 2015-Q4 80 60
Could you all please let me know if there are any processes without using the popular packages and with using too.
Thanks!
We can separate the "Quarter" into "Year", "Quart", group by "Quart", and get the mean of "Revenue"
library(dplyr)
library(tidyr)
separate(df1, Quarter, into = c("Year", "Quart"), remove = FALSE) %>%
group_by(Quart) %>%
mutate(Aggregate = mean(Revenue)) %>%
ungroup() %>%
select(-Quart, -Year)
# Quarter Revenue Aggregate
# <chr> <int> <dbl>
#1 2014-Q1 10 30
#2 2014-Q2 20 40
#3 2014-Q3 30 50
#4 2014-Q4 40 60
#5 2015-Q1 50 30
#6 2015-Q2 60 40
#7 2015-Q3 70 50
#8 2015-Q4 80 60
Or we can do this compactly with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by the substring of 'Quarter (removed the Year and -), we assign (:=) the mean of 'Revenue' to create the 'Aggregate'.
library(data.table)
setDT(df1)[, Aggregate := mean(Revenue) ,.(sub(".*-", "", Quarter))]
One possible solution using functions from the base package.
qtr <- c("Q1", "Q2", "Q3", "Q4")
avg <- numeric()
for (n in 1:length(qtr)) {
ind <- grep(qtr[n], df1$Quarter)
avg[length(avg) + 1] <- mean(df1$Revenue[ind])
}
df1 <- transform(df1, Aggregate = avg)
Apparently using functions from other packages (e.g., dplyr) make code less verbose.

How to assign a value depending on two conditions including column names. (add environmental variable to tracking data)

I have a data frame (track) with the position (longitude - Latitude) and date (number of the day in the year) of tracking point for different animals and an other data frame (var) which gives a the mean temperature for every day of the year in different locations.
I would like to add a new column TEMP to my data frame (Track) where the value would be from (var) and correspond to the date and GPS location of each tracking points in (track).
Here are a really simple subset of my data and what I would like to obtain.
track = data.frame(
animals=c(1,1,1,2,2),
Longitude=c(117,116,117,117,116),
Latitude=c(18,20,20,18,20),
Day=c(1,3,4,1,5))
Var = data.frame(
Longitude=c(117,117,116,116),
Latitude=c(18,20,18,20),
Day1=c(22,23,24,21),
Day2=c(21,28,27,29),
Day3=c(12,13,14,11),
Day4=c(17,19,20,23),
Day5=c(32,33,34,31)
)
TrackPlusVar = data.frame(
animals=c(1,1,1,2,2),
Longitude=c(117,116,117,117,116),
Latitude=c(18,20,20,18,20),
Day=c(1,3,4,1,5),
Temp= c(22,11,19,22,31)
)
I've no idea how to assign the value from the same date and GPS location as it is a column name. Any idea would be very useful !
This is a dplyr and tidyr approach.
library(dplyr)
library(tidyr)
# reshape table Var
Var %>%
gather(Day,Temp,-Longitude, -Latitude) %>%
mutate(Day = as.numeric(gsub("Day","",Day))) -> Var2
# join tables
track %>% left_join(Var2, by=c("Longitude", "Latitude", "Day"))
# animals Longitude Latitude Day Temp
# 1 1 117 18 1 22
# 2 1 116 20 3 11
# 3 1 117 20 4 19
# 4 2 117 18 1 22
# 5 2 116 20 5 31
If the process that creates your tables makes sure that all your cases belong to both tables, then you can use inner_join instead of left_join to make the process faster.
If you're still not happy with the speed you can use a data.table join process to check if it is faster, like:
library(data.table)
Var2 = setDT(Var2, key = c("Longitude", "Latitude", "Day"))
track = setDT(track, key = c("Longitude", "Latitude", "Day"))
Var2[track][order(animals,Day)]
# Longitude Latitude Day Temp animals
# 1: 117 18 1 22 1
# 2: 116 20 3 11 1
# 3: 117 20 4 19 1
# 4: 117 18 1 22 2
# 5: 116 20 5 31 2

Resources