I'm trying to make an COVID animation using the COVID data from my country.
But i keep getting it wrong, and most of the issues i have no idea of how can i solve the problem.
libraries:
library(ggplot2)
library(tidyverse)
library(dplyr)
library(hrbrthemes)
library(rgdal)
library(raster)
library(ggmap)
library(tmap)
require(sp)
library(geobr)
library(readr)
library(gganimate)
library(gifski)
First of all, you can get the dataframe from here:
caso <- readr::read_csv("https://data.brasil.io/dataset/covid19/caso.csv.gz")
caso$date <- as.Date(caso$date)
caso$state <- as.factor(caso$state)
tibble [399,497 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:399497], format: "2020-07-22" "2020-07-21" "2020-07-20" ...
$ state : Factor w/ 27 levels "AC","AL","AM",..: 4 4 4 4 4 4 4 4 4 4 ...
$ city : chr [1:399497] NA NA NA NA ...
$ place_type : chr [1:399497] "state" "state" "state" "state" ...
$ confirmed : num [1:399497] 34660 34405 34145 33705 33585 ...
$ deaths : num [1:399497] 544 533 515 507 505 499 493 488 483 478 ...
$ order_for_place : num [1:399497] 124 123 122 121 120 119 118 117 116 115 ...
$ is_last : logi [1:399497] TRUE FALSE FALSE FALSE FALSE FALSE ...
$ estimated_population_2019 : num [1:399497] 845731 845731 845731 845731 845731 ...
$ city_ibge_code : num [1:399497] 16 16 16 16 16 16 16 16 16 16 ...
$ confirmed_per_100k_inhabitants: num [1:399497] 4098 4068 4037 3985 3971 ...
$ death_rate : num [1:399497] 0.0157 0.0155 0.0151 0.015 0.015 0.0149 0.0149 0.01
> head(caso)
# A tibble: 6 x 12
date state city place_type confirmed deaths order_for_place is_last estimated_popul~
<date> <fct> <chr> <chr> <dbl> <dbl> <dbl> <lgl> <dbl>
1 2020-07-22 AP NA state 34660 544 124 TRUE 845731
2 2020-07-21 AP NA state 34405 533 123 FALSE 845731
3 2020-07-20 AP NA state 34145 515 122 FALSE 845731
4 2020-07-19 AP NA state 33705 507 121 FALSE 845731
5 2020-07-18 AP NA state 33585 505 120 FALSE 845731
6 2020-07-17 AP NA state 33436 499 119 FALSE 845731
# ... with 3 more variables: city_ibge_code <dbl>, confirmed_per_100k_inhabitants <dbl>
The brazil map is also available:
Estados <- read_state(year=2018)
So far, i've been doing plots by summarizing the data,like this:
ggplot() +
geom_sf(data=ontem, aes(fill=deaths), color="#FEBF57", size=.15, show.legend = TRUE) +
labs(title = "Mortes por COVID",size=8) +
scale_fill_distiller(palette = "BrBG",
name= "Mortes Confirmadas", limits=c(min(ontem$deaths),max(ontem$deaths)))+
theme_void() + theme(plot.title = element_text(hjust = 0.5))
options(scipen=10000)
which results in this map:
Where "ontem" df is a dataframe of the last day status of the covid (subset of caso):
ontem <- caso %>% filter(date == Sys.Date()-1,place_type == 'state')
But i would like to make an animation of how the deaths (for example) increase each day, i tried to use something like the same code plus transition_time(date) but i keep getting warning/error messages.
Can someone help me with this? I'm stuck for days!
The transition_time() function requires a vector to be in a date or time format. So, you must either ensure that your time variable is in a format that gganimate likes (it is pretty finicky with date formats) OR you could compute an integer that tracks sequence of time (1, 2, 3, 4...) after sorting by date/time, and using transition_states() with the sequence vector. The latter approach, I've found, is a lot easier.
Related
I'd like to conditionally remove row from data frame using dates and means. In my example:
# Package
library(tidyverse)
# Open dataset
RES_all_files_better <- read.csv("https://raw.githubusercontent.com/Leprechault/trash/main/RES_all_files_better_df.csv")
str(RES_all_files_better)
# 'data.frame': 507 obs. of 11 variables:
# $ STAND : chr "ARROIOXAVIER024B" "ARROIOXAVIER024B" "ARROIOXAVIER024B" "ARROIOXAVIER024B" ...
# $ ESPACAMENT: int 6 6 6 6 6 6 6 6 6 6 ...
# $ ESPECIE : chr "benthamii" "benthamii" "benthamii" "benthamii" ...
# $ IDADE : int 6 6 6 6 6 6 6 7 7 7 ...
# $ DATE_S2 : chr "2019-01-28" "2019-02-22" "2019-03-24" "2019-05-18" ...
# $ NDVI_avg : num 0.877 0.895 0.879 0.912 0.908 ...
# $ NDVI_sd : num 0.0916 0.0808 0.0758 0.1175 0.1132 ...
# $ NDVI_min : num -0.235 -0.1783 0.0844 -0.5666 -0.6093 ...
# $ NDVI_max : num 0.985 0.998 0.993 0.999 0.999 ...
# $ MONTH : int 1 2 3 5 7 8 9 11 12 12 ...
# $ NDVI_ref : num 0.823 0.823 0.823 0.823 0.823 ...
In my case, I search some operation for remove rows in data set, if NDVI_max+NDVI_min/2 is lower than NDVI_avg grouped by (ESPACAMENT,ESPECIE,IDADE) in the date (DATE_S2) before the actual date. An example for RES_all_files_better$STAND=="QUEBRACANGA012F":
# Original dataset:
STAND DATE_S2 NDVI_avg NDVI_min NDVI_max
...
208 QUEBRACANGA012F 2021-08-30 0.8748818 0.8238573 0.9072955
209 QUEBRACANGA012F 2021-11-08 0.5707210 0.2847520 0.8908801
210 QUEBRACANGA012F 2021-11-13 0.5515253 0.2275358 0.8940712
211 QUEBRACANGA012F 2021-12-28 0.5956103 0.2469136 0.9122636
212 QUEBRACANGA012F 2022-01-12 0.5952482 0.2084076 0.9031508
213 QUEBRACANGA012F 2022-01-22 0.5773518 0.2088580 0.8783236
214 QUEBRACANGA012F 2022-02-16 0.4246735 0.1674446 0.6224726
215 QUEBRACANGA012F 2022-02-26 0.4064463 0.1378491 0.6111995
#Final dataset:
STAND DATE_S2 NDVI_avg NDVI_min NDVI_max
...
208 QUEBRACANGA012F 2021-08-30 0.8748818 0.8238573 0.9072955
The lines 209 to 215 were removed because (NDVI_max+NDVI_min/2)=0.5878161 that is lower than NDVI_avg = 0.8748818 in last date 2021-08-30.
Please, any help with it?
We may need to filter on the min computed value ('new')
library(dplyr)
RES_all_files_better %>%
# convert to `Date` class and create a sequence column for checking
mutate(rn = row_number(), DATE_S2 = as.Date(DATE_S2)) %>%
# grouped by columns
group_by(ESPACAMENT,ESPECIE,IDADE) %>%
# create computed column
mutate(New = (NDVI_max+NDVI_min/2)) %>%
# filter the rows where the NDVI_avg is greater than the minimum value
filter(NDVI_avg > min(New)) %>%
ungroup #%>%
# select(-rn, -New)
I am working on the gafa_stock dataframe in the tsibbledata package. I want to find the maximum closing stock price for the each of the four stocks in the dataframe. Since the dataframe has four stocks I want to get a table with four rows with each row giving me the maximum value of a stock. I use the instructions here: Extract the maximum value within each group in a dataframe and write this code:
gafa_stock %>%
group_by(Symbol) %>%
summarise(maximum = max(Close))
The gafa_stock dataframe looks this
The str(gafa_stock) has these results
str(gafa_stock)
tsibble [5,032 x 8] (S3: tbl_ts/tbl_df/tbl/data.frame)
$ Symbol : chr [1:5032] "AAPL" "AAPL" "AAPL" "AAPL" ...
$ Date : Date[1:5032], format: "2014-01-02" "2014-01-03" "2014-01-06" ...
$ Open : num [1:5032] 79.4 79 76.8 77.8 77 ...
$ High : num [1:5032] 79.6 79.1 78.1 78 77.9 ...
$ Low : num [1:5032] 78.9 77.2 76.2 76.8 77 ...
$ Close : num [1:5032] 79 77.3 77.7 77.1 77.6 ...
$ Adj_Close: num [1:5032] 67 65.5 65.9 65.4 65.8 ...
$ Volume : num [1:5032] 5.87e+07 9.81e+07 1.03e+08 7.93e+07 6.46e+07 ...
- attr(*, "key")= tibble [4 x 2] (S3: tbl_df/tbl/data.frame)
..$ Symbol: chr [1:4] "AAPL" "AMZN" "FB" "GOOG"
..$ .rows : list<int> [1:4]
.. ..$ : int [1:1258] 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ : int [1:1258] 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 ...
.. ..$ : int [1:1258] 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 ...
.. ..$ : int [1:1258] 3775 3776 3777 3778 3779 3780 3781 3782 3783 3784 ...
.. ..# ptype: int(0)
..- attr(*, ".drop")= logi TRUE
- attr(*, "index")= chr "Date"
..- attr(*, "ordered")= logi TRUE
- attr(*, "index2")= chr "Date"
- attr(*, "interval")= interval [1:1] 1D
..# .regular: logi TRUE
And, my final results look like this
This command creates a table that has all the 5032 rows and three columns - Symbol, Date and the closing price labeled as maximum. What am I doing wrong? Is this because of some special characteristic of a ts or tsibble dataframe?
We can convert to a tibble first as there are other class attributes as well tbl_ts if the version of tsibble is < 0.9.3
gafa_stock %>%
as_tibble %>%
group_by(Symbol) %>%
summarise(maximum = max(Close), .groups = 'drop')
-output
# A tibble: 4 x 2
# Symbol maximum
# <chr> <dbl>
#1 AAPL 232.
#2 AMZN 2040.
#3 FB 218.
#4 GOOG 1268.
In the newer version (0.9.3), it works without the conversion
gafa_stock %>%
group_by(Symbol) %>%
summarise(maximum = max(Close), .groups = 'drop')
# A tibble: 4 x 2
# Symbol maximum
# <chr> <dbl>
#1 AAPL 232.
#2 AMZN 2040.
#3 FB 218.
#4 GOOG 1268.
According to tsibble (0.9.2)
Each observation should be uniquely identified by index and key in a valid tsibble.
Here, the attribute for index is "Date"
attr(gafa_stock, "index")[1]
#[1] "Date"
I think this is what you want:
gafa_stock %>%
group_by(Symbol) %>%
filter(Close == max(Close))
Result:
# A tsibble: 4 x 8 [!]
# Key: Symbol [4]
# Groups: Symbol [4]
Symbol Date Open High Low Close Adj_Close Volume
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AAPL 2018-10-03 230. 233. 230. 232. 230. 28654800
2 AMZN 2018-09-04 2026. 2050. 2013 2040. 2040. 5721100
3 FB 2018-07-25 216. 219. 214. 218. 218. 58954200
4 GOOG 2018-07-26 1251 1270. 1249. 1268. 1268. 2405600
I changed my dataset to data.table and I'm using sapply (apply family) but so far that wasn't sufficiant. Is this fully correct?
I already went from this:
library(data.table)
library(lubridate)
buying_volume_before_breakout <- list()
for (e in 1:length(df_1_30sec_5min$date_time)) {
interval <- dolar_tick_data_unified_dt[date_time <= df_1_30sec_5min$date_time[e] &
date_time >= df_1_30sec_5min$date_time[e] - time_to_collect_volume &
Type == "Buyer"]
buying_volume_before_breakout[[e]] <- sum(interval$Quantity)
}
To this (created a function and and using sapply)
fun_buying_volume_before_breakout <- function(e) {
interval <- dolar_tick_data_unified_dt[date_time <= df_1_30sec_5min$date_time[e] &
date_time >= df_1_30sec_5min$date_time[e] - time_to_collect_volume &
Type == "Buyer"]
return(sum(interval$Quantity))
}
buying_volume_before_breakout <- sapply(1:length(df_1_30sec_5min$date_time), fun_buying_volume_before_breakout)
I couldn't make my data reproducible but here are some more insights about its structure.
> str(dolar_tick_data_unified_dt)
Classes ‘data.table’ and 'data.frame': 3120650 obs. of 6 variables:
$ date_time : POSIXct, format: "2017-06-02 09:00:35" "2017-06-02 09:00:35" "2017-06-02 09:00:35" ...
$ Buyer_from : Factor w/ 74 levels "- - ","- - BGC LIQUIDEZ DTVM",..: 29 44 19 44 44 44 44 17 17 17 ...
$ Price : num 3271 3271 3272 3271 3271 ...
$ Quantity : num 5 5 5 5 5 5 10 5 50 25 ...
$ Seller_from: Factor w/ 73 levels "- - ","- - BGC LIQUIDEZ DTVM",..: 34 34 42 28 28 28 28 34 45 28 ...
$ Type : Factor w/ 4 levels "Buyer","Direct",..: 1 3 1 1 1 1 1 3 3 3 ...
- attr(*, ".internal.selfref")=<externalptr>
> str(df_1_30sec_5min)
Classes ‘data.table’ and 'data.frame': 3001 obs. of 13 variables:
$ date_time : POSIXct, format: "2017-06-02 09:33:30" "2017-06-02 09:49:38" "2017-06-02 10:00:41" ...
$ Price : num 3251 3252 3256 3256 3260 ...
$ fast_small_mm : num 3250 3253 3254 3256 3259 ...
$ slow_small_mm : num 3254 3253 3254 3256 3259 ...
$ fast_big_mm : num 3255 3256 3256 3256 3258 ...
$ slow_big_mm : num 3258 3259 3260 3261 3262 ...
$ breakout_strength : num 6.5 2 0.5 2 2.5 0.5 1 2.5 1 0.5 ...
$ buying_volume_before_breakout: num 1285 485 680 985 820 ...
$ total_volume_before_breakout : num 1285 485 680 985 820 ...
$ average_buying_volume : num 1158 338 318 394 273 ...
$ average_total_volume : num 1158 338 318 394 273 ...
$ relative_strenght : num 1 1 1 1 1 1 1 1 1 1 ...
$ relative_strenght_last_6min : num 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
First, separate the 'buyer' data from the rest. Then add a column for the start of the time interval and do a non-equi join in data.table, which is what #chinsoon is suggesting. I've made a reproducible example below:
library(data.table)
set.seed(123)
N <- 1e5
# Filter buyer details first
buyer_dt <- data.table(
tm = Sys.time()+runif(N,-1e6,+1e6),
quantity=round(runif(N,1,20))
)
time_dt <- data.table(
t = seq(
min(buyer_dt$tm),
max(buyer_dt$tm),
by = 15*60
)
)
t_int <- 300
time_dt[,t1:=t-t_int]
library(rbenchmark)
benchmark(
a={ # Your sapply code
bv1 <- sapply(1:nrow(time_dt), function(i){
buyer_dt[between(tm,time_dt$t[i]-t_int,time_dt$t[i]),sum(quantity)]
})
},
b={ # data.table non-equi join
all_intervals <- buyer_dt[time_dt,.(t,quantity),on=.(tm>=t1,tm<=t)]
bv2 <- all_intervals[,sum(quantity),by=.(t)]
}
,replications = 9
)
#> test replications elapsed relative user.self sys.self user.child
#> 1 a 9 42.75 158.333 81.284 0.276 0
#> 2 b 9 0.27 1.000 0.475 0.000 0
#> sys.child
#> 1 0
#> 2 0
Edit: In general, any join of two tables A and B is a subset of the outer join [A x B]. The rows of [A x B] will have all possible combinations of the rows of A and the rows of B. An equi join will subset [A x B] by checking equality conditions, i.e. If x and y are the join columns in A and B, Your join will be : rows from [A x B] where A.x=B.x and A.y=B.y
In a NON-equi join, the subset condition will have comparision operators OTHER than =, for example: like your case, where you want columns such that A.x <= B.x <= A.x + delta.
I don't know much about how they are implemented, but data.table has a pretty fast one that has worked well for me with large data frames.
EDIT: The problem was not within the geoMean function, but with a wrong use of aggregate(), as explained in the comments
I am trying to calculate the geometric mean of multiple measurements for several different species, which includes NAs. An example of my data looks like this:
species <- c("Ae", "Ae", "Ae", "Be", "Be")
phen <- c(2, NA, 3, 1, 2)
hveg <- c(NA, 15, 12, 60, 59)
df <- data.frame(species, phen, hveg)
When I try to calculate the geometric mean for the species Ae with the built-in function geoMean from the package EnvStats like this
library("EnvStats")
aggregate(df[, 3:3], list(df1$Sp), geoMean, na.rm=TRUE)
it works wonderful and skips the NAs to give me the geometric means per species.
Group.1 phen hveg
1 Ae 4.238536 50.555696
2 Be 1.414214 1.414214
When I do this with my large dataset, however, the function stumbles over NAs and returns NA as result even though there are e.g 10 numerical values and only one NA. This happens for example with the column SLA_mm2/mg.
My large data set looks like this:
> str(cut2trait1)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 22 obs. of 19 variables:
$ Cut : chr "15_08" "15_08" "15_08" "15_08" ...
$ Block : num 1 1 1 1 1 1 1 1 1 1 ...
$ ID : num 451 512 431 531 591 432 551 393 511 452 ...
$ Plot : chr "1_1" "1_1" "1_1" "1_1" ...
$ Grazing : chr "n" "n" "n" "n" ...
$ Acro : chr "Leuc.vulg" "Dact.glom" "Cirs.arve" "Trif.prat" ...
$ Sp : chr "Lv" "Dg" "Ca" "Tp" ...
$ Label_neu : chr "Lv021" "Dg022" "Ca021" "Tp021" ...
$ PlantFunctionalType: chr "forb" "grass" "forb" "forb" ...
$ PlotClimate : chr "AC" "AC" "AC" "AC" ...
$ Season : chr "Aug" "Aug" "Aug" "Aug" ...
$ Year : num 2015 2015 2015 2015 2015 ...
$ Tiller : num 6 3 3 5 6 8 5 2 1 7 ...
$ Hveg : num 25 38 70 36 68 65 23 58 71 27 ...
$ Hrep : num 39 54 77 38 76 70 65 88 98 38 ...
$ Phen : num 8 8 7 8 8 7 6.5 8 8 8 ...
$ SPAD : num 40.7 42.4 48.7 43 31.3 ...
$ TDW_in_g : num 4.62 4.85 11.86 5.82 8.99 ...
$ SLA_mm2/mg : num 19.6 19.8 20.3 21.2 21.7 ...
and the result of my code
gm_cut2trait1 <- aggregate(cut2trait1[, 13:19], list(cut2trait1$Sp), geoMean, na.rm=TRUE)
is (only the first two rows):
Group.1 Tiller Hveg Hrep Phen SPAD TDW_in_g SLA_mm2/mg
1 Ae 13.521721 73.43485 106.67933 NA 28.17698 1.2602475 NA
2 Be 8.944272 43.95452 72.31182 5.477226 20.08880 0.7266361 9.309672
Here, the geometric mean of SLA for Ae is NA, even though there are 9 numeric measurements and only one NA in the column used to calculate the geometric mean.
I tried to use the geometric mean function suggested here:
Geometric Mean: is there a built-in?
But instead of NAs, this returned the value 1.000 when used with my big dataset, which doesn't solve my problem.
So my question is: What is the difference between my example df and the big dataset that throws the geoMean function off the rails?
I'm trying to plot data on map of switzerland
using this code
require("rgdal")
require("maptools")
require("ggplot2")
require("plyr")
require("maps")
require("ggmap")
ggplot() + geom_polygon(data = da, aes(x=long, y = lat)) +
coord_fixed(1.3)+
geom_point(data=de, aes(x=lat, y=lon), color="orange")
Where data da is a map using swissmap package:
da<- shp_df[[6]]
& data de is:
'data.frame': 115 obs. of 5 variables:
$ FB : Factor w/ 3 levels "I","II","IV": 2 2 2 3 1 2 1 3 1 1
$ Nom : Factor w/ 115 levels "\"Patient Education\" Programm unipolare Depression",..: 9 31 95 112 92 41 70 84 13 21 ...
$ lon : num 7.36 8.54 7.08 NA 7.45 ...
$ lat : num 46.2 47.4 46.1 NA 46.9 ...
$ Coûts: int 100000 380000 150000 300000 2544000 300000 1897000 500000 2930000 2400000 ...
I got this result.
This is not what i want, i'm trying to plot at location (sometime same place)the data in de dataset.
Any kinds of help or advices will be appreciate .
thank you