Plot map using ggplot2 - r

I'm trying to plot data on map of switzerland
using this code
require("rgdal")
require("maptools")
require("ggplot2")
require("plyr")
require("maps")
require("ggmap")
ggplot() + geom_polygon(data = da, aes(x=long, y = lat)) +
coord_fixed(1.3)+
geom_point(data=de, aes(x=lat, y=lon), color="orange")
Where data da is a map using swissmap package:
da<- shp_df[[6]]
& data de is:
'data.frame': 115 obs. of 5 variables:
$ FB : Factor w/ 3 levels "I","II","IV": 2 2 2 3 1 2 1 3 1 1
$ Nom : Factor w/ 115 levels "\"Patient Education\" Programm unipolare Depression",..: 9 31 95 112 92 41 70 84 13 21 ...
$ lon : num 7.36 8.54 7.08 NA 7.45 ...
$ lat : num 46.2 47.4 46.1 NA 46.9 ...
$ Coûts: int 100000 380000 150000 300000 2544000 300000 1897000 500000 2930000 2400000 ...
I got this result.
This is not what i want, i'm trying to plot at location (sometime same place)the data in de dataset.
Any kinds of help or advices will be appreciate .
thank you

Related

Making animations using gganimate: im not getting it right

I'm trying to make an COVID animation using the COVID data from my country.
But i keep getting it wrong, and most of the issues i have no idea of how can i solve the problem.
libraries:
library(ggplot2)
library(tidyverse)
library(dplyr)
library(hrbrthemes)
library(rgdal)
library(raster)
library(ggmap)
library(tmap)
require(sp)
library(geobr)
library(readr)
library(gganimate)
library(gifski)
First of all, you can get the dataframe from here:
caso <- readr::read_csv("https://data.brasil.io/dataset/covid19/caso.csv.gz")
caso$date <- as.Date(caso$date)
caso$state <- as.factor(caso$state)
tibble [399,497 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:399497], format: "2020-07-22" "2020-07-21" "2020-07-20" ...
$ state : Factor w/ 27 levels "AC","AL","AM",..: 4 4 4 4 4 4 4 4 4 4 ...
$ city : chr [1:399497] NA NA NA NA ...
$ place_type : chr [1:399497] "state" "state" "state" "state" ...
$ confirmed : num [1:399497] 34660 34405 34145 33705 33585 ...
$ deaths : num [1:399497] 544 533 515 507 505 499 493 488 483 478 ...
$ order_for_place : num [1:399497] 124 123 122 121 120 119 118 117 116 115 ...
$ is_last : logi [1:399497] TRUE FALSE FALSE FALSE FALSE FALSE ...
$ estimated_population_2019 : num [1:399497] 845731 845731 845731 845731 845731 ...
$ city_ibge_code : num [1:399497] 16 16 16 16 16 16 16 16 16 16 ...
$ confirmed_per_100k_inhabitants: num [1:399497] 4098 4068 4037 3985 3971 ...
$ death_rate : num [1:399497] 0.0157 0.0155 0.0151 0.015 0.015 0.0149 0.0149 0.01
> head(caso)
# A tibble: 6 x 12
date state city place_type confirmed deaths order_for_place is_last estimated_popul~
<date> <fct> <chr> <chr> <dbl> <dbl> <dbl> <lgl> <dbl>
1 2020-07-22 AP NA state 34660 544 124 TRUE 845731
2 2020-07-21 AP NA state 34405 533 123 FALSE 845731
3 2020-07-20 AP NA state 34145 515 122 FALSE 845731
4 2020-07-19 AP NA state 33705 507 121 FALSE 845731
5 2020-07-18 AP NA state 33585 505 120 FALSE 845731
6 2020-07-17 AP NA state 33436 499 119 FALSE 845731
# ... with 3 more variables: city_ibge_code <dbl>, confirmed_per_100k_inhabitants <dbl>
The brazil map is also available:
Estados <- read_state(year=2018)
So far, i've been doing plots by summarizing the data,like this:
ggplot() +
geom_sf(data=ontem, aes(fill=deaths), color="#FEBF57", size=.15, show.legend = TRUE) +
labs(title = "Mortes por COVID",size=8) +
scale_fill_distiller(palette = "BrBG",
name= "Mortes Confirmadas", limits=c(min(ontem$deaths),max(ontem$deaths)))+
theme_void() + theme(plot.title = element_text(hjust = 0.5))
options(scipen=10000)
which results in this map:
Where "ontem" df is a dataframe of the last day status of the covid (subset of caso):
ontem <- caso %>% filter(date == Sys.Date()-1,place_type == 'state')
But i would like to make an animation of how the deaths (for example) increase each day, i tried to use something like the same code plus transition_time(date) but i keep getting warning/error messages.
Can someone help me with this? I'm stuck for days!
The transition_time() function requires a vector to be in a date or time format. So, you must either ensure that your time variable is in a format that gganimate likes (it is pretty finicky with date formats) OR you could compute an integer that tracks sequence of time (1, 2, 3, 4...) after sorting by date/time, and using transition_states() with the sequence vector. The latter approach, I've found, is a lot easier.

After inserting an apply instead of loop

I changed my dataset to data.table and I'm using sapply (apply family) but so far that wasn't sufficiant. Is this fully correct?
I already went from this:
library(data.table)
library(lubridate)
buying_volume_before_breakout <- list()
for (e in 1:length(df_1_30sec_5min$date_time)) {
interval <- dolar_tick_data_unified_dt[date_time <= df_1_30sec_5min$date_time[e] &
date_time >= df_1_30sec_5min$date_time[e] - time_to_collect_volume &
Type == "Buyer"]
buying_volume_before_breakout[[e]] <- sum(interval$Quantity)
}
To this (created a function and and using sapply)
fun_buying_volume_before_breakout <- function(e) {
interval <- dolar_tick_data_unified_dt[date_time <= df_1_30sec_5min$date_time[e] &
date_time >= df_1_30sec_5min$date_time[e] - time_to_collect_volume &
Type == "Buyer"]
return(sum(interval$Quantity))
}
buying_volume_before_breakout <- sapply(1:length(df_1_30sec_5min$date_time), fun_buying_volume_before_breakout)
I couldn't make my data reproducible but here are some more insights about its structure.
> str(dolar_tick_data_unified_dt)
Classes ‘data.table’ and 'data.frame': 3120650 obs. of 6 variables:
$ date_time : POSIXct, format: "2017-06-02 09:00:35" "2017-06-02 09:00:35" "2017-06-02 09:00:35" ...
$ Buyer_from : Factor w/ 74 levels "- - ","- - BGC LIQUIDEZ DTVM",..: 29 44 19 44 44 44 44 17 17 17 ...
$ Price : num 3271 3271 3272 3271 3271 ...
$ Quantity : num 5 5 5 5 5 5 10 5 50 25 ...
$ Seller_from: Factor w/ 73 levels "- - ","- - BGC LIQUIDEZ DTVM",..: 34 34 42 28 28 28 28 34 45 28 ...
$ Type : Factor w/ 4 levels "Buyer","Direct",..: 1 3 1 1 1 1 1 3 3 3 ...
- attr(*, ".internal.selfref")=<externalptr>
> str(df_1_30sec_5min)
Classes ‘data.table’ and 'data.frame': 3001 obs. of 13 variables:
$ date_time : POSIXct, format: "2017-06-02 09:33:30" "2017-06-02 09:49:38" "2017-06-02 10:00:41" ...
$ Price : num 3251 3252 3256 3256 3260 ...
$ fast_small_mm : num 3250 3253 3254 3256 3259 ...
$ slow_small_mm : num 3254 3253 3254 3256 3259 ...
$ fast_big_mm : num 3255 3256 3256 3256 3258 ...
$ slow_big_mm : num 3258 3259 3260 3261 3262 ...
$ breakout_strength : num 6.5 2 0.5 2 2.5 0.5 1 2.5 1 0.5 ...
$ buying_volume_before_breakout: num 1285 485 680 985 820 ...
$ total_volume_before_breakout : num 1285 485 680 985 820 ...
$ average_buying_volume : num 1158 338 318 394 273 ...
$ average_total_volume : num 1158 338 318 394 273 ...
$ relative_strenght : num 1 1 1 1 1 1 1 1 1 1 ...
$ relative_strenght_last_6min : num 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
First, separate the 'buyer' data from the rest. Then add a column for the start of the time interval and do a non-equi join in data.table, which is what #chinsoon is suggesting. I've made a reproducible example below:
library(data.table)
set.seed(123)
N <- 1e5
# Filter buyer details first
buyer_dt <- data.table(
tm = Sys.time()+runif(N,-1e6,+1e6),
quantity=round(runif(N,1,20))
)
time_dt <- data.table(
t = seq(
min(buyer_dt$tm),
max(buyer_dt$tm),
by = 15*60
)
)
t_int <- 300
time_dt[,t1:=t-t_int]
library(rbenchmark)
benchmark(
a={ # Your sapply code
bv1 <- sapply(1:nrow(time_dt), function(i){
buyer_dt[between(tm,time_dt$t[i]-t_int,time_dt$t[i]),sum(quantity)]
})
},
b={ # data.table non-equi join
all_intervals <- buyer_dt[time_dt,.(t,quantity),on=.(tm>=t1,tm<=t)]
bv2 <- all_intervals[,sum(quantity),by=.(t)]
}
,replications = 9
)
#> test replications elapsed relative user.self sys.self user.child
#> 1 a 9 42.75 158.333 81.284 0.276 0
#> 2 b 9 0.27 1.000 0.475 0.000 0
#> sys.child
#> 1 0
#> 2 0
Edit: In general, any join of two tables A and B is a subset of the outer join [A x B]. The rows of [A x B] will have all possible combinations of the rows of A and the rows of B. An equi join will subset [A x B] by checking equality conditions, i.e. If x and y are the join columns in A and B, Your join will be : rows from [A x B] where A.x=B.x and A.y=B.y
In a NON-equi join, the subset condition will have comparision operators OTHER than =, for example: like your case, where you want columns such that A.x <= B.x <= A.x + delta.
I don't know much about how they are implemented, but data.table has a pretty fast one that has worked well for me with large data frames.

R circlular wheel chart

I'm trying to make a wheel chart that has rings. My result looks like the lines all go back to zero before continuing to the next point. Is it a discreet/continuous issue? I've tried making Lap.Time and Lap both numeric to no avail:
f1 <- read.csv("F1 2011 Turkey - Fuel Corrected Lap Times.csv", header = T)
str(f1)
# data.frame: 1263 obs. of 5 variables:
# $ Driver : Factor w/ 23 levels "1","2","3","4",..: 23 23 23 23 23 23 23 23 23 23 ...
# $ Lap : int 1 2 3 4 5 6 7 8 9 10 ...
# $ Lap.Time : num 107 99.3 98.4 97.5 97.4 ...
# $ Fuel.Adjusted.Laptime : num 102.3 94.7 93.9 93.1 93.1 ...
# $ Fuel.and.fastest.lap.adjusted.laptime: num 9.73 2.124 1.321 0.54 0.467 ...
library(ggplot2)
f1$Driver<-as.factor(f1$Driver)
p1 <- ggplot(data=subset(f1, Lap.Time <= 120), aes(x = Lap, y= Lap.Time, colour = Driver)) +
geom_point(aes(colour=Driver))
p2 <- ggplot(subset(f1, Lap.Time <= 120),
aes(x = Lap, y= Lap.Time, colour = Driver, group = 1)) +
geom_line(aes(colour=Driver))
pout <- p1 + coord_polar()
pout2 <- p2 + coord_polar()
pout
pout2
resulting chart image
All the data is in this csv:
https://docs.google.com/spreadsheets/d/1Ef2ewd1-0FM1mJL1o00C6c2gf7HFmanJh8an1EaAq2Q/edit?hl=en_GB&authkey=CMSemOQK#gid=0
Sample of csv:
Driver,Lap,Lap Time,Fuel Adjusted Laptime,Fuel and fastest lap adjusted laptime
25,1,106.951,102.334,9.73
25,2,99.264,94.728,2.124
25,3,98.38,93.925,1.321
25,4,97.518,93.144,0.54
25,5,97.364,93.071,0.467
25,6,97.853,93.641,1.037
25,7,98.381,94.25,1.646
25,8,98.142,94.092,1.488
25,9,97.585,93.616,1.012
25,10,97.567,93.679,1.075
25,11,97.566,93.759,1.155
25,12,97.771,94.045,1.441
25,13,98.532,94.887,2.283
25,14,99.146,95.582,2.978
25,15,98.529,95.046,2.442
25,16,99.419,96.017,3.413
25,17,114.593,111.272,18.668

Passing arguments to ggplot and facet_grid

I need some help with these lines of code.
My data set:
> str(data.tidy)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 9480 obs. of 11 variables:
$ Country.Name : Factor w/ 248 levels "Afghanistan",..: 234 12 13 20 22 31 17 16 25 28 ...
$ Country.Code : Factor w/ 248 levels "ABW","AFG","AGO",..: 7 12 13 16 17 18 19 21 27 28 ...
$ Year : Factor w/ 56 levels "1960","1961",..: 1 1 1 1 1 1 1 1 1 1 ...
$ InfantMortality : num 137.3 20.3 37.3 29.5 186.9 ...
$ AdolFertilityRate: num 176.9 44.8 48.4 27.1 85.8 ...
$ FertilityRate : num 6.93 3.45 2.69 2.54 6.28 ...
$ LifeExpectancy : num 52.2 70.8 68.6 69.7 37.3 ...
$ TotalUnemp : num NA NA NA NA NA NA NA NA NA NA ...
$ TotalPop : num 92612 10276477 7047539 9153489 2431620 ...
$ Region : Factor w/ 8 levels "","East Asia & Pacific",..: 5 2 3 3 8 8 7 5 4 4 ...
$ IncomeGroup : Factor w/ 6 levels "","High income: nonOECD",..: 2 3 3 3 4 4 5 2 5 6 ...
Reference code that I want to 'functionize':
ggplot(data.tidy,aes(as.numeric(as.character(Year)),y=InfantMortality))+
geom_line(aes(color=Country.Name))+
facet_grid(.~IncomeGroup)+
theme(legend.position="none")+
theme(strip.text.x = element_text(size = 7))+
labs(x='Year', title='Change in mortality rate over time')+
geom_smooth(color='black')
I want to replace data.tidy, InfantMortality, IncomeGroup and title in the example above.
Here was my attempt at the code:
facetedlineplot <- function(df,y,facet,title){
ggplot(df,aes(as.numeric(as.character(Year)),y=y))+
geom_line(aes(color=Country.Name))+
facet_grid(.~facet)+
theme(legend.position="none")+
theme(strip.text.x = element_text(size = 7))+
labs(x='Year',title=title)+
geom_smooth(color='black')
}
The error:
> facetedlineplot(data.tidy,y = 'InfantMortality',facet = 'IncomeGroup',title = 'Title goes here')
Error in layout_base(data, cols, drop = drop) :
At least one layer must contain all variables used for facetting
I have tried aes_string, but I couldn't get it to work. What does the error mean? How can I work around this issue?
Update:
I have some code that partially works now, using reformulate()
facetedlineplot <- function(df,y,facet,title){
year <- as.numeric(as.character(df$Year))
ggplot(df,aes(x=year,y=y))+
geom_line(aes(color=Country.Name))+
facet_grid(paste('.~',reformulate(facet)))+
theme(legend.position="none")+
theme(strip.text.x = element_text(size = 7))+
labs(x='Year',title=title)+
geom_smooth(color='black')
}
> facetedlineplot(data.tidy,y = 'InfantMortality', facet = 'IncomeGroup', title = 'Title goes here')
Warning message:
Computation failed in `stat_smooth()`:
x has insufficient unique values to support 10 knots: reduce k.
>
Still, an incorrect plot>
Thank you in advance,
Rahul
I have the solution. Three steps worked for me:
- Change datatype of the Year variable in data.tidy from factor to numeric.
- Use aes_string for the ggplot argument
- For facet_grid(), many things worked:
Use as.formula() to pass '~IncomeGroup'
Just pass '~IncomeGroup' directly to facet_grid()
Final code:
facetedlineplot <- function(df,y,facet,title){
ggplot(df,aes_string(x = 'Year', y = y))+
geom_line(aes(color=Country.Name))+
facet_grid(facet)+
theme(legend.position="none")+
theme(strip.text.x = element_text(size = 9))+
labs(x='Year',title=title)+
geom_smooth(color='black')
}
d <- data.tidy
d$Year <- as.numeric(as.character(d$Year))
facetedlineplot(d,'InfantMortality','~IncomeGroup','Title')

Data Subset error in R using %in% wildcard

My df:
> str(merged)
'data.frame': 714 obs. of 9 variables:
$ Date : Date, format: "2013-03-29" "2013-03-29" "2013-03-29" "2013-03-29" ...
$ patch : Factor w/ 7 levels "BVG1","BVG11",..: 1 2 3 4 5 6 7 1 2 3 ...
$ prod : num 2.93 2.77 2.86 2.87 3.01 ...
$ workmix_pct : int 100 10 16 13 17 21 22 100 11 19 ...
$ jobcounts : int 9480 968 1551 1267 1625 1946 2123 7328 810 1374 ...
$ travel : num 30.7 34.3 33.8 29.1 28.1 24.9 34 31.8 32.7 36.4 ...
$ FWIHweeklyAvg: num 1.63 4.48 3.1 1.36 1.55 ...
$ CST.NAME : Factor w/ 7 levels "Central Scotland",..: 4 2 3 1 5 7 6 4 2 3 ...
$ month : chr "March" "March" "March" "March" ...
> head(merged)
Date patch prod workmix_pct jobcounts travel FWIHweeklyAvg CST.NAME month
1 2013-03-29 BVG1 2.932208 100 9480 30.7 1.627024 Scotland March
2 2013-03-29 BVG11 2.769156 10 968 34.3 4.475714 Highlands & Islands March
3 2013-03-29 BVG12 2.857344 16 1551 33.8 3.098571 North East Scotland March
4 2013-03-29 BVG13 2.870111 13 1267 29.1 1.361429 Central Scotland March
5 2013-03-29 BVG14 3.011260 17 1625 28.1 1.550000 South East Scotland March
6 2013-03-29 BVG15 3.236246 21 1946 24.9 1.392857 West Central Scotland March
I am trying to subset on patch BVG1 by:
data=merged[patch %in% c("BVG1"),]
But getting an error:
Error in match(x, table, nomatch = 0L) : object 'patch' not found
Don't understand why...
I am trying to plot separate timeseries per patch using ggplot
This is what I have tried:
ggplot(data=merged, aes(x=merged$Date, y=merged$prod, group=patch)) + geom_line() + xlab("") + ylab("Weekly Prods")+ scale_x_date(labels = date_format("%b-%Y"),breaks = "1 month")
This plots all patches on one graph... But I want to show BVG1 timeseries only and this is what I was trying:
ggplot(data=merged[patch %in% c("BVG1"),], aes(x=merged$Date, y=merged$prod, group=patch)) + geom_line() + xlab("") + ylab("Weekly Prods")+ scale_x_date(labels = date_format("%b-%Y"),breaks = "1 month")
But getting the same error.
Any ideas?
UPDATE
Problem solved using [merged$patch %in% c("BVG1"),]
You could also do
data <- subset(merged, patch == "BVG1")
Since you're only conditioning on patch being a single value, you don't need %in%, you can just test for equality.
When you use subset(), R automatically interprets variables referenced in the context of the data frame, so merged$patch is unnecessary.
Try
data=merged[merged$patch %in% c("BVG1"),]
That should solve your problems. patch is defined in your dataframe, so you need to tell R where to find it.
Additionally, you may want to look at facet_wrap instead of subsetting. For instance, adding + facet_wrap(~ patch) to your plot command should show you all patches at once. I am not sure this is what you desire as output, but I thought I should point it out as an idea...

Resources