gg plot and geom_point function in r - r

Given below code,I'm trying to visualise the following layered bubble charts:
points of all start stations. Sizes vary with the total number of pickups.
points of all end stations. Sizes vary with the total number of returns.I need end up with a ggplot object named p1, with alpha = 0.5 to both layers.
library(lubridate)
library(tidyverse)
nycbikes18 <- read_csv("data/2018-citibike-tripdata.csv",
locale = locale(tz = "America/New_York"))
nycbikes18
#> # A tibble: 333,687 x 15
#> tripduration starttime stoptime
#> <dbl> <dttm> <dttm>
#> 1 932 2018-01-01 02:06:17 2018-01-01 02:21:50
#> 2 550 2018-01-01 12:06:18 2018-01-01 12:15:28
#> 3 510 2018-01-01 12:06:56 2018-01-01 12:15:27
#> 4 354 2018-01-01 14:53:10 2018-01-01 14:59:05
#> 5 250 2018-01-01 17:34:30 2018-01-01 17:38:40
#> 6 613 2018-01-01 22:05:05 2018-01-01 22:15:19
#> 7 290 2018-01-02 12:13:51 2018-01-02 12:18:42
#> 8 381 2018-01-02 12:50:03 2018-01-02 12:56:24
#> 9 318 2018-01-02 13:55:58 2018-01-02 14:01:16
#> 10 1852 2018-01-02 16:55:29 2018-01-02 17:26:22
#> # … with 333,677 more rows, and 12 more variables:
#> # start_station_id <dbl>, start_station_name <chr>,
#> # start_station_latitude <dbl>, start_station_longitude <dbl>,
#> # end_station_id <dbl>, end_station_name <chr>,
#> # end_station_latitude <dbl>, end_station_longitude <dbl>,
#> # bikeid <dbl>, usertype <chr>, birth_year <dbl>, gender <dbl>
expected output
I tried below code but not sure how to fix the n side.
p1 <- nycbikes18
p1 <- ggplot(p1) +
geom_point(aes(start_station_longitude,start_station_latitude,
size=n), alpha = 0.5) +
geom_point(aes(end_station_longitude,end_station_latitude, size=n),
alpha = 0.5)
p1

You are overwriting your "start station" aesthetics in ggplot() with the "end station" aes in the first geom_point() call.
From your description what you want is something like:
ggplot(p1) +
geom_point(aes(start_station_longitude,start_station_latitude, size = n_start), alpha = 0.5) +
geom_point(aes(end_station_longitude,end_station_latitude, size = n_end), alpha = 0.5)
Although you improve your chances of getting help if you share a reproducible example and explain what error you are getting.

Related

Find percent difference between row below and row above in R and create new column

I want to find the percent difference between the row below and the row above and put the difference into a new column. My data frame (df) looks like this
Date_Time WC_30cm_neg
2018-01-01 05:50:01 0.3051
2018-01-01 06:00:01 0.3048
2018-01-01 06:10:01 0.3048
2018-01-01 06:20:01 0.3048
2018-01-01 06:30:01 0.3051
2018-01-01 06:40:01 0.3051
I've tried:
df_diff <- df %>%
arrange(Date_Time) %>%
group_by(WC_30cm_neg) %>%
mutate(
diff=WC_30cm_neg-lag(WC_30cm_neg),
increase=scales::percent(diff / lag(WC_30cm_neg))
) %>%
filter(row_number()!=1)
This returns me a new data frame, and gives me a percent column, but all of the percentages are 0. Any other suggestions will be greatly appreciated.
Do you need to just remove the group_by()?
library(dplyr, warn.conflicts = FALSE)
df <- tibble::tribble(
~Date_Time, ~WC_30cm_neg,
"2018-01-01 05:50:01", 0.3051,
"2018-01-01 06:00:01", 0.3048,
"2018-01-01 06:10:01", 0.3048,
"2018-01-01 06:20:01", 0.3048,
"2018-01-01 06:30:01", 0.3051,
"2018-01-01 06:40:01", 0.3051
)
df |>
mutate(
diff = WC_30cm_neg - lag(WC_30cm_neg),
increase = scales::percent(diff / lag(WC_30cm_neg))
)
#> # A tibble: 6 × 4
#> Date_Time WC_30cm_neg diff increase
#> <chr> <dbl> <dbl> <chr>
#> 1 2018-01-01 05:50:01 0.305 NA <NA>
#> 2 2018-01-01 06:00:01 0.305 -0.000300 -0.098%
#> 3 2018-01-01 06:10:01 0.305 0 0.000%
#> 4 2018-01-01 06:20:01 0.305 0 0.000%
#> 5 2018-01-01 06:30:01 0.305 0.000300 0.098%
#> 6 2018-01-01 06:40:01 0.305 0 0.000%
Created on 2022-10-14 with reprex v2.0.2

Is there a way to group data according to time in R?

I'm working with trip ticket data and it includes a column with dates and times. I'm want to group trips according to Morning(05:00 - 10:59), Lunch(11:00-12:59), Afternoon(13:00-17:59), Evening(18:00-23:59), and Dawn/Graveyard(00:00-04:59) and then count the number of trips (by means of counting the unique values in the trip_id column) for each of those categories.
Only I don't know how to group/summarize according to time values. Is this possible in R?
trip_id start_time end_time day_of_week
1 CFA86D4455AA1030 2021-03-16 08:32:30 2021-03-16 08:36:34 Tuesday
2 30D9DC61227D1AF3 2021-03-28 01:26:28 2021-03-28 01:36:55 Sunday
3 846D87A15682A284 2021-03-11 21:17:29 2021-03-11 21:33:53 Thursday
4 994D05AA75A168F2 2021-03-11 13:26:42 2021-03-11 13:55:41 Thursday
5 DF7464FBE92D8308 2021-03-21 09:09:37 2021-03-21 09:27:33 Sunday
Here's a solution with hour() and case_when().
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
trip <- tibble(start_time = mdy_hm("1/1/2022 1:00") + minutes(seq(0, 700, 15)))
trip <- trip %>%
mutate(
hr = hour(start_time),
time_of_day = case_when(
hr >= 5 & hr < 11 ~ "morning",
hr >= 11 & hr < 13 ~ "afternoon",
TRUE ~ "fill in the rest yourself :)"
)
)
print(trip)
#> # A tibble: 47 x 3
#> start_time hr time_of_day
#> <dttm> <int> <chr>
#> 1 2022-01-01 01:00:00 1 fill in the rest yourself :)
#> 2 2022-01-01 01:15:00 1 fill in the rest yourself :)
#> 3 2022-01-01 01:30:00 1 fill in the rest yourself :)
#> 4 2022-01-01 01:45:00 1 fill in the rest yourself :)
#> 5 2022-01-01 02:00:00 2 fill in the rest yourself :)
#> 6 2022-01-01 02:15:00 2 fill in the rest yourself :)
#> 7 2022-01-01 02:30:00 2 fill in the rest yourself :)
#> 8 2022-01-01 02:45:00 2 fill in the rest yourself :)
#> 9 2022-01-01 03:00:00 3 fill in the rest yourself :)
#> 10 2022-01-01 03:15:00 3 fill in the rest yourself :)
#> # ... with 37 more rows
trips <- trip %>%
count(time_of_day)
print(trips)
#> # A tibble: 3 x 2
#> time_of_day n
#> <chr> <int>
#> 1 afternoon 7
#> 2 fill in the rest yourself :) 16
#> 3 morning 24
Created on 2022-03-21 by the reprex package (v2.0.1)

Disaggregate daily time series into hourly values using R

I'm working with a dataset that contains daily data of water flow. The data goes from 1-10-1998 to 30-03-2020 and looks like this:
Date QA
1998-10-01 315
1998-10-02 245
1998-10-03 179
1998-10-04 186
1998-10-05 262
1998-10-06 199
1998-10-07 319
(...)
The class(Date) is "Date" and the class(QA) is "numeric".
My goal is to turn this daily data into hourly data. For this I used the function 'td' from the package 'tempdisagg' of R:
library(tempdisagg)
td(QA~1,to="hour",method="denton-cholette")
My problem is in the definition of QA as a time series variable. When I define it as 'ts' and apply the function to disaggregate the data, the following error appears:
QA_ts <- ts(QA, start = decimal_date(as.Date("1998-10-01")), frequency = 365)
td(QA_ts ~ 1, to = "hour",method="denton-cholette")
Error in td(QA_ts ~ 1, to = "hour",method="denton-cholette") :
use a time series class other than 'ts' to deal with 'hour'
And when I define QA as another format such as "xts" or "msts" I get the following error:
newQA <- xts(QA,Date)
td(newQA ~1, to="hour",method="denton-cholette")
Error in seq.Date(lf[1], lf.end, by = to) : 'to' must be a "Date" object
I think I'm doing something wrong when defining QA as time series but I can't solve this issue.
Can anybody help me out?
thanks,
Date needs to be of class POSIXct, rather than Date, to convert to hourly frequency. Here is a reproducible example:
x <- structure(list(time = structure(c(10227, 10258, 10286, 10317,
10347, 10378, 10408), class = "Date"), value = c(315, 245, 179,
186, 262, 199, 319)), row.names = c(NA, -7L), class = c("tbl_df",
"tbl", "data.frame"))
Disaggratate to days:
library(tempdisagg)
m0 <- td(x ~ 1, to = "day", method = "fast")
#> Loading required namespace: tsbox
predict(m0)
#> # A tibble: 212 x 2
#> time value
#> <date> <dbl>
#> 1 1998-01-01 10.4
#> 2 1998-01-02 10.3
#> 3 1998-01-03 10.3
#> 4 1998-01-04 10.3
#> 5 1998-01-05 10.3
#> 6 1998-01-06 10.3
#> 7 1998-01-07 10.3
#> 8 1998-01-08 10.3
#> 9 1998-01-09 10.3
#> 10 1998-01-10 10.3
#> # … with 202 more rows
If you want to disaggregate to hours, time need to be POSIXct:
x$time <- as.POSIXct(x$time)
m1 <- td(x ~ 1, to = "hour", method = "fast")
predict(m1)
#> # A tibble: 5,087 x 2
#> time value
#> <dttm> <dbl>
#> 1 1998-01-01 01:00:00 0.431
#> 2 1998-01-01 02:00:00 0.431
#> 3 1998-01-01 03:00:00 0.431
#> 4 1998-01-01 04:00:00 0.431
#> 5 1998-01-01 05:00:00 0.431
#> 6 1998-01-01 06:00:00 0.431
#> 7 1998-01-01 07:00:00 0.431
#> 8 1998-01-01 08:00:00 0.431
#> 9 1998-01-01 09:00:00 0.431
#> 10 1998-01-01 10:00:00 0.431
#> # … with 5,077 more rows
Here is a slightly more complex example for hourly disaggregation.
This post explains conversion to high-frequency in more detail.

ggplot `geom_segment()` fails to recognize `group_by()` specification

library(tidyverse)
library(lubridate)
library(stringr)
df <-
tibble(Date = as.Date(0:364, origin = "2017-07-01"), Value = rnorm(365)) %>%
mutate(Year = str_sub(Date, 1, 4),
MoFloor = floor_date(Date, unit = "month")) %>%
group_by(Year, MoFloor) %>%
mutate(MoAvgValue = mean(Value)) %>%
ungroup() %>%
group_by(Year) %>%
mutate(MinMoFloor = min(MoFloor),
MaxMoFloor = max(MoFloor),
YearAvgValue = mean(MoAvgValue))
#> # A tibble: 365 x 8
#> # Groups: Year [2]
#> Date Value Year MoFloor
#> <date> <dbl> <chr> <date>
#> 1 2017-07-01 -1.83 2017 2017-07-01
#> 2 2017-07-02 -2.13 2017 2017-07-01
#> 3 2017-07-03 1.49 2017 2017-07-01
#> 4 2017-07-04 0.0753 2017 2017-07-01
#> 5 2017-07-05 -0.437 2017 2017-07-01
#> 6 2017-07-06 -0.327 2017 2017-07-01
#> 7 2017-07-07 -1.28 2017 2017-07-01
#> 8 2017-07-08 0.280 2017 2017-07-01
#> 9 2017-07-09 1.24 2017 2017-07-01
#> 10 2017-07-10 0.0921 2017 2017-07-01
#> # ... with 355 more rows, and 4 more
#> # variables: MoAvgValue <dbl>,
#> # MinMoFloor <date>,
#> # MaxMoFloor <date>,
#> # YearAvgValue <dbl>
Let's first plot the data frame above.
ggplot(df, aes(MoFloor, MoAvgValue, group = Year)) +
facet_grid(~Year, scale = "free_x", space = "free_x") +
geom_point()
In my call to the facet_grid() function I added the arguments scale = "free_x" and space = "free_x" to get rid of empty white space on the plots.
When I go ahead and add geom_segment()s based on group_by()d data, the scale = "free_x" and space = "free_x" arguments are negated. The empty white space reappears!
ggplot(df, aes(MoFloor, MoAvgValue, group = Year)) +
facet_grid(~Year, scale = "free_x", space = "free_x") +
geom_point() +
geom_segment(data = df,
aes(x = min(MinMoFloor),
y = YearAvgValue,
xend = max(MaxMoFloor),
yend = YearAvgValue))
My df data frame is grouped by Year. Why doesn't the geom_segment() function recognize this when I enter (for example) the x = min(MinMoFloor) argument? geom_segment() is pulling the min(MinMoFloor) from the global column, instead of the grouped column. How do I geom_segment() to evaluate the MinMoFloor column as grouped data?

how to cope with nullsError when importing tables in R, from Boston BlueBikes data [duplicate]

This question already has answers here:
Using R to download zipped data file, extract, and import data
(10 answers)
Closed 4 years ago.
I am trying to read-in a dataset from this zip file link: https://s3.amazonaws.com/hubway-data/201901-bluebikes-tripdata.zip within an R markdown. Firstly, I used the code called "code1" below, but the console spits out an error message:"
line 1 appears to contain embedded nulls
Error in read.table("https://s3.amazonaws.com/hubway-data/201901-bluebikes-tripdata.zip", : more columns than column names".
Then I made some adjustment, the other code is called "code2" as shown below, but the console still spits out an error message:
invalid input found on input connection
'https://s3.amazonaws.com/hubway-data/201901-bluebikes-tripdata.zip'incomplete final line found by readTableHeader on 'https://s3.amazonaws.com/hubway-data/201901-bluebikes-tripdata.zip'"
I have looked through all the possible solutions online and tried many other ways, but still could not make it to work. Could someone tell me a solution? Really appreciate it!
code1 <- read.table("https://s3.amazonaws.com/hubway-data/201901-bluebikes-tripdata.zip", header = TRUE, sep = ",")
code2 <- read.table("https://s3.amazonaws.com/hubway-data/201901-bluebikes-tripdata.zip", header = TRUE, sep = ",", fileEncoding = "utf-8", skipNul = TRUE)
You can wrap all it in one function
library(tidyverse)
read_zip <- function(path_down, file_name = NULL){
if(is.null(file_name)) stop("please provide a file name")
download.file(path_down,
destfile = paste0(file_name, ".zip"))
unzip(paste0(file_name, ".zip"))
return(read_csv(paste0(file_name, ".csv")))
}
data <- read_zip(path_down = "https://s3.amazonaws.com/hubway-data/201901-bluebikes-tripdata.zip",
file_name = "201901-bluebikes-tripdata")
data
## A tibble: 69,872 x 15
# tripduration starttime stoptime
# <dbl> <dttm> <dttm>
# 1 371 2019-01-01 00:09:13 2019-01-01 00:15:25
# 2 264 2019-01-01 00:33:56 2019-01-01 00:38:20
# 3 458 2019-01-01 00:41:54 2019-01-01 00:49:33
# 4 364 2019-01-01 00:43:32 2019-01-01 00:49:37
# 5 681 2019-01-01 00:49:56 2019-01-01 01:01:17
# 6 549 2019-01-01 00:50:01 2019-01-01 00:59:10
# 7 304 2019-01-01 00:54:48 2019-01-01 00:59:53
# 8 425 2019-01-01 01:00:48 2019-01-01 01:07:53
# 9 1353 2019-01-01 01:03:34 2019-01-01 01:26:07
#10 454 2019-01-01 01:08:56 2019-01-01 01:16:30
## ... with 69,862 more rows, and 12 more variables: `start
## station id` <dbl>, `start station name` <chr>, `start
## station latitude` <dbl>, `start station longitude` <dbl>,
## `end station id` <dbl>, `end station name` <chr>, `end
## station latitude` <dbl>, `end station longitude` <dbl>,
## bikeid <dbl>, usertype <chr>, `birth year` <dbl>,
## gender <dbl>

Resources