Handling dates in R, Google Capstone Project Cyclistic - r

I'm working on the Google Capstone Project (cyclistic, analysis of bike share data) part of the Google Data Analytics course on Coursera.
I'm having issues converting two variables to date format.
The public data are located here
I've downloaded the data from 202109 to 202208.
I am trying to convert the "started_at" and "ended_at" variables to date format, so that I can calculate durations and other measures. This is what I have so far:
library(readxl)
library(ggplot2)
library(dplyr)
library(psych)
library(tidyr)
library(chron)
library(tidyverse)
library(lubridate) #for date functions , helps wrangle data attributes
library(data.table)
library(janitor)
###################################################################################################
### CLEAR WORKSPACE AND SET DIRECTORY
###################################################################################################
rm(list = ls())
rm(list = ls(all.names = TRUE)) #will clear all objects includes hidden objects.
setwd("[...folder path containing the .csv files...]")
getwd()
dir() # Show all files in the working directory
###################################################################################################
### Set some useful functions
###################################################################################################
detach_package = function(pkg, character.only = FALSE)
{
if(!character.only)
{
pkg = deparse(substitute(pkg))
}
search_item = paste("package", pkg, sep = ":")
while(search_item %in% search())
{
detach(search_item, unload = TRUE, character.only = TRUE)
}
}
# Standard Error Function
stdErr = function(x) sd(x, na.rm = TRUE)/sqrt(length(x))
###################################################################################################
### LOAD and combine DATA
###################################################################################################
# I am loading all the .csv files individually to separate data.frames
data_01 = read.csv("202109-divvy-tripdata.csv")
data_02 = read.csv("202110-divvy-tripdata.csv")
data_03 = read.csv("202111-divvy-tripdata.csv")
data_04 = read.csv("202112-divvy-tripdata.csv")
data_05 = read.csv("202201-divvy-tripdata.csv")
data_06 = read.csv("202202-divvy-tripdata.csv")
data_07 = read.csv("202203-divvy-tripdata.csv")
data_08 = read.csv("202204-divvy-tripdata.csv")
data_09 = read.csv("202205-divvy-tripdata.csv")
data_010 = read.csv("202206-divvy-tripdata.csv")
data_011 = read.csv("202207-divvy-tripdata.csv")
data_012 = read.csv("202208-divvy-tripdata.csv")
# Combine all in one data.frame
totDb = rbind(data_01, data_02, data_03, data_04, data_05,
data_06, data_07, data_08, data_09, data_010,
data_011, data_012)
# Remove the individual datasets
rm(data_01, data_02, data_03, data_04, data_05,
data_06, data_07, data_08, data_09, data_010,
data_011, data_012); print("I've removed the original files")
###################################################################################################
### Remove NA an Duplicates
###################################################################################################
# Remove Missing Values
totDb_Clean = totDb %>%
filter(start_station_name!="") %>%
filter(end_station_name!="")
# remove duplicates
totDb_Clean = totDb_Clean[!duplicated(totDb_Clean$ride_id), ]
print(paste("Removed", nrow(totDb) - nrow(totDb_Clean), "duplicated rows"))
# Remove extra spaces, if any
totDb_Clean$started_at = str_squish(totDb_Clean$started_at)
totDb_Clean$ended_at = str_squish(totDb_Clean$ended_at)`
###################################################################################################
### Convert to date and calculate the ride duration
###################################################################################################
This is where I am having troubles. I have tried to convert these variables to date in several ways but nothing seems to work. Here are some of my attempts:
1)
totDb_Clean$date = as.Date(totDb_Clean$started_at)
totDb_Clean$week_day = format(as.Date(totDb_Clean$date), "%A")
totDb_Clean$month = format(as.Date(totDb_Clean$date), "%b_%y")
totDb_Clean$year =format(totDb_Clean$date,"%Y")
totDb_Clean = totDb_Clean%>%
mutate(started_at = ymd_hms(as_datetime(started_at)),
ended_at = ymd_hms(as_datetime(ended_at)))
Convert to date
totDb_Clean2$started_at = as.POSIXct(totDb_Clean$started_at, tz = "GMT6", format = "%m/%d/%Y %H:%M")
totDb_Clean2$ended_at = as.POSIXct(totDb_Clean$ended_at, tz = "GMT6", format = "%m/%d/%Y %H:%M")
None of these approaches seems to be working. Any help is appreciated, thank you for your time!

Related

R and Lapply to import excel documents but ignore some documents that are not correct format

I am new to R as of today so this may be simple, but I can not find a solution anywhere.
I am trying to loop through .xlsx files, format them and then bind them into one dataset I think it is called. It works, however, there are a few files that have different row names and amounts of rows. Which then breaks the formatting and thus ends the loop. I would like these files to be ignored when the formatting and binding happens and possibly printing to the console the name of the file.
This is what I have so for and feel free to ask questions if I don't make sense.
library(readxl)
library(tidyr)
library(tidyverse)
library(janitor)
setwd("~:/Users/sam/Desktop/Information_engineering/traffic")
my_files <- list.files(pattern = ".xlsx", recursive = TRUE)
traffic_congestion = lapply(my_files, function(i){
my_data = read_excel(i, sheet = 1, range = "A6:H1446")
my_data_location = read_excel(i, sheet = 1, range = "A1:A2")
my_data <- na.omit(my_data)
my_data <- pivot_longer(my_data, cols=2:8, names_to = "Date_day", values_to = "Amount")
colnames(my_data) <- c("times", "dates", "Amount")
my_data$dates <- excel_numeric_to_date(as.numeric(my_data$dates))
my_data$times <- strftime(as.Date(my_data$times), format = "%H:%M:%S")
my_data <- my_data %>% mutate(location = my_data_location[1])
my_data
})
traffic_congestion = do.call("rbind.data.frame", traffic_congestion)
This is the top of the spreadsheet that I want to bind
This is the top of the spreadsheet that I don't want to bind
Try with
library(data.table)
rbindlist(traffic_congestion, fill = T)

Open .mol files and compiling information

I'm trying to create a program that opens a lot of files (.mol), and copies specific information from those files and saves it into a spreadsheet (TAB delimited files '\t').
I have 10000 mol files on my computer that look like SN00000001 SN00000002 SN00000003 ... SN00010000.
(download link => http://bioinf-applied.charite.de/supernatural_new/src/download_mol.php?sn_id=SN00000001)
I have two questions:
I already tried to use function load.molecules (rcdk) and ChemmineR (loadsdf) but I did not succeed to open a .mol file in R.
It´s possible to open each .mol file and save specific information such as "ID", "Name", "Molecular Formula" it into a unique spreadsheet using R?
Ok, I will send you the code
# get the full path of your mol files
mol_files <- list.files(path = file.path(getwd(), "/Users/189919604/Desktop/Download
SuperNatural II/SN00000001"), # specify your folder here
pattern = "*mol",
full.names = TRUE)
# create tibble, with filenames (incl. the full path)
df <- tibble(filenames = mol_files)
# create function to extract all the information
extract_info <- function(sdfset) {
# function to extract information from a sdfset (ChemmineR)
# this only works if there is one molecule in the sdfset
ID <- sdfset#SDF[[1]]#datablock["SNID"]
Name <- sdfset#SDF[[1]]#header["Molecule_Name"]
Molecular_Formula <- sdfset#SDF[[1]]#datablock["Molecular Formula"]
sdf_info <- tibble(SNID = ID,
Name = Name,
MolFormula = Molecular_Formula)
return(sdf_info)
}
# read all files and extract info
df <- df %>%
mutate(sdf_data = map(.x = filenames,
.f = ~ read.SDFset(sdfstr = .x)),
info = map(.x = sdf_data,
.f = ~ extract_info(sdfset = .x)))
# make a nice tibble with only the info you want
all_info <- df %>%
select(molecule) %>%
unnest(info)
# write to file
write_delim(x = all_info,
path = file.path(getwd(), "test.tsv"),
delim = "\t")
I hope this works, I only tested it with 2 mol files. I used read.SDFset from ChemmineR package to read all the mol files. The package tidyverse I use, is to work with tibbles. Tibbles are actually dataframes with some extra properties / functionalities.
library(tidyverse)
library(ChemmineR)
# get the full path of your mol files
mol_files <- list.files(# specify your folder here in case of windows also add your drive letter e.g.: "c:/users/path/to/my/mol_files"
path = "/home/rico/r-stuff/temp",
pattern = "*mol",
full.names = TRUE)
# create tibble, with filenames (incl. the full path)
df <- tibble(filenames = mol_files)
# create function to extract all the information
extract_info <- function(sdfset) {
# function to extract information from a sdfset (ChemmineR)
# this only works if there is one molecule in the sdfset
ID <- sdfset#SDF[[1]]#datablock["SNID"]
Name <- sdfset#SDF[[1]]#header["Molecule_Name"]
Molecular_Formula <- sdfset#SDF[[1]]#datablock["Molecular Formula"]
sdf_info <- tibble(SNID = ID,
Name = Name,
MolFormula = Molecular_Formula)
return(sdf_info)
}
# read all files and extract info
df <- df %>%
mutate(sdf_data = map(.x = filenames,
.f = ~ read.SDFset(sdfstr = .x)),
info = map(.x = sdf_data,
.f = ~ extract_info(sdfset = .x)))
# make a nice tibble with only the info you want
all_info <- df %>%
select(info) %>%
unnest(info)
# write to file
write_delim(x = all_info,
path = file.path(getwd(), "temp", "test.tsv"),
delim = "\t")

How to summarize by Quarter in R

I am having some difficulties on summarizing data from my database in R. I am looking to pull the data and have it summarized by Quarter.
Below is the code i am using to get a txt output but I am getting errors.
What do I need to do to manipulate the code to run this so that I can have the data be summarized by quarter?
library(data.table, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
################
## PARAMETERS ##
################
# Set path of major source folder for raw transaction data
in_directory <- "C:/Users/name/Documents/Raw Data/"
# List names of sub-folders (currently grouped by first two characters of
CUST_ID)
in_subfolders <- list("AA-CA", "CB-HZ", "IA-IL", "IM-KZ", "LA-MI", "MJ-MS",
"MT-NV", "NW-OH", "OI-PZ", "QA-TN", "TO-UZ",
"VA-WA", "WB-ZZ")
# Set location for output
out_directory <- "C:/Users/name/Documents/YTD Master/"
out_filename <- "NEW.csv"
# Set beginning and end of date range to be collected - year-month-day format
date_range <- interval(as.Date("2018-01-01"), as.Date("2018-05-31"))
# Enable or disable filtering of raw files to only grab items bought within
certain months to save space.
# If false, all files will be scanned for unique items, which will take
longer and be a larger file.
date_filter <- TRUE
##########
## CODE ##
##########
starttime <- Sys.time()
mastertable <- NULL
for (j in 1:length(in_subfolders)) {
subfolder <- in_subfolders[j]
sub_directory <- paste0(in_directory, subfolder, "/")
## IMPORT DATA
in_filenames <- dir(sub_directory, pattern =".txt")
for (i in 1:length(in_filenames)) {
# Default value provided for when fast filtering is disabled.
read_this_file <- TRUE
# To fast filter the data, we choose to include or exclude an entire file
based on the date of its first line.
# WARNING: This is only a valid method if filtering by entire months,
since that is the amount of data housed in each file.
if (date_filter) {
temptable <- fread(paste0(sub_directory, in_filenames[i]),
colClasses=c(CUSTOMER_TIER = "character"),
na.strings = "", nrows = 1)
temptable[, INVOICE_DT := as.Date(INVOICE_DT)]
# If date matches, set read flag to TRUE. If date does not match, set
read flag to FALSE.
read_this_file <- temptable[, INVOICE_DT] %within% date_range
}
if (read_this_file) {
print(Sys.time()-starttime)
print(paste0("Reading in ", in_filenames[i]))
temptable <- fread(paste0(sub_directory, in_filenames[i]), colClasses=c(CUSTOMER_TIER = "character"),
na.strings = "")
temptable <- temptable[, lapply(.SD, sum), by = quarter(INVOICE_DT),
.SDcols = c("INV_ITEM_ID","Ext Sale", "Ext Total Cost", "CE100", "CE110","CE120","QTY_SOLD","PACKSLIP_WHSL")]
# Combine into full list
mastertable <- rbindlist(list(mastertable, temptable), use.names = TRUE)
# Release unneeded memory
rm(temptable)
}
}
}
# Save Final table
print("Saving master table")
fwrite(mastertable, paste0(out_directory, out_filename))
rm(mastertable)
print(Sys.time()-starttime)
After running this scrip the below is the error message i receive.
Error in gsum(INV_ITEM_ID) :
Type 'character' not supported by GForce sum (gsum). Either add the prefix base::sum(.) or turn off GForce optimization using options(datatable.optimize=1)
Here is the general approach with some generic data.
library(tidyverse)
library(lubridate)
data.frame(date = seq(as.Date('2010-01-12'), as.Date('2018-02-03'), by = 100),
var = runif(30)) %>%
group_by(quarter(date, with_year = T)) %>%
summarize(average_var = mean(var))
you can leave out the "with_year = T" if you don't care about the differences between years.

Extract numerical value from PDF chart to a variable in R

I'm trying to pull a numerical value from a chart that's been embedded in a pdf.
I tried the two methods below, but I was able to convert every other information into xlsx except the line chart information
Link to the pdf:
http://blog.mass.gov/publichealth/wp-content/uploads/sites/11/2018/01/Weekly-Flu-Report-01-19-2018.pdf
The value that I need to pull into a variable
1st Method
library(pdftools)
library(stringr)
library(xlsx)
set.seed(100)
tx <- pdf_text("flureport.pdf")
tx2 <- unlist(str_split(tx, "[\\r\\n]+"))
tx3 <- str_split_fixed(str_trim(tx2), "\\s{2,}", 5)
write.xlsx(tx3, file="ds.xlsx")
2nd Method
library('tm')
file <- 'flureport.pdf'
Rpdf <- readPDF(control = list(text = "-layout"))
corpus <- VCorpus(URISource(file),
readerControl = list(reader = Rpdf))
corpus.array <- content(content(corpus)[[1]])
c<-data.frame(corpus.array)
write.xlsx(c, file="x.xlsx")
Both the xlsx that I wrote didnt contain any chart information, so that I can fetch the value
This is the solution that worked for me, not sure if it would work for all the cases but it did work work in this particular case.
Thanks #user2554330 for mentioning OCR
library(pdftools)
library(stringr)
library(tesseract)
library(magick)
library(magrittr)
list <- c('http://blog.mass.gov/publichealth/wp-content/uploads/sites/11/2018/01/Weekly-Flu-Report-01-19-2018.pdf')
sapply(list, function(x)
pdf_convert(x, format = "png", pages = NULL, filenames = NULL, dpi = 300, opw = "", upw = "", verbose = TRUE))
text <- image_read("Weekly-Flu-Report-01-19-2018_1.png") %>%
image_resize("2000") %>%
image_convert(colorspace = 'gray') %>%
image_trim() %>%
image_ocr()
a<-print(text)
massili<-regmatches(a, gregexpr("\\d+(\\.\\d+){0,1} %", a))[[1]]

Does anyone know how to download TRMM 3B42 time series data?

I'm trying to download TRMM 3B42 3-hour binary data for a given time span from this NASA FTP server.
There is an excellent code made by Florian Detsch to download the daily product (here is the link: https://github.com/environmentalinformatics-marburg/Rsenal/blob/master/R/downloadTRMM.R) included in the GitHub-only Rsenal package. Unfortunately it is not working for the 3-hour data.
I changed the code:
downloadTRMM <- function(begin, end, dsn = ".", format = "%Y-%m-%d.%H") {
## transform 'begin' and 'end' to 'Date' object if necessary
if (!class(begin) == "Date")
begin <- as.Date(begin, format = format)
if (!class(end) == "Date")
end <- as.Date(end, format = format)
## trmm ftp server
ch_url <-"ftp://disc2.nascom.nasa.gov/data/TRMM/Gridded/3B42_V7/"
## loop over daily sequence
ls_fls_out <- lapply(seq(begin, end, 1), function(i) {
# year and julian day (name of the corresponding folder)
tmp_ch_yr <- strftime(i, format = "%Y%m")
#tmp_ch_dy <- strftime(i, format = "%j")
# trmm date format
tmp_dt <- strftime(i+1, format = "%Y%m%d.%H")
# list files available on server
tmp_ch_url <- paste(ch_url, tmp_ch_yr, "", sep = "/")
tmp_ch_fls <- tmp_ch_fls_out <- character(2L)
for (j in 1:2) {
tmp_ch_fls[j] <- paste0("3B42.", tmp_dt, "z.7.precipitation",
ifelse(j == 1, ".bin"))
tmp_ch_fls[j] <- paste(tmp_ch_url, tmp_ch_fls[j], sep = "/")
tmp_ch_fls_out[j] <- paste(dsn, basename(tmp_ch_fls[j]), sep = "/")
download.file(tmp_ch_fls[j], tmp_ch_fls_out[j], mode = "wb")
}
# return data frame with *.bin and *.xml filenames
tmp_id_xml <- grep("xml", tmp_ch_fls_out)
data.frame(bin = tmp_ch_fls_out[-tmp_id_xml],
xml = tmp_ch_fls_out[tmp_id_xml],
stringsAsFactors = FALSE)
})
## join and return names of processed files
ch_fls_out <- do.call("rbind",ls_fls_out)
return(ch_fls_out)
}
getwd()
setwd("C:/Users/joaoreis/Documents/Bases_Geograficas/trmm_3h/")
fls_trmm <- downloadTRMM(begin = "2008-01-01.00", end = "2008-01-05.00")
fls_trmm
But I get the following error:
trying URL
'ftp://disc2.nascom.nasa.gov/data/TRMM/Gridded/3B42_V7//200801//3B42.20080102.00z.7.precipitation.bin'
Error in download.file(tmp_ch_fls[j], tmp_ch_fls_out[j], mode = "wb")
: cannot open URL
'ftp://disc2.nascom.nasa.gov/data/TRMM/Gridded/3B42_V7//200801//3B42.20080102.00z.7.precipitation.bin'
In addition: Warning message: In download.file(tmp_ch_fls[j],
tmp_ch_fls_out[j], mode = "wb") : InternetOpenUrl failed: '' Called
from: download.file(tmp_ch_fls[j], tmp_ch_fls_out[j], mode = "wb")
Does anyone know how to fix it using R?
Thanks!
As of commit 909f98a, I have enabled the automated retrieval of 3-hourly data from ftp://disc3.nascom.nasa.gov/data/s4pa/TRMM_L3. Make sure you have the latest version of Rsenal installed using
devtools::install_github("environmentalinformatics-marburg/Rsenal")
and then have a look at the examples in ?downloadTRMM. For now, the function supports both character (requires 'format' argument passed on to strptime) and POSIXlt input. For example, something like
downloadTRMM(begin = "2015-01-01 12:00", end = "2015-01-03 12:00",
type = "3-hourly", format = "%Y-%m-%d %H:%M")
to download 3-hourly data from 1-3 January 2015 (noon to noon) should now work just fine.
Note that in contrast to the FTP server you mentioned, the data comes in .HDF format and a rasterize method has not been implemented so far, meaning that you have to deal with the container files yourself. I'll try to figure out something more convenient soon regarding the automated rasterization of the data.

Resources