Trimming unwanted characters - r

I have a very large data set (CSV) with information about bicycle counts from a bike share system. The information I'm working with is the time at which bicycles were taken out of the racks (departure time) and also the total travel time. What I want to do is to add them so I can get the arrival time at the arrival station. The departure time variable is FECHA_HORA_RETIRO and the travel time variable is TIEMPO_USO. The former, which is read by R as factor object, is in the following format: "23/01/2017 19:55:16". On the other hand, TIEMPO_USO is read by R as a character and it's in the following format: "0:17:46".
> head(viajes_ecobici_2017$FECHA_HORA_RETIRO)
[1] 28/01/2017 13:51 17/01/2017 16:24 12/01/2017 16:38 25/01/2017 10:31
> head(viajes_ecobici_2017$TIEMPO_USO)
[1] "1:35:37" "0:11:17" "0:32:51" "0:31:29" "1:31:59" "0:21:43" "0:5:43"
I first used strptime to get everything in the desired format
> viajes_ecobici_2017$FECHA_HORA_RETIRO =format(strptime(viajes_ecobici_2017$FECHA_HORA_RETIRO,format = "%d/%m/%Y %H:%M"),format = "%d/%m/%Y %H:%M:%S")
> viajes_ecobici_2017$TIEMPO_USO = format(strptime(viajes_ecobici_2017$TIEMPO_USO, format="%H:%M:%S"), format="%H:%M:%S")
This works with most observations. However, several observations became NA values after running this code. I went back to the original data to see why this was happening and created a variable with just the observations that became NA. When I looked closer at this observations I saw they have this format "\t\t01/06/2017 00:01". How can I get rid of the "\t\t" while preserving the rest of the information?
Thanks in advance for your help.

trimws() trims white space (including tab characters, \t) from the ends of a character variable:
viajes_ecobici_2017$TIEMPO_USO <- trimws(viajes_ecobici_2017$TIEMPO_USO)
For what it's worth, readr::read_csv() has a built-in trimws option (which is TRUE by default).

Assuming that the variable with the problem is TIEMPO_USO, then a simple regex would take care of the tab characters ("\t")
viajes_ecobici_2017$TIEMPO_USO <- gsub("^\\t\\t","", viajes_ecobici_2017$TIEMPO_USO)

Related

conversion from 12 hr to 24 hr in R and combine two tables

The image for Y table
enter image description here
I want to roll join two tables trial and trial2 with the key as time stamp. One table 'trial' has timestamp POSIXct as the key and other one 'trial2' has a timestamp in character . I tried to convert 'trial2' timestamo from 12 hour format to 24 hr format (POSIXct) so that I can apply roll join on them. But whatever I have used till now gave me NULL in the resulting field rolli for trial2.
library(data.table)
library(dplyr)
library(lubridate)
library(readr)
library(hms)
trial <- read_csv("X.csv")
trial2 <- read_csv("Y.csv")
trial2$rolli<- as.POSIXct(trial2$date ,format = '%m/%d/%Y %I:%M:%S %p')
#######OR#########
trial2$rolli<-strptime(trial2$date, "%m/%d/%Y %I:%M:%S %p")
#######OR#########
trial2$rolli<-ymd_hms(trial2$date)
trial<-mutate(trial, rolli=ymd_hms(paste("2018-11-27", Time), tz='Asia/Kolkata'))
trial<-data.table(trial)
trial2<-data.table(trial2)
setkey(trial, rolli)
setkey(trial2, rolli)
try<-trial[trial2, roll = "nearest"]
class(trial$rolli)
#[1] "POSIXct" "POSIXt"
class(trial2$rolli)
#[1] "POSIXct" "POSIXt"
Debugging is always hard, so a general tip: try to reduce it as much as possible.
Looking at it, I'd think that the parsing of the character hits a problem. I'm not sure about lubridate and ymd_hms, but the as.POSIXct and strptime calls should work.
You can check by printing trial2$date and trial2$rolli. If the date looks fine, but rolli consists of all NA's, then that's the problem.
Probably the dates provided as characters are not in the exact right format, these functions can be very picky.
In order to know exactly what is going wrong, I'd need to see a sample of Y.csv, but you can check if you've typed everything exactly right: spaces, or have you switched "\" and "/"? Also, I normally work with 24-hour-notation, so it could be that strptime is picky about a specification being "am" or "AM" or "a.m." or something else.
EDIT: I've seen the format you're trying to supply, which has decimals in the seconds, which means %S doesn't do the trick.
Instead, you want %OS (it is in the help for ?strptime, but it's quite hidden). Also, I can't see it clearly in the image, but in your original code, there are 2 spaces between "%Y" and "%I". Are there 2 in your input as well?
Anyway:
strptime('11/27/2018 11:44:04.479 AM', format='%m/%d/%Y %I:%M:%OS %p')
# Works with me
trial2$rolli<-strptime(trial2$date, "%m/%d/%Y %I:%M:%OS %p")
# Should solve your problem.
Furthermore, when printing trial2$rolli, the fractional part is not shown, but it is stored. You can show it with as.numeric(trial2$rolli) %% 1, although there may be some small rounding differences.
2nd EDIT:
To fix problems where you have times like 0:00 PM in your input (which is technically wrong, but you might not have control over your input), you can use:
trial2$date <- sub('0+(:..:..)', '12\1', trial2$date)
It replaces all occurences of the form 0 :restoftime or 00 :restoftime with 12 :restoftime
Only be careful about what your source actually means by something like 0:00:00.000 AM: is this midnight or noon? I don't know how R-functions handle this generally (or even if it's guaranteed to always be the same), and I'm not going to burn my hands on that question. If you look on the internet there are a lot of people who have very strong opinions on what AM/PM means in those circumstances, in all variations.

Converting from fctr to date format.

I am attempting to convert a column in my data set from fctr to date format. The current column has data formatted as follows: "01/01/14. 01:00 Am." Ideally I would like to create a column for day and then a column for time as well. There are periods following the day and the time which is another issue I am facing. So far I have attempted to use lubridate to create a new column of data but I get the error "All formats failed to parse. No formats found." Any help would be greatly appreciated, thank you.
test <- fourteen %>%
mutate(When = mdy_hms(V3))
View(test)
If your date factor literally has levels that look like 01/01/14. 01:00 Am. including two periods and a space between the first period and the first hour digits and a space between the minutes and the am/pm designation, and all the dates are in this format, then the following should work:
... mutate(When = as.POSIXct(V3, format="%m/%d/%y. %H:%M %p.")) ...
In particular, the following standalone testcase works fine:
as.POSIXct(factor("01/01/14. 01:00 Am."), format="%m/%d/%y. %H:%M %p.")
For more information on the format argument being used here, see the R help page for the function strftime.

removing date from %d/%m/%Y %H:%M in R

The r code that I am working on is supposed to use the data collected in every five minute intervals.
The data is saved in csv format. However, due to inconsistency in the data collected, the time column in the data sometimes represent timestamp instead of just time.(dd/mm/yyyy HH:MM, instead of HH:MM)
This causes an error to my system as the system reads the data as having multiple different values for the same time value. Therefore, I would like to omit the date format from the timestamp such that the code would only read the time value.
My failed attempt was:
as.Date(data[[1]],"%H:%M")
which gave me all NA values for the time column.
I have searched for similar questions in SO, but I did not manage to find a clear answer to my question. Can anyone suggest me some possible functions to use?
I appreciate your help.
You could just strip the date portion of the text and then use as.POSIXct to convert them all to a %H:%M timestamp, e.g.:
x <- c("10:25","01/01/2014 10:30")
x <- gsub("^.+(\\d{2}:\\d{2})$","\\1",x)
as.POSIXct(x,format="%H:%M",tz="UTC")
#[1] "2014-06-02 10:25:00 UTC" "2014-06-02 10:30:00 UTC"

Time series (xts) strptime; ONLY month and day

I've been trying to do a time series on my dataframe, and I need to strip times from my csv. This is what I've got:
campbell <-read.csv("campbell.csv")
campbell$date = strptime(campbell$date, "%m/%d")
campbell.ts <- xts(campbell[,-1],order.by=campbell[,1])
First, what I'm trying to do is just get xts to strip the dates as "xx/xx" meaning just the month and day. I have no year for my data. When I try that second line of code and call upon the date column, it converts it to "2013-xx-xx." These months and days have no year associated with them, and I can't figure out how to get rid of the 2013. (The csv file I'm calling on has the dates in the format "9/30,10/1...etc.)
Secondly, once I try and make a time series (the third line), I am unsure what the "order.by" command is calling on. What am I indexing?
Any help??
Thanks!
For strptime, you need to provide the full date, i.e. day, month and year. In case, any of these is not provided, current ones are assumed from the system's time and appended to the incomplete date. So, if you want to retain your date format as you have read it, first make a copy of that and store in a temporary variable and then use strptime over campbell$date to convert into R readable date format. Since, year is not a concern to you, you need not bother about it even though it is automatically appended by strptime.
campbell <-read.csv("campbell.csv")
date <- campbell$date
campbell$date <- strptime(campbell$date, "%m/%d")
Secondly, what you are doing by 'the third line' (xts(campbell[,-1],order.by=campbell[,1])) command is that, your are telling to order all the data of campbell except the first column (campbell[,-1]) according to the index provided by the time data in the first column of campbell (campbell[,1]). So, it would only work given the date is in the first column.
After ordering the data according to time-series, you can replace back the campbell$date column with date to get back the date format you wanted (although here, first you have to order date also like shown below)
date <- xts(date, order.by=campbell[,1]) # assuming campbell$date is campbell[,1]
campbell.ts <- xts(campbell[,-1], order.by=campbell[,1])
campbell.ts <- cbind(date, campbell.ts)
format(as.Date(campbell$dat, "%m/%d/%Y"), "%m/%d")

Converting time format to numeric with R

In most cases, we convert numeric time to POSIXct format using R. However, if we want to compare two time points, then we would prefer the numeric time format. For example, I have a date format like "2001-03-13 10:31:00",
begin <- "2001-03-13 10:31:00"
Using R, I want to covert this into a numeric (e.g., the Julian time), perhaps something like the passing seconds between 1970-01-01 00:00:00 and 2001-03-13 10:31:00.
Do you have any suggestions?
The Julian calendar began in 45 BC (709 AUC) as a reform of the Roman calendar by Julius Caesar. It was chosen after consultation with the astronomer Sosigenes of Alexandria and was probably designed to approximate the tropical year (known at least since Hipparchus). see http://en.wikipedia.org/wiki/Julian_calendar
If you just want to remove ":" , " ", and "-" from a character vector then this will suffice:
end <- gsub("[: -]", "" , begin, perl=TRUE)
#> end
#[1] "20010313103100"
You should read the section about 1/4 of the way down in ?regex about character classes. Since the "-" is special in that context as a range operator, it needs to be placed first or last.
After your edit then the answer is clearly what #joran wrote, except that you would need first to convert to a DateTime class:
as.numeric(as.POSIXct(begin))
#[1] 984497460
The other point to make is that comparison operators do work for Date and DateTime classed variables, so the conversion may not be necessary at all. This compares 'begin' to a time one second later and correctly reports that begin is earlier:
as.POSIXct(begin) < as.POSIXct(begin) +1
#[1] TRUE
Based on the revised question this should do what you want:
begin <- "2001-03-13 10:31:00"
as.numeric(as.POSIXct(begin))
The result is a unix timestamp, the number of seconds since epoch, assuming the timestamp is in the local time zone.
Maybe this could also work:
library(lubridate)
...
df <- '24:00:00'
as.numeric(hms(df))
hms() will convert your data from one time format into another, this will let you convert it into seconds. See full documentation.
I tried this because i had trouble with data which was in that format but over 24 hours.
The example from ?as.POSIX help gives
as.POSIXct(strptime(begin, "%Y-%m-%d %H:%M:%S"))
so for you it would be
as.numeric(as.POSIXct(strptime(begin, "%Y-%m-%d %H:%M:%S")))

Resources