Split a variable into two columns(tried colsplit) - r

I have one variable called Date which is of the format 03/08/2015 09:00:00 AM.
I want to split the column into two columns, namely Date and Time. I used colsplit() command as:
> colsplit(crime$Date,"",names = c("Date","Time"))[1:5,]
Date Time
1 0 3/18/2015 07:44:00 PM
2 0 3/18/2015 11:00:00 PM
3 0 3/18/2015 10:45:00 PM
4 0 3/18/2015 10:30:00 PM
5 0 3/18/2015 09:00:00 PM
But it's not quite as expected. The Date variable has 0 and the Time variable has the other values. How do I rectify this?
Also, when I try to include these variables in the crime data set, the column names are date.date and date.time.

You can also use tidyr::separate to divide up character columns.
data %>% separate(date_time, c("date", "time"), sep = " ", remove = TRUE)

For your purposes, just use substr:
data.frame(Date = substr(crime$Date, 1, 10), Time = substr(crime$Date, 12, 19))
But I have to agree with the first comment about not changing this to two columns.

Related

Filling not observed observations

I want to make a time series with the frequency a date and time is observed. The raw data looked something like this:
dd-mm-yyyy hh:mm
28-2-2018 0:12
28-2-2018 11:16
28-2-2018 12:12
28-2-2018 13:22
28-2-2018 14:23
28-2-2018 14:14
28-2-2018 16:24
The date and time format is in the wrong way for R, so I had to adjust it:
extracted_times <- as.POSIXct(bedrijf.CSV$viewed_at, format = "%d-%m-%Y %H:%M")
I ordered the data with frequency in a table using the following code:
timeserieswithoutzeros <- table(extracted_times)
The data looks something like this now:
2018-02-28 00:11:00 2018-02-28 01:52:00 2018-02-28 03:38:00
1 2 5
2018-02-28 04:10:00 2018-02-28 04:40:00 2018-02-28 04:45:00
2 1 1
As you may see there are a lot of unobserved dates and times.
I want to add these unobserved dates and times with the frequency of 0.
I tried the complete function, but the error states that it can't best used, because I use as.POSIXct().
Any ideas?
As already mentinoned in the comments by #eric-lecoutre, you can combine your observations with a sequence begining at the earliest ending at the last date using seq and subtract 1 of the frequency table.
timeseriesWithzeros <- table(c(extracted_times, seq(min(extracted_times), max(extracted_times), "1 min")))-1
Maybe the following is what you want.
First, coerce the data to class "POSIXt" and create the sequence of all date/time between min and max by steps of 1 minute.
bedrijf.CSV$viewed_at <- as.POSIXct(bedrijf.CSV$viewed_at, format = "%d-%m-%Y %H:%M")
new <- seq(min(bedrijf.CSV$viewed_at),
max(bedrijf.CSV$viewed_at),
by = "1 mins")
tmp <- data.frame(viewed_at = new)
Now see if these values are in the original data.
tmp$viewed <- tmp$viewed_at %in% bedrijf.CSV$viewed_at
tbl <- xtabs(viewed ~ viewed_at, tmp)
sum(tbl != 0)
#[1] 7
Final clean up.
rm(new, tmp)

subset data by time interval if I have all data between time interval

I have a data frame that looks like this:
X id mat.1 mat.2 mat.3 times
1 1 1 Anne 1495206060 18.5639404 2017-05-19 11:01:00
2 2 1 Anne 1495209660 9.0160321 2017-05-19 12:01:00
3 3 1 Anne 1495211460 37.6559161 2017-05-19 12:31:00
4 4 1 Anne 1495213260 31.1218856 2017-05-19 13:01:00
....
164 164 1 Anne 1497825060 4.8098351 2017-06-18 18:31:00
165 165 1 Anne 1497826860 15.0678781 2017-06-18 19:01:00
166 166 1 Anne 1497828660 4.7636241 2017-06-18 19:31:00
What I would like is to subset the data set by time interval (all data between 11 AM and 4 PM) if there are data points for each hour at least (11 AM, 12, 1, 2, 3, 4 PM) within each day. I want to ultimately sum the values from mat.3 per time interval (11 AM to 4 PM) per day.
I did tried:
sub.1 <- subset(t,format(times,'%H')>='11' & format(times,'%H')<='16')
but this returns all the data from any of the times between 11 AM and 4 PM, but often I would only have data for e.g. 12 and 1 PM for a given day.
I only want the subset from days where I have data for each hour from 11 AM to 4 PM. Any ideas what I can try?
A complement to #Henry Navarro answer for solving an additional problem mentioned in the question.
If I understand in proper way, another concern of the question is to find the dates such that there are data points at least for each hour of the given interval within the day. A possible way following the style of #Henry Navarro solution is as follows:
library(lubridate)
your_data$hour_only <- as.numeric(format(your_data$times, format = "%H"))
your_data$days <- ymd(format(your_data$times, "%Y-%m-%d"))
your_data_by_days_list <- split(x = your_data, f = your_data$days)
# the interval is narrowed for demonstration purposes
hours_intervals <- 11:13
all_hours_flags <- data.frame(days = unique(your_data$days),
all_hours_present = sapply(function(Z) (sum(unique(Z$hour_only) %in% hours_intervals) >=
length(hours_intervals)), X = your_data_by_days_list), row.names = NULL)
your_data <- merge(your_data, all_hours_flags, by = "days")
There is now the column "all_hours_present" indicating that the data for a corresponding day contains at least one value for each hour in the given hours_intervals. And you may use this column to subset your data
subset(your_data, all_hours_present)
Try to create a new variable in your data frame with only the hour.
your_data$hour<-format(your_data$times, format="%H:%M:%S")
Then, using this new variable try to do the next:
#auxiliar variable with your interval of time
your_data$aux_var<-ifelse(your_data$hour >"11:00:00" || your_data$hour<"16:00:00" ,1,0)
So, the next step is filter your data when aux_var==1
your_data[which(your_data$aux_var ==1),]

Inserting Row in Missing Hourly Data in R using padr package - weird error

I am new to R and I am having some issues with the padr package described here.
I have a hourly data set that is missing hours and I would like to insert a row to input a value for the missing data. I am trying to use the pad function and the fill_by_value function from the padr package but I am getting an error when I use the pad function.
The data called Mendo is presented as:
Date.Local Time.Local Sample.Measurement
2016-01-01 00:00:00 3
2016-01-01 00:01:00 4
2016-01-01 00:02:00 1
2016-01-01 00:04:00 4
2016-01-01 00:05:00 5
I want the final data to look like:
Date.Local Time.Local Sample.Measurement
2016-01-01 00:00:00 3
2016-01-01 00:01:00 4
2016-01-01 00:02:00 1
2016-01-01 00:03:00 999
2016-01-01 00:04:00 4
2016-01-01 00:05:00 5
I am under the impression the padr packaged wants a datetime POSIXct column so I use the command
Mendo$Time.Local <- as.POSIXct(paste(Mendo$Date.Local, Mendo$Time.Local), format = '%Y-%m-%d %H:%M')
to get:
Time.Local Sample.Measurement
2016-01-01 00:00:00 3
2016-01-01 00:01:00 4
2016-01-01 00:02:00 1
2016-01-01 00:04:00 4
2016-01-01 00:05:00 5
Now I try to use the pad function like instruction in the link provided above. My line of code is:
Mendo_padded <- Mendo %>% pad
and I get the error:
Error in if (total_invalid == nrow(x)) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In if (unique(nchar(x_char)) == 10) { :
the condition has length > 1 and only the first element will be used
If this were to work, I would then use the command
Mendo_padded %>% fill_by_value(Sample.Measurement, value = 999)
to get all the missing hours Sample.Measurement value to be 999.
I would love feedback, suggestions or comments on what I may be doing wrong and how I can go about getting this code to work! Thank you!
It seems that pad can automatically detect which column is of Date / POSIXct / POSIXlt type, so you do not need to supply Mendo$Time.Local to pad. The padding will be applied on hour intervals.
library(magrittr)
library(padr)
PM10 <- read.csv(file="../Downloads/hourly_81102_2016.csv",
stringsAsFactors = FALSE) # don't change the columns to factors
Mendo <- PM10[PM10$County.Name == "Mendocino",]
Mendo$Time.Local <-
as.POSIXct(paste(
Mendo$Date.Local, Mendo$Time.Local), format = '%Y-%m-%d %H:%M')
Mendo <- Mendo[,c("Time.Local", "Sample.Measurement")]
# remove Mendo$Time.Local
Mendo_padded <- Mendo %>% na.omit %>%
pad(interval = 'hour',
start_val = NULL, end_val = NULL, group = NULL,
break_above = 1)
You may also consider using the column Time.GMT and Date.GMT because date and time may depend on where you (your computer) are.
Edit: As suggested by OP, na.omit should be used before pad to avoid NA values in the Date column.

Split Date to Day, Month and Year for ffdf Data in R

I'm using R's ff package with ffdf objects named MyData, (dim=c(10819740,16)). I'm trying to split the variable Date into Day, Month and Year and add these 3 variables into ffdf existing data MyData.
For instance: My Date column named SalesReportDate with VirtualVmode and PhysicalVmode = double after I've changed SalesReportDate to as.date(,format="%m/%d/%Y").
Example of SalesReportDate are as follow:
> B
SalesReportDate
1 2013-02-01
2 2013-05-02
3 2013-05-04
4 2013-10-06
5 2013-15-10
6 2013-11-01
7 2013-11-03
8 2013-30-02
9 2013-12-12
10 2014-01-01
I've refer to Split date into different columns for year, month and day and try to apply it but keep getting error warning.
So, is there any way for me to do this? Thanks in advance.
Credit to #jwijffels for this great solution:
require(ffbase)
MyData$SalesReportDateYear <- with(MyData["SalesReportDate"], format(SalesReportDate, "%Y"), by = 250000)
MyData$SalesReportDateMonth <- with(MyData["SalesReportDate"], format(SalesReportDate, "%m"), by = 250000)
MyData$SalesReportDateDay <- with(MyData["SalesReportDate"], format(SalesReportDate, "%d"), by = 250000)

Adding missing rows

The format of my excel data file is:
day value
01-01-2000 00:00:00 4
01-01-2000 00:01:00 3
01-01-2000 00:02:00 1
01-01-2000 00:04:00 1
I open my file with this:
ts = read.csv(file=pathfile, header=TRUE, sep=",")
How can I add additional rows with zero number in column “value” into the data frame. Output example:
day value
01-01-2000 00:00:00 4
01-01-2000 00:01:00 3
01-01-2000 00:02:00 1
01-01-2000 00:03:00 0
01-01-2000 00:04:00 1
This is now completely automated in the padr package. Takes only one line of code.
original <- data.frame(
day = as.POSIXct(c("01-01-2000 00:00:00",
"01-01-2000 00:01:00",
"01-01-2000 00:02:00",
"01-01-2000 00:04:00"), format="%m-%d-%Y %H:%M:%S"),
value = c(4, 3, 1, 1))
library(padr)
library(dplyr) # for the pipe operator
original %>% pad %>% fill_by_value(value)
See vignette("padr") or this blog post for its working.
I think this is a more general solution, which relies on creating a sequence of all timestamps, using that as the basis for a new data frame, and then filling in your original values in that df where applicable.
# convert original `day` to POSIX
ts$day <- as.POSIXct(ts$day, format="%m-%d-%Y %H:%M:%S", tz="GMT")
# generate a sequence of all minutes in a day
minAsNumeric <- 946684860 + seq(0,60*60*24,by=60) # all minutes of your first day
minAsPOSIX <- as.POSIXct(minAsNumeric, origin="1970-01-01", tz="GMT") # convert those minutes to POSIX
# build complete dataframe
newdata <- as.data.frame(minAsPOSIX)
newdata$value <- ts$value[pmatch(newdata$minAsPOSIX, ts$day)] # fill in original `value`s where present
newdata$value[is.na(newdata$value)] <- 0 # replace NAs with 0
Try:
ts = read.csv(file=pathfile, header=TRUE, sep=",", stringsAsFactors=F)
ts.tmp = rbind(ts,list("01-01-2000 00:03:00",0))
ts.out = ts.tmp[order(ts.tmp$day),]
Notice that you need to force load the strings in first column as character and not factors otherwise you will have issue with the rbind. To get the day column to be a factor after than just do:
ts.out$day = as.factor(ts.out$day)
Tidyr offers the nice complete function to generate rows for implicitly missing data. I use replace_na to turn NA values to 0 in second step.
ts%>%
tidyr::complete(day=seq.POSIXt(min(day), max(day), by="min"))%>%
dplyr::mutate(value=tidyr::replace_na(value,0))
Notice that I set the granularity of the dates to minutes since your dataset expects a row every minute.

Resources