Subsetting by year gives different results using ymd() vs. year() - r

I am getting different nrows subsetting by year using ymd() and year() in package lubridate and I am trying to figure out what might be causing that disparity.
A 331kb CSV file with 10k dates is here. A url pointing to Google Drive and Dropbox kept throwing up errors, beyond my newbie skills to figure out.
require(data.table)
require(lubridate)
teaSet <- fread("../teaSet.csv", na.strings=c("NA","N/A", ""))
teaSet$opened <- ymd_hms(teaSet$opened, tz = "")
teaSet$year <- as.factor(teaSet$year)
ymd2010 <- teaSet[opened >= ymd("2010-01-01") & opened <= ymd("2010-12-31"),]
#1480 obs.
year2010 <- teaSet[year(opened)==2010,]
#1483 obs
summary(teaSet$year)
#2010 2011 2012 2013 2014 2015 2016
#1483 1408 1317 1414 1521 1701 1156
Can anyone explain what I am missing? I was subsetting by date range and then by year() and noticed the year() and ymd() counts were different. I created a factor column for years (and cleverly named it "year")to speed things up - my dataset has 13 million rows - but is not directly relevant to my question. Seemed like a good idea when I started. I did different sample sizes and the disparity remains across sizes. Thanks!

Looking over the problem some more it looks like: ymd("2010-12-31") is 12:00 AM on the 31st and not 12 PM.
There are 2 options which I see a possible solutions. Use the next day in the filter or covert all of your date/times to just dates with GMT.
If you change opened <= ymd("2011-1-1") it will work.
require(lubridate)
library(data.table)
teaSet <- fread("teaSet.csv", na.strings=c("NA","N/A", ""))
teaSet$opened <- ymd_hms(teaSet$opened, tz = "")
teaSet$year <- as.factor(teaSet$year)
ymd2010 <- teaSet[opened >= ymd("2010-01-01") & opened < ymd("2011-1-1"),]
print(dim(ymd2010))
#a second possible option - not as clean as the prior one
teaSet$opened <- ymd_hms(teaSet$opened, tz = "GMT")
ymd2010_2 <-teaSet[as.Date(opened) >= ymd("2010-01-01") & as.Date(opened) <= ymd("2010-12-31")]
print(dim(ymd2010_2))
year2010 <- teaSet[year(opened)==2010,]
print( dim(year2010 ))
summary(teaSet$year)
I agree the timezone issue is unintuitive, but it is what is it is. Nice job on testing for and catching the inconsistence in your original solution.

Related

Is there a way to solve for a date in an ifelse statement?

I am trying to solve for a date so that the parameter I am looking for is true in an ifelse statement, so that the true input is the date. I'm having difficulty wording what I'm trying to do, but basically I have when someone is hired, and when they were born. I need to find the date that how long they've been working + how old they are adds up to 79.
I've gone through a couple different ways of writing the ifelse code, but the error is in getting that placeholder variable to be solved for.
library(lubridate)
hire_date <- ymd("1982-02-02")
birth_date <- ymd("1967-02-02")
retire <- 1
test <- data.frame(hire_date, birth_date, retire)
test$pension4 <- ifelse(test$retire == 1, ifelse(((
as.duration(ymd(x) %--% test$hire_date / dyears(1))) + (as.duration(ymd(x)
%--% test$birth_date / dyears(1)))) == 79, x, NA), NA)
x is what I want to solve for. In the example, if my person was hired on 1982-02-02 and born 1967-02-02, what I want input into pension4 is the date where their combined age and years of service is 79. In the example, that would be 2014-02-02. They started working at 15 (I just made random numbers up so yeah that's not super accurate), so in 2014 they will be 47 and have worked for 32 years, for a combined age and years of service of 79.
If I need to use something other than ifelse, that's fine, I just need to be able to get that date.
Thank you!
test <- data.frame(hire_date = as.Date("1982-02-02"), birth_date = as.Date("1967-02-02"))
transform(test, retire_date = hire_date + (79*365.25 - (hire_date - birth_date))/2)
# hire_date birth_date retire_date
# 1 1982-02-02 1967-02-02 2014-02-02

Subsetting data by multiple date ranges - R

I'll get straight to the point: I have been given some data sets in .csv format containing regularly logged sensor data from a machine. However, this data set also contains measurements taken when the machine is turned off, which I would like to separate from the data logged from when it is turned on. To subset the relevant data I also have a file containing start and end times of these shutdowns. This file is several hundred rows long.
Examples of the relevant files for this problem:
file: sensor_data.csv
sens_name,time,measurement
sens_A,17/12/11 06:45,32.3321
sens_A,17/12/11 08:01,36.1290
sens_B,17/12/11 05:32,17.1122
sens_B,18/12/11 03:43,12.3189
##################################################
file: shutdowns.csv
shutdown_start,shutdown_end
17/12/11 07:46,17/12/11 08:23
17/12/11 08:23,17/12/11 09:00
17/12/11 09:00,17/12/11 13:30
18/12/11 01:42,18/12/11 07:43
To subset data in R, I have previously used the subset() function with simple conditions which has worked fine, but I don't know how to go about subsetting sensor data which fall outside multiple shutdown date ranges. I've already formatted the date and time data using as.POSIXlt().
I'm suspecting some scripting may be involved to come up with a good solution, but I'm afraid I am not yet experienced enough to handle this type of data.
Any help, advice, or solutions will be greatly appreciated. Let me know if there's anything else needed for a solution.
I prefer POSIXct format for ranges within data frames. We create an index for sensors operating during shutdowns with t < shutdown_start OR t > shutdown_end. With these ranges we can then subset the data as necessary:
posixct <- function(x) as.POSIXct(x, format="%d/%m/%y %H:%M")
sensor_data$time <- posixct(sensor_data$time)
shutdowns[] <- lapply(shutdowns, posixct)
ind1 <- sapply(sensor_data$time, function(t) {
sum(t < shutdowns[,1] | t > shutdowns[,2]) == length(sensor_data$time)})
#Measurements taken when shutdown
sensor_data[ind1,]
# sens_name time measurement
# 1 sens_A 2011-12-17 06:45:00 32.3321
# 3 sens_B 2011-12-17 05:32:00 17.1122
#Measurements taken when not shutdown
sensor_data[!ind1,]
# sens_name time measurement
# 2 sens_A 2011-12-17 08:01:00 36.1290
# 4 sens_B 2011-12-18 03:43:00 12.3189

R - Aggregate Values Based on a Date Interval and 3 Factor Variables

I have ExpandedGrid 11760 obs of 4 variables:
Date - date format
Device - factor
Creative - factor
Partner - factor
I also have a MediaPlanDF 215 obs of 6 variables:
Interval - an interval of dates I created using lubridate
Partner - factor
Device - factor
Creative - factor
Daily Spend - num
Daily Impressions - num
Here is my trouble.
I need to sum daily spend and daily impressions in respective columns in MediaPlanDF, based on the following 2 criteria:
Criterion 1
- ExpandedGrid$Device matches MediaPlanDF$Device
- ExpandedGrid$Creative matches MediaPlanDF$Creative
- ExpandedGrid$Partner matches MediaPlanDF$Partner
Criterion 2
- ExpandedGrid$Date falls within MediaPlanDF$Interval
Now I can pull this off for each criteria on its own, but I am having the hardest time putting them together without getting errors, and my search for answers hasn't ended in very much success (a lot of great examples but nothing I have the skill to adapt to my context). I've tried a variety of methods but my mind is starting to wander towards overly complicated solutions and I need help.
I've tried indexing like so:
indexb <- as.character(ExpandedGrid$Device) == as.character(MediaPlanDF$Device);
indexc <- as.character(ExpandedGrid$Creative) == as.character(MediaPlanDF$Creative);
indexd <- as.character(ExpandedGrid$Partner) == as.character(MediaPlanDF$Partner);
index <- ExpandedGrid$Date %within% MediaPlanDF$Interval;
KEYDF <- data.frame(index, indexb, indexc, indexd)
KEYDF$Key <- apply(KEYDF, 1, function(x)(all(x) || all(!x)))
KEYDF$Key.cha <- as.character(KEYDF$Key)
outputbydim <- do.call(rbind, lapply(KEYDF$Key.cha, function(x){
index <- x == "TRUE";
list(impressions = sum(MediaPlanDF$Daily.Impressions[index]),
spend = sum(MediaPlanDF$Daily.Spend[index]))}))
Unfortunately this excludes values from being summed correctly, but the sum values for those that are true are incorrect.
Here is a data snippet:
ExpandedGrid:
Date Device Creative Partner
2015-08-31 "Desktop" "Standard" "ACCUEN"
MediaPlanDF
Interval Device Creative Partner Daily Spend Daily Impressions
2015-08-30 17:00:00 PDT--2015-10-03 17:00:00 PDT "Desktop" "Standard" "ACCUEN" 1696.27 1000339.17
Does anyone know where to go from here?
Thanks in advance!

Plot a histogram of subset of a data

!The image shows the screen shot of the .txt file of the data.
The data consists of 2,075,259 rows and 9 columns
Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.
Only data from the dates 2007-02-01 and 2007-02-02 is needed.
I was trying to plot a histogram of "Global_active_power" in the above mentioned dates.
Note that in this dataset missing values are coded as "?"]
This is the code i was trying to plot the histogram:
{
data <- read.table("household_power_consumption.txt", header=TRUE)
my_data <- data[data$Date %in% as.Date(c('01/02/2007', '02/02/2007'))]
my_data <- gsub(";", " ", my_data) # replace ";" with " "
my_data <- gsub("?", "NA", my_data) # convert "?" to "NA"
my_data <- as.numeric(my_data) # turn into numbers
hist(my_data["Global_active_power"])
}
After running the code it is showing this error:
Error in hist.default(my_data["Global_active_power"]) :
invalid number of 'breaks'
Can you please help me spot the mistake in the code.
Link of the data file : https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip
You need to provide the separator (";") explicitly and your types aren't what you think they are, observe:
data <- read.table("household_power_consumption.txt", header=TRUE, sep=';', na.strings='?')
data$Date <- as.Date(data$Date, format='%d/%m/%Y')
bottom.date <- as.Date('01/02/2007', format='%d/%m/%Y')
top.date <- as.Date('02/02/2007', format='%d/%m/%Y')
my_data <- data[data$Date > bottom.date & data$Date < top.date,3]
hist(my_data)
Gives as the plot. Hope that helps.
Given you have 2m rows (though not too many columns), you're firmly into fread territory;
Here's how I would do what you want:
library(data.table)
data<-fread("household_power_consumption.txt",sep=";", #1
na.strings=c("?","NA"),colClasses="character" #2
)[,Date:=as.Date(Date,format="%d/%m/%Y")
][Date %in% seq(from=as.Date("2007-02-01"), #3
to=as.Date("2007-02-02"),by="day")]
numerics<-setdiff(names(data),c("Date","Time")) #4
data[,(numerics):=lapply(.SD,as.numeric),.SDcols=numerics]
data[,hist(Global_active_power)] #5
A brief explanation of what's going on
1: See the data.table vignettes for great introductions to the package. Here, given the structure of your data, we tell fread up front that ; is what separates fields (which is nonstandard)
2: We can tell fread up front that it can expect ? in some of the columns and should treat them as NA--e.g., here's data[8640] before setting na.strings:
Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00 ? ? ? ? ? ? NA
Once we set na.strings, we sidestep having to replace ? as NA later:
Date Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00 NA NA NA NA NA NA
On the other hand, we also have to read those fields as characters, even though they're numeric. This is something I'm hoping fread will be able to handle automatically in the future.
data.table commands can be chained (from left to right); I'm using this to subset the data before it's assigned. It's up to you whether you find that more or less readable, as there's only marginal performance differences.
Since we had to read the numeric fields as strings, we now recast them as numeric; this is the standard data.table syntax for doing so.
Once we've got our data subset as we like and of the right type, we can pass hist as an argument in j and get what we want.
Note that if all you wanted from this data set was the histogram, you could have condensed the code a bit:
ok_dates<-seq(from=as.Date("2007-02-01"),
to=as.Date("2007-02-02"),by="day")
fread("household_power_consumption.txt",sep=";",
select=c("Date","Global_active_power"),
na.strings=c("?","NA"),colClasses="character"
)[,Date:=as.Date(Date,format="%d/%m/%Y")
][Date %in% ok_dates,hist(as.numeric(Global_active_power))]

Select a value from time series by date in R

How to select a value from time series corresponding needed date?
I create a monthly time series object with command:
producers.price <- ts(producers.price, start=2012+0/12, frequency=12)
Then I try to do next:
value <- producers.price[as.Date("01.2015", "%m.%Y")]
But this doesn't make that I want and value is equal
[1] NA
Instead of 10396.8212805739 if producers.price is:
producers.price <- structure(c(7481.52109434237, 6393.18959031561, 6416.63065650718,
5672.08354710121, 7606.24186413516, 5201.59247092013, 6488.18361474813,
8376.39182893415, 9199.50916585545, 8261.87133079494, 8293.8195347453,
8233.13630279516, 7883.17272003961, 7537.21001580393, 6566.60260432381,
7119.99345843556, 8086.40101607729, 9125.11104610046, 10134.0228610828,
10834.5732454454, 9410.35031874371, 9559.36933274129, 9952.38679679724,
10390.3628690951, 11134.8432864557, 11652.0075507499, 12626.9616107684,
12140.6698452193, 11336.8315981684, 10526.0309052316, 10632.1492109584,
8341.26367412737, 9338.95688558448, 9732.80173656971, 10724.5525831506,
11272.2273444623, 10396.8212805739, 10626.8428853062, 11701.0802817581,
NA), .Tsp = c(2012, 2015.25, 12), class = "ts")
So, I had/have a similar problem and was looking all over to solve it. My solution is not as great as I'd have wanted it to be, but it works. I tried it out with your data and it seems to give the right result.
Explanation
Turns out in R time series data is really stored as a sequence, starting at 1, and not with yout T. Eg. If you have a time series that starts in 1950 and ends in 1960 with each data at one year interval, the Y at 1950 will be ts[1] and Y at 1960 will be ts[11].
Based on this logic you will need to subtract the date from the start of the data and add 1 to get the value at that point.
This code in R gives you the result you expect.
producers.price[((as.yearmon("2015-01")- as.yearmon("2012-01"))*12)+1]
If you need help in the time calculations, check this answer
You will need the zoo and lubridate packages
Get the difference between dates in terms of weeks, months, quarters, and years
Hope it helps :)
1) window.ts
The window.ts function is used to subset a "ts" time series by a time window. The window command produces a time series with one data point and the [[1]] makes it a straight numeric value:
window(producers.price, start = 2015 + 0/12, end = 2015 + 0/12)[[1]]
## [1] 10396.82
2) zoo We can alternately convert it to zoo and subscript it by a yearmon class variable and then use [[1]] or coredata to convert it to a plain number or we can use window.zoo much as we did with window.ts :
library(zoo)
as.zoo(producers.price)[as.yearmon("2015-01")][[1]]
## [1] 10396.82
coredata(as.zoo(producers.price)[as.yearmon("2015-01")])
## [1] 10396.82
window(as.zoo(producers.price), 2015 + 0/12 )[[1]]
## [1] 10396.82
coredata(window(as.zoo(producers.price), 2015 + 0/12 ))
## [1] 10396.82
3) xts The four lines in (2) also work if library(zoo) is replaced with library(xts) and as.zoo is replaced with as.xts.
Looking for a simple command, one line and no library needed?
You might try this.
as.numeric(window(producers.price, 2015.1, 2015.2))

Resources