I have something like this within a function:
x <- as.POSIXct((substr((dataframe[z, ])$variable, 1, 8)), tz = "GMT",
format = "%H:%M:%S")
print(x)
if ( (x >= as.POSIXct("06:00:00", tz = "GMT", format = "%H:%M:%S")) &
(x < as.POSIXct("12:00:00", tz = "GMT", format = "%H:%M:%S")) ){
position <- "first"
}
but I get this output:
character(0)
Error in if ((as.numeric(departure) - as.numeric(arrival)) < 0) { : argument is of length zero
how can I fix this so my comparison works and it prints the correct thing?
some examples of the dataframe$variable column:
16:33:00
15:34:00
14:51:00
07:26:00
05:48:00
11:10:00
17:48:00
06:17:00
08:22:00
11:31:00
Welcome to Stack Overflow!
First, the reason you've gotten some down votes is most likely because you haven't given much in your question to go on. For one thing, you haven't shown us what
(dataframe[z, ])$variable
is, which makes it hard for us to formulate a complete answer. You seem to be trying to extract a single value from a dataframe, is that right? If so, I've never seen it done that way, try replacing the above with:
dataframe$variable[z]
My guess is what you're trying to achieve is a comparison of an entire column of the dataframe called "variable", since that's generally more useful...
Having said that, I often come up against issues with time data, and from what I've heard, my experiences are not uncommon. When I'm dealing with just times, as it appears you are here, I prefer the chron::times format over POSIXct (POSIX is a date-time format, so a date is always included, it also tries to correct for timezone changes, as well as daylight savings changes, which tends to get in my way more than help). If you've got your data in the format you've specified in your first as.POSIXct call, you won't even need to specify that in calling the times function instead.
x <- chron::times( dataframe$variable )
print(x)
position <- ifelse ( x >= chron::times( "06:00:00" ) &
x < chron::times( "12:00:00" ),
"first", "not first"
)
This will output a vector "position", with a result for all values taken from dataframe$variable. Does that achieve what you're hoping for?
From here, if you did want to extract the comparison result for the particular row "z" in dataframe, you can still do that with
position[z]
EDIT to add:
It might be worth checking for missing values in "variable". This should return TRUE:
sum( is.na( dataframe$variable ) ) == 0
Also check for any that aren't correctly formatted. Again, this should return TRUE:
sum( is.na( chron::times( dataframe$variable ) ) ) == 0
EDIT to add:
As per the comments, it looks like some values in your "variables" column aren't converting properly. You should be able to find them with
subset( dataframe, is.na( chron::times( variable ) ) )
That should let you see what's wrong. It may be a single cell, or it may be a number of them. You'll need to tidy up that data, which you can do in a few ways. You could go through and fix them manually, you could add a function in your script to repair them before the conversion (this might be a good idea if there is a common issue between all of those values, or if you expect the same issue to happen again as new data comes in, if indeed you need to allow for that).
The other option is simply to exclude those rows from your analysis. If you go this route, make sure it's appropriate to the analysis you're running. If it is appropriate in your case, you can add a step to clean up the dataframe before running the steps in your question:
dataframe <- subset( dataframe, !is.na( chron::times( variable ) ) )
NOTE: there's a good chance this will come up with a warning. If you run the same line twice, and the warning goes away the second time (after the offending rows have been removed), you may need to look further into it.
That should drop the offending values, leaving only values that are properly converting to the times format, which should help with the steps you're trying to run. Check how your dataframe dimensions change before and after that step; that'll tell you how many rows you're dropping.
You could do the same thing with POSIXct if that's what you're comfortable with, I'm just personally more comfortable with times for what you're doing.
Related
I am working on a data frame and have extracted on the of the columns with hour data from 0 t0 23. I am adding one more column as type of the day based on hour. I had executed below for loop but getting error. Can somebody help me what is wrong with below syntax and how to correct the same.
for(i in data$Requesthours) {
if(data$Requesthours>=0 & data$Requesthours<3) {
data$Partoftheday <- "Midnight"
} else if(data$Requesthours>=3 & data$Requesthours<6) {
data$Partoftheday <- "Early Morning"
} else if(data$Requesthours>=6 & data$Requesthours<12) {
data$Partoftheday <- "Morning"
} else if(data$Requesthours>=12 & data$Requesthours<16) {
data$Partoftheday <- "Afternoon"
} else if(data$Requesthours>=16 & data$Requesthours<20) {
data$Partoftheday <- "Evening"
} else if(data$Requesthours>=20 & data$Requesthours<=23) {
data$Partoftheday <- "Night"
}
}
Still waiting for you to post your bug, but here's an R coding tip which will reduce this to a one-liner (and bypass your bug). Also it'll be way faster (it's vectorized, unlike your for-loop and if-else-ladder).
data$Partoftheday <- as.character(
cut(data$Requesthours,
breaks=c(-1,3,6,12,16,20,24),
labels=c('Midnight', 'Early Morning', 'Morning', 'Afternoon', 'Evening', 'Night')
)
)
# see Notes on cut() at bottom to explain this
Now back to your bug: You're confused about how to iterate over a column in R. for(i in data$Requesthours) is trying to iterate over your df, but you're confusing indices with data values. Also you try to make i an iterator, but then you don't refer to the value i anywhere inside the loop, you refer back to data$Requesthours, which is an entire column not a single value (how do the loop contents known which value you're referring to? They don't. You could use an ugly explicit index-loop like for (i in 1:nrow(data) ... or for (i in seq_along(data) ... then access data[i,]$Requesthours, but please don't. Because...
One of the huge idiomatic things about learning R is generally when you write a for-loop to iterate over a dataframe or a df column, you should stop to think (or research) if there isn't a vectorized function in R that does what you want. cut, if, sum, mean, max, diff, stdev, ... fns are all vectorized, as are all the arithmetic and logical operators. 'vectorized' means you can feed them an entire (column) vector as an input, and they produce an entire (column) vector as output which you can directly assign to your new column. Very simple, very fast, very powerful. Generally beats the pants off for-loops. Please read R-intro.html, esp. Section 2 about vector assignment
And if you can't find or write a vectorized fn, there's also the *apply family of functions apply, sapply, lapply, ... to apply any arbitrary function you want to a list/vector/dataframe/df column.
Notes on cut()
cut(data, breaks, labels, ...) is a function where data is your input vector (e.g. your selected column data$Requesthours), breaks is a vector of integer or numeric, and labels is a vector to name the output. The length of labels is one more than breaks, since 5 breaks divides your data into 6 ranges.
We want the output vector to be string, not categorical, hence we apply as.character() to the output from cut()
Since your first if-else comparison is (hr>=0 & hr<3), we have to fiddle the lowest cutoff_hour 0 to -1, otherwise hr==0 would wrongly give NA. (There is a parameter include.lowest=TRUE/FALSE but it's not what you want, because it would also cause hr==3 to be 'Midnight', hr==6 to be 'Early Morning', etc.)
if(data$Requesthours>=0 & data$Requesthours<3) (and other similar ifs) make no sense since data$Requesthours is a vector. You should try either of the following:
Solution 1:
for(i in seq(length(data$Requesthours))) {
if(data$Requesthours[i]>=0 & data$Requesthours[i]<3)
data$Partoftheday[i] <- "Midnight"
....
}
This solution is slow like hell and really ugly, but it would work.
Solution 2:
data$Partoftheday[data$Requesthours>=0 & data$Requesthours<3] <- "Midnight"
...
Solution 3 = what was proposed by smci
I am trying to make a new column in my data.table. I have two columns, one with a start date and one with an end date. The starting date always is 2016-02-28. The end date in some cases is 2014-12-31 and in others it is 2020-12-31 (all in YYYY-MM-DD format).
In the first case it's evident that I should get a negative difference in dates. In the second case it is positive.
I want to use the sapply function with an ifelse statement to determine the difference in dates. Any time, the difference is negative, I want R to replace this with the value 1.
I do this as follows.
sapply(df$end.date, function(x) { ifelse(df$end.date>start_date, as.integer(length(seq(from=start_date, to=as.POSIXct(x,format="%Y-%m-%d"), by ='month')) ), 1) } )
Unfortunately, I get the following error
Error in seq.POSIXt(from = start_date, to = as.POSIXct(df$end.date, :
'from' must be of length 1
How can I make this work?
PS: both start_date and df$end.date are in POSIXct format in a data.table.
ifelse is already vectorised, doubling up sapply and ifelse is redundant.
Unfortunately ifelse won’t work here because we cannot get the month difference for negative dates (as per your comment). So we just use if in combination with mapply instead:
months_between = function (start, end) {
if (end > start)
length(seq(start, end, by = 'month'))
else
1
}
df$new_column = mapply(months_between, df$start.date, df$end.date)
I’m also pretty sure that there’s a better way to write months_between but I’m not versed in the base R date manipulation functions since they are generally quite bad; I recommend using the ‹lubridate› package instead.
I think you're approach is overly complicated. If you're going to use sapply, you ought to be able to avoid ifelse since you will be able to focus on one value at a time (this assumes you are running a vector through sapply. This might not hold true if running a list through sapply). If you really want to use an apply function, however, you'd be better off using mapply with an if ... else clause.
But the apply function isn't necessary at all. In fact, the ifelse function isn't necessary. You can simplify the process a great deal with:
# Borrowed code from http://stackoverflow.com/questions/1995933/number-of-months-between-two-dates/1996404
elapsed_months <- function(end_date, start_date) {
mapply(
function(end_date, start_date){
ed <- as.POSIXlt(end_date)
sd <- as.POSIXlt(start_date)
12 * (ed$year - sd$year) + (ed$mon - sd$mon)
},
end_date,
start_date,
SIMPLIFY = FALSE
)
}
DFrame <- data.frame(start = rep(as.Date("2016-02-28"), 2),
end = as.Date(c("2014-12-31", "2020-12-31")))
DFrame$diff <- elapsed_months(DFrame$end, DFrame$start)
DFrame$diff[DFrame$diff < 0] <- 1
DFrame
All I did was calculate the difference for all of the variables, obtain an index for the negative values, and replace them with 1.
An alternative approach would be to do the indexing up front. This way you aren't calculating the difference in dates for any values you will eventually change. This might have a benefit if you have a few million rows, but I would guess the performance increase would be small.
DFrame$diff2 <- vector("numeric", nrow(DFrame))
end_first <- DFrame$end < DFrame$start
DFrame$diff2[!end_first] <- elapsed_months(DFrame$end[!end_first], DFrame$start[!end_first])
DFrame$diff2[end_first] <- 1
I'm having an issue, which I have found a solution for, but would like to understand what was going on in the original coding.
So I started with a table pulled from an SQL database and wanted information for 1 client, who is covered by 2 client numbers.
Originally I was running this to select those account numbers.
match <- c("C524",'5568')
gtc <- gtc[gtc$AccountNumber == match,]
However this was only returning about half of the desired results, and the results returned vary at different times (this was running as a weekly report), and depending on the PC running it.
Now, I've set up a loop which works fine and extracts all the results, but would really like to know what was going on with the original query.
match <- c("C524",'5568')
for (each in match) {
gtcLoop<- gtc[gtc$AccountNumber == each,]
result<-rbind(result,gtcLoop)
}
Also, long time lurker, first time poster so let me know if I've done anything wrong in this question.
You need to replace == by %in%:
gtc <- data.frame(AccountNumber = sample(c(match, "something"), 10, replace = TRUE))
gtc[gtc$AccountNumber %in% match,]
Just to tag onto Qaswed's answer (+1), you need to understand what is happening when you compute vector comparisons like ==. See:
?`==`
and
?`%in%`
then try something like 1 == c(1,2) and 1 %in% c(1,2).
The reason you are getting half the results is because the row subset is using the first evaluation only, as in:
df <- data.frame(id=c(1:5), acct_cd = letters[1:5])
df[df$acct_cd == c("a","c"),] # this is wrong, for demo only
df[df$acct_cd %in% c("a","c"),] # this is correct
I created a flag called JF which I initialize to FALSE
JF <- F
I want to change the value to TRUE if the (date) value of SomeDate is in a list of dates that I retrieve from SQL Server. I would use == instead of %in% but there are a couple of different variables like SomeDate and any number of dates in JF.Date... and also because I want to master the use of %in%.
sql <- "select distinct Date_Values from SomeTable"
JF.Date <- sqlQuery(db, sql)
if (SomeDate %in% JF.Date) {JF <- T}
I checked the class() of both objects -- SomeDate is character, JF.Date is a data.frame, and JF.Date$Date_Values was originally factor, but I tried changing it to character and it didn't fix this issue. There are unrelated reasons that I'm storing SomeDate as character at this point in my code.
This returns no error but it doesn't change the value of JF to TRUE when it should. What am I doing wrong?
You can reproduce this with any arbitrary date assignment like <- "1970-01-01" to objects of the same type/classes.
This
for (i in 1:nrow(JF.Date)){
if (SomeDate == JF.Date[i]) {JF <- T}
}
does work, but again, I want to know how to use %in%.
#beginneR had a suggestion that didn't work, but which led me to the solution:
JF <- SomeDate %in% JF.Date
That makes total sense, yet didn't change the value of JF from FALSE to TRUE, just like my original attempt.
I realized that's because -- while there's only 1 column in this "dataframe" -- there could be multiple columns and R doesn't know which one to look at. This would be a lot more obvious if there were an appropriate Warning or Error message displayed, but nonetheless this works:
JF <- SomeDate %in% JF.Date$Date_Value
I can't accept my own answer for 2 days, so if anyone comes up with a more useful/comprehensive answer to this question, then I'll consider selecting your answer as a solution over my own. If I'm going to forgoe those points you've got to make it really good though! ;)
I am trying to calculate a simple ratio using data.table. Different files have different tmax values, so that is why I need ifelse. When I debug this, the dt looks good. The tmaxValue is a single value (the first "t=60" encountered in this case), but t0Value is all of the "t=0" values in dt.
summaryDT <- calculate_Ratio(reviewDT[,list(Result, Time), by=key(reviewDT)])
calculate_Ratio <- function(dt){
tmaxValue <- ifelse(grepl("hhep", inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=240min"),Result],
ifelse(grepl("hlm",inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=60"),Result],
dt[which(dt[,Time] == "t=30"),Result]))
t0Value <- dt[which(dt[,Time] == "t=0"),Result]
return(dt[,Ratio:=tmaxValue/t0Value])
}
What I am getting out is theResult for tmaxValue divided by all of the Result's for all of the t0Value's, but what I want is a single ratio for each unique by.
Thanks for the help.
You didn't provide a reproducible example, but typically using ifelse is the wrong thing to do.
Try using if(...) ... else ... instead.
ifelse(test, yes, no) acts very weird: It produces a result with the attributes and length from test and the values from yes or no.
...so in your case you should get something without attributes and of length one - and that's probably not what you wanted, right?
[UPDATE] ...Hmm or maybe it is since you say that tmaxValue is a single value...
Then the problem isn't in calculating tmaxValue? Note that ifelse is still the wrong tool for the job...