I am trying to make a new column in my data.table. I have two columns, one with a start date and one with an end date. The starting date always is 2016-02-28. The end date in some cases is 2014-12-31 and in others it is 2020-12-31 (all in YYYY-MM-DD format).
In the first case it's evident that I should get a negative difference in dates. In the second case it is positive.
I want to use the sapply function with an ifelse statement to determine the difference in dates. Any time, the difference is negative, I want R to replace this with the value 1.
I do this as follows.
sapply(df$end.date, function(x) { ifelse(df$end.date>start_date, as.integer(length(seq(from=start_date, to=as.POSIXct(x,format="%Y-%m-%d"), by ='month')) ), 1) } )
Unfortunately, I get the following error
Error in seq.POSIXt(from = start_date, to = as.POSIXct(df$end.date, :
'from' must be of length 1
How can I make this work?
PS: both start_date and df$end.date are in POSIXct format in a data.table.
ifelse is already vectorised, doubling up sapply and ifelse is redundant.
Unfortunately ifelse won’t work here because we cannot get the month difference for negative dates (as per your comment). So we just use if in combination with mapply instead:
months_between = function (start, end) {
if (end > start)
length(seq(start, end, by = 'month'))
else
1
}
df$new_column = mapply(months_between, df$start.date, df$end.date)
I’m also pretty sure that there’s a better way to write months_between but I’m not versed in the base R date manipulation functions since they are generally quite bad; I recommend using the ‹lubridate› package instead.
I think you're approach is overly complicated. If you're going to use sapply, you ought to be able to avoid ifelse since you will be able to focus on one value at a time (this assumes you are running a vector through sapply. This might not hold true if running a list through sapply). If you really want to use an apply function, however, you'd be better off using mapply with an if ... else clause.
But the apply function isn't necessary at all. In fact, the ifelse function isn't necessary. You can simplify the process a great deal with:
# Borrowed code from http://stackoverflow.com/questions/1995933/number-of-months-between-two-dates/1996404
elapsed_months <- function(end_date, start_date) {
mapply(
function(end_date, start_date){
ed <- as.POSIXlt(end_date)
sd <- as.POSIXlt(start_date)
12 * (ed$year - sd$year) + (ed$mon - sd$mon)
},
end_date,
start_date,
SIMPLIFY = FALSE
)
}
DFrame <- data.frame(start = rep(as.Date("2016-02-28"), 2),
end = as.Date(c("2014-12-31", "2020-12-31")))
DFrame$diff <- elapsed_months(DFrame$end, DFrame$start)
DFrame$diff[DFrame$diff < 0] <- 1
DFrame
All I did was calculate the difference for all of the variables, obtain an index for the negative values, and replace them with 1.
An alternative approach would be to do the indexing up front. This way you aren't calculating the difference in dates for any values you will eventually change. This might have a benefit if you have a few million rows, but I would guess the performance increase would be small.
DFrame$diff2 <- vector("numeric", nrow(DFrame))
end_first <- DFrame$end < DFrame$start
DFrame$diff2[!end_first] <- elapsed_months(DFrame$end[!end_first], DFrame$start[!end_first])
DFrame$diff2[end_first] <- 1
Related
EDIT: I implemented offered solutions so far, and the code looks way cleaner now. This was the key to finally finding my error. It was a logical condition that I didn't check within the while loop. It could happen that the iterator would exceed the number of elements in the vector and thus pass a "NA" to the while condition! Thx
I also changed the solution to use vector assignments to store the results and then recombine after the for loop, as vector indexing seems to be way faster than data.table indexing and value assignment within the loop.
Pls let me apologize first for any errors and lack of information for troubleshooting my problem as this is my first post so far. I have already read that this can happen accidentally whenever ther is an error in a computation and the value of a condition results in an error, such as
if (TRUE & sqrt(-1))
It's been days and I am still receiving this error. It really gives me a headache, as the inherent logic behind such code is actually pretty straigth forward and I still can't properly formalize it. It goes like following: Compare for each unique bond ID contained in a vector of size N (loop through with i), the static value of its corresponding maturity to 7 periods' end date for distinct set of rules (loop through with k) to determine which periods with unique rules the respective issue falls into, and then determine by looping through all the periods' size thresholds (loop through by l) to find if a particular issue has violted these minimium size requirements. If a violation is determined, I can assign the date of the violation. If (l == k), I can reckon that for all periods that the issue's maturity falls into, have also successfully looped through the corresponding size requirements checks and as such hasn't violated any rules. I then assign the result of the conditional checks as corresponding binary values in a new data.table column as well as the violation date. So far, I really cant determine what is casusing this error.
My data looks like following. I have a pretty large data.table containing bond issue identifiers and various other column variables that describe those issues. It was imported as initially with the read_dta() function and then transformed to a data.table with setDT().
I extract 3 columns out of this data.table, using
issue_IDs.vec <- as.numeric(issues.dt[[2]])
maturity.vec <- as.Date(issues.dt[[8]], "%Y-%m-%d")
offerings_atm.vec <- as.numeric(issues.dt[[33]])
Next, I transform eligibility criteria of an index as following.
# (1) Creating size requirement end periods (valid thru) ----
size_req_per_1 <- as.Date("1992-01-01", "%Y-%m-%d")
size_req_per_2 <- as.Date("1994-01-01", "%Y-%m-%d")
size_req_per_3 <- as.Date("1999-07-01", "%Y-%m-%d")
size_req_per_4 <- as.Date("2003-10-01", "%Y-%m-%d")
size_req_per_5 <- as.Date("2004-07-01", "%Y-%m-%d")
size_req_per_6 <- as.Date("2017-02-01", "%Y-%m-%d")
size_req_per_7 <- as.Date("2021-02-01", "%Y-%m-%d")
size_req_val_per.vec <- c(size_req_per_1, size_req_per_2, size_req_per_3, size_req_per_4,
size_req_per_5, size_req_per_6, size_req_per_7)
# (2) Create a size requirement threshold per rules' validity period ----
size_req_thresh_1 <- 25000
size_req_thresh_2 <- 50000
size_req_thresh_3 <- 100000
size_req_thresh_4 <- 150000
size_req_thresh_5 <- 200000
size_req_thresh_6 <- 250000
size_req_thresh_7 <- 300000
size_req_thresh.vec <- c(size_req_thresh_1, size_req_thresh_2, size_req_thresh_3,
size_req_thresh_4, size_req_thresh_5, size_req_thresh_6,
size_req_thresh_7)
Next, I do write a loop to perform conditional checks to find for each issue ID stored in the issues_ID.vec if they violate the index eligibility criterium of the minimim issance size during their maturity. I do this by passing the value of iterator variable i as a position value to the issues_ID.vec.
# (3) Looping through a set of conditional check to find out if and if so when a particular issue violated the size requirement ---
# Iterator variables ----
# Length of issues.dt
j <- issues.dt[, .N]
# Main iterator looping through all entries of isssues.dt extracted as vector
i <- 1
# Looping through vector elements of issue rules (vec. 1: validity periods)
k <- 1
# Looping through vector elements of issue rules (vec. 2: size thresholds)
l <- 1
# Loop
for (i in 1:j) {
id <- issue_IDs.vec[i]
maturity <- maturity.vec[i]
offering_atm <- issue_IDs.vec[i]
k <- 1
maturity_comp <- size_req_val_per.vec[k]
while (maturity >= maturity_comp) {
if (k < 7) {
k <- k + 1
maturity_comp <- size_req_val_per.vec[k]
} else {
break
}
}
l <- 1
offering_size_comp <- size_req_thresh.vec[l]
for (l in 1:k) {
if (offering_atm >= offering_size_comp) {
offering_size_comp <- size_req_thresh.vec[l]
next
} else {}
}
if (l == k) {
issues.dt[ISSUE_ID == id,
`:=`(SIZE_REQ_VIOLATION = 0,
SIZE_REQ_VIOLATION_DATE = NA)]
} else {
issues.dt[ISSUE_ID == id,
`:=`(SIZE_REQ_VIOLATION = 1,
SIZE_REQ_VIOLATION_DATE = size_req_val_per.vec[l])]
}
i <- i + 1
}
Whenever I try running the code in a simplified version, such as
k <- 1
for (1 in 1:7) {
print(maturity >= maturity_comp)
k <- k + 1
maturity_comp <- format(as.Date(size_req_val_per.vec[k]), "%Y-%m-%d")
}
the code runs smooth and always results in the printed evaluations TRUE or FALSE, depending which ID I initially to create the corresponding static maturity of the particular bond issue. As this stage, I already exhasuted my troubleshooting skills.
I'd appreciate any input from you guys, and if you need any additional information, explanations etc. just let me know.
I think the answer lies in Gregor's comment. The way you are formatting your dates converts them to character variables. Here's a quick example:
Exmpl<-as.Date("08-25-2020", "%m-%d-%Y")
class(Exmpl)
[1] "Date"
##Not your preferred format, but it is a Date variable##
Exmpl
"2020-08-25"
##Formatting changes it to a character
Exmpl2<-format(as.Date(Exmpl), "%m-%d-%Y")
class(Exmpl2)
[1] "character"
When you call them in the while() function, R is trying make a comparison to decided if the condition (i.e., maturity is greater than or equal to maturity comp) is TRUE or FALSE (logical variables). Because you have character variables, R cannot make this comparison.
I think your code will work if you don't format the dates, but simply read them in and leave them in the YYYY-mm-dd format.
I'm trying to create a function to solve this puzzle:
An Arithmetic Progression is defined as one in which there is a constant difference between the consecutive terms of a given series of numbers. You are provided with consecutive elements of an Arithmetic Progression. There is however one hitch: exactly one term from the original series is missing from the set of numbers which have been given to you. The rest of the given series is the same as the original AP. Find the missing term.
You have to write the function findMissing(list), list will always be at least 3 numbers. The missing term will never be the first or last one.
The next section of code shows my attempt at this function. The site i'm on runs tests against the function, all of which passed, as in they output the correct missing integer.
The problem i'm facing is it's giving me a timeout error, because it takes to long to run all the tests. There are 102 tests and it's saying it takes over 12 seconds to complete them. Taking more than 12 seconds means the function isn't efficient enough.
After running my own timing tests in RStudio it seems running the function would take considerably less time than 12 seconds to run but regardless i need to make it more efficient to be able to complete the puzzle.
I asked on the site forum and someone said "Sorting is expensive, think of another way of doing it without it." I took this to mean i shouldn't be using the sort() function. Is this what they mean?
I've since found a few different ways of getting my_diff which is calculated using the sort() function. All of these ways are even less efficient than the original way of doing it.
Can anyway give me a more efficient way of doing the sort to find my_diff or maybe make other parts of the code more efficient? It's the sort() part which is apparently the inefficient part of the code though.
find_missing <- function(sequence){
len <- length(sequence)
if(len > 3){
my_diff <- as.integer(names(sort(table(diff(sequence)), decreasing = TRUE))[1])
complete_seq <- seq(sequence[1], sequence[len], my_diff)
}else{
differences <- diff(sequence)
complete_seq_1 <- seq(sequence[1],sequence[len],differences[1])
complete_seq_2 <- seq(sequence[1],sequence[len],differences[2])
if(length(complete_seq_1) == 4){
complete_seq <- complete_seq_1
}else{
complete_seq <- complete_seq_2
}
}
complete_seq[!complete_seq %in% sequence]
}
Here are a couple of sample sequences to check the code works:
find_missing(c(1,3,5,9,11))
find_missing(c(1,5,7))
Here are some of the other things i tried instead of sort:
1:
library(pracma)
Mode(diff(sequence))
2:
library(dplyr)
(data.frame(diff_1 = diff(sequence)) %>%
group_by(diff_1) %>%
summarise(count = n()) %>%
ungroup() %>%
filter(count==max(count)))[1]
3:
MaxTable <- function(sequence, mult = FALSE) {
differences <- diff(sequence)
if (!is.factor(differences)) differences <- factor(differences)
A <- tabulate(differences)
if (isTRUE(mult)) {
as.integer(levels(differences)[A == max(A)])
}
else as.integer(levels(differences)[which.max(A)])
}
Here is one way to do this using seq. We can create a sequence from minimum value in sequence to maximum value in the sequence having length as length(x) + 1 as there is exactly one term missing in the sequence.
find_missing <- function(x) {
setdiff(seq(min(x), max(x), length.out = length(x) + 1), x)
}
find_missing(c(1,3,5,9,11))
#[1] 7
find_missing(c(1,5,7))
#[1] 3
This approach takes the diff() of the vector - there will always be one difference higher than the others.
find_missing <- function(x) {
diffs <- diff(x)
x[which.max(diffs)] + min(diffs)
}
find_missing(c(1,3,5,9,11))
[1] 7
find_missing(c(1,5,7))
[1] 3
There is actually a simple formula for this, which will work even if your vector is not sorted...
find_missing <- function(x) {
(length(x) + 1) * (min(x) + max(x))/2 - sum(x)
}
find_missing(c(1,5,7))
[1] 3
find_missing(c(1,3,5,9,11,13,15))
[1] 7
find_missing(c(2,8,6))
[1] 4
It is based on the fact that the sum of the full series should be the average value times the length.
I am trying to execute a code that takes way too much time (>6 days). Maybe there is a way of making it more efficient. Any ideas?
library(haven)
library(plyr)
AFILIAD1 <- read_sav("XXXX")
#this sav has around 6 million rows.
AFILIAD1$F_ALTA<- as.character(AFILIAD1$F_ALTA)
AFILIAD1$F_BAJA<- as.character(AFILIAD1$F_BAJA)
AFILIAD1$F_ALTA <- as.Date(AFILIAD1$F_ALTA, "%Y%m%d")
AFILIAD1$F_BAJA <- as.Date(AFILIAD1$F_BAJA, "%Y%m%d")
#starting and ending date
meses <- seq(as.Date("1900-01-01"), as.Date("2014-12-31"), by = "month")
#this is the function that needs to be more efficient
ocupados <- function(pruebas){
previo <- c()
total <- c()
for( i in 1:length(meses)){
for( j in 1:nrow(pruebas)){
ifelse(pruebas$F_ALTA[j] <= meses[i] & pruebas$F_BAJA[j] >=
meses[i], previo[j]<- pruebas$IPF[j],previo[j]<- NA)
}
total[i] <- (length(unique(previo))-1)
}
names(total)<-meses
return(total)
}
#this takes >6 days to execute
afiliado1 <- ocupados(AFILIAD1)
There is a lot you can do to speed this up. Here's one example:
library(tidyverse) % adds pipes
ocupados <- function(pruebas) {
total <- map_int(meses, function(x) {
with(pruebas, {
IPF[F_ALTA <= x & F_BAJA >= x] %>%
n_distinct() #I'm assuming you subtract 1 to remove the NA effect - no longer needed
})
})
names(total) <- meses
return(total)
}
There are two big speed ups here. First, the inner loop is implemented in compiled code (so you don't see it here), which will be huge savings for you.
Second, we never define empty vectors. Those empty vectors have to be copied EVERY time you increase the length - which is very expensive. Instead, all I'm saving is the final result. The apply family of functions behave like loops, but implement the code in a function.
If you're not familiar with the pipe operator (%>%), all it does is call the next function with the result from the previous function as the next argument. So
length(unique(x))
is the same as
x %>%
unique() %>%
length()
The advantage is readability - it's easier to see that I apply unique, then length using the pipe.
One more comment - without a reproducible example, I cannot test this code. If you have trouble, you need to include a small reproducible data set so we can actually test what the code is doing.
I am working on a data frame and have extracted on the of the columns with hour data from 0 t0 23. I am adding one more column as type of the day based on hour. I had executed below for loop but getting error. Can somebody help me what is wrong with below syntax and how to correct the same.
for(i in data$Requesthours) {
if(data$Requesthours>=0 & data$Requesthours<3) {
data$Partoftheday <- "Midnight"
} else if(data$Requesthours>=3 & data$Requesthours<6) {
data$Partoftheday <- "Early Morning"
} else if(data$Requesthours>=6 & data$Requesthours<12) {
data$Partoftheday <- "Morning"
} else if(data$Requesthours>=12 & data$Requesthours<16) {
data$Partoftheday <- "Afternoon"
} else if(data$Requesthours>=16 & data$Requesthours<20) {
data$Partoftheday <- "Evening"
} else if(data$Requesthours>=20 & data$Requesthours<=23) {
data$Partoftheday <- "Night"
}
}
Still waiting for you to post your bug, but here's an R coding tip which will reduce this to a one-liner (and bypass your bug). Also it'll be way faster (it's vectorized, unlike your for-loop and if-else-ladder).
data$Partoftheday <- as.character(
cut(data$Requesthours,
breaks=c(-1,3,6,12,16,20,24),
labels=c('Midnight', 'Early Morning', 'Morning', 'Afternoon', 'Evening', 'Night')
)
)
# see Notes on cut() at bottom to explain this
Now back to your bug: You're confused about how to iterate over a column in R. for(i in data$Requesthours) is trying to iterate over your df, but you're confusing indices with data values. Also you try to make i an iterator, but then you don't refer to the value i anywhere inside the loop, you refer back to data$Requesthours, which is an entire column not a single value (how do the loop contents known which value you're referring to? They don't. You could use an ugly explicit index-loop like for (i in 1:nrow(data) ... or for (i in seq_along(data) ... then access data[i,]$Requesthours, but please don't. Because...
One of the huge idiomatic things about learning R is generally when you write a for-loop to iterate over a dataframe or a df column, you should stop to think (or research) if there isn't a vectorized function in R that does what you want. cut, if, sum, mean, max, diff, stdev, ... fns are all vectorized, as are all the arithmetic and logical operators. 'vectorized' means you can feed them an entire (column) vector as an input, and they produce an entire (column) vector as output which you can directly assign to your new column. Very simple, very fast, very powerful. Generally beats the pants off for-loops. Please read R-intro.html, esp. Section 2 about vector assignment
And if you can't find or write a vectorized fn, there's also the *apply family of functions apply, sapply, lapply, ... to apply any arbitrary function you want to a list/vector/dataframe/df column.
Notes on cut()
cut(data, breaks, labels, ...) is a function where data is your input vector (e.g. your selected column data$Requesthours), breaks is a vector of integer or numeric, and labels is a vector to name the output. The length of labels is one more than breaks, since 5 breaks divides your data into 6 ranges.
We want the output vector to be string, not categorical, hence we apply as.character() to the output from cut()
Since your first if-else comparison is (hr>=0 & hr<3), we have to fiddle the lowest cutoff_hour 0 to -1, otherwise hr==0 would wrongly give NA. (There is a parameter include.lowest=TRUE/FALSE but it's not what you want, because it would also cause hr==3 to be 'Midnight', hr==6 to be 'Early Morning', etc.)
if(data$Requesthours>=0 & data$Requesthours<3) (and other similar ifs) make no sense since data$Requesthours is a vector. You should try either of the following:
Solution 1:
for(i in seq(length(data$Requesthours))) {
if(data$Requesthours[i]>=0 & data$Requesthours[i]<3)
data$Partoftheday[i] <- "Midnight"
....
}
This solution is slow like hell and really ugly, but it would work.
Solution 2:
data$Partoftheday[data$Requesthours>=0 & data$Requesthours<3] <- "Midnight"
...
Solution 3 = what was proposed by smci
I have something like this within a function:
x <- as.POSIXct((substr((dataframe[z, ])$variable, 1, 8)), tz = "GMT",
format = "%H:%M:%S")
print(x)
if ( (x >= as.POSIXct("06:00:00", tz = "GMT", format = "%H:%M:%S")) &
(x < as.POSIXct("12:00:00", tz = "GMT", format = "%H:%M:%S")) ){
position <- "first"
}
but I get this output:
character(0)
Error in if ((as.numeric(departure) - as.numeric(arrival)) < 0) { : argument is of length zero
how can I fix this so my comparison works and it prints the correct thing?
some examples of the dataframe$variable column:
16:33:00
15:34:00
14:51:00
07:26:00
05:48:00
11:10:00
17:48:00
06:17:00
08:22:00
11:31:00
Welcome to Stack Overflow!
First, the reason you've gotten some down votes is most likely because you haven't given much in your question to go on. For one thing, you haven't shown us what
(dataframe[z, ])$variable
is, which makes it hard for us to formulate a complete answer. You seem to be trying to extract a single value from a dataframe, is that right? If so, I've never seen it done that way, try replacing the above with:
dataframe$variable[z]
My guess is what you're trying to achieve is a comparison of an entire column of the dataframe called "variable", since that's generally more useful...
Having said that, I often come up against issues with time data, and from what I've heard, my experiences are not uncommon. When I'm dealing with just times, as it appears you are here, I prefer the chron::times format over POSIXct (POSIX is a date-time format, so a date is always included, it also tries to correct for timezone changes, as well as daylight savings changes, which tends to get in my way more than help). If you've got your data in the format you've specified in your first as.POSIXct call, you won't even need to specify that in calling the times function instead.
x <- chron::times( dataframe$variable )
print(x)
position <- ifelse ( x >= chron::times( "06:00:00" ) &
x < chron::times( "12:00:00" ),
"first", "not first"
)
This will output a vector "position", with a result for all values taken from dataframe$variable. Does that achieve what you're hoping for?
From here, if you did want to extract the comparison result for the particular row "z" in dataframe, you can still do that with
position[z]
EDIT to add:
It might be worth checking for missing values in "variable". This should return TRUE:
sum( is.na( dataframe$variable ) ) == 0
Also check for any that aren't correctly formatted. Again, this should return TRUE:
sum( is.na( chron::times( dataframe$variable ) ) ) == 0
EDIT to add:
As per the comments, it looks like some values in your "variables" column aren't converting properly. You should be able to find them with
subset( dataframe, is.na( chron::times( variable ) ) )
That should let you see what's wrong. It may be a single cell, or it may be a number of them. You'll need to tidy up that data, which you can do in a few ways. You could go through and fix them manually, you could add a function in your script to repair them before the conversion (this might be a good idea if there is a common issue between all of those values, or if you expect the same issue to happen again as new data comes in, if indeed you need to allow for that).
The other option is simply to exclude those rows from your analysis. If you go this route, make sure it's appropriate to the analysis you're running. If it is appropriate in your case, you can add a step to clean up the dataframe before running the steps in your question:
dataframe <- subset( dataframe, !is.na( chron::times( variable ) ) )
NOTE: there's a good chance this will come up with a warning. If you run the same line twice, and the warning goes away the second time (after the offending rows have been removed), you may need to look further into it.
That should drop the offending values, leaving only values that are properly converting to the times format, which should help with the steps you're trying to run. Check how your dataframe dimensions change before and after that step; that'll tell you how many rows you're dropping.
You could do the same thing with POSIXct if that's what you're comfortable with, I'm just personally more comfortable with times for what you're doing.