R: What's wrong with my use of %in%? - r

I created a flag called JF which I initialize to FALSE
JF <- F
I want to change the value to TRUE if the (date) value of SomeDate is in a list of dates that I retrieve from SQL Server. I would use == instead of %in% but there are a couple of different variables like SomeDate and any number of dates in JF.Date... and also because I want to master the use of %in%.
sql <- "select distinct Date_Values from SomeTable"
JF.Date <- sqlQuery(db, sql)
if (SomeDate %in% JF.Date) {JF <- T}
I checked the class() of both objects -- SomeDate is character, JF.Date is a data.frame, and JF.Date$Date_Values was originally factor, but I tried changing it to character and it didn't fix this issue. There are unrelated reasons that I'm storing SomeDate as character at this point in my code.
This returns no error but it doesn't change the value of JF to TRUE when it should. What am I doing wrong?
You can reproduce this with any arbitrary date assignment like <- "1970-01-01" to objects of the same type/classes.
This
for (i in 1:nrow(JF.Date)){
if (SomeDate == JF.Date[i]) {JF <- T}
}
does work, but again, I want to know how to use %in%.

#beginneR had a suggestion that didn't work, but which led me to the solution:
JF <- SomeDate %in% JF.Date
That makes total sense, yet didn't change the value of JF from FALSE to TRUE, just like my original attempt.
I realized that's because -- while there's only 1 column in this "dataframe" -- there could be multiple columns and R doesn't know which one to look at. This would be a lot more obvious if there were an appropriate Warning or Error message displayed, but nonetheless this works:
JF <- SomeDate %in% JF.Date$Date_Value
I can't accept my own answer for 2 days, so if anyone comes up with a more useful/comprehensive answer to this question, then I'll consider selecting your answer as a solution over my own. If I'm going to forgoe those points you've got to make it really good though! ;)

Related

r - after filtering all my values become na

Please help me!
I have quite big data set containing bank accounts
It is organised in a following way:
V1 - register number of a bank
V2 - date of account value record
V3 - account number
all remaining V-s are for values themselves (in cur, metals, etc)
I need to make a filter through account numbers, remaining everything in the table, but for specific acc numbers. Here is the code I use:
filelist = list.files(pattern = ".txt")
datalist = lapply(filelist, function(x)read.table(x, header=FALSE, sep = ";"))
all_data = do.call("rbind", datalist)
r_d <- rename(all_data, c("V1"="Number", "V2"="Dates", "V3"="Account"))
r_d$Account <- as.character(r_d$Account)
f_d <- filter(all_data, r_d$Account >= 42301 & r_d$Account <= 42315 |
r_d$Account >= 20202 & r_d$Account <= 20210 |
r_d$Account == 98010 | r_d$Account == 98015)
The problem is that the output of this code is a table containing only NAs, everything becomes NA, even though those acc numbers exist, and I am absolutely sure in that.
If I use Account in filter instead of r_d$Account, R writes me that object Account does not exist. Which I also do not understand.
Please, correct me.
There are several things wrong with your code. The reason you are getting NAs is that you are passing NULLs all over the place. Did you ever look at r_d$Account? When you see problems in your code, you should start by going things piece-meal step-by-step, and in this case you'll see that r_d$Account gives you NULL. Why? Because you did not rename the columns correctly. colnames(r_d) will be revealing.
First, rename either does non-standard evaluation with un-quoted arguments, or rename_ takes a vector of character=character pairs. These might work (I can't know for certain, since I'm not about to transcribe your image of data ... please provide copyable output from dput next time!):
# non-standard evaluation
rename(all_data, Number=V1, Dates=V2, Account=V3)
# standard-evaluation #1:
rename_(all_data, Number="V1", Dates="V2", Account="V3")
# standard-evaluation #2
rename_(all_data, .dots = c("Number"="v1", "Dates"="V2", "Account"="V3"))
From there, if you step through your code, you should see that r_d$Account is no longer NULL.
Second, is there a reason you create r_d but still reference all-data? There are definitely times when you need to do this kind of stuff; here is not one of them, it is too prone to problems (e.g., if the row-order or dimensions of one of them changes).
Third, because you convert $Account to character, it is really inappropriate to use inequality comparisons. Though it is certainly legal to do so ("1" < "2" is TRUE), it will run into problems, such as "11" < "2" is also TRUE, and "3" < "22" is FALSE. I'm not saying that you should avoid conversion to string; I think it is appropriate. Your use of account ranges is perplexing: an account number should be categorical, not ordinal, so selecting a range of account numbers is illogical.
Fourth, even assuming that account numbers should be ordinal and ranges make sense, your use of filter can be improved, but only if you either (a) accept that comparisons of stringified-numbers is acceptable ("3" > "22"), or (b) keep them as integers. First, you should not be referencing r_d$ within a NSE dplyr function. (Edit: you also need to group your logic with parentheses.) This is a literal translation from your code:
f_d <- filter(r_d, (Account >= 42301 & Account <= 42315) |
(Account >= 20202 & Account <= 20210) |
Account == 98010 | Account == 98015)
You can make this perhaps more readable with:
f_d <- filter(r_d,
Account %in% c(98010, 98015) |
between(Account, 42301, 42315) |
between(Account, 20202, 20210)
)
Perhaps a better way to do it, assuming $Account is character, would be to determine which accounts are appropriate based on some other criteria (open date, order date, something else from a different column), and once you have a vector of account numbers, do
filter(r_d,
Account %in% vector_of_interesting_account_numbers)

trying to compare POSIXct objects in if statements

I have something like this within a function:
x <- as.POSIXct((substr((dataframe[z, ])$variable, 1, 8)), tz = "GMT",
format = "%H:%M:%S")
print(x)
if ( (x >= as.POSIXct("06:00:00", tz = "GMT", format = "%H:%M:%S")) &
(x < as.POSIXct("12:00:00", tz = "GMT", format = "%H:%M:%S")) ){
position <- "first"
}
but I get this output:
character(0)
Error in if ((as.numeric(departure) - as.numeric(arrival)) < 0) { : argument is of length zero
how can I fix this so my comparison works and it prints the correct thing?
some examples of the dataframe$variable column:
16:33:00
15:34:00
14:51:00
07:26:00
05:48:00
11:10:00
17:48:00
06:17:00
08:22:00
11:31:00
Welcome to Stack Overflow!
First, the reason you've gotten some down votes is most likely because you haven't given much in your question to go on. For one thing, you haven't shown us what
(dataframe[z, ])$variable
is, which makes it hard for us to formulate a complete answer. You seem to be trying to extract a single value from a dataframe, is that right? If so, I've never seen it done that way, try replacing the above with:
dataframe$variable[z]
My guess is what you're trying to achieve is a comparison of an entire column of the dataframe called "variable", since that's generally more useful...
Having said that, I often come up against issues with time data, and from what I've heard, my experiences are not uncommon. When I'm dealing with just times, as it appears you are here, I prefer the chron::times format over POSIXct (POSIX is a date-time format, so a date is always included, it also tries to correct for timezone changes, as well as daylight savings changes, which tends to get in my way more than help). If you've got your data in the format you've specified in your first as.POSIXct call, you won't even need to specify that in calling the times function instead.
x <- chron::times( dataframe$variable )
print(x)
position <- ifelse ( x >= chron::times( "06:00:00" ) &
x < chron::times( "12:00:00" ),
"first", "not first"
)
This will output a vector "position", with a result for all values taken from dataframe$variable. Does that achieve what you're hoping for?
From here, if you did want to extract the comparison result for the particular row "z" in dataframe, you can still do that with
position[z]
EDIT to add:
It might be worth checking for missing values in "variable". This should return TRUE:
sum( is.na( dataframe$variable ) ) == 0
Also check for any that aren't correctly formatted. Again, this should return TRUE:
sum( is.na( chron::times( dataframe$variable ) ) ) == 0
EDIT to add:
As per the comments, it looks like some values in your "variables" column aren't converting properly. You should be able to find them with
subset( dataframe, is.na( chron::times( variable ) ) )
That should let you see what's wrong. It may be a single cell, or it may be a number of them. You'll need to tidy up that data, which you can do in a few ways. You could go through and fix them manually, you could add a function in your script to repair them before the conversion (this might be a good idea if there is a common issue between all of those values, or if you expect the same issue to happen again as new data comes in, if indeed you need to allow for that).
The other option is simply to exclude those rows from your analysis. If you go this route, make sure it's appropriate to the analysis you're running. If it is appropriate in your case, you can add a step to clean up the dataframe before running the steps in your question:
dataframe <- subset( dataframe, !is.na( chron::times( variable ) ) )
NOTE: there's a good chance this will come up with a warning. If you run the same line twice, and the warning goes away the second time (after the offending rows have been removed), you may need to look further into it.
That should drop the offending values, leaving only values that are properly converting to the times format, which should help with the steps you're trying to run. Check how your dataframe dimensions change before and after that step; that'll tell you how many rows you're dropping.
You could do the same thing with POSIXct if that's what you're comfortable with, I'm just personally more comfortable with times for what you're doing.

R row selection providing partial results

I'm having an issue, which I have found a solution for, but would like to understand what was going on in the original coding.
So I started with a table pulled from an SQL database and wanted information for 1 client, who is covered by 2 client numbers.
Originally I was running this to select those account numbers.
match <- c("C524",'5568')
gtc <- gtc[gtc$AccountNumber == match,]
However this was only returning about half of the desired results, and the results returned vary at different times (this was running as a weekly report), and depending on the PC running it.
Now, I've set up a loop which works fine and extracts all the results, but would really like to know what was going on with the original query.
match <- c("C524",'5568')
for (each in match) {
gtcLoop<- gtc[gtc$AccountNumber == each,]
result<-rbind(result,gtcLoop)
}
Also, long time lurker, first time poster so let me know if I've done anything wrong in this question.
You need to replace == by %in%:
gtc <- data.frame(AccountNumber = sample(c(match, "something"), 10, replace = TRUE))
gtc[gtc$AccountNumber %in% match,]
Just to tag onto Qaswed's answer (+1), you need to understand what is happening when you compute vector comparisons like ==. See:
?`==`
and
?`%in%`
then try something like 1 == c(1,2) and 1 %in% c(1,2).
The reason you are getting half the results is because the row subset is using the first evaluation only, as in:
df <- data.frame(id=c(1:5), acct_cd = letters[1:5])
df[df$acct_cd == c("a","c"),] # this is wrong, for demo only
df[df$acct_cd %in% c("a","c"),] # this is correct

Need an explanation for a particular R code snippet

The following is the code for which i need an explanation for:
for (i in id) {
data <- read.csv(files[i] )
c <- complete.cases(data)
naRm <- data[c, ]
completeCases <- rbind(completeCases, c(i, nrow(naRm)))
as i understand, the variable c here stores multiple logical values. The line after, that seems foreign to me. How does data[c, ] work?
FYI, I am an R newbie.
complete.classes looks for all rows that are "complete", have no missing values. Here is the man page. Thus the completeCases object will tell you the number of "complete" rows in each file you have just read. You really don't need to store the value of i in the rbind call though as it is just the row number, so it is redundant. A vector would do just fine for this application.
Also looks like you are missing a close brackets or this isn't a complete chunk of code.

Problem with data.table ifelse behavior

I am trying to calculate a simple ratio using data.table. Different files have different tmax values, so that is why I need ifelse. When I debug this, the dt looks good. The tmaxValue is a single value (the first "t=60" encountered in this case), but t0Value is all of the "t=0" values in dt.
summaryDT <- calculate_Ratio(reviewDT[,list(Result, Time), by=key(reviewDT)])
calculate_Ratio <- function(dt){
tmaxValue <- ifelse(grepl("hhep", inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=240min"),Result],
ifelse(grepl("hlm",inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=60"),Result],
dt[which(dt[,Time] == "t=30"),Result]))
t0Value <- dt[which(dt[,Time] == "t=0"),Result]
return(dt[,Ratio:=tmaxValue/t0Value])
}
What I am getting out is theResult for tmaxValue divided by all of the Result's for all of the t0Value's, but what I want is a single ratio for each unique by.
Thanks for the help.
You didn't provide a reproducible example, but typically using ifelse is the wrong thing to do.
Try using if(...) ... else ... instead.
ifelse(test, yes, no) acts very weird: It produces a result with the attributes and length from test and the values from yes or no.
...so in your case you should get something without attributes and of length one - and that's probably not what you wanted, right?
[UPDATE] ...Hmm or maybe it is since you say that tmaxValue is a single value...
Then the problem isn't in calculating tmaxValue? Note that ifelse is still the wrong tool for the job...

Resources