R row selection providing partial results - r

I'm having an issue, which I have found a solution for, but would like to understand what was going on in the original coding.
So I started with a table pulled from an SQL database and wanted information for 1 client, who is covered by 2 client numbers.
Originally I was running this to select those account numbers.
match <- c("C524",'5568')
gtc <- gtc[gtc$AccountNumber == match,]
However this was only returning about half of the desired results, and the results returned vary at different times (this was running as a weekly report), and depending on the PC running it.
Now, I've set up a loop which works fine and extracts all the results, but would really like to know what was going on with the original query.
match <- c("C524",'5568')
for (each in match) {
gtcLoop<- gtc[gtc$AccountNumber == each,]
result<-rbind(result,gtcLoop)
}
Also, long time lurker, first time poster so let me know if I've done anything wrong in this question.

You need to replace == by %in%:
gtc <- data.frame(AccountNumber = sample(c(match, "something"), 10, replace = TRUE))
gtc[gtc$AccountNumber %in% match,]

Just to tag onto Qaswed's answer (+1), you need to understand what is happening when you compute vector comparisons like ==. See:
?`==`
and
?`%in%`
then try something like 1 == c(1,2) and 1 %in% c(1,2).
The reason you are getting half the results is because the row subset is using the first evaluation only, as in:
df <- data.frame(id=c(1:5), acct_cd = letters[1:5])
df[df$acct_cd == c("a","c"),] # this is wrong, for demo only
df[df$acct_cd %in% c("a","c"),] # this is correct

Related

Why after I use "subset", the filtered data is less than it should be?

I want to have "Blancas" and "Sultana" under the "Variete" column.
Why after I use "subset", the filtered data is less than it should be?
Figure 1 is the original data,
figure 2 is the expected result,
figure 3 is result I obtained with the code below:
df <- read_excel("R_NLE_FTSW.xlsx")
options(scipen=200)
BLANCAS<-subset(df, Variete==c("Blancas","Sultana"))
view(BLANCAS)
It's obvious that some data of BLANCAS are missing.
P.S. And if try it in a sub-sheet, the final result sometimes will be 5 times more!
path = "R_NLE_FTSW.xlsx"
df <- map_dfr(excel_sheets(path),
~ read_xlsx(path, sheet = 4))
I don't understand why sometimes it's more and sometimes less than the expected result. Can anyone help me? Thank you so much!
First of all, while you mention that you need both "Blancas" and "sultanas" , your expected result shows only Blancas! So get that straight first.
For such data comign from excel :
Always clean the data after its imported. Check for unqiue values to find if there are any extra spaces etc.
Trim the character data, ensure Date fields are correct and numbers are numeric (not characters)
Now to subset a data : Use df%>%filter(Variete %in% c('Blancas','Sultana')
-> you can modify the c() vector to include items of interest.
-> if you wish to clean on the go?
df%>%filter(trimws(Variete)) %in% c('Blancas','Sultana'))
and your sub-sheet problem : We even don't know what data is there. If its similar then apply same logics.

Removing a set of adjacent rows of a data frame meeting a specific pattern - R

I posted this question on 12/19. I received one response that was very helpful but not quite what I was looking for. Then the question was closed by three folks with the specification it needed more focus. the instructions indicated I could update the question or post a new on but after editing it to make it more focused it remained closed. So, I am posting it again.
Here is the link to the edited question, including a more concise dataset (which had been one critical comment): Identifying a specific pattern in several adjacent rows of a single column - R
But, in case that link isn't allowed, here's the content:
I need to remove a specific set of rows from data when they occur. In our survey, an automated telephone survey, the survey tool will attempt three times during that call to prompt the respondent to enter a response. After three timeouts of the question the survey tool hangs up. This mostly happens when the call goes to someone's voicemail.
I would like to identify that pattern when it happens so I can remove it from calculating call time.
The pattern I am looking for looks like this in the Interactions column:
It doesn't HAVE to be Intro. It can be any part of the survey where it prompting the respondent for a response THREE times but no response is provided so the call fails. But, it does have to be sandwiched in between "Answer" (the phone picks up) and "Timeout. Call failed." (a failure).
I did try to apply what I learned from yesterday's solution (about run length encoding) to my other indexing question but I couldn't make it work in the slightest. So, here I am.
Here's an example dataset:
This is 4 respondents and every interaction between the survey tool and the respondent (or their phone, essentially).
Here's the code for the dataframe: This goes to a Google Drive text editor with the code
The response I got from Rui Barradas was this:
removeRows <- function(X, col = "Interaction",
ans = "Answer",
fail = c("Timeout. Call failed.", "Partial", "Enqueueing call"))
{
a <- grep(ans, X[[col]])
f <- which(X[[col]] %in% fail)
a <- a[findInterval(f, a)]
for(i in seq_along(a)){
X[[col]][a[i]:f[i]] <- NA_character_
}
Y <- X[complete.cases(X), , drop = FALSE]
Y
}
removeRows(survey_data)
However, this solution is too broad. I need to specifically to only remove the rows where 3 attempts are made to prompt a response but no response is provided. So, where the prompt is Intro and there's no response so it times out and eventually the call fails.
Thanks!
I would normally use the dplyr package. I'm sure this method can be modified to use base R if needed but dplyr has pre-made functions to make it easier. Comments in the code to explain what it's doing.
df2 <- df %>%
# Find any entry where there were three timeouts evenly spaced afterwards and set TRUE.
# You can add other conditions here if needed (to check even leading values).
mutate(triple_timeout = ifelse(
lead(Interaction,n=1) == "Timeout" &
lead(Interaction,n=3) == "Timeout" &
lead(Interaction,n=5) == "Timeout",
TRUE,
FALSE
)) %>%
# Lead will have some NA values so fill those in
mutate(triple_timeout = ifelse(is.na(triple_timeout),FALSE,triple_timeout)) %>%
# Every triple timeout has six entries that should be true, but only the first is id'd.
# Use an `or` logic and lag statements to set value to true for 5 entries after any TRUE
mutate(triple_timeout = triple_timeout |
lag(triple_timeout,n=1) |
lag(triple_timeout,n=2) |
lag(triple_timeout,n=3) |
lag(triple_timeout,n=4) |
lag(triple_timeout,n=5)
) %>%
# Lag will have some NA values to fill those in
mutate(triple_timeout = ifelse(is.na(triple_timeout),FALSE,triple_timeout)) %>%
# Filter out any TRUE triple_filter
filter(!triple_timeout) %>%
# Remove the filter column
select(-triple_timeout)
I'll know for sure in the coming month when I have this kind of data for 5K respondents. But I have decent RAM. Thanks, again!

r - after filtering all my values become na

Please help me!
I have quite big data set containing bank accounts
It is organised in a following way:
V1 - register number of a bank
V2 - date of account value record
V3 - account number
all remaining V-s are for values themselves (in cur, metals, etc)
I need to make a filter through account numbers, remaining everything in the table, but for specific acc numbers. Here is the code I use:
filelist = list.files(pattern = ".txt")
datalist = lapply(filelist, function(x)read.table(x, header=FALSE, sep = ";"))
all_data = do.call("rbind", datalist)
r_d <- rename(all_data, c("V1"="Number", "V2"="Dates", "V3"="Account"))
r_d$Account <- as.character(r_d$Account)
f_d <- filter(all_data, r_d$Account >= 42301 & r_d$Account <= 42315 |
r_d$Account >= 20202 & r_d$Account <= 20210 |
r_d$Account == 98010 | r_d$Account == 98015)
The problem is that the output of this code is a table containing only NAs, everything becomes NA, even though those acc numbers exist, and I am absolutely sure in that.
If I use Account in filter instead of r_d$Account, R writes me that object Account does not exist. Which I also do not understand.
Please, correct me.
There are several things wrong with your code. The reason you are getting NAs is that you are passing NULLs all over the place. Did you ever look at r_d$Account? When you see problems in your code, you should start by going things piece-meal step-by-step, and in this case you'll see that r_d$Account gives you NULL. Why? Because you did not rename the columns correctly. colnames(r_d) will be revealing.
First, rename either does non-standard evaluation with un-quoted arguments, or rename_ takes a vector of character=character pairs. These might work (I can't know for certain, since I'm not about to transcribe your image of data ... please provide copyable output from dput next time!):
# non-standard evaluation
rename(all_data, Number=V1, Dates=V2, Account=V3)
# standard-evaluation #1:
rename_(all_data, Number="V1", Dates="V2", Account="V3")
# standard-evaluation #2
rename_(all_data, .dots = c("Number"="v1", "Dates"="V2", "Account"="V3"))
From there, if you step through your code, you should see that r_d$Account is no longer NULL.
Second, is there a reason you create r_d but still reference all-data? There are definitely times when you need to do this kind of stuff; here is not one of them, it is too prone to problems (e.g., if the row-order or dimensions of one of them changes).
Third, because you convert $Account to character, it is really inappropriate to use inequality comparisons. Though it is certainly legal to do so ("1" < "2" is TRUE), it will run into problems, such as "11" < "2" is also TRUE, and "3" < "22" is FALSE. I'm not saying that you should avoid conversion to string; I think it is appropriate. Your use of account ranges is perplexing: an account number should be categorical, not ordinal, so selecting a range of account numbers is illogical.
Fourth, even assuming that account numbers should be ordinal and ranges make sense, your use of filter can be improved, but only if you either (a) accept that comparisons of stringified-numbers is acceptable ("3" > "22"), or (b) keep them as integers. First, you should not be referencing r_d$ within a NSE dplyr function. (Edit: you also need to group your logic with parentheses.) This is a literal translation from your code:
f_d <- filter(r_d, (Account >= 42301 & Account <= 42315) |
(Account >= 20202 & Account <= 20210) |
Account == 98010 | Account == 98015)
You can make this perhaps more readable with:
f_d <- filter(r_d,
Account %in% c(98010, 98015) |
between(Account, 42301, 42315) |
between(Account, 20202, 20210)
)
Perhaps a better way to do it, assuming $Account is character, would be to determine which accounts are appropriate based on some other criteria (open date, order date, something else from a different column), and once you have a vector of account numbers, do
filter(r_d,
Account %in% vector_of_interesting_account_numbers)

Need an explanation for a particular R code snippet

The following is the code for which i need an explanation for:
for (i in id) {
data <- read.csv(files[i] )
c <- complete.cases(data)
naRm <- data[c, ]
completeCases <- rbind(completeCases, c(i, nrow(naRm)))
as i understand, the variable c here stores multiple logical values. The line after, that seems foreign to me. How does data[c, ] work?
FYI, I am an R newbie.
complete.classes looks for all rows that are "complete", have no missing values. Here is the man page. Thus the completeCases object will tell you the number of "complete" rows in each file you have just read. You really don't need to store the value of i in the rbind call though as it is just the row number, so it is redundant. A vector would do just fine for this application.
Also looks like you are missing a close brackets or this isn't a complete chunk of code.

Problem with data.table ifelse behavior

I am trying to calculate a simple ratio using data.table. Different files have different tmax values, so that is why I need ifelse. When I debug this, the dt looks good. The tmaxValue is a single value (the first "t=60" encountered in this case), but t0Value is all of the "t=0" values in dt.
summaryDT <- calculate_Ratio(reviewDT[,list(Result, Time), by=key(reviewDT)])
calculate_Ratio <- function(dt){
tmaxValue <- ifelse(grepl("hhep", inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=240min"),Result],
ifelse(grepl("hlm",inFile, ignore.case = TRUE),
dt[which(dt[,Time] == "t=60"),Result],
dt[which(dt[,Time] == "t=30"),Result]))
t0Value <- dt[which(dt[,Time] == "t=0"),Result]
return(dt[,Ratio:=tmaxValue/t0Value])
}
What I am getting out is theResult for tmaxValue divided by all of the Result's for all of the t0Value's, but what I want is a single ratio for each unique by.
Thanks for the help.
You didn't provide a reproducible example, but typically using ifelse is the wrong thing to do.
Try using if(...) ... else ... instead.
ifelse(test, yes, no) acts very weird: It produces a result with the attributes and length from test and the values from yes or no.
...so in your case you should get something without attributes and of length one - and that's probably not what you wanted, right?
[UPDATE] ...Hmm or maybe it is since you say that tmaxValue is a single value...
Then the problem isn't in calculating tmaxValue? Note that ifelse is still the wrong tool for the job...

Resources