Shorten code to filter data - r

Take this code:
quite_long_data_frame_name <- data.frame(variable.name = rnorm(50, 3, 2))
quite_long_data_frame_name$variable.name[quite_long_data_frame_name$variable.name > 2 & quite_long_data_frame_name$variable.name < 3] <- NA
In the last line, quite_long_data_frame_name$variable.name needs to be repeated 3 times. Is there any way to achieve same result but using quite_long_data_frame_name$variable.name just once? Can dplyr or magrittr achieve this?

Use subset and the negation of that logical vector:
subset( quite_long_data_frame_name, !(variable.name > 2 & variable.name < 3) )
If you want to destructively modfiy the original, then just assign that value to the original.
If your really do want a result with the NA's:
within( quite_long_data_frame_name, is.na(variable.name) <-
(variable.name > 2 & variable.name < 3) )
You will need to assign back to quite_long_data_frame_name if you want this result to replace the original.

In dplyr, I suppose you would do
quite_long_data_frame_name %>%
mutate(variable.name=ifelse(variable.name>2 & variable.name<3, NA, variable.name))
Now you only type the dataframe name once, but you have to type the variable name 4 times instead of 3. Could help if the variable names are short compared to the dataframe name. Unfortunately no more terse dplyr solution comes to mind.
As an alternative to the attach solution, use with
with(quite_long_data_frame_name, variable.name[variable.name > 2 & variable.name < 3] <- NA)
Still pretty long. I don't know of any way to do this without typing variable.name at least 3 times.
Give your variables shorter names? :)
Note that if you wanted to actually filter (in the dplyr sense) it's easier
quite_long_data_frame_name %>%
filter(variablename > 2 & variablename < 3)
but this is shorter in base R as well.
===== Edit ==========
This this specific conditional, you can use the %between% operator in the data.table package. Shorter, but not very general. Aggregating everything here, we get
with(quite_long_data_frame_name, is.na(variable.name) <- variable.name %between% c(2, 3))

Related

dynamically subseting data.table in R

my data.table contain K columns called claims, among other 30 columns. I want to subset the data.table, such that only rows remain which do not have 0 claims.
So, firstly i get all the column names i need for filtering. For the purpose of this example, i have chosen K = 2
> claimsCols = c("claimsnext", paste0("claims" , 1:K))
> claimsCols
[1] "claimsnext" "claims1" "claims2"
i have tried subsetting like:
for(i in claimsCols){
BTplan <- BTplan[ claimsCols[i] == 0, ]
i+1
}
this doent work:
Error in i + 1 : non-numeric argument to binary operator
I am sure there is a better way to do this?
I would basically do what akrun does
idx = BTplan[ , Reduce(`&`, .SD), .SDcols = patterns('claims')]
BTplan = BTplan[idx]
The innovations are:
Use patterns in .SDcols to specify the columns to include by pattern
& automatically converts numeric to logical, i.e. 1.1 & 2.2 is TRUE, and becomes FALSE as soon as there's a 0 anywhere (hence filtering the corresponding row)
In a future version of data.table this will be slightly more efficient and concise (and hopefully more readable):
idx = BTplan[ , pall(.SD), .SDcols = patterns('claims')]
BTplan = BTplan[idx]
Keep an eye on this pull request:
https://github.com/Rdatatable/data.table/pull/4448
In the OP's code, the i is each of the elements of 'claimsCols' which is character, so i +1 won't work and in fact, it is not needed
for(colnm in claimsCols) {
BTplan <- BTplan[BTplan[[colnm]] != 0,]
}
Or using data.table syntax
library(data.table)
setDT(BTplan)
BTplan[BTplan[, Reduce(`&`, lapply(.SD, `!=`, 0)),.SDcols = claimsCols]]

Passing list of variable names to custom function with mutate

I am trying to perform a function over each row and create a new column that considers multiple columns using tidyverse , I was initially using rowwise() but that was very slow. I want the list of columns into my custom function be a variable, but I can't get it to work unless I explicitly list the variable names. For example, this works:
low_risk_codes <- c(0,1,10)
vars <- c("V1", "V2")
m <- matrix(1:9, ncol=3)
classify_low_risk_drug <- function(...){
t <- cbind(...)
return(apply(t, 1, function(x) ifelse(any(x %in% low_risk_codes), 1, 0)))
}
as.data.frame(m) %>%
mutate(val4 = classify_low_risk_drug(V1, V2))
But if I want it to evaluate using the column input as vars :
as.data.frame(m) %>%
mutate(val4 = classify_low_risk_drug(vars))
But I can't get it to work even if I include !!, what am I missing?!
Also any suggestions for how to do this with map instead are also appreciated!
This sounds like it will do what you want, but I need to qualify it (a lot). First, FYI, I am still wrapping my mind around NSE in R but I find this vignette very helpful.
Related to the solution, I tried to speed up the function by avoiding rowwise() or apply(). It should be quicker with rapply()/rowSums() but I did not benchmark it. It may run into issues with very large data because rowSums() will convert the dataframe into a matrix but that probably wont be a problem. In theory, you should also be able to use select helpers / unquoted variable names / columns positions (if you so dare).
Also, I find it a little quirky that you need to supply the dataframe as the first argument (i.e., as .), but there may be a way around that. I am certainly open to anyway that wants to edit this / use this as the base for their solution. Hope this helps / gets you going in the right direction!
classify_low_risk_drug <- function(.data, vars, codes, na.rm = FALSE){
df <- rapply(.data, function(x) x %in% codes, how = "replace")
as.integer(rowSums(select(df, !!enquo(vars)), na.rm = na.rm) > 0)
}
as.data.frame(m) %>%
mutate(val4 = classify_low_risk_drug(., vars = vars, codes = c(0, 1, 10)))
V1 V2 V3 val4
1 1 4 7 1
2 2 5 8 0
3 3 6 9 0
EDIT: you could improve the speed a little bit by avoiding the matrix conversion / using lapply() w/ pmax():
classify_low_risk_drug2 <- function(.data, vars, codes, na.rm = FALSE){
as.integer(do.call(pmax, lapply(select(.data, !!enquo(vars)), `%in%`, codes)))
}

R: Mutate using of variable name instead of value

I'm trying to create a loop and for each iteration (the number of which can vary between source files) construct a mutate statement to add a column based on the value of another column.
Having my programming background in php, to my mind this should work:
for(i in number){
colname <- paste("Column",i,sep="")
filtercol <- paste("DateDiff_",i,sep="")
dataset <- mutate(dataset, a = ifelse(b >= 0 & b <= 364,1,NA))
}
But... as I've noticed a couple of times now with R functions sometimes the function ignores outright that you have defined a variable with that name -
as mutate() is here.
So instead of getting several columns titled "a1", "a2", "a3", etc, I get one column entitled "a" that gets overwritten each iteration.
Firstly, can somebody point out to me where I'm going wrong here, but secondly could someone explain to me under what circumstances R ignores variable names, as it's happened a couple of times now and it just seems wildly inconsistent at this point. I'm sure it's not, and there's logic there, but it's certainly well obfuscated.
It's also worth mentioning that originally I tried it this way:
just.dates <- just.dates %>%
for(i in number){
a <- paste("a",i,sep="")
filtercol <- paste("DateDiff_",i,sep="")
mutate(a = ifelse(filtercol >= 0 & filtercol <= 364),1,NA)
}
But that way decided I was passing the for() loop 4 arguments when it only wanted three.
Something like this may work for you. The mutate_() function as opposed to just mutate() should help you out with this.
# Create dataframe for testing
dataset <- data.frame(date = as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001",
"06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011"), "%d/%m/%Y"),
event=c(0,0,1,0,1, 1,0,1,0,1),
id = c(rep(1,5),rep(2,5)),
DateDiff_1 = c(-2,0,34,700,rep(5,6)),
DateDiff_2 = c(20,-12,360,900,rep(5,6))
)
# Set test number vector
number <- c(1:2)
# Begin loop through numbers
for(i in number){
# Set the name of the new column to be created
newcolumn <- paste("Column",i,sep="")
# Set the name of the column to be filtered
filtercolumn <- paste("DateDiff_",i,sep="")
# Create the function to be passed into the mutate command
mutate_function = lazyeval::interp(~ ifelse(fc >= 0 & fc <= 364, 1, NA), fc = as.name(filtercolumn))
# Apply the mutate command to the dataframe
dataset <- dataset %>%
mutate_(.dots = setNames(list(mutate_function), newcolumn))
}

How to remove duplicated (by name) column in data.tables in R?

While reading a data set using fread, I've noticed that sometimes I'm getting duplicated column names, for example (fread doesn't have check.names argument)
> data.table( x = 1, x = 2)
x x
1: 1 2
The question is: is there any way to remove 1 of 2 columns if they have the same name?
How about
dt[, .SD, .SDcols = unique(names(dt))]
This selects the first occurrence of each name (I'm not sure how you want to handle this).
As #DavidArenburg suggests in comments above, you could use check.names=TRUE in data.table() or fread()
.SDcols approaches would return a copy of the columns you're selecting. Instead just remove those duplicated columns using :=, by reference.
dt[, which(duplicated(names(dt))) := NULL]
# x
# 1: 1
Different approaches:
Indexing
my.data.table <- my.data.table[ ,-2]
Subsetting
my.data.table <- subset(my.data.table, select = -2)
Making unique names if 1. and 2. are not ideal (when having hundreds of columns, for instance)
setnames(my.data.table, make.names(names = names(my.data.table), unique=TRUE))
Optionnaly systematize deletion of variables which names meet some criterion (here, we'll get rid of all variables having a name ending with ".X" (X being a number, starting at 2 when using make.names)
my.data.table <- subset(my.data.table,
select = !grepl(pattern = "\\.\\d$", x = names(my.data.table)))

Drop columns per row based on a separate column value

Given a dummy data frame that looks like this:
Data1<-rnorm(20, mean=20)
Data2<-rnorm(20, mean=21)
Data3<-rnorm(20, mean=22)
Data4<-rnorm(20, mean=19)
Data5<-rnorm(20, mean=20)
Data6<-rnorm(20, mean=23)
Data7<-rnorm(20, mean=21)
Data8<-rnorm(20, mean=25)
Index<-rnorm(20,mean=5)
DF<-data.frame(Data1,Data2,Data3,Data4,Data5,Data6,Data7,Data8,Index)
What I'd like to do is remove (make NA) certain columns per row based on the Index column. I took the long way and did this to give you an idea of what I'm trying to do:
DF[DF$Index>5.0,8]<-NA
DF[DF$Index>=4.5 & DF$Index<=5.0,7:8]<-NA
DF[DF$Index>=4.0 & DF$Index<=4.5,6:8]<-NA
DF[DF$Index>=3.5 & DF$Index<=4.0,5:8]<-NA
DF[DF$Index>=3.0 & DF$Index<=3.5,4:8]<-NA
DF[DF$Index>=2.5 & DF$Index<=3.0,3:8]<-NA
DF[DF$Index>=2.0 & DF$Index<=2.5,2:8]<-NA
DF[DF$Index<=2.0,1:8]<-NA
This works fine as is, but is not very adaptable. If the number of columns change, or I need to tweak the conditional statements, it's a pain to rewrite the entire code (the actual data set is much larger).
What I would like to do is be able to define a few variables, and then run some sort of loop or apply to do exactly what the lines of code above do.
As an example, in order to replicate my long code, something along the lines of this kind of logic:
NumCol<-8
Max<-5
Min<-2.0
if index > Max, then drop NumCol
if index >= (Max-0.5) & <=Max, than drop NumCol:(NumCol -1)
repeat until reach Min
I don't know if that's the most logical line of reasoning in R, and I'm pretty bad with Looping and apply, so I'm open to any line of thought that can replicate the above long lines of code with the ability to adjust the above variables.
If you don't mind changing your data.frame to a matrix, here is a solution that uses indexing by a matrix. The building of the two-column matrix of indices to drop is a nice review of the apply family of functions:
Seq <- seq(Min, Max, by = 0.5)
col.idx <- lapply(findInterval(DF$Index, Seq) + 1, seq, to = NumCol)
row.idx <- mapply(rep, seq_along(col.idx), sapply(col.idx, length))
drop.idx <- as.matrix(data.frame(unlist(row.idx), unlist(col.idx)))
M <- as.matrix(DF)
M[drop.idx] <- NA
Here is a memory efficient (but I can't claim elegant) data.table solution
It uses the very useful function findInterval to change you less than / greater than loop
#
library(data.table)
DT <- data.table(DF)
# create an index column which 1:8 represent your greater than less than
DT[,IND := findInterval(Index, c(-Inf, seq(2,5,by =0.5 ), Inf))]
# the columns you want to change
changing <- names(DT)[1:8]
setkey(DT, IND)
# loop through the indexes and alter by reference
for(.ind in DT[,unique(IND)]){
# the columns you want to change
.which <- tail(changing, .ind)
# create a call to `:=`(a = as(NA, class(a), b= as(NA, class(b))
pairlist <- mapply(sprintf, .which, .which, MoreArgs = list(fmt = '%s = as(NA,class(%s))'))
char_exp <- sprintf('`:=`( %s )',paste(pairlist, collapse = ','))
.e <- parse(text = char_exp)
DT[J(.ind), eval(.e)]
}

Resources