How to efficiently chunk data.frame into smaller ones and process them - r

I have a bigger data.frame which i want to cut into small ones, depending on some "unique_keys" ( In reffer to MySQL ). At the moment I am doing this with this loop, but it takes awfully long ~45sec for 10k rows.
for( i in 1:nrow(identifiers_test) ) {
data_test_offer = data_test[(identifiers_test[i,"m_id"]==data_test[,"m_id"] &
identifiers_test[i,"a_id"]==data_test[,"a_id"] &
identifiers_test[i,"condition"]==data_test[,"condition"] &
identifiers_test[i,"time_of_change"]==data_test[,"time_of_change"]),]
# Sort data by highest prediction
data_test_offer = data_test_offer[order(-data_test_offer[,"prediction"]),]
if(data_test_offer[1,"is_v"]==1){
true_counter <- true_counter+1
}
}
How can i refactor this, to make it more "R" - and faster?

Before applying groups you are filtering your data.frame using another data.frame. I would use merge then by.
ID <- c("m_id","a_id","condition","time_of_change")
filter_data <- merge(data_test,identifiers_test,by=ID)
by(filter_data, do.call(paste,filter_data[,ID]),
FUN=function(x)x[order(-x[,"prediction"]),])
Of course the same thing can be written using data.table more efficiently:
library(data.table)
setkeyv(setDT(identifiers_test),ID)
setkeyv(setDT(data_test),ID)
data_test[identifiers_test][rev(order(prediction)),,ID]
NOTE: the answer below is not tested since you don't provide a data to test it.

Related

Speed up R's grep during an if conditional %in% operation

I'm in need of some R for-loop and grep optimisation assistance.
I have a data.frame made up of columns of different data types. 42 of these columns have the name "treatmentmedication_code_#", where # is a number 1 to 42.
There is a lot of code so a reproducible example is quite tricky. As a compromise, the following code is the precise operation I need to optimise.
for(i in 1:nTreatments) {
...lots of code...
controlsDrugStatusDF <- cbind(controlsTreatmentDF, Drug=0)
for(n in 1:nControls) {
if(treatment %in% controlsDrugStatusDF[n,grep(pattern="^treatmentmedication_code*",x=colnames(controlsDrugStatusDF))]) {
controlsDrugStatusDF$Drug[n] <- 1
} else {
controlsDrugStatusDF$Drug[n] <- 0
}
}
}
treatment is some coded medication e.g., 145374524. The condition inside the if statement is very slow. It checks to see whether the treatment value is present in any one of those columns defined by the grep for the row n. To make matters worse, this is done for every treatment, thus the i for-loop.
Short of launching multiple processes or massacring my data.frames into lots of separate matrices then pasting them together and converting them back into a data.frame, are there any notable improvements one could make on the if statement?
As part of optimization, the grep for selecting the columns can be done outside the loop. Regarding the treatments part it is not clear. Consider that it is a vector of values. We can use
nm1 <- grep("^treatmentmedication_code*",
colnames(controlsDrugStatusDF), values = TRUE)
nm2 <- paste0("Drug", seq_along(nm1))
controlsDrugStatusDF[nm2] <- lapply(controlsDrugStatusDF[nm1],
function(x)
+(x %in% treatments))

Double "for loops" in a dataframe in R

I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.

Deleting a row from a data set

I am trying to create a function that deletes n rows from a data set in R. The rows that I want to delete are the minimum values from the column time in the data set my_data_set.
I currently have
delete_data <- function(n)
{
k=1
while(k <= n)
{
my_data_set = my_data_set[-(which.min(my_data_set$time)),]
k=k+1
}
}
When I input these lines manually (without the use of the while loop) it works perfectly but I am not able to get the loop to work.
I am calling the function by:
delete_data(n = 2)
Any help is appreciated!
Thanks
Try:
my_data_set[ ! my_data_set$time == min(my_data_set$time), ]
Or if you are using data.table and wish to use the more direct syntax that data.table provides:
library(data.table)
setDT( my_data_set )
my_data_set[ ! time == min(time) ]
Then review how R work. R is a vectorized language that pretty much does what you mean without having to resort to complicated loops.
Also try:
my_data_set <- my_data_set[which(my_data_set$time > min(my_data_set$time)),]
By the way, which.min() will only pick up the first record if there is more than one record matching the minimum value.

Alternate to using loops to replace values for big datasets in R?

I have a large (~4.5 million records) data frame, and several of the columns have been anonymised by hashing, and I don't have the key, but I do wish to renumber them to something more legible to aid analysis.
To this end, for example, I've deduced that 'campaignID' has 161 unique elements over the 4.5 records, and have created a vector to hold these. I've then tried writing a FOR/IF loop to search through the full dataset using the unique element vector - for each value of 'campaignID', it is checked against the unique element vector, and when it finds a match, it returns the index value of the unique element vector as the new campaign ID.
campaigns_length <- length(unique_campaign)
dataset_length <- length(dataset$campaignId)
for (i in 1:dataset_length){
for (j in 1:campaigns_length){
if (dataset$campaignId[[i]] == unique_campaign[[j]]){
dataset$campaignId[[i]] <- j
}}}
The problem of course is that, while it works, it takes an enormously long time - I had to stop it after 12 hours! Can anything think of a better approach that's much, much quicker and computationally less expensive?
You could use match.
dataset$campaignId <- match(dataset$campaignId, unique_campaign)
See Is there an R function for finding the index of an element in a vector?
You might benefit from using the data.table package in this case:
library(data.table)
n = 10000000
unique_campaign = sample(1:10000, 169)
dataset = data.table(
campaignId = sample(unique_campaign, n, TRUE),
profit = round(runif(n, 100, 1000))
)
dataset[, campaignId := match(campaignId, unique_campaign)]
This example with 10 million rows will only take you a few seconds to run.
You could avoid the inside loop with a dictionnary-like structure :
id_dict = list()
for (id in 1:unique_campaign) {
id_dict[[ unique_campaign[[id]] ]] = id
}
for (i in 1:dataset_length) {
dataset$campaignId[[i]] = id_dict[[ dataset$campaignId[[i]] ]]
}
As pointed in this post, list do not have O(1) access so it will not divided the time recquired by 161 but by a smaller factor depending on the repartition of ids in your list.
Also, the main reason why your code is so slow is because you are using those inefficient lists (dataset$campaignId[[i]] alone can take a lot of time if i is big). Take a look at the hash package which provides O(1) access to the elements (see also this thread on hashed structures in R)

Avoiding redundancy when selecting rows in a data frame

My code is littered with statements of the following taste:
selected <- long_data_frame_name[long_data_frame_name$col1 == "condition1" &
long_data_frame_name$col2 == "condition2" & !is.na(long_data_frame_name$col3),
selected_columns]
The repetition of the data frame name is tedious and error-prone. Is there a way to avoid it?
You can use with
For instance
sel.ID <- with(long_data_frame_name, col1==2 & col2<0.5 & col3>0.2)
selected <- long_data_frame_name[sel.ID, selected_columns]
Several ways come to mind.
If you think about it, you are subsetting your data. Hence use the subset function (base package):
your_subset <- subset(long_data_frame_name,
col1 == "cond1" & "cond2" == "cond2" & !is.na(col3),
select = selected_columns)
This is in my opinion the most "talking" code to accomplish your task.
Use data tables.
library(data.table)
long_data_table_name = data.table(long_data_frame_name, key="col1,col2,col3")
selected <- long_data_table_name[col1 == "condition1" &
col2 == "condition2" &
!is.na(col3),
list(col4,col5,col6,col7)]
You don't have to set the key in the data.table(...) call, but if you have a large dataset, this will be much faster. Either way it will be much faster than using data frames. Finally, using J(...), as below, does require a keyed data.table, but is even faster.
selected <- long_data_table_name[J("condition1","condition2",NA),
list(col4,col5,col6,col7)]
You have several possibilities:
attach which adds the variables of the data.frame to the search path just below the global environment. Very useful for code demonstrations but I warn you not to do that programmatically.
with which creates a whole new environment temporarilly.
In very limited cases you want to use other options such as within.
df = data.frame(random=runif(100))
df1 = with(df,log(random))
df2 = within(df,logRandom <- log(random))
within will examine the created environment after evaluation and add the modifications to the data. Check the help of with to see more examples. with will just evaluate you expression.

Resources