Better way to write this R data-cleaning function? - r

I am writing a R function that takes a dataframe column (probably preferably of type factor) and clumps together all the entries below a user-defined frequency as "other." This is done for data cleaning.
Here is what I have written:
zcut <- function(column, threshold){
dft <- data.frame(table(column))
dft_ind <- sapply(dft$Freq, function(x) x < threshold)
dft_list <- dft[[1]][dft_ind]
levels(column)[levels(column) %in% dft_list] <- "Other"
return(column)
}
I think this is pretty straightforward, but are there ways to make my code more concise or exact?
I would have asked this on the Code Review stack exchange, although it's not clear to me many R experts lurk there.

You don't need sapply here. Try:
dft_ind <- dft$Freq < threshold
This should speed up the function in the case of large data.frames.

Related

How to make a more elegant and shorter for loop with multiple internal functions and results

These are the steps I am following:
subset two matrices by a range of proportions (e.g. 80-85, 85-90)
run two separate distance measure functions for each subset of data
run a mantel using the distance matrix produced by each subset of data
produce a list of each test result, each with a unique name
produce a data frame of all the mantel-r results and their
corresponding p-values
I have written code that will complete this process, but I feel there is a more elegant and better way to do so. What I have works, but I would like to improve my R-skills, so any advice/ideas would be welcomed. I am not new to R, but I am far from being where I would like to be.
Also, my code produces unnecessary objects (i.e. SS, HB, sp.dis, epa.dis, and nam in the code below). They are not a big deal, but it would be nice to have code that doesn’t produce this side effect. A reproducible example (modeled after how my data is formatted) and the packages I’m using are below:
library(tidyverse)
library(betapart)
library(vegan)
set.seed(2)
spe2<-data.frame(replicate(10,sample(0:100,100,replace=T)))
spe2$Ag<-round(runif(100, min=0.4, max=1),2)
epa2<-data.frame(replicate(3,sample(1:20,100,replace=T)))
epa2$Ag<-spe2$Ag
Mantel.List<-list()
List.names <- list()
for(i in seq(from=0.85, to=0.95,by=0.05 )){
SS<-spe2 %>%
filter(Ag >= i & Ag < i+0.05)
HB<-epa2 %>%
filter(Ag >= i & Ag < i+0.05)
sp.dis<-beta.pair(decostand(SS[,1:ncol(SS)-1],'pa'))
epa.dis<-vegdist(HB[,1:ncol(HB)-1],
method = 'euclidean')
mnt<-mantel(sp.dis$beta.sor,epa.dis)
Mantel.List[[length(Mantel.List)+1]] <- mnt
nam<-paste('M.tt',i*100,sep='')
List.names[[length(List.names)+1]] <- nam
}
names(Mantel.List)<-List.names
Mantel.Results<-cbind(sapply(Mantel.List, function(x) x$statistic),sapply(Mantel.List, function(x) x$signif))
colnames(Mantel.Results)<-c('Mantel-r', 'p-value')
Mantel.Results
Thank you!
I've done two things two try to make this code a little better. First, I eliminated all the unnecessary objects, and I've done this by using data.table package, which is usually the most efficient way to handle data.frames, cause it doesn't make copies of itself when subsetting.
Secondly, instead of using a for loop, I'm using an apply function. Note the assigner <<- inside doit(), which will replace the object outside the function.
Here's my suggestion:
library(data.table)
set.seed(2)
spe2<-as.data.table(data.frame(replicate(10,sample(0:100,100,replace=T))))
spe2$Ag<-round(runif(100, min=0.4, max=1),2)
epa2<-as.data.table(data.frame(replicate(3,sample(1:20,100,replace=T))))
epa2$Ag<-spe2$Ag
doitAll=function(dt1,dt2){
Mantel.List<-list()
List.names <- list()
doit=function(x,dt1,dt2){
mnt<-mantel(beta.pair(decostand(dt1[Ag >= x & Ag < x+0.05,1:(ncol(dt1)-1),with=F],'pa'))$beta.sor,
vegdist(dt2[Ag >= x & Ag < x+0.05,1:(ncol(dt2)-1),with=F],
method = 'euclidean'))
Mantel.List[[length(Mantel.List)+1]] <<- mnt
nam<-paste('M.tt',x*100,sep='')
List.names[[length(List.names)+1]] <<- nam
}
sapply(seq(from=0.85, to=0.95,by=0.05 ),doit,dt1=dt1,dt2=dt2)
names(Mantel.List)<-List.names
Mantel.Results<-cbind(sapply(Mantel.List, function(x) x$statistic),sapply(Mantel.List, function(x) x$signif))
colnames(Mantel.Results)<-c('Mantel-r', 'p-value')
return(Mantel.Results)
}
doitAll(dt1=spe2,dt2=epa2)
It might be a little hard to read, but it's surely more efficient.

How to optimize for loops and rbinds with large datasets

I am currently working on a large dataset (~1.5M of entries) using R - a language I am not yet completely familiar with.
Basically, what I try to do is the following :
I want to check what happens during a time interval after "Start".
"Start" represents a few temporal values within every "Trial", and "Trial" represents all of the trials recorded for one "Reference".
So for each Reference, i want to check all Trials and see what happens after "Start", during this Trial
It's not so important if what i'm trying to do is still obscure, the thing is that I want to check every data in my dataframe.
My instinctive (understand, R-noob-ish) way of programming this function led me to a piece of code which I know is far from being optimized, and takes a LOT of time to run.
My_Function <- function(DataFrame){
counts <- data.frame()
for (reference in DataFrame$Ref){
ref_tested <- subset(DataFrame, Ref == reference)
ref_count <- data.frame()
for (trial in ref_tested$Trial){
trial_tested <- subset(ref_tested, Trial == trial)
for (timing in trial_tested$Start){
interesting <- subset(DataFrame, Start > timing & Start <= timing + some_time & Trial == trial)
ref_count <- rbind(ref_count,as.data.frame(table(interesting$ele)))
}
}
temp <- aggregate(Freq~Var1,data=ref_count,FUN=sum);
counts <- rbind (counts, temp)
}
return(counts)
}
Here, as.data.frame(table(interesting$ele)) can have different lengths, and thus, so do ref_count.
I failed to find a way to grow my dataframe without using rbind, but I also know that given the size of my output it is not time-efficient at all.
Also, I have already programmed in other languages such as Python or C++ (a long time ago) and also know that having 3 consecutive for loops usually means that you're doing it wrong. But then again, I did not find a way to avoid doing that in this particular case.
So, do you have any advice on how to use R, or one of its package, to avoid such a situation?
Thank you in advance,
K.
EDIT :
Thank you for your first advices.
I tried the 'plyr' package and was able to reduce the size of my code chunck - it does as expected and is more understandable.Plus, i was able to produce some example data for reproductivity. See :
#Example Input
DF <- data.frame(c(sample(1:400,500000, replace = TRUE)),c(sample(1:25,500000, replace = TRUE)), rnorm(n=500000, m=1, sd=1) )
colnames(DF)<-c("Trial","Ref","Start")
DF$rn<-rownames(DF)
tempDF <- DF[sample(nrow(DF), 100), ] #For testing purposes
Test<- ddply(.data = tempDF, "rn", function(x){
interesting <- subset(DF,
Trial == x$Trial &
Start > x$Start &
Start < x$Start + some_time )
interesting$Elec <- x$Ref
return(interesting)
})
This is nice, but I still feel like it is not the way to go ; in this example, we only browse 100 observations, which takes ~4sec (I used a system.time()), but if i want to scan the 500000 observations of DF, it'd take more than 5 hours.
I have checked data.table but I am still trying to understand how to use it for now.

Expand dataframe in R with rbind (union)

I need to scale up a set of files for a proof of concept in my company. Essentially have several 1000row files with around 200 columns each, and I want to rbind them until I reach the desired scale. This might be 1Million rows or more.
The output will be essentially a repetition of data (sounds a bit silly) and I'm aware of that, but i just need to prove something.
I used a while loop in R similar to this:
while(nrow(df) < 1000000) {df <- rbind(df,df);}
This seems to work but it looks a bit computationally heavy. It might might take like 10-15minutes.
I though of creating a function (below) and use an "apply" family function on the df, but couldn't succeed:
scaleup_function <- function(x)
{
while(nrow(df) < 1000)
{
x <- rbind(df, df)
}
}
Is there a quicker and more efficient way of doing it (it doesn't need to be with rbind) ?
Many thanks,
Joao
This should do the trick:
df <- matrix(0,nrow=1000,ncol=200)
reps_needed <- ceiling(1000000 / nrow(df))
df_scaled <- df[rep(1:nrow(df),reps_needed),]

speeding up applying a function to unique values in R

I was hoping somebody could help, I'm trying to speed up an apply function, and I've tried a few tricks but it is still quite slow and I was wondering if anybody had any more suggestions.
I have data as follows:
myData= data.frame(ident=c(3,3,4,4,4,4,4,4,4,4,4,7,7,7,7,7,7,7),
group=c(7,7,7,7,7,7,7,7,7,7,7,8,8,8,8,8,8,8),
significant=c(1,1,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0),
year=c(2003,2002,2001,2008,2010,2007,2007,2008,2006,2012,2008,
2012,2006,2001,2014,2012,2004,2007),
month=c(1,1,9,12,3,2,4,3,9,5,12,8,11,3,1,6,3,1),
subReport=c(0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0),
prevReport=c(1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1))
and I want to end up with a dataframe like this:
results=data.frame(ident=c(3,4,7),
significant=c(1,0,1),
prevReports=c(2,6,7),
subReport=c(0,1,0),
group=c(7,7,8))
To do this I wrote the code below and to do it quickly i've tried converting to data tables and using rbindlist instead of rbind, which I've found suggested in a few threads. I've also tried parLapply, I still find the process to be quite slow however, (I'm tring to do this on about 250,000 data points).
dt<-data.table(myData)
results<-NULL
ApplyModel <- function (id,data) {
dtTemp<-dt[dt$ident== id,]
if(nrow(dtTemp)>=1){
prevReport = if(sum(dtTemp$prevReport)>=1) sum(dtTemp$prevReport) else 0
subsequentReport = if(sum(dtTemp$subReport)>=1) 1 else 0
significant = as.numeric(head(dtTemp$sig,1))
group = head(dtTemp$group,1)
id= as.numeric(head(dtTemp$id,1))
output<-cbind(id, significant ,prevReport,subsequentReport ,group)
output<-output[!duplicated(output[,1]),]
print(output)
results <- rbindlist(list(as.list(output)))
}
}
results<-lapply(unique(dt$ident), ApplyModel)
results<-as.data.frame(do.call(rbind, results))
Any suggestions on how this might be speeded up would be most welcome! I think it may be to do with the subsetting, I want to apply the function to a subset based on a unique value but I think lapply is really more for applying a function to each value, so subsetting is defeating the object somewhat...
Here, your code produces an error:
results<-lapply(unique(dt$ident), ApplyModel)
Error in dt$ident : object of type 'closure' is not subsettable
It appears to me, that you are looking for tapply instead of lapply. Using tapply you could express roughly the above in much more concise ways:
results2 <- data.frame(significant = tapply(myData$significant, myData$ident, function(x) return(x[1])),
prevreports = tapply(myData$prevReport, myData$ident, sum),
subReports = tapply(myData$subReport, myData$ident, function(x) as.numeric(any(x==1))),
group = tapply(myData$group, myData$ident, function(x) return(x[1])))
Should do about the same job but be much more readable. Now this should really be fast except for huge datasets. In most cases it should be faster to wait for R to complete the job than to spend more time programming. One way to make this even faster would be to use the power of the data.table package, but just invoking it doesn't do the trick. You'll need to learn it's very special syntax. Please check before, that the code given this way really is too slow.
If it is really too slow, check this:
library(data.table)
first <- function(x) x[1]
myAny <- function(x) as.numeric(any(x==1))
myData <- data.table(myData)
myData[, .(significant=first(significant),
prevReports=sum(prevReport),
subReports=myAny(subReport),
group=first(group)), ident]
You could use dplyr:
require(dplyr)
new <- myData %>% group_by(ident) %>%
summarise(first(significant),sum(prevReport),(n_distinct(subReport)-1), first(group)) %>%
data.frame()

How to extract a parameter from a list of functions in a loop

I have a large data set and I want to perform several functions at once and extract for each a parameter.
The test dataset:
testdf <- data.frame(vy = rnorm(60), vx = rnorm(60) , gvar = rep(c("a","b"), each=30))
I first definded a list of functions:
require(fBasics)
normfuns <- list(jarqueberaTest=jarqueberaTest, shapiroTest=shapiroTest, lillieTest=lillieTest)
Then a function to perform the tests by the grouping variable
mynormtest <- function(d) {
norm_test <- res_reg <- list()
for (i in c("a","b")){
res_reg[[i]] <- residuals(lm(vy~vx, data=d[d$gvar==i,]))
norm_test[[i]] <- lapply(normfuns, function(f) f(res_reg[[i]]))
}
return(norm_test)
}
mynormtest(testdf)
I obtain a list of test summaries for each grouping variable.
However, I am interested in getting only the parameter "STATISTIC" and I did not manage to find out how to extract it.
You can obtain the value stored as "STATISTIC" in the output of the various tests with
res_list <- mynormtest(testdf)
res_list$a$shapiroTest#test#statistic
res_list$a$jarqueberaTest#test#statistic
res_list$a$lillieTest#test#statistic
And correspondingly for set b:
res_list$b$shapiroTest#test$statistic
res_list$b$jarqueberaTest#test$statistic
res_listb$lillieTest#test$statistic
Hope this helps.
Concerning your function fgetparam I think that it is a nice starting point. Here's my suggestion with a few minor modifications:
getparams2 <- function(myp) {
m <- matrix(NA, nrow=length(myp), ncol=3)
for (i in (1:length(myp))){
m[i,] <- sapply(1:3,function(x) myp[[i]][[x]]#test$statistic)}
return(m)
}
This function represents a minor generalization in the sense that it allows for an arbitrary number of observations, while in your case this was fixed to two cases, a and b. The code can certainly be further shortened, but it might then also become somewhat more cryptic. I believe that in developing a code it is helpful to preserve a certain compromise between efficacy and compactness on one hand and readability or easiness to understand on the other.
Edit
As pointed out by #akrun and #Roland the function getparams2() can be written in a much more elegant and shorter form. One possibility is
getparams2 <- function(myp) {
matrix(unname(rapply(myp, function(x) x#test$statistic)),ncol=3)}
Another great alternative is
getparams2 <- function(myp){t(sapply(myp, sapply, function(x) x#test$statistic))}

Resources