Expand dataframe in R with rbind (union) - r

I need to scale up a set of files for a proof of concept in my company. Essentially have several 1000row files with around 200 columns each, and I want to rbind them until I reach the desired scale. This might be 1Million rows or more.
The output will be essentially a repetition of data (sounds a bit silly) and I'm aware of that, but i just need to prove something.
I used a while loop in R similar to this:
while(nrow(df) < 1000000) {df <- rbind(df,df);}
This seems to work but it looks a bit computationally heavy. It might might take like 10-15minutes.
I though of creating a function (below) and use an "apply" family function on the df, but couldn't succeed:
scaleup_function <- function(x)
{
while(nrow(df) < 1000)
{
x <- rbind(df, df)
}
}
Is there a quicker and more efficient way of doing it (it doesn't need to be with rbind) ?
Many thanks,
Joao

This should do the trick:
df <- matrix(0,nrow=1000,ncol=200)
reps_needed <- ceiling(1000000 / nrow(df))
df_scaled <- df[rep(1:nrow(df),reps_needed),]

Related

How to save output of my loop in a dataframe R studio

I have a dataframe (check the picture). I am creating periods of 30 values and I am calculating how many of this values are over 0.1. At the end, I want to save all the 336 outputs in a dataframe (as a row). How could I do that? My code is failing!
i <- 0
secos=as.data.frame(NULL)
for (i in c(0:336)){
hola=as.data.frame(pp[c(1+i:29 + i)])
secos[[i]]=sum(hola > 0.1)
secos=rbind(secos[[i]])}
Iteratively building (growing) data.frames in R is a bad thing. For good reading, see the R Inferno, chapter 2 on Growing Objects. Bottom line, though: it works, but as you add more rows, it will get progressively slower and use (at least) twice as much memory as you intend.
You explicitly overwrite secos with rbind(secos[[i]]), where the rbind call is a complete no-op doing nothing. (e.g., see identical(rbind(mtcars), mtcars)). Back to (1), best to L <- lapply(0:336, function(i) ...) then secos <- do.call(rbind, L).
R indexes are 1-based, but your first call assigns to secos[[0]] which fails.
A literal translation of this into a better start is something like the following. (Up front, your reference to pp only makes sense if you have an object pp that you used to create your data.frame above ... since pp[.] by itself will not reference the frame. If you're using attach(.) to be able to do that, then ... don't. Too many risks and things that can go wrong with it, it is one of the base functions I'd vote to remove.)
invec <- 0:336
L <- sapply(invec, function(i) {
hola=as.data.frame(pp[c(1+i:29 + i)])
sum(hola > 0.1)
})
secos <- data.frame(i = invec, secos = L)
An alternative:
L <- lapply(invec, function(i) {
hola=as.data.frame(pp[c(1+i:29 + i)])
data.frame(secos = sum(hola > 0.1))
})
out <- do.call(rbind, L)
I can't help but think there is a more efficient, R-idiomatic way to aggregate this data. My guess is that it's a moving window of sorts, perhaps a month wide (or similar). If that's the case, I recommend looking into zoo::rollapply(pp, 30, function(z) sum(z > 0.1)), perhaps with meaningful application of align=, partial=, and/or fill=.

How to make a more elegant and shorter for loop with multiple internal functions and results

These are the steps I am following:
subset two matrices by a range of proportions (e.g. 80-85, 85-90)
run two separate distance measure functions for each subset of data
run a mantel using the distance matrix produced by each subset of data
produce a list of each test result, each with a unique name
produce a data frame of all the mantel-r results and their
corresponding p-values
I have written code that will complete this process, but I feel there is a more elegant and better way to do so. What I have works, but I would like to improve my R-skills, so any advice/ideas would be welcomed. I am not new to R, but I am far from being where I would like to be.
Also, my code produces unnecessary objects (i.e. SS, HB, sp.dis, epa.dis, and nam in the code below). They are not a big deal, but it would be nice to have code that doesn’t produce this side effect. A reproducible example (modeled after how my data is formatted) and the packages I’m using are below:
library(tidyverse)
library(betapart)
library(vegan)
set.seed(2)
spe2<-data.frame(replicate(10,sample(0:100,100,replace=T)))
spe2$Ag<-round(runif(100, min=0.4, max=1),2)
epa2<-data.frame(replicate(3,sample(1:20,100,replace=T)))
epa2$Ag<-spe2$Ag
Mantel.List<-list()
List.names <- list()
for(i in seq(from=0.85, to=0.95,by=0.05 )){
SS<-spe2 %>%
filter(Ag >= i & Ag < i+0.05)
HB<-epa2 %>%
filter(Ag >= i & Ag < i+0.05)
sp.dis<-beta.pair(decostand(SS[,1:ncol(SS)-1],'pa'))
epa.dis<-vegdist(HB[,1:ncol(HB)-1],
method = 'euclidean')
mnt<-mantel(sp.dis$beta.sor,epa.dis)
Mantel.List[[length(Mantel.List)+1]] <- mnt
nam<-paste('M.tt',i*100,sep='')
List.names[[length(List.names)+1]] <- nam
}
names(Mantel.List)<-List.names
Mantel.Results<-cbind(sapply(Mantel.List, function(x) x$statistic),sapply(Mantel.List, function(x) x$signif))
colnames(Mantel.Results)<-c('Mantel-r', 'p-value')
Mantel.Results
Thank you!
I've done two things two try to make this code a little better. First, I eliminated all the unnecessary objects, and I've done this by using data.table package, which is usually the most efficient way to handle data.frames, cause it doesn't make copies of itself when subsetting.
Secondly, instead of using a for loop, I'm using an apply function. Note the assigner <<- inside doit(), which will replace the object outside the function.
Here's my suggestion:
library(data.table)
set.seed(2)
spe2<-as.data.table(data.frame(replicate(10,sample(0:100,100,replace=T))))
spe2$Ag<-round(runif(100, min=0.4, max=1),2)
epa2<-as.data.table(data.frame(replicate(3,sample(1:20,100,replace=T))))
epa2$Ag<-spe2$Ag
doitAll=function(dt1,dt2){
Mantel.List<-list()
List.names <- list()
doit=function(x,dt1,dt2){
mnt<-mantel(beta.pair(decostand(dt1[Ag >= x & Ag < x+0.05,1:(ncol(dt1)-1),with=F],'pa'))$beta.sor,
vegdist(dt2[Ag >= x & Ag < x+0.05,1:(ncol(dt2)-1),with=F],
method = 'euclidean'))
Mantel.List[[length(Mantel.List)+1]] <<- mnt
nam<-paste('M.tt',x*100,sep='')
List.names[[length(List.names)+1]] <<- nam
}
sapply(seq(from=0.85, to=0.95,by=0.05 ),doit,dt1=dt1,dt2=dt2)
names(Mantel.List)<-List.names
Mantel.Results<-cbind(sapply(Mantel.List, function(x) x$statistic),sapply(Mantel.List, function(x) x$signif))
colnames(Mantel.Results)<-c('Mantel-r', 'p-value')
return(Mantel.Results)
}
doitAll(dt1=spe2,dt2=epa2)
It might be a little hard to read, but it's surely more efficient.

Better way to write this R data-cleaning function?

I am writing a R function that takes a dataframe column (probably preferably of type factor) and clumps together all the entries below a user-defined frequency as "other." This is done for data cleaning.
Here is what I have written:
zcut <- function(column, threshold){
dft <- data.frame(table(column))
dft_ind <- sapply(dft$Freq, function(x) x < threshold)
dft_list <- dft[[1]][dft_ind]
levels(column)[levels(column) %in% dft_list] <- "Other"
return(column)
}
I think this is pretty straightforward, but are there ways to make my code more concise or exact?
I would have asked this on the Code Review stack exchange, although it's not clear to me many R experts lurk there.
You don't need sapply here. Try:
dft_ind <- dft$Freq < threshold
This should speed up the function in the case of large data.frames.

alternative to subsetting in R

I have a df, YearHT, 6.5M x 55 columns. There is specific information I want to extract and add but only based on an aggregate values. I am using a for loop to subset the large df, and then performing the computations.
I have heard that for loops should be avoided, and I wonder if there is a way to avoid a for loop that I have used, as when I run this query it takes ~3hrs.
Here is my code:
srt=NULL
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
e=unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
srt=rbind(srt,e)
}
srt=data.frame(srt)
s2=data.frame(srt$X2,srt$X1,srt$X3)
colnames(s2)=colnames(srt)
s=rbind(srt,s2)
doubletCounts is 700 x 3 df, and each of the values is found within the large df.
I would be glad to hear any ideas to optimize/speed up this process.
Here is a fast solution using data.table , although it is not completely clear from your question what is the output you want to get.
# load library
library(datat.table)
# convert your dataset into data.table
setDT(YearHT)
# subset YearHT keeping values that are present in doubletCounts$Var1
YearHT_df <- YearHT[ berthlet %in% doubletCounts$Var1]
# aggregate values
output <- YearHT_df[ , .( median= median(berthtime)) ]
for loops aren't necessarily something to avoid, but there are certain ways of using for loops that should be avoided. You've committed the classic for loop blunder here.
srt = NULL
for (i in index)
{
[stuff]
srt = rbind(srt, [stuff])
}
is bound to be slower than you would like because each time you hit srt = rbind(...), you're asking R to do all sorts of things to figure out what kind of object srt needs to be and how much memory to allocate to it. When you know what the length of your output needs to be up front, it's better to do
srt <- vector("list", length = doubletCounts$Var1)
for(i in doubletCounts$Var1){
s=subset(YearHT,YearHT$berthlet==i)
srt[[i]] = unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
srt=data.frame(srt)
Or the apply alternative of
srt = lapply(doubletCounts$Var1,
function(i)
{
s=subset(YearHT,YearHT$berthlet==i)
unlist(c(strsplit(i,'\\|'),median(s$berthtime)))
}
)
Both of those should run at about the same speed
(Note: both are untested, for lack of data, so they might be a little buggy)
Something else you can try that might have a smaller effect would be dropping the subset call and use indexing. The content of your for loop could be boiled down to
unlist(c(strsplit(i, '\\|'),
median(YearHT[YearHT$berthlet == i, "berthtime"])))
But I'm not sure how much time that would save.

Double "for loops" in a dataframe in R

I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.

Resources