Subtract each col in a df from every other col - r

I would like to try out a normalisation method a friend recommended, in which each col of a df should be subtracted, at first from the first col and next from every other col of that df.
eg:
df <- data.frame(replicate(9,1:4))
x_df_1 <- df[,1] - df[2:ncol(df)]
x_df_2 <- df[,2] - df[c(1, 3:ncol(df))]
x_df_3 <- df[,3] - df[c(1:2, 4:ncol(df))]
...
x_cd_ncol(df) <- df[c(1: (1-ncol(df)))]
As the df has 90 cols, doing this by hand would be terrible (and very bad coding). I am sure there must be an elegant way to solve this and to receive at the end a list containing all the dfs, but I am totally stuck how to get there. I would appreciate a dplyr method (for familiarity) but any working solution would be fine.
Thanks a lot for your help!
Sebastian

I may have found a solution that I am sharing here.
Please correct me if im wrong.
This is a permutation without replacement task.
The original df has 90 cols.
Lets check how many combinations there are possible first:
(from: https://davetang.org/muse/2013/09/09/combinations-and-permutations-in-r/)
comb_with_replacement <- function(n, r){
return( factorial(n + r - 1) / (factorial(r) * factorial(n - 1)) )
}
comb_with_replacement(90,2) #4095 combinations
Now using a modified answer from here: https://stackoverflow.com/a/16921442/10342689
(df has 90 cols. don't know how to create this proper as an example df here.)
cc_90 <- combn(colnames(df), 90)
result <- apply(cc_90, 2, function(x) df[[x[1]]]-df[[x[2]]])
dim(result) #4095
That should work.

In R one can index using negative indices to represent "all except this index".
So we can re-write the first of your normalization rows:
x_df_1 <- df[,1] - df[2:ncol(df)]
# rewrite as:
x_df_1 <- df[,1] - df[,-1]
From this, it's a pretty easy next step to write a loop to generate the 90 new dataframes that you generated 'by hand':
list_of_dfs=lapply(seq_len(ncol(df)),function(x) df[,x]-df[,-x])
This seems to be somewhat different to what you're proposing in your own answer to your question, though...

Related

Is there a way to make a for-loop faster?

I am working on a code but it has a step that is just super slow. Basically I just need to check for 2 columns and if their value is the same (at their respective row) I mark a 1 at a third column. Like the code below:
#FLAG_REPETIDOS
df1$FLAG_REPETIDOS <- ""
j <-1
for (j in 1:nrow(df1)) {
df1$FLAG_REPETIDOS[[j]] <- ifelse(df1$DATO[[j]]==df1$DATO_ANT[[j]], 1, df1$FLAG_REPETIDOS[[j]])
df1$FLAG_REPETIDOS[[j]] <- ifelse(is.na(df1$FLAG_REPETIDOS[[j]])==TRUE, "", df1$FLAG_REPETIDOS[[j]])
x <- j/100
if ((x == round(x))==TRUE){
print(paste(j, "/", nrow(df1)))
}
}
print(paste("Check 11:", Sys.time(), sep=" "))
Some more information: I am using data table, not data frame. My computer is no the best one, only 8G RAM and the data I am using has 1M rows more or less. Accordingly with my estimative it should take around 72h to end just this step of the code, which is unreasonable.
Is my code doing something it could be done easier and faster? Is there any way to optimize it? I am new to R so I dont know a lot about optimization.
Thanks in advance
I already changed from dataframe to datatable, I've researched on google about optimization and it was one of the things I could try.
The way to make R code go fast is to vectorize your code.
Assuming df is a dataframe, you could probably replace all your included code with something like:
library(dplyr)
df %>%
mutate(
FLAG_REPETIDOS = case_when(
is.na(DATO) | is.na(DATO_ANT) ~ "",
DATO == DATO_ANT ~ 1,
TRUE ~ ""
)
)
However, I'm not able to check since you did not include any data with your question.
Your loop is equivalent to this much simpler and faster code.
df1$FLAG_REPETIDOS <- ""
df1$FLAG_REPETIDOS[which(df1$DATO == df1$DATO_ANT)] <- "1"
Note that which doesn't have the danger of getting NA's in the 2nd code line index.
Hard to know without sample data, but this should work using data.table
library(data.table)
dt <- data.table(x=c(1,3,5,7,9), y=c(1,2,5,6,7)) # example
dt[, z:='']
dt[x==y, z:='1']

what is the most efficient way to find the most common value in a vector?

I'm trying to create a function to solve this puzzle:
An Arithmetic Progression is defined as one in which there is a constant difference between the consecutive terms of a given series of numbers. You are provided with consecutive elements of an Arithmetic Progression. There is however one hitch: exactly one term from the original series is missing from the set of numbers which have been given to you. The rest of the given series is the same as the original AP. Find the missing term.
You have to write the function findMissing(list), list will always be at least 3 numbers. The missing term will never be the first or last one.
The next section of code shows my attempt at this function. The site i'm on runs tests against the function, all of which passed, as in they output the correct missing integer.
The problem i'm facing is it's giving me a timeout error, because it takes to long to run all the tests. There are 102 tests and it's saying it takes over 12 seconds to complete them. Taking more than 12 seconds means the function isn't efficient enough.
After running my own timing tests in RStudio it seems running the function would take considerably less time than 12 seconds to run but regardless i need to make it more efficient to be able to complete the puzzle.
I asked on the site forum and someone said "Sorting is expensive, think of another way of doing it without it." I took this to mean i shouldn't be using the sort() function. Is this what they mean?
I've since found a few different ways of getting my_diff which is calculated using the sort() function. All of these ways are even less efficient than the original way of doing it.
Can anyway give me a more efficient way of doing the sort to find my_diff or maybe make other parts of the code more efficient? It's the sort() part which is apparently the inefficient part of the code though.
find_missing <- function(sequence){
len <- length(sequence)
if(len > 3){
my_diff <- as.integer(names(sort(table(diff(sequence)), decreasing = TRUE))[1])
complete_seq <- seq(sequence[1], sequence[len], my_diff)
}else{
differences <- diff(sequence)
complete_seq_1 <- seq(sequence[1],sequence[len],differences[1])
complete_seq_2 <- seq(sequence[1],sequence[len],differences[2])
if(length(complete_seq_1) == 4){
complete_seq <- complete_seq_1
}else{
complete_seq <- complete_seq_2
}
}
complete_seq[!complete_seq %in% sequence]
}
Here are a couple of sample sequences to check the code works:
find_missing(c(1,3,5,9,11))
find_missing(c(1,5,7))
Here are some of the other things i tried instead of sort:
1:
library(pracma)
Mode(diff(sequence))
2:
library(dplyr)
(data.frame(diff_1 = diff(sequence)) %>%
group_by(diff_1) %>%
summarise(count = n()) %>%
ungroup() %>%
filter(count==max(count)))[1]
3:
MaxTable <- function(sequence, mult = FALSE) {
differences <- diff(sequence)
if (!is.factor(differences)) differences <- factor(differences)
A <- tabulate(differences)
if (isTRUE(mult)) {
as.integer(levels(differences)[A == max(A)])
}
else as.integer(levels(differences)[which.max(A)])
}
Here is one way to do this using seq. We can create a sequence from minimum value in sequence to maximum value in the sequence having length as length(x) + 1 as there is exactly one term missing in the sequence.
find_missing <- function(x) {
setdiff(seq(min(x), max(x), length.out = length(x) + 1), x)
}
find_missing(c(1,3,5,9,11))
#[1] 7
find_missing(c(1,5,7))
#[1] 3
This approach takes the diff() of the vector - there will always be one difference higher than the others.
find_missing <- function(x) {
diffs <- diff(x)
x[which.max(diffs)] + min(diffs)
}
find_missing(c(1,3,5,9,11))
[1] 7
find_missing(c(1,5,7))
[1] 3
There is actually a simple formula for this, which will work even if your vector is not sorted...
find_missing <- function(x) {
(length(x) + 1) * (min(x) + max(x))/2 - sum(x)
}
find_missing(c(1,5,7))
[1] 3
find_missing(c(1,3,5,9,11,13,15))
[1] 7
find_missing(c(2,8,6))
[1] 4
It is based on the fact that the sum of the full series should be the average value times the length.

R Matching closest number from columns

I have a list of responses to 7 questions from a survey, each their own column, and am trying to find the response within the first 6 that is closest (numerically) to the 7th. Some won't be the exact same, so I want to create a new variable that produces the difference between the closest number in the first 6 and the 7th. The example below would produce 0.
s <- c(1,2,3,4,5,6,3)
s <- t(s)
s <- as.data.frame(s)
s
Any help is deeply appreciated. I apologize for not having attempted code as nothing I have tried has actually gotten close.
How about this?
which.min( abs(s[1, 1:6] - s[1, 7]))
I'm assuming you want it generalized somehow, but you'd need to provide more info for that. Or just run it through a loop :-)
EDIT: added the loop from the comment and changed exactly 2 tiny things.
s <- c(1,2,3,4,5,6,3)
t <- c(1,2,3,4,5,6,7)
p <- c(1,2,3,4,5,6,2)
s <- data.frame(s,t,p)
k <- t(s)
k <- as.data.frame(k)
k$t <- NA ### need to initialize the column
for(i in 1:3){
## need to refer to each line of k when populating the t column
k[i,]$t <- which.min(abs(k[i, 1:6] - k[i, 7])) }

How to optimize for loops and rbinds with large datasets

I am currently working on a large dataset (~1.5M of entries) using R - a language I am not yet completely familiar with.
Basically, what I try to do is the following :
I want to check what happens during a time interval after "Start".
"Start" represents a few temporal values within every "Trial", and "Trial" represents all of the trials recorded for one "Reference".
So for each Reference, i want to check all Trials and see what happens after "Start", during this Trial
It's not so important if what i'm trying to do is still obscure, the thing is that I want to check every data in my dataframe.
My instinctive (understand, R-noob-ish) way of programming this function led me to a piece of code which I know is far from being optimized, and takes a LOT of time to run.
My_Function <- function(DataFrame){
counts <- data.frame()
for (reference in DataFrame$Ref){
ref_tested <- subset(DataFrame, Ref == reference)
ref_count <- data.frame()
for (trial in ref_tested$Trial){
trial_tested <- subset(ref_tested, Trial == trial)
for (timing in trial_tested$Start){
interesting <- subset(DataFrame, Start > timing & Start <= timing + some_time & Trial == trial)
ref_count <- rbind(ref_count,as.data.frame(table(interesting$ele)))
}
}
temp <- aggregate(Freq~Var1,data=ref_count,FUN=sum);
counts <- rbind (counts, temp)
}
return(counts)
}
Here, as.data.frame(table(interesting$ele)) can have different lengths, and thus, so do ref_count.
I failed to find a way to grow my dataframe without using rbind, but I also know that given the size of my output it is not time-efficient at all.
Also, I have already programmed in other languages such as Python or C++ (a long time ago) and also know that having 3 consecutive for loops usually means that you're doing it wrong. But then again, I did not find a way to avoid doing that in this particular case.
So, do you have any advice on how to use R, or one of its package, to avoid such a situation?
Thank you in advance,
K.
EDIT :
Thank you for your first advices.
I tried the 'plyr' package and was able to reduce the size of my code chunck - it does as expected and is more understandable.Plus, i was able to produce some example data for reproductivity. See :
#Example Input
DF <- data.frame(c(sample(1:400,500000, replace = TRUE)),c(sample(1:25,500000, replace = TRUE)), rnorm(n=500000, m=1, sd=1) )
colnames(DF)<-c("Trial","Ref","Start")
DF$rn<-rownames(DF)
tempDF <- DF[sample(nrow(DF), 100), ] #For testing purposes
Test<- ddply(.data = tempDF, "rn", function(x){
interesting <- subset(DF,
Trial == x$Trial &
Start > x$Start &
Start < x$Start + some_time )
interesting$Elec <- x$Ref
return(interesting)
})
This is nice, but I still feel like it is not the way to go ; in this example, we only browse 100 observations, which takes ~4sec (I used a system.time()), but if i want to scan the 500000 observations of DF, it'd take more than 5 hours.
I have checked data.table but I am still trying to understand how to use it for now.

How to extract a parameter from a list of functions in a loop

I have a large data set and I want to perform several functions at once and extract for each a parameter.
The test dataset:
testdf <- data.frame(vy = rnorm(60), vx = rnorm(60) , gvar = rep(c("a","b"), each=30))
I first definded a list of functions:
require(fBasics)
normfuns <- list(jarqueberaTest=jarqueberaTest, shapiroTest=shapiroTest, lillieTest=lillieTest)
Then a function to perform the tests by the grouping variable
mynormtest <- function(d) {
norm_test <- res_reg <- list()
for (i in c("a","b")){
res_reg[[i]] <- residuals(lm(vy~vx, data=d[d$gvar==i,]))
norm_test[[i]] <- lapply(normfuns, function(f) f(res_reg[[i]]))
}
return(norm_test)
}
mynormtest(testdf)
I obtain a list of test summaries for each grouping variable.
However, I am interested in getting only the parameter "STATISTIC" and I did not manage to find out how to extract it.
You can obtain the value stored as "STATISTIC" in the output of the various tests with
res_list <- mynormtest(testdf)
res_list$a$shapiroTest#test#statistic
res_list$a$jarqueberaTest#test#statistic
res_list$a$lillieTest#test#statistic
And correspondingly for set b:
res_list$b$shapiroTest#test$statistic
res_list$b$jarqueberaTest#test$statistic
res_listb$lillieTest#test$statistic
Hope this helps.
Concerning your function fgetparam I think that it is a nice starting point. Here's my suggestion with a few minor modifications:
getparams2 <- function(myp) {
m <- matrix(NA, nrow=length(myp), ncol=3)
for (i in (1:length(myp))){
m[i,] <- sapply(1:3,function(x) myp[[i]][[x]]#test$statistic)}
return(m)
}
This function represents a minor generalization in the sense that it allows for an arbitrary number of observations, while in your case this was fixed to two cases, a and b. The code can certainly be further shortened, but it might then also become somewhat more cryptic. I believe that in developing a code it is helpful to preserve a certain compromise between efficacy and compactness on one hand and readability or easiness to understand on the other.
Edit
As pointed out by #akrun and #Roland the function getparams2() can be written in a much more elegant and shorter form. One possibility is
getparams2 <- function(myp) {
matrix(unname(rapply(myp, function(x) x#test$statistic)),ncol=3)}
Another great alternative is
getparams2 <- function(myp){t(sapply(myp, sapply, function(x) x#test$statistic))}

Resources