Optimization of for loop in R

Optimization of for loop in R - r

I'm trying to optimize for-loop in my R-code.
Summary:
I've a data table with 47 million rows and 4 columns( designated by 'nvars' in code).
I want to compare row-wise values in each column and if any two are equal, set delete flag as 1, else 0.
I need to delete all those rows in which at least two values in any of 4 columns are equal. (values are numeric in all columns, e.g. 1,2,3... )
I tried optimising using vectorisation but it's still taking ~1.5 hours (approx.)
Can this be optimised further?
test2 <- as.data.table(test2)
delete_output <- numeric(nrow(test2))
for (i in 1:nrow(test2)){
for (j in 1:(nvars-1)){
k=j+1
if (test2[i,..j] == test2[i,..k]){
delete_output[i] <- 1
next
}
}
}
If any two values in a particular row are equal, it should assign delete flag as 1.
My file should look like the one in the image. This is an example of 3 input variable and corresponding output variable (delete). Check that if all V1, V2, V3 are unique for a particular row, delete flag is equal to 0, else 1.

We can use apply (but I fear it might not be fast enough) and check for any duplicated value.
df$delete <- +(apply(df, 1, function(x) any(duplicated(x))))
df
# V1 V2 V3 V4 delete
#1 3 3 3 1 1
#2 1 4 4 3 1
#3 2 2 1 4 1
#4 2 2 3 3 1
#5 2 4 4 2 1
#6 1 3 2 4 0
#7 1 1 1 3 1
#8 4 2 1 1 1
#9 3 4 2 2 1
#10 1 2 2 4 1
data
set.seed(1432)
df <- as.data.frame(matrix(sample(1:4, 40, replace = TRUE), ncol = 4))

You can do:
set.seed(1432)
test2 <- as.data.frame(matrix(sample(1:4, 40, replace = TRUE), ncol = 4))
test2
test2[apply(test2, 1, function(x) all(table(x)==1)), ]
This will select only those rows, in which all elements are unique.
If you need the extra column you can do:
set.seed(1432)
test2 <- as.data.frame(matrix(sample(1:4, 40, replace = TRUE), ncol = 4))
test2
test2$delete <- !apply(test2, 1, function(x) all(table(x)==1))
test2

Related

Compute differences between all variable pairs in R

I have a dataframe with 4 columns.
set.seed(123)
df <- data.frame(A = round(rnorm(1000, mean = 1)),
B = rpois(1000, lambda = 3),
C = round(rnorm(1000, mean = -1)),
D = round(rnorm(1000, mean = 0)))
I would like to compute the differences for every possible combination of my columns (A-B, A-C, A-D, B-C, B-D, C-D) at every row of my dataframe.
This would be the equivalent of doing df$A - df$B for every combination.
Can we use the dist() function to compute this efficiently as I have a very large dataset? I would like to then convert the dist object into a data.frame to plot the results with ggplot2.
Unless there is a good tidy version of doing the above.
Many Thanks
The closest I got was doing the below, but I am not sure to what the column names refer to.
d <- apply(as.matrix(df), 1, function(e) as.vector(dist(e)))
t(d)

dist will compare every value in a vector to every other value in the same vector, so if you are looking to compare columns row-by-row, this is not what you are looking for.
If you just want to calculate the difference between all columns pairwise, you can do:
df <- cbind(df,
do.call(cbind, lapply(asplit(combn(names(df), 2), 2), function(x) {
setNames(data.frame(df[x[1]] - df[x[2]]), paste(x, collapse = ""))
})))
head(df)
#> A B C D AB AC AD BC BD CD
#> 1 0 1 -2 -1 -1 2 1 3 2 -1
#> 2 1 1 -1 1 0 2 0 2 0 -2
#> 3 3 1 -2 -1 2 5 4 3 2 -1
#> 4 1 3 0 -1 -2 1 2 3 4 1
#> 5 1 3 0 1 -2 1 0 3 2 -1
#> 6 3 3 1 0 0 2 3 2 3 1
Created on 2022-06-14 by the reprex package (v2.0.1)

Using base r:
df_dist <- t(apply(df, 1, dist))
colnames(df_dist) <- apply(combn(names(df), 2), 2, paste0, collapse = "_")
If you really want to use a tidy-approach, you could go with c_across, but this also removes the names, and is much slower if your data is huge

How can I merge dataframes of unequal length but known chunk length?

I have a model where individuals can die and reproduce. I record information from the model at set intervals. I know the identity of the individuals and the iteration number I sampled from:
df1<-data.frame(
who= c(1,2,3,4,1,2,3,3,5),
iteration = c(1,1,1,1,2,2,2,3,3)
)
df1
But each of the individuals has a list of numbers associated with it that I want to track. Because each individual has more than one number associated with it, I get two data frames of unequal sample size.
df2 <- data.frame(values=c(1,1, # id = 1
1,2, # id = 2
2,1, # id = 3
0,0, # id = 4
1,1, # id = 1
1,2, # id = 2
2,1, # id = 3
2,1, # id = 3
0,0)) # id = 5
df2
I want to bind them so the 'who' variable is matched up with its value. I did the following to split the values up into the right sized chunks but now I'm stuck.
df3 <- split(df2$values, ceiling(seq_along(df2$values)/2))
I should get something out that looks like this:
who iteration value1 value2
1 1 1 1
2 1 1 2
3 1 2 1
4 1 0 0
1 2 1 1
2 2 1 2
3 2 2 1
3 3 2 1
5 3 0 0

Here, we split the 'values' column based on a grouping index created with %% into a list of vectors, then make the list element pad with NA at the end (in case if there are less number of elements) by assigning the length<- to the maximum length of list element
lst <- split(df2$values, (seq_along(df2$values)-1) %% 2 +1)
m1 <- do.call(cbind, lapply(lst, "length<-", max(lengths(lst))))
cbind(df1, m1)

Replicate rows with missing values and replace missing values by vector

I have a dataframe in which a column has some missing values.
I would like to replicate the rows with the missing values N times, where N is the length of a vector which contains replacements for the missing values.
I first define a replacement vector, then my starting data.frame, then my desired result and finally my attempt to solve it. Unfortunately that didn't work...
> replace_values <- c('A', 'B', 'C')
> data.frame(value = c(3, 4, NA, NA), result = c(5, 3, 1,2))
value result
1 3 5
2 4 3
3 NA 1
4 NA 2
> data.frame(value = c(3, 4, replace_values, replace_values), result = c(5, 3, rep(1, 3),rep(2, 3)))
value result
1 3 5
2 4 3
3 A 1
4 B 1
5 C 1
6 A 2
7 B 2
8 C 2
> t <- data.frame(value = c(3, 4, NA, NA), result = c(5, 3, 1,2))
> mutate(t, value = ifelse(is.na(value), replace_values, value))
value result
1 3 5
2 4 3
3 C 1
4 A 2

You can try a tidyverse solution
d %>%
mutate(value=ifelse(is.na(value), paste0(replace_values, collapse=","), value)) %>%
separate_rows(value, sep=",") %>%
select(value, everything())
value result
1 3 5
2 4 3
3 A 1
4 B 1
5 C 1
6 A 2
7 B 2
8 C 2
The idea is to replace the NA's by the ,-collapsed 'replace_values'. Then separate the collpased values and binding them by row using tidyr's separate_rows function. Finally sort the data.frame according your expected output.

We can do an rbind here using base R. Create a logical vector where the 'value' is NA ('i1'), get the number of NA elements by taking the sum of it ('n'), create a data.frame by replicating the 'replace_values' with 'n' as well as the 'result' elements that correspond to the NA elements of 'value' by the length of 'replace_values' and 'rbind' with the subset of dataset i.e. the non-NA elements of 'value' rows
i1 <- is.na(df1$value)
n <- sum(i1)
rbind(df1[!i1,],
data.frame(value = rep(replace_values, n),
result = rep(df1$result[i1], each = length(replace_values))))
# value result
#1 3 5
#2 4 3
#3 A 1
#4 B 1
#5 C 1
#6 A 2
#7 B 2
#8 C 2

for each column, calculate the difference between it and the max of the others

Let's say I have a dataframe:
x <- data.frame(a=c(1,2,3), b=c(2,3,2), c=c(4,5,1))
# a b c
#1 1 2 4
#2 2 3 5
#3 3 2 1
For each column, I would like to calculate the difference between that and the max of the other columns:
# Desired result:
# a b c
#1 -3 -2 2
#2 -3 -2 2
#3 1 -1 -2
For example, for the (1,1) entry, it's 1 because for the first row, a = 1, and max(b,c) = 4, so 1 - 4 = -3.
Note that I don't necessarily know the number of columns in the dataframe up front, so there could be arbitrarily many columns.

This should work on any number of columns:
sapply(1:ncol(x), function (i) {
x[,i] - do.call(pmax, x[,-i])
})

If you want a dplyr solution with a bit of RC indexing, you can use transmute to generate a new data frame, or mutate to add to your existing dataframe.
x <- data.frame(a=c(1,2,3), b=c(2,3,2), c=c(4,5,1))
x %>% transmute(a = a-max(x[,-1]),
b = b-max(x[,-2]),
c = c-max(x[,-3]))

Count occurance of value over cases

In my dataset, I have 6 Variables with four possible values each (1,10,100 or NA)
set.seed(1)
x <- setNames(
as.data.frame(replicate(6, sample(c(1,10,100,NA), 10, replace = TRUE))),
letters[c(1:5,7)])
I would like to count how often each value appears per case over all six variables, resulting in three scales (No_of_1s, No_of_10s, No_of_100s) all ranging from 0 to 6.
So far, I used this
All<-data.frame(a,b,c,d,e,g)
All_table<-apply(All,MARGIN=1,table)
which gives me the counts of 1s,10s and 100s for each case in a table.
I was thinking now of using
No_of_1s<-All_table[,1]
to create the variable I need. However, it appears that All_table does not create zeros for empty rows but instead just omits them for that case, resulting in a gigantic mess.
Does anyone know how to adjust this?
The solution to this problem is probably pretty straightforward but I can´t seem to figure it out myself.

I would do (thanks to #akrun)...
table(id = seq(nrow(x))[row(x)], unlist(x), useNA= "ifany")
Or with the reshape2 package
library(reshape2)
x$id = seq(nrow(x))
table(melt(x, id="id")[, c("id","value")], useNA="ifany")
value
id 1 10 100 <NA>
1 1 3 0 2
2 2 1 2 1
3 0 2 3 1
4 3 1 1 1
5 2 1 1 2
6 1 2 1 2
7 2 1 1 2
8 1 2 2 1
9 0 1 4 1
10 1 3 1 1
You might also want to look into the log10 if your data follows this pattern to higher numbers.

You could use something like
No_of_10s <- rowSums(All == 10)
No_of_100s <- rowSums(All == 100)
I tested this in a data.frame like this:
x <- data.frame(a = sample(c(1,10,100), 10, replace = TRUE), b = sample(c(1,10,100), 10, replace = TRUE), c=sample(c(1,10,100), 10, replace = TRUE), d=sample(c(1,10,100), 10, replace = TRUE), e=sample(c(1,10,100), 10, replace = TRUE), g=sample(c(1,10,100), 10, replace = TRUE))
rowSums(x == 10)
# answer

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Optimization of for loop in R - r

Related

Compute differences between all variable pairs in R

How can I merge dataframes of unequal length but known chunk length?

Replicate rows with missing values and replace missing values by vector

for each column, calculate the difference between it and the max of the others

Count occurance of value over cases

Categories

Resources