I'm trying to optimize for-loop in my R-code.
Summary:
I've a data table with 47 million rows and 4 columns( designated by 'nvars' in code).
I want to compare row-wise values in each column and if any two are equal, set delete flag as 1, else 0.
I need to delete all those rows in which at least two values in any of 4 columns are equal. (values are numeric in all columns, e.g. 1,2,3... )
I tried optimising using vectorisation but it's still taking ~1.5 hours (approx.)
Can this be optimised further?
test2 <- as.data.table(test2)
delete_output <- numeric(nrow(test2))
for (i in 1:nrow(test2)){
for (j in 1:(nvars-1)){
k=j+1
if (test2[i,..j] == test2[i,..k]){
delete_output[i] <- 1
next
}
}
}
If any two values in a particular row are equal, it should assign delete flag as 1.
My file should look like the one in the image. This is an example of 3 input variable and corresponding output variable (delete). Check that if all V1, V2, V3 are unique for a particular row, delete flag is equal to 0, else 1.
We can use apply (but I fear it might not be fast enough) and check for any duplicated value.
df$delete <- +(apply(df, 1, function(x) any(duplicated(x))))
df
# V1 V2 V3 V4 delete
#1 3 3 3 1 1
#2 1 4 4 3 1
#3 2 2 1 4 1
#4 2 2 3 3 1
#5 2 4 4 2 1
#6 1 3 2 4 0
#7 1 1 1 3 1
#8 4 2 1 1 1
#9 3 4 2 2 1
#10 1 2 2 4 1
data
set.seed(1432)
df <- as.data.frame(matrix(sample(1:4, 40, replace = TRUE), ncol = 4))
You can do:
set.seed(1432)
test2 <- as.data.frame(matrix(sample(1:4, 40, replace = TRUE), ncol = 4))
test2
test2[apply(test2, 1, function(x) all(table(x)==1)), ]
This will select only those rows, in which all elements are unique.
If you need the extra column you can do:
set.seed(1432)
test2 <- as.data.frame(matrix(sample(1:4, 40, replace = TRUE), ncol = 4))
test2
test2$delete <- !apply(test2, 1, function(x) all(table(x)==1))
test2
Related
I have a dataframe with 4 columns.
set.seed(123)
df <- data.frame(A = round(rnorm(1000, mean = 1)),
B = rpois(1000, lambda = 3),
C = round(rnorm(1000, mean = -1)),
D = round(rnorm(1000, mean = 0)))
I would like to compute the differences for every possible combination of my columns (A-B, A-C, A-D, B-C, B-D, C-D) at every row of my dataframe.
This would be the equivalent of doing df$A - df$B for every combination.
Can we use the dist() function to compute this efficiently as I have a very large dataset? I would like to then convert the dist object into a data.frame to plot the results with ggplot2.
Unless there is a good tidy version of doing the above.
Many Thanks
The closest I got was doing the below, but I am not sure to what the column names refer to.
d <- apply(as.matrix(df), 1, function(e) as.vector(dist(e)))
t(d)
dist will compare every value in a vector to every other value in the same vector, so if you are looking to compare columns row-by-row, this is not what you are looking for.
If you just want to calculate the difference between all columns pairwise, you can do:
df <- cbind(df,
do.call(cbind, lapply(asplit(combn(names(df), 2), 2), function(x) {
setNames(data.frame(df[x[1]] - df[x[2]]), paste(x, collapse = ""))
})))
head(df)
#> A B C D AB AC AD BC BD CD
#> 1 0 1 -2 -1 -1 2 1 3 2 -1
#> 2 1 1 -1 1 0 2 0 2 0 -2
#> 3 3 1 -2 -1 2 5 4 3 2 -1
#> 4 1 3 0 -1 -2 1 2 3 4 1
#> 5 1 3 0 1 -2 1 0 3 2 -1
#> 6 3 3 1 0 0 2 3 2 3 1
Created on 2022-06-14 by the reprex package (v2.0.1)
Using base r:
df_dist <- t(apply(df, 1, dist))
colnames(df_dist) <- apply(combn(names(df), 2), 2, paste0, collapse = "_")
If you really want to use a tidy-approach, you could go with c_across, but this also removes the names, and is much slower if your data is huge
I have a model where individuals can die and reproduce. I record information from the model at set intervals. I know the identity of the individuals and the iteration number I sampled from:
df1<-data.frame(
who= c(1,2,3,4,1,2,3,3,5),
iteration = c(1,1,1,1,2,2,2,3,3)
)
df1
But each of the individuals has a list of numbers associated with it that I want to track. Because each individual has more than one number associated with it, I get two data frames of unequal sample size.
df2 <- data.frame(values=c(1,1, # id = 1
1,2, # id = 2
2,1, # id = 3
0,0, # id = 4
1,1, # id = 1
1,2, # id = 2
2,1, # id = 3
2,1, # id = 3
0,0)) # id = 5
df2
I want to bind them so the 'who' variable is matched up with its value. I did the following to split the values up into the right sized chunks but now I'm stuck.
df3 <- split(df2$values, ceiling(seq_along(df2$values)/2))
I should get something out that looks like this:
who iteration value1 value2
1 1 1 1
2 1 1 2
3 1 2 1
4 1 0 0
1 2 1 1
2 2 1 2
3 2 2 1
3 3 2 1
5 3 0 0
Here, we split the 'values' column based on a grouping index created with %% into a list of vectors, then make the list element pad with NA at the end (in case if there are less number of elements) by assigning the length<- to the maximum length of list element
lst <- split(df2$values, (seq_along(df2$values)-1) %% 2 +1)
m1 <- do.call(cbind, lapply(lst, "length<-", max(lengths(lst))))
cbind(df1, m1)
I have a dataframe in which a column has some missing values.
I would like to replicate the rows with the missing values N times, where N is the length of a vector which contains replacements for the missing values.
I first define a replacement vector, then my starting data.frame, then my desired result and finally my attempt to solve it. Unfortunately that didn't work...
> replace_values <- c('A', 'B', 'C')
> data.frame(value = c(3, 4, NA, NA), result = c(5, 3, 1,2))
value result
1 3 5
2 4 3
3 NA 1
4 NA 2
> data.frame(value = c(3, 4, replace_values, replace_values), result = c(5, 3, rep(1, 3),rep(2, 3)))
value result
1 3 5
2 4 3
3 A 1
4 B 1
5 C 1
6 A 2
7 B 2
8 C 2
> t <- data.frame(value = c(3, 4, NA, NA), result = c(5, 3, 1,2))
> mutate(t, value = ifelse(is.na(value), replace_values, value))
value result
1 3 5
2 4 3
3 C 1
4 A 2
You can try a tidyverse solution
d %>%
mutate(value=ifelse(is.na(value), paste0(replace_values, collapse=","), value)) %>%
separate_rows(value, sep=",") %>%
select(value, everything())
value result
1 3 5
2 4 3
3 A 1
4 B 1
5 C 1
6 A 2
7 B 2
8 C 2
The idea is to replace the NA's by the ,-collapsed 'replace_values'. Then separate the collpased values and binding them by row using tidyr's separate_rows function. Finally sort the data.frame according your expected output.
We can do an rbind here using base R. Create a logical vector where the 'value' is NA ('i1'), get the number of NA elements by taking the sum of it ('n'), create a data.frame by replicating the 'replace_values' with 'n' as well as the 'result' elements that correspond to the NA elements of 'value' by the length of 'replace_values' and 'rbind' with the subset of dataset i.e. the non-NA elements of 'value' rows
i1 <- is.na(df1$value)
n <- sum(i1)
rbind(df1[!i1,],
data.frame(value = rep(replace_values, n),
result = rep(df1$result[i1], each = length(replace_values))))
# value result
#1 3 5
#2 4 3
#3 A 1
#4 B 1
#5 C 1
#6 A 2
#7 B 2
#8 C 2
Let's say I have a dataframe:
x <- data.frame(a=c(1,2,3), b=c(2,3,2), c=c(4,5,1))
# a b c
#1 1 2 4
#2 2 3 5
#3 3 2 1
For each column, I would like to calculate the difference between that and the max of the other columns:
# Desired result:
# a b c
#1 -3 -2 2
#2 -3 -2 2
#3 1 -1 -2
For example, for the (1,1) entry, it's 1 because for the first row, a = 1, and max(b,c) = 4, so 1 - 4 = -3.
Note that I don't necessarily know the number of columns in the dataframe up front, so there could be arbitrarily many columns.
This should work on any number of columns:
sapply(1:ncol(x), function (i) {
x[,i] - do.call(pmax, x[,-i])
})
If you want a dplyr solution with a bit of RC indexing, you can use transmute to generate a new data frame, or mutate to add to your existing dataframe.
x <- data.frame(a=c(1,2,3), b=c(2,3,2), c=c(4,5,1))
x %>% transmute(a = a-max(x[,-1]),
b = b-max(x[,-2]),
c = c-max(x[,-3]))
In my dataset, I have 6 Variables with four possible values each (1,10,100 or NA)
set.seed(1)
x <- setNames(
as.data.frame(replicate(6, sample(c(1,10,100,NA), 10, replace = TRUE))),
letters[c(1:5,7)])
I would like to count how often each value appears per case over all six variables, resulting in three scales (No_of_1s, No_of_10s, No_of_100s) all ranging from 0 to 6.
So far, I used this
All<-data.frame(a,b,c,d,e,g)
All_table<-apply(All,MARGIN=1,table)
which gives me the counts of 1s,10s and 100s for each case in a table.
I was thinking now of using
No_of_1s<-All_table[,1]
to create the variable I need. However, it appears that All_table does not create zeros for empty rows but instead just omits them for that case, resulting in a gigantic mess.
Does anyone know how to adjust this?
The solution to this problem is probably pretty straightforward but I canĀ“t seem to figure it out myself.
I would do (thanks to #akrun)...
table(id = seq(nrow(x))[row(x)], unlist(x), useNA= "ifany")
Or with the reshape2 package
library(reshape2)
x$id = seq(nrow(x))
table(melt(x, id="id")[, c("id","value")], useNA="ifany")
value
id 1 10 100 <NA>
1 1 3 0 2
2 2 1 2 1
3 0 2 3 1
4 3 1 1 1
5 2 1 1 2
6 1 2 1 2
7 2 1 1 2
8 1 2 2 1
9 0 1 4 1
10 1 3 1 1
You might also want to look into the log10 if your data follows this pattern to higher numbers.
You could use something like
No_of_10s <- rowSums(All == 10)
No_of_100s <- rowSums(All == 100)
I tested this in a data.frame like this:
x <- data.frame(a = sample(c(1,10,100), 10, replace = TRUE), b = sample(c(1,10,100), 10, replace = TRUE), c=sample(c(1,10,100), 10, replace = TRUE), d=sample(c(1,10,100), 10, replace = TRUE), e=sample(c(1,10,100), 10, replace = TRUE), g=sample(c(1,10,100), 10, replace = TRUE))
rowSums(x == 10)
# answer