Count occurance of value over cases - r

In my dataset, I have 6 Variables with four possible values each (1,10,100 or NA)
set.seed(1)
x <- setNames(
as.data.frame(replicate(6, sample(c(1,10,100,NA), 10, replace = TRUE))),
letters[c(1:5,7)])
I would like to count how often each value appears per case over all six variables, resulting in three scales (No_of_1s, No_of_10s, No_of_100s) all ranging from 0 to 6.
So far, I used this
All<-data.frame(a,b,c,d,e,g)
All_table<-apply(All,MARGIN=1,table)
which gives me the counts of 1s,10s and 100s for each case in a table.
I was thinking now of using
No_of_1s<-All_table[,1]
to create the variable I need. However, it appears that All_table does not create zeros for empty rows but instead just omits them for that case, resulting in a gigantic mess.
Does anyone know how to adjust this?
The solution to this problem is probably pretty straightforward but I canĀ“t seem to figure it out myself.

I would do (thanks to #akrun)...
table(id = seq(nrow(x))[row(x)], unlist(x), useNA= "ifany")
Or with the reshape2 package
library(reshape2)
x$id = seq(nrow(x))
table(melt(x, id="id")[, c("id","value")], useNA="ifany")
value
id 1 10 100 <NA>
1 1 3 0 2
2 2 1 2 1
3 0 2 3 1
4 3 1 1 1
5 2 1 1 2
6 1 2 1 2
7 2 1 1 2
8 1 2 2 1
9 0 1 4 1
10 1 3 1 1
You might also want to look into the log10 if your data follows this pattern to higher numbers.

You could use something like
No_of_10s <- rowSums(All == 10)
No_of_100s <- rowSums(All == 100)
I tested this in a data.frame like this:
x <- data.frame(a = sample(c(1,10,100), 10, replace = TRUE), b = sample(c(1,10,100), 10, replace = TRUE), c=sample(c(1,10,100), 10, replace = TRUE), d=sample(c(1,10,100), 10, replace = TRUE), e=sample(c(1,10,100), 10, replace = TRUE), g=sample(c(1,10,100), 10, replace = TRUE))
rowSums(x == 10)
# answer

Related

Compute differences between all variable pairs in R

I have a dataframe with 4 columns.
set.seed(123)
df <- data.frame(A = round(rnorm(1000, mean = 1)),
B = rpois(1000, lambda = 3),
C = round(rnorm(1000, mean = -1)),
D = round(rnorm(1000, mean = 0)))
I would like to compute the differences for every possible combination of my columns (A-B, A-C, A-D, B-C, B-D, C-D) at every row of my dataframe.
This would be the equivalent of doing df$A - df$B for every combination.
Can we use the dist() function to compute this efficiently as I have a very large dataset? I would like to then convert the dist object into a data.frame to plot the results with ggplot2.
Unless there is a good tidy version of doing the above.
Many Thanks
The closest I got was doing the below, but I am not sure to what the column names refer to.
d <- apply(as.matrix(df), 1, function(e) as.vector(dist(e)))
t(d)
dist will compare every value in a vector to every other value in the same vector, so if you are looking to compare columns row-by-row, this is not what you are looking for.
If you just want to calculate the difference between all columns pairwise, you can do:
df <- cbind(df,
do.call(cbind, lapply(asplit(combn(names(df), 2), 2), function(x) {
setNames(data.frame(df[x[1]] - df[x[2]]), paste(x, collapse = ""))
})))
head(df)
#> A B C D AB AC AD BC BD CD
#> 1 0 1 -2 -1 -1 2 1 3 2 -1
#> 2 1 1 -1 1 0 2 0 2 0 -2
#> 3 3 1 -2 -1 2 5 4 3 2 -1
#> 4 1 3 0 -1 -2 1 2 3 4 1
#> 5 1 3 0 1 -2 1 0 3 2 -1
#> 6 3 3 1 0 0 2 3 2 3 1
Created on 2022-06-14 by the reprex package (v2.0.1)
Using base r:
df_dist <- t(apply(df, 1, dist))
colnames(df_dist) <- apply(combn(names(df), 2), 2, paste0, collapse = "_")
If you really want to use a tidy-approach, you could go with c_across, but this also removes the names, and is much slower if your data is huge

How to sort a dataframe in decreasing order lapply and sort in r

Not sure if this is a duplicate but I couldn't find anything that either solves my original problem or the issue I'm running into with the partial I did find.
The goal is to sort a dataframe independently by column.
Reproducible example
a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
a
name date1 date2 date3
1 a 2 0 0
2 a 3 2 2
3 a 1 3 0
4 b 3 1 3
5 b 1 2 2
6 b 2 0 1
b <- ddply(a, "name", function(x) { as.data.frame(lapply(x, sort))
b
name date1 date2 date3
1 a 1 0 0
2 a 2 2 0
3 a 3 3 2
4 b 1 0 1
5 b 2 1 2
6 b 3 2 3
Now this works as expected, but is the opposite of what I'm looking to do.
Desired output
b
name date1 date2 date3
1 a 3 3 2
2 a 2 2 0
3 a 1 0 0
4 b 3 2 3
5 b 2 1 2
6 b 1 0 1
I've tried to add in the decreasing=T parameter but haven't had any luck with the variations I've tried and usually end up with an error about missing arguments or undefined columns being selected. How does one correctly implement a decreasing sort with this syntax and/or otherwise achieve the end result without relying on explicitly naming the columns (they names are dates so change often)
Bonus
How could this code be adapted to account for NA's with na.last
Thank you!
I think you nuked the data.frame rows with your code, not a very good practice standard dplyr use the arrange() function like this
library(tidyverse)
a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
a %>%
arrange(name,-date1)
If you want to live a dangerous life here is the code for it
a %>%
group_by(name) %>%
mutate_all(sort,decreasing = TRUE)
name date1 date2 date3
<fct> <dbl> <dbl> <dbl>
1 a 3 3 2
2 a 2 2 0
3 a 1 0 0
4 b 3 2 3
5 b 2 1 2
6 b 1 0 1
A solution with the data.table package is the following
library(data.table)
a <- data.table(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
# alternatively:
# a <- data.frame(name = c("a","a","a","b","b","b"),date1 = c(2,3,1,3,1,2),date2 = c(0,2,3,1,2,0),date3 = c(0,2,0,3,2,1))
# setDT(a)
b <- a[, lapply(.SD, sort, decreasing = TRUE), by = name]
.SD returns the subset of data, in this case created with the by = name. It splits the original data.table by the values in the given column.
This also fulfills your bonus requirement, the na.last can be supplied.
aa <- data.table(name = c("a","a","a","b","b","b"),date1 = c(NA,3,1,3,1,NA),date2 = c(0,2,NA,1,2,0),date3 = c(0,2,0,3,2,NA))
bb <- aa[, lapply(.SD, sort, decreasing = TRUE, na.last = TRUE), by = name]

Optimization of for loop in R

I'm trying to optimize for-loop in my R-code.
Summary:
I've a data table with 47 million rows and 4 columns( designated by 'nvars' in code).
I want to compare row-wise values in each column and if any two are equal, set delete flag as 1, else 0.
I need to delete all those rows in which at least two values in any of 4 columns are equal. (values are numeric in all columns, e.g. 1,2,3... )
I tried optimising using vectorisation but it's still taking ~1.5 hours (approx.)
Can this be optimised further?
test2 <- as.data.table(test2)
delete_output <- numeric(nrow(test2))
for (i in 1:nrow(test2)){
for (j in 1:(nvars-1)){
k=j+1
if (test2[i,..j] == test2[i,..k]){
delete_output[i] <- 1
next
}
}
}
If any two values in a particular row are equal, it should assign delete flag as 1.
My file should look like the one in the image. This is an example of 3 input variable and corresponding output variable (delete). Check that if all V1, V2, V3 are unique for a particular row, delete flag is equal to 0, else 1.
We can use apply (but I fear it might not be fast enough) and check for any duplicated value.
df$delete <- +(apply(df, 1, function(x) any(duplicated(x))))
df
# V1 V2 V3 V4 delete
#1 3 3 3 1 1
#2 1 4 4 3 1
#3 2 2 1 4 1
#4 2 2 3 3 1
#5 2 4 4 2 1
#6 1 3 2 4 0
#7 1 1 1 3 1
#8 4 2 1 1 1
#9 3 4 2 2 1
#10 1 2 2 4 1
data
set.seed(1432)
df <- as.data.frame(matrix(sample(1:4, 40, replace = TRUE), ncol = 4))
You can do:
set.seed(1432)
test2 <- as.data.frame(matrix(sample(1:4, 40, replace = TRUE), ncol = 4))
test2
test2[apply(test2, 1, function(x) all(table(x)==1)), ]
This will select only those rows, in which all elements are unique.
If you need the extra column you can do:
set.seed(1432)
test2 <- as.data.frame(matrix(sample(1:4, 40, replace = TRUE), ncol = 4))
test2
test2$delete <- !apply(test2, 1, function(x) all(table(x)==1))
test2

for each column, calculate the difference between it and the max of the others

Let's say I have a dataframe:
x <- data.frame(a=c(1,2,3), b=c(2,3,2), c=c(4,5,1))
# a b c
#1 1 2 4
#2 2 3 5
#3 3 2 1
For each column, I would like to calculate the difference between that and the max of the other columns:
# Desired result:
# a b c
#1 -3 -2 2
#2 -3 -2 2
#3 1 -1 -2
For example, for the (1,1) entry, it's 1 because for the first row, a = 1, and max(b,c) = 4, so 1 - 4 = -3.
Note that I don't necessarily know the number of columns in the dataframe up front, so there could be arbitrarily many columns.
This should work on any number of columns:
sapply(1:ncol(x), function (i) {
x[,i] - do.call(pmax, x[,-i])
})
If you want a dplyr solution with a bit of RC indexing, you can use transmute to generate a new data frame, or mutate to add to your existing dataframe.
x <- data.frame(a=c(1,2,3), b=c(2,3,2), c=c(4,5,1))
x %>% transmute(a = a-max(x[,-1]),
b = b-max(x[,-2]),
c = c-max(x[,-3]))

Finding maximum value across multiple columns (pmax gives NAs)

Using reshape function, I have data that looks something like this:
df1
ID Score1 Score2 Score3
1 2 3 1
2 3 2 1
3 2 1 NA
4 1 NA NA
As you can see, some of my score variables have missing values.
I'm interested in finding the maximum score variable for all ID values. When I tried using pmax(df1$Score1,df1$Score2,df1$Score3), my resulting vector contains NAs. I'm not sure why this is, as I know that my Score1 variable doesn't contain any NAs.
This is what I'd like my output to accomplish:
ID MaxScore
1 3
2 3
3 2
4 1
Thanks
You could use apply on each row (MARGIN = 1)
apply(X = df1[,-1], MARGIN = 1, FUN = max, na.rm = TRUE)
#[1] 3 3 2 1
We can use the vectorized pmax to get this done
cbind(df1['ID'], MaxScore = do.call(pmax, c(df1[-1], na.rm = TRUE)))
# ID MaxScore
#1 1 3
#2 2 3
#3 3 2
#4 4 1

Resources