Number of maximums in each row and more - r

My dataset contains four numerical variables X1, X2, X3, X_4 and an ID column.
ID <- c(1,2,3,4,5,6,7,8,9,10)
X1 <- c(3,1,1,1,2,1,2,1,3,4)
X2 <- c(1,2,1,3,2,2,4,1,2,4)
X3 <- c(1,1,1,3,2,3,3,2,1,4)
X4 <- c(1,4,1,1,1,4,3,1,4,4)
Mydata <- data.frame(ID, X1,X2,X3,X4)
I need to create two more columns: 1) Max, and 2) Var
1) Max column: For each row that has ONLY ONE maximum, I need to save this 'max' value in the Max variable. And if the
row has more than one, then the Max value should be 999.
2) Var column: For the rows with only one maximum, I need to know whether it was X1, X2, X3$, or X4.
For the above dataset, here is the output:
ID X1 X2 X3 X4 Max Var
1 3 1 1 1 3 X1
2 1 2 1 4 4 X4
3 1 1 1 1 999 NA
4 1 3 3 1 999 NA
5 2 2 2 1 999 NA
6 1 2 3 4 4 X4
7 2 4 3 3 4 X2
8 1 1 2 1 2 X3
9 3 2 1 4 4 X4
10 4 4 4 4 999 NA

We could get the column names of the 'Mydata' for the maximum value in each row (excluding the 'ID' column) using max.col ('Var'), and the maximum value per row with pmax ('Max'). Create a logical index for rows that have more than one maximum value ('indx') and use it with ifelse to get the expected output.
Var <- names(Mydata[-1])[max.col(Mydata[-1])]
Max <- do.call(pmax,Mydata[-1])
indx <- rowSums(Mydata[-1]==Max)>1
transform(Mydata, Var= ifelse(indx, NA, Var), Max=ifelse(indx, 999, Max))

Here's another possible apply solution
MyFunc <- function(x){
Max <- max(x)
if(sum(x == Max) > 1L) {
Max <- 999
Var <- NA
} else {
Var <- which.max(x)
}
c(Max, Var)
}
Mydata[c("Max", "Var")] <- t(apply(Mydata[-1], 1, MyFunc))
# ID X1 X2 X3 X4 Max Var
# 1 1 3 1 1 1 3 1
# 2 2 1 2 1 4 4 4
# 3 3 1 1 1 1 999 NA
# 4 4 1 3 3 1 999 NA
# 5 5 2 2 2 1 999 NA
# 6 6 1 2 3 4 4 4
# 7 7 2 4 3 3 4 2
# 8 8 1 1 2 1 2 3
# 9 9 3 2 1 4 4 4
# 10 10 4 4 4 4 999 NA

I would break this down into some small steps, which may not be the most efficient but would at least give you a starting point to work from if efficiency were an issues for your real problem.
First, compute the row maxes:
maxs <- apply(Mydata[, -1], 1, max)
> maxs
[1] 3 4 1 3 2 4 4 2 4 4
Next compute how which values in the rows equal the maximum
wMax <- apply(Mydata[, -1], 1, function(x) length(which(x == max(x))))
This gives a list, which we can sapply() over to get the number of values equalling the maximum:
nMax <- sapply(wMax, length)
> nMax
[1] 1 1 4 2 3 1 1 1 1 4
Now add the Max & Var columns:
Mydata$Max <- ifelse(nMax > 1L, 999, maxs)
Mydata$Var <- ifelse(nMax > 1L, NA, sapply(wMax, `[[`, 1))
> Mydata
ID X1 X2 X3 X4 Max Var
1 1 3 1 1 1 3 1
2 2 1 2 1 4 4 4
3 3 1 1 1 1 999 NA
4 4 1 3 3 1 999 NA
5 5 2 2 2 1 999 NA
6 6 1 2 3 4 4 4
7 7 2 4 3 3 4 2
8 8 1 1 2 1 2 3
9 9 3 2 1 4 4 4
10 10 4 4 4 4 999 NA
This isn't going to win any prizes for elegant use of the language, but it works and you can build off of it.
(That last line creating Var needs a little explanation: wMax is actually a list. We want the first element of each component of that list (because those will be the only maximums), and the sapply() call produces that.)
Now we can write a function that incorporates all the steps for you:
MaxVar <- function(x, na.rm = FALSE) {
## compute `max`
maxx <- max(x, na.rm = na.rm)
## which equal the max
wmax <- which(x == max(x))
## how many equal the max
nmax <- length(wmax)
## return
out <- if(nmax > 1L) {
c(999, NA)
} else {
c(maxx, wmax)
}
out
}
And use it like this:
> new <- apply(Mydata[, -1], 1, MaxVar)
> new
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 3 4 999 999 999 4 4 2 4 999
[2,] 1 4 NA NA NA 4 2 3 4 NA
> Mydata <- cbind(Mydata, Max = new[1, ], Var = new[2, ])
> Mydata
ID X1 X2 X3 X4 Max Var
1 1 3 1 1 1 3 1
2 2 1 2 1 4 4 4
3 3 1 1 1 1 999 NA
4 4 1 3 3 1 999 NA
5 5 2 2 2 1 999 NA
6 6 1 2 3 4 4 4
7 7 2 4 3 3 4 2
8 8 1 1 2 1 2 3
9 9 3 2 1 4 4 4
10 10 4 4 4 4 999 NA
Again, not the most elegant or efficient of code, but it works and it's easy to see what it is doing.

Yet another way to do this using apply
Mydata$Max = apply(Mydata[,-1], 1,
function(x){ m = max(x); ifelse(m != max(x[duplicated(x)]), m, 999)})
Mydata$Var = apply(Mydata[,-1], 1,
function(x){ index = which.max(x); ifelse(index != 5, names(x)[index], NA)})
#> Mydata
#ID X1 X2 X3 X4 Max Var
#1 1 3 1 1 1 3 X1
#2 2 1 2 1 4 4 X4
#3 3 1 1 1 1 999 <NA>
#4 4 1 3 3 1 999 <NA>
#5 5 2 2 2 1 999 <NA>
#6 6 1 2 3 4 4 X4
#7 7 2 4 3 3 4 X2
#8 8 1 1 2 1 2 X3
#9 9 3 2 1 4 4 X4
#10 10 4 4 4 4 999 <NA>

Related

Issue of generating conditional numbers to a set frequency in R

I am having a issue generating conditional numbers. Repeated frequency of the number is shown in "size". For example, 1 should be repeated 3 times and 2 should be repeated 2 times and so on.
My desired output is shown below but I am unable to achieve this. Can somebody correct me please?
Desired output
x1
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
9 5
10 5
data <- data.frame(x1= rep(c(1),each=10))
data
size <- as.array(c(3,2,1,2,2))
for(i in 1:5) {
x_val <- size[i]
new <- rep(c(x_val), each=x_val)
data[nrow(size[i]) + 1, ] <- new
}
print(data)
x1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
We could use rep with times
data.frame(x1 = rep(seq_along(size), size))
-output
x1
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
9 5
10 5
If we need a for loop
x1 <- c()
for(i in seq_along(size)) x1 <- c(x1, rep(i, each = size[i]))
x1
#[1] 1 1 1 2 2 3 4 4 5 5

How to get consecutive rank for multiple variables [duplicate]

This question already has answers here:
Create a ranking variable with dplyr?
(3 answers)
Closed 3 years ago.
I have a data set where 5 varieties (var) and 3 variables (x,y,z) are available. I need to rank these varieties for 3 variables. When there is tie in rank it shows gap before starting the following rank. I cannot get the consecutive rank. Here is my data
x<-c(3,3,4,5,5)
y<-c(5,6,4,4,5)
z<-c(2,3,4,3,5)
df<-cbind(x,y,z)
rownames(df) <- paste0("G", 1:nrow(df))
df <- data.frame(var = row.names(df), df)
I tried the following code for my result
res <- sapply(df, rank,ties.method='min')
res
var x y z
[1,] 1 1 3 1
[2,] 2 1 5 2
[3,] 3 3 1 4
[4,] 4 4 1 2
[5,] 5 4 3 5
I got x variable with rank 1 1 3 4 4 instead of 1 1 2 3 3. For y and z the same thing was found.
My desired result is
>res
var x y z
[1,] 1 1 2 1
[2,] 2 1 3 2
[3,] 3 2 1 3
[4,] 4 3 1 2
[5,] 5 3 2 4
I will be grateful if anyone helps me.
Well, an easy way would be to convert to factor and then integer
df[] <- lapply(df, function(x) as.integer(factor(x)))
df
# var x y z
#G1 1 1 2 1
#G2 2 1 3 2
#G3 3 2 1 3
#G4 4 3 1 2
#G5 5 3 2 4
One dplyr possibility could be:
df %>%
mutate_at(2:4, list(~ dense_rank(.)))
var x y z
1 G1 1 2 1
2 G2 1 3 2
3 G3 2 1 3
4 G4 3 1 2
5 G5 3 2 4
Or a base R possibility:
df[2:4] <- lapply(df[2:4], function(x) match(x, sort(unique(x))))
We can use data.table
library(data.table)
setDT(df)[, (2:4) := lapply(.SD, dense_rank), .SDcols = 2:4]
df
# var x y z
#1: G1 1 2 1
#2: G2 1 3 2
#3: G3 2 1 3
#4: G4 3 1 2
#5: G5 3 2 4

Applying custom function to each row uses only first value of argument

I am trying to recode NA values to 0 in a subset of columns using the following dataset:
set.seed(1)
df <- data.frame(
id = c(1:10),
trials = sample(1:3, 10, replace = T),
t1 = c(sample(c(1:9, NA), 10)),
t2 = c(sample(c(1:7, rep(NA, 3)), 10)),
t3 = c(sample(c(1:5, rep(NA, 5)), 10))
)
Each row has a certain number of trials associated with it (between 1-3), specified by the trials column. columns t1-t3 represent scores for each trial.
The number of trials indicates the subset of columns in which NAs should be recoded to 0: NAs that are within the number of trials represent missing data, and should be recoded as 0, while NAs outside the number of trials are not meaningful, and should remain NAs. So, for a row where trials == 3, an NA in column t3 would be recoded as 0, but in a row where trials == 2, an NA in t3 would remain an NA.
So, I tried using this function:
replace0 <- function(x, num.sun) {
x[which(is.na(x[1:(num.sun + 2)]))] <- 0
return(x)
}
This works well for single vectors. When I try applying the same function to a data frame with apply(), though:
apply(df, 1, replace0, num.sun = df$trials)
I get a warning saying:
In 1:(num.sun + 2) :
numerical expression has 10 elements: only the first used
The result is that instead of having the value of num.sun change every row according to the value in trials, apply() simply uses the first value in the trials column for every single row. How could I apply the function so that the num.sun argument changes according to the value of df$trials?
Thanks!
Edit: as some have commented, the original example data had some non-NA scores that didn't make sense according to the trials column. Here's a corrected dataset:
df <- data.frame(
id = c(1:5),
trials = c(rep(1, 2), rep(2, 1), rep(3, 2)),
t1 = c(NA, 7, NA, 6, NA),
t2 = c(NA, NA, 3, 7, 12),
t3 = c(NA, NA, NA, 4, NA)
)
Another approach:
# create an index of the NA values
w <- which(is.na(df), arr.ind = TRUE)
# create an index with the max column by row where an NA is allowed to be replaced by a zero
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
# subset 'w' such that only the NA's which fall in the scope of 'm' remain
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
# use 'i' to replace the allowed NA's with a zero
df[i] <- 0
which gives:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
You could easily wrap this in a function:
replace.NA.with.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
m <- matrix(c(1:nrow(df), (df$trials + 2)), ncol = 2)
i <- w[w[,2] <= m[,2][match(w[,1], m[,1])],]
df[i] <- 0
return(df)
}
Now, using replace.NA.with.0(df) will produce the above result.
As noted by others, some rows (1, 3 & 10) have more values than trails. You could tackle that problem by rewriting the above function to:
replace.with.NA.or.0 <- function(df) {
w <- which(is.na(df), arr.ind = TRUE)
df[w] <- 0
v <- tapply(m[,2], m[,1], FUN = function(x) tail(x:5,-1))
ina <- matrix(as.integer(unlist(stack(v)[2:1])), ncol = 2)
df[ina] <- NA
return(df)
}
Now, using replace.with.NA.or.0(df) produces the following result:
id trials t1 t2 t3
1 1 1 3 NA NA
2 2 2 2 2 NA
3 3 2 6 6 NA
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 NA
9 9 2 1 3 NA
10 10 1 9 NA NA
Here I just rewrite your function using double subsetting x[paste0('t',x['trials'])], which overcome the problem in the other two solutions with row 6
replace0 <- function(x){
#browser()
x_na <- x[paste0('t',x['trials'])]
if(is.na(x_na)){x[paste0('t',x['trials'])] <- 0}
return(x)
}
t(apply(df, 1, replace0))
id trials t1 t2 t3
[1,] 1 1 3 NA 5
[2,] 2 2 2 2 NA
[3,] 3 2 6 6 4
[4,] 4 3 NA 1 2
[5,] 5 1 5 NA NA
[6,] 6 3 7 NA 0
[7,] 7 3 8 7 0
[8,] 8 2 4 5 1
[9,] 9 2 1 3 NA
[10,] 10 1 9 4 3
Here is a way to do it:
x <- is.na(df)
df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
The output looks like this:
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
> x <- is.na(df)
> df[x & t(apply(x, 1, cumsum)) > 3 - df$trials] <- 0
> df
id trials t1 t2 t3
1 1 1 3 NA 5
2 2 2 2 2 NA
3 3 2 6 6 4
4 4 3 0 1 2
5 5 1 5 NA NA
6 6 3 7 0 0
7 7 3 8 7 0
8 8 2 4 5 1
9 9 2 1 3 NA
10 10 1 9 4 3
Note: row 1/3/10, is problematic since there are more non-NA values than the trials.
Here's a tidyverse way, note that it doesn't give the same output as other solutions.
Your example data shows results for trials that "didn't happen", I assumed your real data doesn't.
library(tidyverse)
df %>%
nest(matches("^t\\d")) %>%
mutate(data = map2(data,trials,~mutate_all(.,replace_na,0) %>% select(.,1:.y))) %>%
unnest
# id trials t1 t2 t3
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA
Using the more commonly used gather strategy this would be:
df %>%
gather(k,v,matches("^t\\d")) %>%
arrange(id) %>%
group_by(id) %>%
slice(1:first(trials)) %>%
mutate_at("v",~replace(.,is.na(.),0)) %>%
spread(k,v)
# # A tibble: 10 x 5
# # Groups: id [10]
# id trials t1 t2 t3
# <int> <int> <dbl> <dbl> <dbl>
# 1 1 1 3 NA NA
# 2 2 2 2 2 NA
# 3 3 2 6 6 NA
# 4 4 3 0 1 2
# 5 5 1 5 NA NA
# 6 6 3 7 0 0
# 7 7 3 8 7 0
# 8 8 2 4 5 NA
# 9 9 2 1 3 NA
# 10 10 1 9 NA NA

How to filter multiple columns using a single critera

I have
4 5 6 7
1 3 3 3 3
2 1 2 2 1
3 2 1 1 NA
4 2 7 1 NA
5 1 1 1 1
I want to filter rows with either 2 or 3 in columns 1 to 4 so I only get rows 1,2,4
I tried
df1%>%filter_at(vars(4:7), all_vars(c(2,3)) -> df2
which returns
Error in filter_impl(.data, quo) : Result must have length 413, not 2
and
filter(d1[4:7]%in%c(1,3))
which returns
Error in filter_impl(.data, quo) : Result must have length 413, not 4
I want to avoid using
df1%>%filter(rowname1%in%c(1,3)|rowname1%in%c(1,3)| ...)
I dont get the syntax. Thanks
We can use any_vars and %in% to achieve this task.
library(dplyr)
df1 %>% filter_at(vars(1:4), any_vars(. %in% c(2, 3)))
# X4 X5 X6 X7
# 1 3 3 3 3
# 2 1 2 2 1
# 3 2 1 1 NA
# 4 2 7 1 NA
Or use == with |.
df1 %>% filter_at(vars(1:4), any_vars(. == 2 | . == 3))
# X4 X5 X6 X7
# 1 3 3 3 3
# 2 1 2 2 1
# 3 2 1 1 NA
# 4 2 7 1 NA
DATA
df1 <- read.table(text = " 4 5 6 7
1 3 3 3 3
2 1 2 2 1
3 2 1 1 NA
4 2 7 1 NA
5 1 1 1 1",
header = TRUE, stringsAsFactors = FALSE)

Exclude a Specific Value from a Unique Value Counter

I am trying to count how many different responses a person gives during a trial of an experiment, but there is a catch.
There are supposed to be 6 possible responses (1,2,3,4,5,6) BUT sometimes 0 is recorded as a response (it's a glitch / flaw in design).
I need to count the number of different responses they give, BUT ONLY counting unique values within the range 1-6. This helps us calculate their accuracy.
Is there a way to exclude the value 0 from contributing to a unique value counter? Any other work-arounds?
Currently I am trying this method below, but it includes 0, NA, and I think any other entry in a cell in the Unique Value Counter Column (I have named "Span6"), which makes me sad.
# My Span6 calculator:
ASixImageTrials <- data.frame(eSOPT_831$T8.RESP, eSOPT_831$T9.RESP, eSOPT_831$T10.RESP, eSOPT_831$T11.RESP, eSOPT_831$T12.RESP, eSOPT_831$T13.RESP)
ASixImageTrials$Span6 = apply(ASixImageTrials, 1, function(x) length(unique(x)))
Use na.omit inside unique and sum logic vector as below
df$res = apply(df, 1, function(x) sum(unique(na.omit(x)) > 0))
df
Output:
X1 X2 X3 X4 X5 res
1 2 1 1 2 1 2
2 3 0 1 1 2 3
3 3 NA 1 1 3 2
4 3 3 3 4 NA 2
5 1 1 0 NA 3 2
6 3 NA NA 1 1 2
7 2 0 2 3 0 2
8 0 2 2 2 1 2
9 3 2 3 0 NA 2
10 0 2 3 2 2 2
11 2 2 1 2 1 2
12 0 2 2 2 NA 1
13 0 1 4 3 2 4
14 2 2 1 1 NA 2
15 3 NA 2 2 NA 2
16 2 2 NA 3 NA 2
17 2 3 2 2 2 2
18 2 NA 3 2 2 2
19 NA 4 5 1 3 4
20 3 1 2 1 NA 3
Data:
set.seed(752)
mat <- matrix(rbinom(100, 10, .2), nrow = 20)
mat[sample(1:100, 15)] = NA
data.frame(mat) -> df
df$res = apply(df, 1, function(x) sum(unique(na.omit(x)) > 0))
could you edit your question and clarify why this doesn't solve your problem?
# here is a numeric vector with a bunch of numbers
mtcars$carb
# here is how to limit that vector to only 1-6
mtcars$carb[ mtcars$carb %in% 1:6 ]
# here is how to tabulate that result
table( mtcars$carb[ mtcars$carb %in% 1:6 ] )

Resources