How to get consecutive rank for multiple variables [duplicate] - r

This question already has answers here:
Create a ranking variable with dplyr?
(3 answers)
Closed 3 years ago.
I have a data set where 5 varieties (var) and 3 variables (x,y,z) are available. I need to rank these varieties for 3 variables. When there is tie in rank it shows gap before starting the following rank. I cannot get the consecutive rank. Here is my data
x<-c(3,3,4,5,5)
y<-c(5,6,4,4,5)
z<-c(2,3,4,3,5)
df<-cbind(x,y,z)
rownames(df) <- paste0("G", 1:nrow(df))
df <- data.frame(var = row.names(df), df)
I tried the following code for my result
res <- sapply(df, rank,ties.method='min')
res
var x y z
[1,] 1 1 3 1
[2,] 2 1 5 2
[3,] 3 3 1 4
[4,] 4 4 1 2
[5,] 5 4 3 5
I got x variable with rank 1 1 3 4 4 instead of 1 1 2 3 3. For y and z the same thing was found.
My desired result is
>res
var x y z
[1,] 1 1 2 1
[2,] 2 1 3 2
[3,] 3 2 1 3
[4,] 4 3 1 2
[5,] 5 3 2 4
I will be grateful if anyone helps me.

Well, an easy way would be to convert to factor and then integer
df[] <- lapply(df, function(x) as.integer(factor(x)))
df
# var x y z
#G1 1 1 2 1
#G2 2 1 3 2
#G3 3 2 1 3
#G4 4 3 1 2
#G5 5 3 2 4

One dplyr possibility could be:
df %>%
mutate_at(2:4, list(~ dense_rank(.)))
var x y z
1 G1 1 2 1
2 G2 1 3 2
3 G3 2 1 3
4 G4 3 1 2
5 G5 3 2 4
Or a base R possibility:
df[2:4] <- lapply(df[2:4], function(x) match(x, sort(unique(x))))

We can use data.table
library(data.table)
setDT(df)[, (2:4) := lapply(.SD, dense_rank), .SDcols = 2:4]
df
# var x y z
#1: G1 1 2 1
#2: G2 1 3 2
#3: G3 2 1 3
#4: G4 3 1 2
#5: G5 3 2 4

Related

Issue of generating conditional numbers to a set frequency in R

I am having a issue generating conditional numbers. Repeated frequency of the number is shown in "size". For example, 1 should be repeated 3 times and 2 should be repeated 2 times and so on.
My desired output is shown below but I am unable to achieve this. Can somebody correct me please?
Desired output
x1
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
9 5
10 5
data <- data.frame(x1= rep(c(1),each=10))
data
size <- as.array(c(3,2,1,2,2))
for(i in 1:5) {
x_val <- size[i]
new <- rep(c(x_val), each=x_val)
data[nrow(size[i]) + 1, ] <- new
}
print(data)
x1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
We could use rep with times
data.frame(x1 = rep(seq_along(size), size))
-output
x1
1 1
2 1
3 1
4 2
5 2
6 3
7 4
8 4
9 5
10 5
If we need a for loop
x1 <- c()
for(i in seq_along(size)) x1 <- c(x1, rep(i, each = size[i]))
x1
#[1] 1 1 1 2 2 3 4 4 5 5

R: how to obtain unique pairwise combinations of 2 vectors [duplicate]

This question already has answers here:
How to generate permutations or combinations of object in R?
(3 answers)
Closed 2 years ago.
x = 1:3
y = 1:3
> expand.grid(x = 1:3, y = 1:3)
x y
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
Using expand.grid gives me all of the combinations. However, I want only pairwise comparisons, that is, I don't want a comparison of 1 vs 1, 2 vs, 2, or 3 vs 3. Moreover, I want to keep only the unique pairs, i.e., I want to keep 1 vs 2 (and not 2 vs 1).
In summary, for the above x and y, I want the following 3 pairwise combinations:
x y
1 1 2
2 1 3
3 2 3
Similarly, for x = y = 1:4, I want the following pairwise combinations:
x y
1 1 2
2 1 3
3 1 4
4 2 3
5 2 4
6 3 4
We can use combn
f1 <- function(x) setNames(as.data.frame(t(combn(x, 2))), c("x", "y"))
f1(1:3)
# x y
#1 1 2
#2 1 3
#3 2 3
f1(1:4)
# x y
#1 1 2
#2 1 3
#3 1 4
#4 2 3
#5 2 4
#6 3 4
Using data.table,
library(data.table)
x <- 1:4
y <- 1:4
CJ(x, y)[x < y]
x y
1: 1 2
2: 1 3
3: 1 4
4: 2 3
5: 2 4
6: 3 4
Actually you are already very close to the desired output. You may need subset as well
> subset(expand.grid(x = x, y = y), x < y)
x y
4 1 2
7 1 3
8 2 3
Here is another option but with longer code
v <- letters[1:5] # dummy data vector
mat <- diag(length(v))
inds <- upper.tri(mat)
data.frame(
x = v[row(mat)[inds]],
y = v[col(mat)[inds]]
)
which gives
x y
1 a b
2 a c
3 b c
4 a d
5 b d
6 c d
7 a e
8 b e
9 c e
10 d e

Shifting rows up in columns and flush remaining ones

I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!
If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5
You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.
One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5

Select rows of data frame based on a vector with duplicated values

What I want can be described as: give a data frame, contains all the case-control pairs. In the following example, y is the id for the case-control pair. There are 3 pairs in my data set. I'm doing a resampling with respect to the different values of y (the pair will be both selected or neither).
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
> sample_df
x y
1 1 1
2 2 1
3 3 2
4 4 2
5 5 3
6 6 3
select_y = c(1,3,3)
select_y
> select_y
[1] 1 3 3
Now, I have computed a vector contains the pairs I want to resample, which is select_y above. It means the case-control pair number 1 will be in my new sample, and number 3 will also be in my new sample, but it will occur 2 times since there are two 3. The desired output will be:
x y
1 1
2 1
5 3
6 3
5 3
6 3
I can't find out an efficient way other than writing a for loop...
Solution:
Based on #HubertL , with some modifications, a 'vectorized' approach looks like:
sel_y <- as.data.frame(table(select_y))
> sel_y
select_y Freq
1 1 1
2 3 2
sub_sample_df = sample_df[sample_df$y%in%select_y,]
> sub_sample_df
x y
1 1 1
2 2 1
5 5 3
6 6 3
match_freq = sel_y[match(sub_sample_df$y, sel_y$select_y),]
> match_freq
select_y Freq
1 1 1
1.1 1 1
2 3 2
2.1 3 2
sub_sample_df$Freq = match_freq$Freq
rownames(sub_sample_df) = NULL
sub_sample_df
> sub_sample_df
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
4 6 3 2
selected_rows = rep(1:nrow(sub_sample_df), sub_sample_df$Freq)
> selected_rows
[1] 1 2 3 3 4 4
sub_sample_df[selected_rows,]
x y Freq
1 1 1 1
2 2 1 1
3 5 3 2
3.1 5 3 2
4 6 3 2
4.1 6 3 2
Another method of doing the same without a loop:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
row_names <- split(1:nrow(sample_df),sample_df$y)
select_y = c(1,3,3)
row_num <- unlist(row_names[as.character(select_y)])
ans <- sample_df[row_num,]
I can't find a way without a loop, but at least it's not a for loop, and there is only one iteration per frequency:
sample_df = data.frame(x=1:6, y=c(1,1,2,2,3,3))
select_y = c(1,3,3)
sel_y <- as.data.frame(table(select_y))
do.call(rbind,
lapply(1:max(sel_y$Freq),
function(freq) sample_df[sample_df$y %in%
sel_y[sel_y$Freq>=freq, "select_y"],]))
x y
1 1 1
2 2 1
5 5 3
6 6 3
51 5 3
61 6 3

columnwise sum matching values to another column

Seems, I am missing some link here.
I have data frame
df<-data.frame(w=sample(1:3,10, replace=T), x=sample(1:3,10, replace=T), y=sample(1:3,10, replace=T), z=sample(1:3,10, replace=T))
> df
w x y z
1 3 1 1 3
2 2 1 1 3
3 1 3 2 2
4 3 1 3 1
5 2 2 1 1
6 1 2 2 3
7 1 2 2 2
8 2 2 2 3
9 1 3 3 3
10 2 2 1 1
I want to get the number of rows of each column which matches to 1st column.
sum(df$w==df$x)
[1] 3
sum(df$w==df$y)
[1] 2
sum(df$w==df$z)
[1] 1
I know using apply, I can do rowwise or colwise operations.
apply(df,2,length)
w x y z
10 10 10 10
How do I combine these two functions?
Try colSums
colSums(df[-1] == df[, 1])
# x y z
# 3 2 1
Or if you into *apply loops could try
vapply(df[-1], function(x) sum(x == df[, 1]), double(1))

Resources