Using `rank` across columns to create new variable - r

I have a question I can't figure out, which I'm almost certain involves rank. Let's say that I have a df in wide form with 3 variables with integer values.
id var1 var2 var3
1 23 8 30
2 1 2 3
3 4 5 1
4 100 80 60
I'd like to create three new variables with the rank of the values for var1, var2, and var3 from largest to smallest. For example,
id var1 var2 var3 var1_rank var2_rank var3_rank
1 23 8 30 2 3 1
2 1 2 3 3 2 1
3 4 5 1 2 1 3
4 100 80 60 1 2 3
How would I go about doing this? Thanks!

Get the example data:
test <- read.table(text="id var1 var2 var3
1 23 8 30
2 1 2 3
3 4 5 1
4 100 80 60",header=TRUE)
Get the ranks part and rename appropriately (notice the -x to reverse the rank so it relates to decreasing instead of increasing size - this will be generalisable to any size of data.frame used as input):
ranks <- t(apply(test[,-1], 1, function(x) rank(-x) ))
colnames(ranks) <- paste(colnames(ranks), "_rank", sep="")
Join with the old data frame.
data.frame(test, ranks)
Result:
> data.frame(test,ranks)
id var1 var2 var3 var1_rank var2_rank var3_rank
1 1 23 8 30 2 3 1
2 2 1 2 3 3 2 1
3 3 4 5 1 2 1 3
4 4 100 80 60 1 2 3
To get to #mnel's answer using base R, you could also do something like:
testres <- data.frame(test["id"],stack(test[2:4]))
testres$rank <- ave(testres$values,testres$id,FUN=function(x) rank(-x) )
> testres
id values ind rank
1 1 23 var1 2
2 2 1 var1 3
3 3 4 var1 2
4 4 100 var1 1
5 1 8 var2 3
6 2 2 var2 2
7 3 5 var2 1
8 4 80 var2 2
9 1 30 var3 1
10 2 3 var3 1
11 3 1 var3 3
12 4 60 var3 3

I think it is easier to work in long format (and more memory efficient, as apply will coerce to a matrix. Here is an approach using reshape and data.table
library(data.table)
tlong <- reshape(data.table(test), direction ='long', varying = list(2:4),
times = paste0('var',1:3), v.names = 'value')
# calculate the rank within each `id`
tlong[, rank := rank(-value), by = id]
tlong
## id time value rank
## 1: 1 var1 23 2
## 2: 2 var1 1 3
## 3: 3 var1 4 2
## 4: 4 var1 100 1
## 5: 1 var2 8 3
## 6: 2 var2 2 2
## 7: 3 var2 5 1
## 8: 4 var2 80 2
## 9: 1 var3 30 1
## 10: 2 var3 3 1
## 11: 3 var3 1 3
## 12: 4 var3 60 3
# reshape to wide (if you want)
oldname <- paste0('var1',1:3)
twide <- reshape(tlong, direction = 'wide', timevar = 'time', idvar = 'id')
# reorder from value.var1, rank.var1,... to value.var1, value.var2,....rank.var1, rank.var2
setcolorder(twide, c('id', paste('value', oldname, sep ='.'), paste('rank', oldname, sep = '.'))

Here's one approach:
data.frame(dat, 4 - t(apply(dat[, -1], 1, rank)))
## > data.frame(dat, 4 - t(apply(dat[, -1], 1, rank)))
## id var1 var2 var3 var1.1 var2.1 var3.1
## 1 1 23 8 30 2 3 1
## 2 2 1 2 3 3 2 1
## 3 3 4 5 1 2 1 3
## 4 4 100 80 60 1 2 3

Related

is there a way in R to subtract two rows within a group by specifying another grouping var?

Say I have something like this:
ID = c("a","a","a","a","a", "b","b","b","b","b")
Group = c("1","2","3","4","5", "1","2","3","4","5")
Value = c(3, 4,2,4,3, 6, 1, 8, 9, 10)
df<-data.frame(ID,Group,Value)
I want to subtract group=5 from group=3 within the ID, with an output column which has this difference for each ID like so:
ID Group Value Want
1 a 1 3 1
2 a 2 4 1
3 a 3 2 1
4 a 4 4 1
5 a 5 3 1
6 b 1 6 2
7 b 2 1 2
8 b 3 8 2
9 b 4 9 2
10 b 5 10 2
Also, if that calculation cannot be done (i.e. group 5 is missing), NA values for the 'want' column would be ideal.
As there is only one unique 'Group' per 'ID', we can do subsetting
library(dplyr)
df %>%
group_by(ID) %>%
mutate(want = Value[Group == 5] - Value[Group == 3])
# A tibble: 10 x 4
# Groups: ID [2]
# ID Group Value want
# <fct> <fct> <dbl> <dbl>
# 1 a 1 3 1
# 2 a 2 4 1
# 3 a 3 2 1
# 4 a 4 4 1
# 5 a 5 3 1
# 6 b 1 6 2
# 7 b 2 1 2
# 8 b 3 8 2
# 9 b 4 9 2
#10 b 5 10 2
The above can be made more error-proof if we convert to numeric index and get the first element. When there are no TRUE, by using [1], it returns NA
df %>%
slice(-10) %>%
group_by(ID) %>%
mutate(want = Value[which(Group == 5)[1]] - Value[which(Group == 3)[1]])
Or use match which returns an index of NA if there are no matches, and anything with NA index returns NA which will subsequently return NA in subtraction (NA -3)
df %>%
slice(-10) %>% # removing the last row where Group is 10
group_by(ID) %>%
mutate(want = Value[match(5, Group)] - Value[match(3, Group)])
Here is a base R solution
dfout <- Reduce(rbind,
lapply(split(df,df$ID),
function(x) within(x, Want <-diff(subset(Value, Group %in% c("3","5"))))))
such that
> dfout
ID Group Value Want
1 a 1 3 1
2 a 2 4 1
3 a 3 2 1
4 a 4 4 1
5 a 5 3 1
6 b 1 6 2
7 b 2 1 2
8 b 3 8 2
9 b 4 9 2
10 b 5 10 2
A data.table method:
library(data.table)
setDT(df)[, want := (Value[Group == 5] - Value[Group == 3]), by = .(ID)]
df
# ID Group Value want
# 1: a 1 3 1
# 2: a 2 4 1
# 3: a 3 2 1
# 4: a 4 4 1
# 5: a 5 3 1
# 6: b 1 6 2
# 7: b 2 1 2
# 8: b 3 8 2
# 9: b 4 9 2
# 10: b 5 10 2
Here is a solution using base R.
unsplit(
lapply(
split(df, df$ID),
function(d) {
x5 = d$Value[d$Group == "5"]
x5 = ifelse(length(x5) == 1, x5, NA)
x3 = d$Value[d$Group == "3"]
x3 = ifelse(length(x3) == 1, x3, NA)
d$Want = x5 - x3
d
}),
df$ID)

Count distinct values that are not the same as the current row's values

Suppose I have a data frame:
df <- data.frame(SID=sample(1:4,15,replace=T), Var1=c(rep("A",5),rep("B",5),rep("C",5)), Var2=sample(2:4,15,replace=T))
which comes out to something like this:
SID Var1 Var2
1 4 A 2
2 3 A 2
3 4 A 3
4 3 A 3
5 1 A 4
6 1 B 2
7 3 B 2
8 4 B 4
9 4 B 4
10 3 B 2
11 2 C 2
12 2 C 2
13 4 C 4
14 2 C 4
15 3 C 3
What I hope to accomplish is to find the count of unique SIDs (see below under update, this should have said count of unique (SID, Var1) combinations) where the given row's Var1 is excluded from this count and the count is grouped on Var2. So for the example above, I would like to output:
SID Var1 Var2 Count.Excluding.Var1
1 4 A 2 3
2 3 A 2 3
3 4 A 3 1
4 3 A 3 1
5 1 A 4 3
6 1 B 2 3
7 3 B 2 3
8 4 B 4 3
9 4 B 4 3
10 3 B 2 3
11 2 C 2 4
12 2 C 2 4
13 4 C 4 2
14 2 C 4 2
15 3 C 3 2
For the 1st observation, we have a count of 3 because there are 3 unique combinations of (SID, Var1) for the given Var2 value (2, in this case) where Var1 != A (Var1 value of 1st observation) -- specifically, the count includes observation 6, 7 and 11, but not 12 because we already accounted for a (SID, Var1)=(2,C) and not row 2 because we do not want Var1 to be "A". All of these rows have the same Var2 value.
I'd preferably like to use dplyr functions and the %>% operator.
&
UPDATE
I apologize for the confusion and my incorrect explanation above. I have corrected what I intended on asking for in the paranthesis, but I am leaving my original phrasing as well because majority of answers seem to interpret it this way.
As for the example, I apologize for not setting the seed. There seems to have been some confusion with regards to the Count.Excluding.Var1 for rows 11 and 12. With unique (SID, Var1) combinations, rows 11 and 12 should make sense as these count rows 1,2,6, and 7 xor 8.
A simple mapply can do the trick. But as OP requested for %>% based solution, an option could be as:
df %>% mutate(Count.Excluding.Var1 =
mapply(function(x,y)nrow(unique(df[df$Var1 != x & df$Var2 == y,1:2])),.$Var1,.$Var2))
# SID Var1 Var2 Count.Excluding.Var1
# 1 4 A 2 3
# 2 2 A 3 3
# 3 4 A 4 3
# 4 4 A 4 3
# 5 3 A 4 3
# 6 4 B 3 1
# 7 3 B 3 1
# 8 3 B 3 1
# 9 4 B 2 3
# 10 2 B 3 1
# 11 2 C 2 2
# 12 4 C 4 2
# 13 1 C 4 2
# 14 1 C 2 2
# 15 3 C 4 2
Data:
The above results are based on origional data provided by OP.
df <- data.frame(SID=sample(1:4,15,replace=T), Var1=c(rep("A",5),rep("B",5),rep("C",5)), Var2=sample(2:4,15,replace=T))
could not think of a dplyr solution, but here's one with apply
df$Count <- apply(df, 1, function(x) length(unique(df$SID[(df$Var1 != x['Var1']) & (df$Var2 == x['Var2'])])))
# SID Var1 Var2 Count
# 1 4 A 2 3
# 2 3 A 2 3
# 3 4 A 3 1
# 4 3 A 3 1
# 5 1 A 4 2
# 6 1 B 2 3
# 7 3 B 2 3
# 8 4 B 4 3
# 9 4 B 4 3
# 10 3 B 2 3
# 11 2 C 2 3
# 12 2 C 2 3
# 13 4 C 4 2
# 14 2 C 4 2
# 15 3 C 3 2
Here is a dplyr solution, as requested. For future reference, please use set.seed so we can reproduce your desired output with sample, else I have to enter data by hand...
I think this is your logic? You want the n_distinct(SID) for each Var2, but for each row, you want to exclude rows which have the same Var1 as the current row. So a key observation here is row 3, where a simple grouped summarise would yield a count of 2. Of the rows with Var2 = 3, row 3 has SID = 4, row 4 has SID = 3, row 15 has SID = 3, but we don't count row 3 or row 4, so final count is one unique SID.
Here we get first the count of unique SID for each Var2, then the count of unique SID for each Var1, Var2 combo. First count is too large by the amount of additional unique SID for each combo, so we subtract it and add one. There is an edge case where for a Var1, there is only one corresponding Var2. This should return 0 since you exclude all the possible values of SID. I added two rows to illustrate this.
library(tidyverse)
df <- read_table2(
"SID Var1 Var2
4 A 2
3 A 2
4 A 3
3 A 3
1 A 4
1 B 2
3 B 2
4 B 4
4 B 4
3 B 2
2 C 2
2 C 2
4 C 4
2 C 4
3 C 3
1 D 5
2 D 5"
)
df %>%
group_by(Var2) %>%
mutate(SID_per_Var2 = n_distinct(SID)) %>%
group_by(Var1, Var2) %>%
mutate(SID_per_Var1Var2 = n_distinct(SID)) %>%
ungroup() %>%
add_count(Var1) %>%
add_count(Var1, Var2) %>%
mutate(
Count.Excluding.Var1 = if_else(
n > nn,
SID_per_Var2 - SID_per_Var1Var2 + 1,
0
)
) %>%
select(SID, Var1, Var2, Count.Excluding.Var1)
#> # A tibble: 17 x 4
#> SID Var1 Var2 Count.Excluding.Var1
#> <int> <chr> <int> <dbl>
#> 1 4 A 2 3.
#> 2 3 A 2 3.
#> 3 4 A 3 1.
#> 4 3 A 3 1.
#> 5 1 A 4 3.
#> 6 1 B 2 3.
#> 7 3 B 2 3.
#> 8 4 B 4 3.
#> 9 4 B 4 3.
#> 10 3 B 2 3.
#> 11 2 C 2 4.
#> 12 2 C 2 4.
#> 13 4 C 4 2.
#> 14 2 C 4 2.
#> 15 3 C 3 2.
#> 16 1 D 5 0.
#> 17 2 D 5 0.
Created on 2018-04-12 by the reprex package (v0.2.0).
Here's a solution using purrr - you can wrap this in a mutate statement if you want, but I don't know that it adds much in this particular case.
library(purrr)
df$Count.Excluding.Var1 = map_int(1:nrow(df), function(n) {
df %>% filter(Var2 == Var2[n], Var1 != Var1[n]) %>% distinct() %>% nrow()
})
(Updated with input from comments by Calum You. Thanks!)
A 100% tidyverse solution:
library(tidyverse) # dplyr + purrr
df %>%
group_by(Var2) %>%
mutate(count = map_int(Var1,~n_distinct(SID[.x!=Var1],Var1[.x!=Var1])))
# # A tibble: 15 x 4
# # Groups: Var2 [3]
# SID Var1 Var2 count
# <int> <chr> <int> <int>
# 1 4 A 2 3
# 2 3 A 2 3
# 3 4 A 3 1
# 4 3 A 3 1
# 5 1 A 4 3
# 6 1 B 2 3
# 7 3 B 2 3
# 8 4 B 4 3
# 9 4 B 4 3
# 10 3 B 2 3
# 11 2 C 2 4
# 12 2 C 2 4
# 13 4 C 4 2
# 14 2 C 4 2
# 15 3 C 3 2

How do I add column values based on matching IDs in R?

I have two data frames:
A:
ID Var1 Var2 Var3
1 0 3 4
2 1 5 0
3 1 6 7
B:
ID Var1 Var2 Var3
1 2 4 2
2 2 1 1
3 0 2 1
4 1 0 3
I want to add the columns from A and B based on matching ID's to get data frame C, and keep row 4 from B (even though it does not have a matching ID from A):
ID Var1 Var2 Var3
1 2 7 6
2 3 6 1
3 1 8 8
4 1 0 3
rbind and aggregate by ID:
aggregate(. ~ ID, data=rbind(A,B), sum)
# ID Var1 Var2 Var3
#1 1 2 7 6
#2 2 3 6 1
#3 3 1 8 8
#4 4 1 0 3
In data.table you can similarly do:
library(data.table)
setDT(rbind(A,B))[, lapply(.SD, sum), by=ID]
And there would be analogous solutions in dplyr and sql or whatever else. Bind the rows, group by ID, sum.

R - Loop through a data table with combination of dcast of sum

I have a table similar this, with more columns. What I am trying to do is creating a new table that shows, for each ID, the number of Counts of each Type, the Value of each Type.
df
ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3
I am able to do it for one single column by using
dcast(df[,j=list(sum(Counts,na.rm = TRUE)),by = c("ID","Type")],ID ~ paste(Type,"Counts",sep="_"))
However, I want to use a loop through each column within the data table. but there is no success, it will always add up all the rows. I have try to use
sum(df[[i]],na.rm = TRUE)
sum(names(df)[[i]] == "",na.rm = TRUE)
sum(df[[names(df)[i]]],na.rm = TRUE)
j = list(apply(df[,c(3:4),with=FALSE],2,function(x) sum(x,na.rm = TRUE)
I want to have a new table similar like
ID A_Counts B_Counts A_Value B_Value
1 1 2 5 4
2 5 3 5 6
My own table have more columns, but the idea is the same. Do I over-complicated it or is there a easy trick I am not aware of? Please help me. Thank you!
You have to melt your data first, and then dcast it:
library(reshape2)
df2 <- melt(df,id.vars = c("ID","Type"))
# ID Type variable value
# 1 1 A Counts 1
# 2 1 B Counts 2
# 3 2 A Counts 2
# 4 2 A Counts 3
# 5 2 B Counts 1
# 6 2 B Counts 2
# 7 1 A Value 5
# 8 1 B Value 4
# 9 2 A Value 1
# 10 2 A Value 4
# 11 2 B Value 3
# 12 2 B Value 3
dcast(df2,ID ~ Type + variable,fun.aggregate=sum)
# ID A_Counts A_Value B_Counts B_Value
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Another solution with base functions only:
df3 <- aggregate(cbind(Counts,Value) ~ ID + Type,df,sum)
# ID Type Counts Value
# 1 1 A 1 5
# 2 2 A 5 5
# 3 1 B 2 4
# 4 2 B 3 6
reshape(df3, idvar='ID', timevar='Type',direction="wide")
# ID Counts.A Value.A Counts.B Value.B
# 1 1 1 5 2 4
# 2 2 5 5 3 6
Data
df <- read.table(text ="ID Type Counts Value
1 A 1 5
1 B 2 4
2 A 2 1
2 A 3 4
2 B 1 3
2 B 2 3",stringsAsFactors=FALSE,header=TRUE)

R - how to add cases of one variable to other variable (stack variables)

var1 var2 var3
1 2 3
1 2 3
1 2 3
I want to stack var2 and var3 underneath var1 to get:
var1
1
1
1
2
2
2
3
3
3
I tried:
data$var <- append(data$var1,data$var2)
Then I get an error that my replacement has more rows. How do I solve this?
df <- data.frame(var1=1:3,var2=4:6,var3=7:9)
df2 <- stack(df)
print(df2)
values ind
1 1 var1
2 2 var1
3 3 var1
4 4 var2
5 5 var2
6 6 var2
7 7 var3
8 8 var3
9 9 var3
You may want to try unlist:
dtf <- data.frame(a = 1:3, b = 1:3, c = 1:3)
unlist(dtf)
a1 a2 a3 b1 b2 b3 c1 c2 c3
1 2 3 1 2 3 1 2 3
Your output has a different number of rows to your input, so trying to turn the latter into the former is going to cause problems. Just make a new data frame:
df <- data.frame(x = c(df$var1, df$var2, df$var3)
You can also get fancy with do.call, taking advantage of the fact that a data frame is a list under the hood:
df <- data.frame(x = do.call("c", df))
I'm guessing you're getting this error because each column/variable in a dataframe needs to be the same length. You could make a new longer variable and join it with the old dataframe but it is going to repeat the data in the other variables.
> df <- data.frame(var1=1:3,var2=4:6,var3=7:9)
> df
var1 var2 var3
1 1 4 7
2 2 5 8
3 3 6 9
# join combination of var1/var2 and 'df' dataframe
> data.frame(newvar=c(df$var1,df$var2),df)
newvar var1 var2 var3
1 1 1 4 7
2 2 2 5 8
3 3 3 6 9
4 4 1 4 7
5 5 2 5 8
6 6 3 6 9
Stack seems like the obvious answer here, but melt in the reshape package does works in an equivalent fashion and MAY offer some flexibility for other more complicated situations. Assuming you are working with an object named dat:
library(reshape)
melt(dat)
variable value
1 var1 1
2 var1 1
3 var1 1
4 var2 2
5 var2 2
6 var2 2
7 var3 3
8 var3 3
9 var3 3
If you need to preserve one of the columns as an ID variable:
> melt(dat, id.vars = "var1")
var1 variable value
1 1 var2 2
2 1 var2 2
3 1 var2 2
4 1 var3 3
5 1 var3 3
6 1 var3 3

Resources