How to make a true ranking by giving same ranking to same score while the rest follow the true order? - r

Here is the toy sample of 8 students with grades from A to D. I would like to give a ranking which reflects the true order while students with same grade shall have same ranking.
It seems the .GRP is most likely the right approach, but it goes with order of numbers, how can I skip the position occupied by the students with same grade, with data.table? Thanks.
DT <- data.table(GRADE = c("A","B","B","C",rep("D",4)))
DT[, GRP:=.GRP, by = GRADE][, RANK:= c(1,2,2,4,5,5,5,5)]
# GRADE GRP RANK
#1: A 1 1
#2: B 2 2
#3: B 2 2
#4: C 3 4
#5: D 4 5
#6: D 4 5
#7: D 4 5
#8: D 4 5

An option is frank
DT[, RANK := frank(GRADE, ties.method = 'min')]
DT$RANK
#[1] 1 2 2 4 5 5 5 5
Or in dplyr with min_rank
library(dplyr)
DT %>%
mutate(RANK = min_rank(GRADE))

Related

How to keep rows with the same values in two variables in r?

I have a dataset with several variables, but I want to keep the rows that are the same based on two columns. Here is an example of what I want to do:
a <- c(rep('A',3), rep('B', 3), rep('C',3))
b <- c(1,1,2,4,4,4,5,5,5)
df <- data.frame(a,b)
a b
1 A 1
2 A 1
3 A 2
4 B 4
5 B 4
6 B 4
7 C 5
8 C 5
9 C 5
I know that if I use the duplicated function I can get:
df[!duplicated(df),]
a b
1 A 1
3 A 2
4 B 4
7 C 5
But since the level 'A' on column a does not have a unique value in b, I want to drop both observations to get a new data.frame as this:
a b
4 B 4
7 C 5
I don't mind to have repeated values across b, as long as for every same level on a there is the same value in b.
Is there a way to do this? Thanks!
This one maybe?
ag <- aggregate(b~a, df, unique)
ag[lengths(ag$b)==1,]
# a b
#2 B 4
#3 C 5
Maybe something like this:
> ind <- apply(sapply(with(df, split(b,a)), diff), 2, function(x) all(x==0) )
> out <- df[!duplicated(df),]
> out[out$a %in% names(ind)[ind], ]
a b
4 B 4
7 C 5
Here is another option with data.table
library(data.table)
setDT(df)[, if(uniqueN(b)==1) .SD[1L], by = a]
# a b
#1: B 4
#2: C 5

Reshaping 2 column data.table from long to wide

This is my data.frame:
library(data.table)
df<- fread('
predictions Label
3 A
4 B
5 C
1 A
2 B
3 C
')
Desired Output:
A B C
3 4 5
1 2 3
I am trying DesiredOutput<-dcast(df, Label+predictions ~ Label, value.var = "predictions") with no success. Your help is appreciated!
df[, idx := 1:.N, by = Label]
dcast(df, idx ~ Label, value.var = 'predictions')
# idx A B C
#1: 1 3 4 5
#2: 2 1 2 3
Maybe the base R function unstack is the cleanest solution:
unstack(df)
A B C
1 3 4 5
2 1 2 3
Note that this returns a data.frame rather than a data.table, so if you want a data.table at the end:
df2 <- setDT(unstack(df))
will return a data.table.

R sort summarise ddply by group sum

I have a data.frame like this
x <- data.frame(Category=factor(c("One", "One", "Four", "Two","Two",
"Three", "Two", "Four","Three")),
City=factor(c("D","A","B","B","A","D","A","C","C")),
Frequency=c(10,1,5,2,14,8,20,3,5))
Category City Frequency
1 One D 10
2 One A 1
3 Four B 5
4 Two B 2
5 Two A 14
6 Three D 8
7 Two A 20
8 Four C 3
9 Three C 5
I want to make a pivot table with sum(Frequency) and used the ddply function like this:
ddply(x,.(Category,City),summarize,Total=sum(Frequency))
Category City Total
1 Four B 5
2 Four C 3
3 One A 1
4 One D 10
5 Three C 5
6 Three D 8
7 Two A 34
8 Two B 2
But I need this results sorted by the total in each Category group. Something like this:
Category City Frequency
1 Two A 34
2 Two B 2
3 Three D 14
4 Three C 5
5 One D 10
6 One A 1
7 Four B 5
8 Four C 3
I have looked and tried sort, order, arrange, but nothing seems to do what I need. How can I do this in R?
Here is a base R version, where DF is the result of your ddply call:
with(DF, DF[order(-ave(Total, Category, FUN=sum), Category, -Total), ])
produces:
Category City Total
7 Two A 34
8 Two B 2
6 Three D 8
5 Three C 5
4 One D 10
3 One A 1
1 Four B 5
2 Four C 3
The logic is basically the same as David's, calculate the sum of Total for each Category, use that number for all rows in each Category (we do this with ave(..., FUN=sum)), and then sort by that plus some tie breakers to make sure stuff comes out as expected.
This is a nice question and I can't think of a straight way of doing this rather than creating a total size index and then sorting by it. Here's a possible data.table approach which uses setorder function which will order your data by reference
library(data.table)
Res <- setDT(x)[, .(Total = sum(Frequency)), by = .(Category, City)]
setorder(Res[, size := sum(Total), by = Category], -size, -Total, Category)[]
# Category City Total size
# 1: Two A 34 36
# 2: Two B 2 36
# 3: Three D 8 13
# 4: Three C 5 13
# 5: One D 10 11
# 6: One A 1 11
# 7: Four B 5 8
# 8: Four C 3 8
Or if you deep in the Hdleyverse, we can reach a similar result using the newer dplyr package (as suggested by #akrun)
library(dplyr)
x %>%
group_by(Category, City) %>%
summarise(Total = sum(Frequency)) %>%
mutate(size= sum(Total)) %>%
ungroup %>%
arrange(-size, -Total, Category)

Dplyr: subtracting within uneven factor levels

I am trying to learn dplyr, and I cannot find an answer for a relatively simple question on Stackoverflow or the documentation. I thought I'd ask it here.
I have a data.frame that looks like this:
set.seed(1)
dat<-data.frame(rnorm(10,20,20),rep(seq(5),2),rep(c("a","b"),5))
names(dat)<-c("number","factor_1","factor_2")
dat<-dat[order(dat$factor_1,dat$factor_2),]
dat<-dat[c(-3,-7),]
number factor_1 factor_2
1 7.470924 1 a
6 3.590632 1 b
2 23.672866 2 b
3 3.287428 3 a
8 34.766494 3 b
4 51.905616 4 b
5 26.590155 5 a
10 13.892232 5 b
I would like to use dplyr to subtract the values number column associated with factor_2=="b" from factor_2=="a" within each level of factor one.
The first line of the resulting data.frame would look like:
diff factor_1
1 3.880291 1
A caveat is that there are not always values for each level of factor_2 within each level of factor_1. Should this be the case, I would like to assign 0 to the number associated with the missing factor level.
Thank you for your help.
Here is one approach:
set.seed(1)
dat<-data.frame(rnorm(10,20,20),rep(seq(5),2),rep(c("a","b"),5))
names(dat)<-c("number","factor_1","factor_2")
dat<-dat[order(dat$factor_1,dat$factor_2),]
dat<-dat[c(-3,-7),]
# number factor_1 factor_2
#1 7.470924 1 a
#6 3.590632 1 b
#2 23.672866 2 b
#3 3.287428 3 a
#8 34.766494 3 b
#4 51.905616 4 b
#5 26.590155 5 a
#10 13.892232 5 b
library(dplyr)
dat %>%
group_by(factor_1) %>%
summarize(diff=number[match('a',factor_2)]-number[match('b',factor_2)]) ->
d2
d2$diff[is.na(d2$diff)] <- 0
d2
# Source: local data frame [5 x 2]
#
# factor_1 diff
# 1 1 3.880291
# 2 2 0.000000
# 3 3 -31.479066
# 4 4 0.000000
# 5 5 12.697923
Here's a quick data.table solution using your data (next time please use set.seed when producing a data set with rnorm)
library(data.table)
setDT(dat)[order(-factor_2), if(.N == 1L) 0 else diff(number), by = factor_1]
# factor_1 V1
# 1: 1 18.20020
# 2: 2 0.00000
# 3: 3 -51.88444
# 4: 4 0.00000
# 5: 5 61.90332

R data.table not preserving factor when applying function by group [duplicate]

The data comes from another question I was playing around with:
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
country=c(rep(1,4),rep(2,6)),
event=1:10, key="user")
# user country event
#1: 3 1 1
#2: 3 1 2
#3: 3 1 3
#4: 3 1 4
#5: 3 2 5
#6: 4 2 6
#7: 4 2 7
#8: 4 2 8
#9: 4 2 9
#10: 4 2 10
And here's the surprising behavior:
dt[user == 3, as.data.frame(table(country))]
# country Freq
#1 1 4
#2 2 1
dt[user == 4, as.data.frame(table(country))]
# country Freq
#1 2 5
dt[, as.data.frame(table(country)), by = user]
# user country Freq
#1: 3 1 4
#2: 3 2 1
#3: 4 1 5
# ^^^ - why is this 1 instead of 2?!
Thanks mnel and Victor K. The natural follow-up is - shouldn't it be 2, i.e. is this a bug? I expected
dt[, blah, by = user]
to return identical result to
rbind(dt[user == 3, blah], dt[user == 4, blah])
Is that expectation incorrect?
The idiomatic data.table approach is to use .N
dt[ , .N, by = list(user, country)]
This will be far quicker and it will also retain country as the same class as in the original.
As mnel noted in comments, as.data.frame(table(...)) produces a data frame where the first variable is a factor. For user == 4, there is only one level in the factor, which is stored internally as 1.
What you want is factor levels, but what you get is how factors are stored internally (as integers, starting from 1). The following provides the expected result:
> dt[, lapply(as.data.frame(table(country)), as.character), by = user]
user country Freq
1: 3 1 4
2: 3 2 1
3: 4 2 5
Update. Regarding your second question: no, I think data.table behaviour is correct. Same thing happens in plain R when you join two factors with different levels:
> a <- factor(3:5)
> b <- factor(6:8)
> a
[1] 3 4 5
Levels: 3 4 5
> b
[1] 6 7 8
Levels: 6 7 8
> c(a,b)
[1] 1 2 3 1 2 3

Resources