Ordering alphabetically after ordering once numerically [duplicate] - r

Supose I have a data frame with 3 columns (name, y, sex) where name is character, y is a numeric value and sex is a factor.
sex<-c("M","M","F","M","F","M","M","M","F")
x<-c("MARK","TOM","SUSAN","LARRY","EMMA","LEONARD","TIM","MATT","VIOLET")
name<-as.character(x)
y<-rnorm(9,8,1)
score<-data.frame(x,y,sex)
score
name y sex
1 MARK 6.767086 M
2 TOM 7.613928 M
3 SUSAN 7.447405 F
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
7 TIM 10.385221 M
8 MATT 7.497702 M
9 VIOLET 10.177969 F
If I wanted to order it by y I would use:
score[order(score$y),]
x y sex
1 MARK 6.767086 M
3 SUSAN 7.447405 F
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
9 VIOLET 10.177969 F
7 TIM 10.385221 M
So far, so good... The names keep the correct score BUT how could I reorder it to have M and F levels not mixed. I need to order and at the same time keep factor levels separated.
Finally I would like to take a step further to involve character, the example doesn't help, but what if there were tied y values and I would have to order again within factor (e.g. TIM and TOM got 8.4 and I have to assign alphabetical order).
I was thinking about by function but it creates a list and doesn't help really. I think there must be some function like it to apply on data frames and get data frames as return.
TO MAKE CLEAR THE POINT:
sep<-split(score,score$sex)
sep$M<-sep$M[order(sep$M[,2]),]
sep$M
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
sep$F<-sep$F[order(sep$F[,2]),]
sep$F
x y sex
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
merged<-rbind(sep$M,sep$F)
merged
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
I know how to do that if I have 2 or 3 factors. But what if I had serious levels of factors, say 20, should I write a for loop?

order takes multiple arguments, and it does just what you want:
with(score, score[order(sex, y, x),])
## x y sex
## 3 SUSAN 6.636370 F
## 5 EMMA 6.873445 F
## 9 VIOLET 8.539329 F
## 6 LEONARD 6.082038 M
## 2 TOM 7.812380 M
## 8 MATT 8.248374 M
## 4 LARRY 8.424665 M
## 7 TIM 8.754023 M
## 1 MARK 8.956372 M

Here is a summary of all methods mentioned in other answers/comments (to serve future searchers). I've added a data.table way of sorting.
# Base R
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
with(score, score[order(sex, y, x),])
score[order(score$sex,score$x),]
# Using plyr
arrange(score, sex,y)
ddply(score, c('sex', 'y'))
# Using `data.table`
library("data.table")
score_dt <- setDT(score)
# setting a key works sorts the data.table
setkey(score_dt,sex,x)
print(score_dt)
Here is Another question that deals with the same

I think there must be some function like it to apply on data frames
and get data frames as return
Yes there is:
library(plyr)
ddply(score, c('y', 'sex'))

It sounds to me like you're trying to order by score within the males and females and return a combined data frame of sorted males and sorted females.
You are right that by(score, score$sex, function(x) x[order(x$y),]) returns a list of sorted data frames, one for male and one for female. You can use do.call with the rbind function to combine these data frames into a single final data frame:
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
# x y sex
# F.5 EMMA 7.526866 F
# F.9 VIOLET 8.182407 F
# F.3 SUSAN 9.677511 F
# M.4 LARRY 6.929395 M
# M.8 MATT 7.970015 M
# M.7 TIM 8.297137 M
# M.6 LEONARD 8.845588 M
# M.2 TOM 9.035948 M
# M.1 MARK 10.082314 M

I believe that the person asked how to sort it by the orders in the case of say 20.
I know how to do that if I have 2 or 3 factors. But what if I had
serious levels of factors, say 20, should I write a for loop?
I have one where there are 9 orders with various counts.
stage_name count
<ord> <int>
1 Closed Lost 957
2 Closed Won 1413
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Nurture 1222
6 Opportunity Disqualified 805
7 Order Submitted 1673
8 Qualifying 5138
9 Quoted 4976
In this case you can see that it is displayed using alphabetical order of stage_name, but stage_name is actually an ordered factor that has a very different order.
This code orders the factor is a much different order:
# Make categoricals ----
check_stage$stage_name = ordered(check_stage$stage_name, levels=c(
'Opportunity Disqualified',
'Qualifying',
'Evaluation',
'Meeting Scheduled',
'Quoted',
'Order Submitted',
'Closed Won',
'Closed Lost',
'Nurture'))
Now we can just apply the factor as the method of ordering this is a dplyr function, but you might need forcats too. I have both libraries installed:
check_stage <- check_stage %>%
arrange(factor(stage_name))
This now gives the output in the factor order as desired:
check_stage
# A tibble: 9 × 2
stage_name count
<ord> <int>
1 Opportunity Disqualified 805
2 Qualifying 5138
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Quoted 4976
6 Order Submitted 1673
7 Closed Won 1413
8 Closed Lost 957
9 Nurture 1222

Related

Order a variable and separate by another variable R

I want to order() a variable by the number of characters it has and then separate that variable based on gender, f and m. How can I do that using R?
Calling your data d,
d[order(d$sex, nchar(as.character(d$name))), ]
# name sex
# 2 Alex Berniugn female
# 5 Aaron Poss male
# 1 Alex Josh Adams male
# 4 Potter Allbinore male
# 3 Abbigol Alexander male

How to convert a n x 3 data frame into a square (ordered) matrix?

I need to reshape a table or (data frame) to be able to use an R package (NetworkRiskMetrics). Suppose I have a data frame of lenders, borrowers and loan values:
lender borrower loan_USD
John Mark 100
Mark Paul 45
Joe Paul 30
Dan Mark 120
How do I convert this data frame into:
John Mark Joe Dan Paul
John
Mark
Joe
Dan
Paul
(placing zeros in empty cells)?
Thanks.
Use reshape function
d <- data.frame(lander=c('a','b','c', 'a'), borower=c('m','p','m','p'), loan=c(10,20,15,12))
d
loan lander borower
10.1 1 a m
20.1 1 b p
15.1 1 c m
12.1 1 a p
reshape(data=d, direction='long', varying=list('lander','borower'), idvar='loan', timevar='loan')
lander borower loan
1 a m 10
2 b p 20
3 c m 15
4 a p 12

Erasing duplicates with NA values

I have a data frame like this:
names <- c('Mike','Mike','Mike','John','John','John','David','David','David','David')
dates <- c('04-26','04-26','04-27','04-28','04-27','04-26','04-01','04-02','04-02','04-03')
values <- c(NA,1,2,4,5,6,1,2,NA,NA)
test <- data.frame(names,dates,values)
Which is:
names dates values
1 Mike 04-26 NA
2 Mike 04-26 1
3 Mike 04-27 2
4 John 04-28 4
5 John 04-27 5
6 John 04-26 6
7 David 04-01 1
8 David 04-02 2
9 David 04-02 NA
10 David 04-03 NA
I'd like to get rid of duplicates with NA values. So, in this case, I have a valid observation from Mike on 04-26 and also have a valid observation from David on 04-02, so rows 1 and 9 should be erased and I will end up with:
names dates values
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
I tried to use duplicated function, something like this:
test[!duplicated(test[,c('names','dates')]),]
But that does not work since some NA values come before the valid value. Do you have any suggestions without trying things like merge or making another data frame?
Update: I'd like to keep rows with NA that are not duplicates.
What about this way?
library(dplyr)
test %>% group_by(names, dates) %>% filter((n()>=2 & !is.na(values)) | n()==1)
Source: local data frame [8 x 3]
Groups: names, dates [8]
names dates values
(fctr) (fctr) (dbl)
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
Here is an attempt in data.table:
# set up
libary(data.table)
setDT(test)
# construct condition
test[, dupes := max(duplicated(.SD)), .SDcols=c("names", "dates"), by=c("names", "dates")]
# print out result
test[dupes == 0 | !is.na(values),]
Here is a similar method using base R, except that the dupes variable is kept separately from the data.frame:
dupes <- duplicated(test[c("names", "dates")])
# this generates warnings, but works nonetheless
dupes <- ave(dupes, test$names, test$dates, FUN=max)
# print out result
test[dupes == 0 | !is.na(test$values),]
If there are duplicated rows where the values variable is NA, and these duplicates add nothing to the data, then you can drop them prior to running the code above:
testNoNADupes <- test[!(duplicated(test) & is.na(test$values)),]
This should work based on your sample.
test <- test[order(test$values),]
test <- test[!(duplicated(test$names) & duplicated(test$dates) & is.na(test$values)),]

Sort data frame column by factor

Supose I have a data frame with 3 columns (name, y, sex) where name is character, y is a numeric value and sex is a factor.
sex<-c("M","M","F","M","F","M","M","M","F")
x<-c("MARK","TOM","SUSAN","LARRY","EMMA","LEONARD","TIM","MATT","VIOLET")
name<-as.character(x)
y<-rnorm(9,8,1)
score<-data.frame(x,y,sex)
score
name y sex
1 MARK 6.767086 M
2 TOM 7.613928 M
3 SUSAN 7.447405 F
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
7 TIM 10.385221 M
8 MATT 7.497702 M
9 VIOLET 10.177969 F
If I wanted to order it by y I would use:
score[order(score$y),]
x y sex
1 MARK 6.767086 M
3 SUSAN 7.447405 F
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
5 EMMA 8.306875 F
6 LEONARD 8.697268 M
9 VIOLET 10.177969 F
7 TIM 10.385221 M
So far, so good... The names keep the correct score BUT how could I reorder it to have M and F levels not mixed. I need to order and at the same time keep factor levels separated.
Finally I would like to take a step further to involve character, the example doesn't help, but what if there were tied y values and I would have to order again within factor (e.g. TIM and TOM got 8.4 and I have to assign alphabetical order).
I was thinking about by function but it creates a list and doesn't help really. I think there must be some function like it to apply on data frames and get data frames as return.
TO MAKE CLEAR THE POINT:
sep<-split(score,score$sex)
sep$M<-sep$M[order(sep$M[,2]),]
sep$M
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
sep$F<-sep$F[order(sep$F[,2]),]
sep$F
x y sex
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
merged<-rbind(sep$M,sep$F)
merged
x y sex
1 MARK 6.767086 M
8 MATT 7.497702 M
2 TOM 7.613928 M
4 LARRY 8.040069 M
6 LEONARD 8.697268 M
7 TIM 10.385221 M
3 SUSAN 7.447405 F
5 EMMA 8.306875 F
9 VIOLET 10.177969 F
I know how to do that if I have 2 or 3 factors. But what if I had serious levels of factors, say 20, should I write a for loop?
order takes multiple arguments, and it does just what you want:
with(score, score[order(sex, y, x),])
## x y sex
## 3 SUSAN 6.636370 F
## 5 EMMA 6.873445 F
## 9 VIOLET 8.539329 F
## 6 LEONARD 6.082038 M
## 2 TOM 7.812380 M
## 8 MATT 8.248374 M
## 4 LARRY 8.424665 M
## 7 TIM 8.754023 M
## 1 MARK 8.956372 M
Here is a summary of all methods mentioned in other answers/comments (to serve future searchers). I've added a data.table way of sorting.
# Base R
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
with(score, score[order(sex, y, x),])
score[order(score$sex,score$x),]
# Using plyr
arrange(score, sex,y)
ddply(score, c('sex', 'y'))
# Using `data.table`
library("data.table")
score_dt <- setDT(score)
# setting a key works sorts the data.table
setkey(score_dt,sex,x)
print(score_dt)
Here is Another question that deals with the same
I think there must be some function like it to apply on data frames
and get data frames as return
Yes there is:
library(plyr)
ddply(score, c('y', 'sex'))
It sounds to me like you're trying to order by score within the males and females and return a combined data frame of sorted males and sorted females.
You are right that by(score, score$sex, function(x) x[order(x$y),]) returns a list of sorted data frames, one for male and one for female. You can use do.call with the rbind function to combine these data frames into a single final data frame:
do.call(rbind, by(score, score$sex, function(x) x[order(x$y),]))
# x y sex
# F.5 EMMA 7.526866 F
# F.9 VIOLET 8.182407 F
# F.3 SUSAN 9.677511 F
# M.4 LARRY 6.929395 M
# M.8 MATT 7.970015 M
# M.7 TIM 8.297137 M
# M.6 LEONARD 8.845588 M
# M.2 TOM 9.035948 M
# M.1 MARK 10.082314 M
I believe that the person asked how to sort it by the orders in the case of say 20.
I know how to do that if I have 2 or 3 factors. But what if I had
serious levels of factors, say 20, should I write a for loop?
I have one where there are 9 orders with various counts.
stage_name count
<ord> <int>
1 Closed Lost 957
2 Closed Won 1413
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Nurture 1222
6 Opportunity Disqualified 805
7 Order Submitted 1673
8 Qualifying 5138
9 Quoted 4976
In this case you can see that it is displayed using alphabetical order of stage_name, but stage_name is actually an ordered factor that has a very different order.
This code orders the factor is a much different order:
# Make categoricals ----
check_stage$stage_name = ordered(check_stage$stage_name, levels=c(
'Opportunity Disqualified',
'Qualifying',
'Evaluation',
'Meeting Scheduled',
'Quoted',
'Order Submitted',
'Closed Won',
'Closed Lost',
'Nurture'))
Now we can just apply the factor as the method of ordering this is a dplyr function, but you might need forcats too. I have both libraries installed:
check_stage <- check_stage %>%
arrange(factor(stage_name))
This now gives the output in the factor order as desired:
check_stage
# A tibble: 9 × 2
stage_name count
<ord> <int>
1 Opportunity Disqualified 805
2 Qualifying 5138
3 Evaluation 1773
4 Meeting Scheduled 4104
5 Quoted 4976
6 Order Submitted 1673
7 Closed Won 1413
8 Closed Lost 957
9 Nurture 1222

Deduplicate dataframe based on criteria in R?

I've got this dataframe:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18
As you can see the value "Jane" appears 3 times. What I want to do is to deduplicate the list based on the variable "Name" but because the rest of the columns are important to me, I want to keep the rows that have the most information in them. For example if I was to deduplicate the above file in excel, it would keep the first value of "Jane" and delete all the other ones. But the first value of "Jane" (row no3) has got missing information in the other columns.
So in other words I want to deduplicate the list by "Name" but add a criteria to keep the rows that have any other value different from "0" in the column "Age". This way the result I would get would be this:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane US F 30
4 Kate GB F 18
I have tried this
file3 <- file1[!duplicated(file1$Name),]
But like excel it keeps the value of "Jane" that has no usable information in the other columns.
How do I sort the rows based on column "Age" in a Z-A order so that anything that has "0" will be on the bottom and will be removed when I deduplicate the list?
Cheers
David
Try this trick
ind <- with(DF,
Country !=0 &
Gender %in% c('F', 'M') &
Age !=0)
DF[ind, ]
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
4 Jane US F 30
6 Kate GB F 18
So far it works well and produces your desired output
EDIT
library(doBy)
orderBy(~ -Age+Name, DF) # Sort decreasingly by Age and Name
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Or simply using Base functions:
DF[order(DF$Age, DF$Name, decreasing = TRUE), ]
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Now you can select by indexing the correct rows meeting your conditions, I really think the first part is better than these two lasts.
If all duplicated rows have the value zero in column Age, it will work with subset:
# the data
file1 <- read.table(text="Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18", header = TRUE, stringsAsFactors = FALSE)
# create a subset of the data
subset(file1, Age > 0)
# Name Country Gender Age
# 1 John GB M 25
# 2 Mark US M 35
# 4 Jane US F 30
# 6 Kate GB F 18

Resources