How to sort alphabetically rows of a data frame? [duplicate] - r

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 8 years ago.
I am tring to sort c alphabetically if x[i]== x[i+1]. I used order() function but it changes the x column as well. I want to order the entire row:
best <- function(state){
HospitalName<-vector()
StateName<-vector()
HeartAttack<-vector()
k<-1
outcome<-read.csv("outcome-of-care-measures.csv",colClasses= "character")
temp<-(outcome[,c(2,7,11,17,23)])
for (i in 1:nrow(temp)){
if(identical(state,temp[i,2])==TRUE){
HospitalName[k]<-temp[i,1]
StateName[k]<-temp[i,2]
HeartAttack[k]<-as.numeric(temp[i,4])
k<-k+1
}}
frame<-data.frame(cbind(HospitalName,StateName,HeartAttack))
library(dplyr)
frame %>%
group_by(as.numeric(as.character(frame[,3]))) %>%
arrange(frame[,1])
}
Output:
HospitalName StateName HeartAttack
1 FORT DUNCAN MEDICAL CENTER TX 8.1
2 TOMBALL REGIONAL MEDICAL CENTER TX 8.5
3 CYPRESS FAIRBANKS MEDICAL CENTER TX 8.7
4 DETAR HOSPITAL NAVARRO TX 8.7
5 METHODIST HOSPITAL,THE TX 8.8
6 MISSION REGIONAL MEDICAL CENTER TX 8.8
7 BAYLOR ALL SAINTS MEDICAL CENTER AT FW TX 8.9
8 SCOTT & WHITE HOSPITAL-ROUND ROCK TX 8.9
9 THE HEART HOSPITAL BAYLOR PLANO TX 9
10 UT SOUTHWESTERN UNIVERSITY HOSPITAL TX 9
.. ... ... ...
Variables not shown: as.numeric(as.character(frame[, 3])) (dbl)
Output does not contain the HeartAttack Column and I do not understand why?

One solution with dplyr:
library(dplyr)
df %>%
group_by(x) %>%
arrange(c)
Or as #Akrun mentions in the comments below just
df %>%
arrange(x,c)
if you are not interested in grouping. Depends on what you want.
Output:
Source: local data frame [5 x 2]
Groups: x
x c
1 2 A
2 2 D
3 3 B
4 3 C
5 5 E
There is another solution in base R but it will only work if your x column is ordered as is, or if you don't mind changing the order it has:
> df[order(df$x, df$c), , drop = FALSE]
x c
2 2 A
1 2 D
4 3 B
3 3 C
5 5 E

Related

Should I use for loop? OR apply? [duplicate]

This question already has answers here:
Split dataframe by levels of a factor and name dataframes by those levels
(3 answers)
Closed 5 years ago.
this is my first post.
I have this dataframe of the Nhl draft.
What I would like to do is to use some sort of recursive function to create 10 objects.
So, I want to create these 10 objects by subsetting the Nhl dataframe by Year.
Here are the first 6 rows of the data set (nhl_draft)
Year Overall Team
1 2000 1 New York Islanders
2 2000 2 Atlanta Thrashers
3 2000 3 Minnesota Wild
4 2000 4 Columbus Blue Jackets
5 2000 5 New York Islanders
6 2000 6 Nashville Predators
Player PS
1 Rick DiPietro 49.3
2 Dany Heatley 95.2
3 Marian Gaborik 103.6
4 Rostislav Klesla 34.5
5 Raffi Torres 28.4
6 Scott Hartnell 74.5
I want to create 10 objects by subsetting out the Years, 2000 ~ 2009.
I tried,
for (i in 2000:2009) {
nhl_draft.i <- subset(nhl_draft, Year == "i")
}
BUT this doesn't do anything. What's the problem with this for-loop? Can you suggest any other ways?
Please tell me if this is confusing after all, this is my first post......
The following code may fix your error.
# Create an empty list
nhl_list <- list()
for (i in 2000:2009) {
# Subset the data frame based on Year
nhl_draft_temp <- subset(nhl_draft, Year == i)
# Assign the subset to the list
nhl_list[[as.character(i)]] <- nhl_draft_temp
}
But you can consider split, which is more concise.
nhl_list <- split(nhl_draft, f = nhl_draft$Year)

R: split data based on a factor, add a ranking column and extract

I still haven't been able to know how we can access different elements of a split data. Here is my problem:
I have a data set, that I want to split based on a column (State). I want to have a ranking column added to my data for each subset. This is part of a function I'm writing.
My data set has 2 columns, Hospital, State, Outcome. For each state, I want to add a 'Rank' column that ranks the data based on Outcome; the lowest Outcome will be ranked 1 and the highest outcome will be ranked the last.
How can I use split, sapply/lapply to do this? Is there a better way, like using "arrange"?
My main problem is that when I use either of these methods, I do not know how to access each element of the split or arranged data.
Here's how my data set looks like:
Hospital State Outcome. The row lines are not important here.
Hospital State Outcome
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3
2 MARSHALL MEDICAL CENTER SOUTH AL 18.5
3 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1
7 ST VINCENT'S EAST TX 17.7
8 DEKALB REGIONAL MEDICAL CENTER AL 18.0
9 SHELBY BAPTIST MEDICAL CENTER AL 15.9
The desired outcome would be
Hospital State Outcome Rank
1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 1
2 SHELBY BAPTIST MEDICAL CENTER AL 15.9 2
3 DEKALB REGIONAL MEDICAL CENTER AL 18.0 3
4 MARSHALL MEDICAL CENTER SOUTH AL 18.5 4
5 ST VINCENT'S EAST TX 17.7 1
6 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1 2
Thanks in advance.
The dplyr package provides a very elegant solution for this type of problem. I'm using the mtcars data as an example:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(rank = row_number(mpg))
The OP's example is hard to read into R because of all the spaces in the string variable.
Here's a simpler example:
set.seed(1)
DF <- data.frame(id=rep(1:2,sample(5,2))); DF$v <- runif(nrow(DF))*100
# id v
# 1 A 57.28534
# 2 A 90.82078
# 3 B 20.16819
# 4 B 89.83897
# 5 B 94.46753
# 6 B 66.07978
# 7 B 62.91140
Here's a solution without using any packages:
DF$r <- ave(DF$v,DF$id,FUN=rank)
# id v r
# 1 A 57.28534 1
# 2 A 90.82078 2
# 3 B 20.16819 1
# 4 B 89.83897 4
# 5 B 94.46753 5
# 6 B 66.07978 3
# 7 B 62.91140 2
Finally, to order by ranking within state:
DF[order(DF$id,DF$r),]
# id v r
# 1 A 57.28534 1
# 2 A 90.82078 2
# 3 B 20.16819 1
# 7 B 62.91140 2
# 6 B 66.07978 3
# 4 B 89.83897 4
# 5 B 94.46753 5
If you have ties in the column you're ranking, read the documentation for rank and decide how you want the ties treated. The dplyr and data.table packages (mentioned in the other answers) also have nice functionality for dealing with ties, like the notion of a "dense rank."
You could try this
library(data.table)
setDT(dat)[, myrank := rank(Outcome), by = State]
dat[,.SD[order(myrank)], by=State]
# State Hospital Outcome myrank
#1: AL SOUTHEAST ALABAMA MEDICAL CENTER 14.3 1
#2: AL SHELBY BAPTIST MEDICAL CENTER 15.9 2
#3: AL DEKALB REGIONAL MEDICAL CENTER 18.0 3
#4: AL MARSHALL MEDICAL CENTER SOUTH 18.5 4
#5: TX ST VINCENT EAST 17.7 1
#6: TX ELIZA COFFEE MEMORIAL HOSPITAL 18.1 2
Or using ddply
library(plyr)
ddply(dat, .(State), function(x){x$myrank = rank(x$Outcome); x[order(x$myrank),]})
# Hospital State Outcome myrank
#1 SOUTHEAST ALABAMA MEDICAL CENTER AL 14.3 1
#2 SHELBY BAPTIST MEDICAL CENTER AL 15.9 2
#3 DEKALB REGIONAL MEDICAL CENTER AL 18.0 3
#4 MARSHALL MEDICAL CENTER SOUTH AL 18.5 4
#5 ST VINCENT EAST TX 17.7 1
#6 ELIZA COFFEE MEMORIAL HOSPITAL TX 18.1 2
You can use by:
do.call(
rbind,
by(d, list(State = d$State), function(x) { x$Rank <- order(x$Outcome); x[order(x$Rank), ] }))
where d is your raw data.

Issue with sorting one column after rank is assigned

*****This is to deal with the question asked in Coursera and hence I may not be able to reveal the complete code*****
hi,
below is my data frame (outcome_H)
Hospital_Name H_A H_F PN
ABC 4.5 5 6
CDE 4.5 1 3
EFG 5 2 1
1) I need to rank the column provided in the function call (it could be one of H_A ,H_F,PN)
2) there will also a rank be provided in the call. Need to match that rank with the rank calculated above and return the respective Hospital_Name
I had used ties.method="first" to solve the tie problem. But however when I look at the final output the hospital name is not sorted.
Example: if i give rank =2, I expect CDE to be printed, but due to some problems(which I am note aware) ABC gets printed for rank=2 and CDE is printed for rank=1.
Below are some parts of code for better understanding:
H_A <- as.numeric(outcome_H$H_A)
HA <- H_A[order(H_A)] // newly added piece to order the value
df <- data.frame(HA,round(rank(HA,ties.method="first")),outcome_H$Hospital_Name)
rowss <- df[order(df$round.rank.HA..),]
Before ordering Output:
HA round.rank.HA.. outcome_H.Hospital.Name
42 8.1 1 FORT DUNCAN MEDICAL CENTER
192 8.5 2 TOMBALL REGIONAL MEDICAL CENTER
61 8.7 4 DETAR HOSPITAL NAVARRO
210 8.7 4 CYPRESS FAIRBANKS MEDICAL CENTER
69 8.8 6 MISSION REGIONAL MEDICAL CENTER
117 8.8 6 METHODIST HOSPITAL,THE
After Ordering output:
HA round.rank.HA..ties.method....first... outcome_H.Hospital.Name
1 8.1 1 PROVIDENCE MEMORIAL HOSPITAL
2 8.5 2 MEMORIAL HERMANN BAPTIST ORANGE HOSPITAL
3 8.7 3 PETERSON REGIONAL MEDICAL CENTER
4 8.7 4 CHILDREN'S HOSPITAL -SCOTT & WHITE HEALTHCARE
5 8.8 5 UNITED REGIONAL HEALTH CARE SYSTEM
6 8.8 6 ST JOSEPH REGIONAL HEALTH CENTER
As you can see, the data with hospital names are completely incorrect.
Any help is very much appreciated.
Thanks,
Pravellika J
You could try H_A <- as.numeric(as.character(outcome_H$H_A))
Output
HA round.rank.HA..ties.method....first... outcome_H.Hospital_Name
1 4.5 1 ABC
2 4.5 2 CDE
3 5.0 3 EFG
I figured it myself. I had initialy assigned HA only with one of the three cols(H_A,H_F,PN). Now i clubbed it with hospital_Name and ordered it based on both the attributes.
Thanks,
Pravellika J

Arrange dataframe for pairwise correlations

I am working with data in the following form:
Country Player Goals
"USA" "Tim" 0
"USA" "Tim" 0
"USA" "Dempsey" 3
"USA" "Dempsey" 5
"Brasil" "Neymar" 6
"Brasil" "Neymar" 2
"Brasil" "Hulk" 5
"Brasil" "Luiz" 2
"England" "Rooney" 4
"England" "Stewart" 2
Each row represents the number of goals that a player scored per game, and also contains that player's country. I would like to have the data in the form such that I can run pairwise correlations to see whether being from the same country has some association with the number of goals that a player scores. The data would look like this:
Player_1 Player_2
0 8 # Tim Dempsey
8 5 # Neymar Hulk
8 2 # Neymar Luiz
5 2 # Hulk Luiz
4 2 # Rooney Stewart
(You can ignore the comments, they are there simply to clarify what each row contains).
How would I do this?
table(df$player)
gets me the number of goals per player, but then how to I generate these pairwise combinations?
This is a pretty classic self-join problem. I'm gonna start by summarizing your data to get the total goals for each player. I like dplyr for this, but aggregate or data.table work just fine too.
library(dplyr)
df <- df %>% group_by(Player, Country) %>% dplyr::summarize(Goals = sum(Goals))
> df
Source: local data frame [7 x 3]
Groups: Player
Player Country Goals
1 Dempsey USA 8
2 Hulk Brasil 5
3 Luiz Brasil 2
4 Neymar Brasil 8
5 Rooney England 4
6 Stewart England 2
7 Tim USA 0
Then, using good old merge, we join it to itself based on country, and then so we don't get each row twice (Dempsey, Tim and Tim, Dempsey---not to mention Dempsey, Dempsey), we'll subset it so that Player.x is alphabetically before Player.y. Since I already loaded dplyr I'll use filter, but subset would do the same thing.
df2 <- merge(df, df, by.x = "Country", by.y = "Country")
df2 <- filter(df2, as.character(Player.x) < as.character(Player.y))
> df2
Country Player.x Goals.x Player.y Goals.y
2 Brasil Hulk 5 Luiz 2
3 Brasil Hulk 5 Neymar 8
6 Brasil Luiz 2 Neymar 8
11 England Rooney 4 Stewart 2
15 USA Dempsey 8 Tim 0
The self-join could be done in dplyr if we made a little copy of the data and renamed the Player and Goals columns so they wouldn't be joined on. Since merge is pretty smart about the renaming, it's easier in this case.
There is probably a smarter way to get from the aggregated data to the pairs, but assuming your data is not too big (national soccer data), you can always do something like:
A<-aggregate(df$Goals~df$Player+df$Country,data=df,sum)
players_in_c<-table(A[,2])
dat<-NULL
for(i in levels(df$Country)) {
count<-players_in_c[i]
pair<-combn(count,m=2)
B<-A[A[,2]==i,]
dat<-rbind(dat, cbind(B[pair[1,],],B[pair[2,],]) )
}
dat
> dat
df$Player df$Country df$Goals df$Player df$Country df$Goals
1 Hulk Brasil 5 Luiz Brasil 2
1.1 Hulk Brasil 5 Neymar Brasil 8
2 Luiz Brasil 2 Neymar Brasil 8
4 Rooney England 4 Stewart England 2
6 Dempsey USA 8 Tim USA 0

Rank a sorted dataset using apply function

My dataframe looks like this:
head(temp$HName)
[1] "UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER AT TYLER"
[2] "METHODIST HOSPITAL,THE"
[3] "TOMBALL REGIONAL MEDICAL CENTER"
[4] "METHODIST SUGAR LAND HOSPITAL"
[5] "GULF COAST MEDICAL CENTER"
[6] "VHS HARLINGEN HOSPITAL COMPANY LLC"
head(temp$Rate)
[1] 7.3 8.3 8.7 8.7 8.8 8.9
76 Levels: 7.3 8.3 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 ... 17.1
> head(temp$Rank)
[1] NA NA NA NA NA NA
The temp$Rate is sorted. I am trying to write a function assignRank which gives me a new column temp$Rank which has values as 1, 2, 3, 3, 4, 5
My code is as below:
tapply(temp$Rank,temp$Rate, assignRank)
where :
assignRank<- function(r=1){
temp$Rank <- r
r <- r + 1
return(r)
}
I get following error when running tapply
tapply(temp$Rank,temp$Rate, assignRank)
Show Traceback
Rerun with Debug
Error in `$<-.data.frame`(`*tmp*`, "Rank", value = c(NA, NA)) :
replacement has 2 rows, data has 301
Please advise where I am going wrong?
I use data.table for stuff like this, because both sorting and ranking are very efficient/simple syntax
library(data.table)
setkey(setDT(temp), Rate) # This will sort your data set by Rate in case it's not yet sorted
temp[, Rank := .GRP, by = Rate]
temp
# HName Rate Rank
# 1: UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER AT TYLER 7.3 1
# 2: METHODIST HOSPITAL,THE 8.3 2
# 3: TOMBALL REGIONAL MEDICAL CENTER 8.7 3
# 4: METHODIST SUGAR LAND HOSPITAL 8.7 3
# 5: GULF COAST MEDICAL CENTER 8.8 4
# 6: VHS HARLINGEN HOSPITAL COMPANY LLC 8.9 5
Or you could easily do the same using base R (assuming your data is sorted by Rank) just do
as.numeric(factor(temp$Rate))
## [1] 1 2 3 3 4 5
Or could also use dense_rank function from dplyr package (which will not require sorting the data set)
library(dplyr)
temp %>%
mutate(Rank = dense_rank(Rate))
# HName Rate Rank
# 1 UNIVERSITY OF TEXAS HEALTH SCIENCE CENTER AT TYLER 7.3 1
# 2 METHODIST HOSPITAL,THE 8.3 2
# 3 TOMBALL REGIONAL MEDICAL CENTER 8.7 3
# 4 METHODIST SUGAR LAND HOSPITAL 8.7 3
# 5 GULF COAST MEDICAL CENTER 8.8 4
# 6 VHS HARLINGEN HOSPITAL COMPANY LLC 8.9 5
Other options (if the data is ordered)
with(temp, cumsum(ave(Rate, Rate, FUN=function(x) c(1,x[-1]!=x[-length(x)]))))
#[1] 1 2 3 3 4 5
with(temp, match(Rate, unique(Rate)) )
#[1] 1 2 3 3 4 5

Resources