Ordering rows when there is a tie between several of them - r

I have a data.frame (corresponding to a leaderboard) like this one:
structure(list(PJ = c(4, 4, 4, 4, 4, 4), V = c(4, 2, 2, 2, 1,
1), E = c(0, 0, 0, 0, 0, 0), D = c(0, 2, 2, 2, 3, 3), GF = c(182,
91, 92, 185, 126, 119), GC = c(84, 143, 144, 115, 141, 168),
Dif = c(98, -52, -52, 70, -15, -49), Pts = c(12, 6, 6, 6,
3, 3)), class = "data.frame", row.names = c("Player1", "Player2",
"Player3", "Player4", "Player5", "Player6"))
I would like to order the rows according to the number of points Pts. This can be done by df[order(df$Pts, decreasing=T),]. The issue appears when there is a tie between several players, then, what I want to do is to order the rows according to Dif.
How can this be done?

The order function which you are already using can take multiple arguments, each used sequentially to break ties in the previous one; see ?order
So you simply have to add Dif to you existing call:
df[order(df$Pts, df$Dif, decreasing=T),]
You can add further terms to break any remaining ties, e.g. Player2 and Player3 who have identical Pts and Dif.
If you want to specify which direction each argument should be ordered by (increasing or decreasing), you can either specify the decreasing argument as a vector, as in #r.user.05apr's comment, or my preferred lazy solution of adding - to any term that should be ordered in a decreasing direction
df[order(-df$Pts, df$Dif),]
(this will order by Pts decreasing and Dif increasing; it won't work if e.g. one of the ordering columns is character)

You can use sqldf or dplyr library
library (sqldf)
sqldf('select *
from "df"
order by "Pts" desc, "Dif" desc ')
Output
PJ V E D GF GC Dif Pts
1 4 4 0 0 182 84 98 12
2 4 2 0 2 185 115 70 6
3 4 2 0 2 91 143 -52 6
4 4 2 0 2 92 144 -52 6
5 4 1 0 3 126 141 -15 3
6 4 1 0 3 119 168 -49 3

Related

R: How to repeatedly subtract specific columns from different series of columns, and output to a new dataframe?

I have a dataframe in wide format, and I want to subtract specific columns from different series of columns. Ideally I'd like the results to be in a new dataframe.
For example:
From this sample dataframe (dfOld), I would like columns A, B and C to each subtract D, and columns E, F and G to each subtract column H. In the real dataset, this keeps going and needs to be iterated.
image of dfOld as table
Sample Data:
dfOld <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10), A = c(2, 3, 4,5,4,6,7,1,9,12), B = c(3, 4, 5,2,4,5,1,7,0,8), C = c(5, 6, 7,2,4,1,5,4,6,13), D = c(68, 7, 8,2,1,5,7,9,78,7), E = c(2, 3, 42,5,4,6,7,1,9,12), F = c(37, 4, 5,2,48,5,1,7,60,8), G = c(5, 6, 7,2,4,1,5,4,6,13), H = c(35, 7, 8,2,1,5,7,9,78,7))
The results would ideally be in a new dataframe, with columns that have values and names for A-D, B-D, C-D, E-H, F-H, G-H, and look like this:
image of dfNew as table
In Excel, the formula would be "=B2-$E2" dragged down the rows, and across 3 columns, and then repeated again for "F2-$I2" etc, using the "$" sign to lock the column
In R, I've only been able to do this manually, kind of like the answer previously posted for a similar question (Subtracting two columns to give a new column in R)
dfOld$A-D<-(dfOld$A-dfOld$D)
dfOld$B-D<-(dfOld$B-dfOld$D)
dfOld$C-D<-(dfOld$C-dfOld$D)
dfOld$E-H<-(dfOld$E-dfOld$H)
dfOld$F-H<-(dfOld$F-dfOld$H)
dfOld$G-H<-(dfOld$G-dfOld$H)
And then separated the new columns out into a new dataset.
However, this obviously isn't scalable for my much larger dataset, and I'd really like to learn how else to do this kind of operation that's so easy in Excel(although still not scalable for large datasets).
Part of the answer may already be here: Subtract a column in a dataframe from many columns in R
But this answer (an several other similar ones) changes the values in the same dataframe, and the columns keep the same names.
I haven't been able to adapt it so that the new values have new columns, with new names (and ideally in a new dataframe)
Another part of the answer may be here:
Iterative function to subtract columns from a specific column in a dataframe and have the values appear in a new column
These answers put the subtracted results in new columns with new names, but every column in this dataframe subtracts values of every other column (A,B,C,D,E,F,G,H each minus C). And I can't seem to adapt it so that it works over specific series of columns (A, B, C each minus D, then E, F, G each minus H, etc.)
Thanks in advance for your help.
Probably others have better ways - but here is one possibility.
load two libraries and set dfOld to data.table
library(data.table)
library(magrittr)
setDT(dfOld)
get information about the columns, and make into a list.
lv = names(dfOld)[-1][seq(1,ncol(dfOld)-1)%%4>0]
lv = split(lv, ceiling(seq_along(lv)/3))
names(lv) = names(dfOld)[-1][seq(1,ncol(dfOld)-1)%%4==0]
lv looks like this:
> lv
$D
[1] "A" "B" "C"
$H
[1] "E" "F" "G"
This is a bit convoluted, but basically, I'm taking each of the elements of the lv list, and I'm reshaping columns from dfOld, so I can do all subtractions at once. Then I'm retaining only the variables I need, and binding each of the resulting list of data.tables into a single datatable using rbindlist
res =rbindlist(lapply(names(lv), function(x) {
melt(dfOld,id=c("ID", x),measure.vars = lv[[x]]) %>%
.[,`:=`(nc=value-get(x),variable=paste0(variable,"-",x))] %>%
.[,.(ID,variable,nc)]
}))
Last step is simple - just dcast back
dcast(res,ID~variable, value.var="nc")
Output
ID A-D B-D C-D E-H F-H G-H
1: 1 -66 -65 -63 -33 2 -30
2: 2 -4 -3 -1 -4 -3 -1
3: 3 -4 -3 -1 34 -3 -1
4: 4 3 0 0 3 0 0
5: 5 3 3 3 3 47 3
6: 6 1 0 -4 1 0 -4
7: 7 0 -6 -2 0 -6 -2
8: 8 -8 -2 -5 -8 -2 -5
9: 9 -69 -78 -72 -69 -18 -72
10: 10 5 1 6 5 1 6
First, I create a function to do the simple calculation, where we have the dataframe, then the column names as the inputs. Then, I use purrr map2 to pass the function (which I replicate for the number of times needed, which in this case is 6). Then, I provide the list of parameters to apply that function for each column pair. Then, I use invoke to apply the function and parameter. Now, we are left with a list of dataframes (as the output is an individual column with the ID). Then, I use reduce` to combine them back into one dataframe, then update the column names.
library(tidyverse)
subtract <- function(x, a, b){
x %>%
mutate(!! a := !!rlang::parse_expr(a) - !!rlang::parse_expr(b)) %>%
dplyr::select(ID, which(colnames(x)==a))
}
col_names <- c("ID", "A-D", "B-D", "C-D", "E-H", "F-H", "G-H")
map2(
flatten(list(rep(list(
subtract
), 6))),
list(
expression(a = "A", b = "D"),
expression(a = "B", b = "D"),
expression(a = "C", b = "D"),
expression(a = "E", b = "H"),
expression(a = "F", b = "H"),
expression(a = "G", b = "H")
),
~ invoke(.x, c(list(dfOld), as.list(.y)))
) %>%
reduce(left_join, by = "ID") %>%
set_names(col_names)
Output
ID A-D B-D C-D E-H F-H G-H
1 1 -66 -65 -63 -33 2 -30
2 2 -4 -3 -1 -4 -3 -1
3 3 -4 -3 -1 34 -3 -1
4 4 3 0 0 3 0 0
5 5 3 3 3 3 47 3
6 6 1 0 -4 1 0 -4
7 7 0 -6 -2 0 -6 -2
8 8 -8 -2 -5 -8 -2 -5
9 9 -69 -78 -72 -69 -18 -72
10 10 5 1 6 5 1 6
Data
dfOld <- structure(
list(
ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
A = c(2,
3, 4, 5, 4, 6, 7, 1, 9, 12),
B = c(3, 4, 5, 2, 4, 5, 1, 7, 0,
8),
C = c(5, 6, 7, 2, 4, 1, 5, 4, 6, 13),
D = c(68, 7, 8, 2,
1, 5, 7, 9, 78, 7),
E = c(2, 3, 42, 5, 4, 6, 7, 1, 9, 12),
F = c(37,
4, 5, 2, 48, 5, 1, 7, 60, 8),
G = c(5, 6, 7, 2, 4, 1, 5, 4, 6,
13),
H = c(35, 7, 8, 2, 1, 5, 7, 9, 78, 7)
),
class = "data.frame",
row.names = c(NA,-10L)
)

How to filter and count in R based on variables with a type of name

I have educational data in R that looks like this:
df <- data.frame(
"StudentID" = c(101, 102, 103, 104, 105, 106, 111, 112, 113, 114, 115, 116, 121, 122, 123, 124, 125, 126),
"FedEthn" = c(1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3),
"HIST.11.LEV" = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 5, 3, 3),
"HIST.11.SCORE" = c(96, 95, 95, 97, 88, 99, 89, 96, 79, 83, 72, 95, 96, 93, 97, 98, 96, 87),
"HIST.12.LEV" = c(2, 2, 1, 2, 1, 1, 2, 3, 2, 2, 2, 2, 4, 3, 3, 3, 3, 3),
"SCI.9.LEV" = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3),
"SCI.9.SCORE" = c(91, 99, 82, 95, 65, 83, 96, 97, 99, 94, 95, 96, 89, 78, 96, 95, 97, 90),
"SCI.10.LEV" = c(1, 2, 1, 2, 1, 1, 3, 3, 2, 2, 2, 3, 3, 3, 4, 3, 4, 3)
)
## StudentID FedEthn HIST.11.LEV HIST.11.SCORE HIST.12.LEV SCI.9.LEV SCI.9.SCORE SCI.10.LEV
## 1 101 1 1 96 2 1 91 1
## 2 102 1 1 95 2 1 99 2
## 3 103 2 1 95 1 1 82 1
## 4 104 2 1 97 2 1 95 2
## 5 105 3 1 88 1 1 65 1
## 6 106 3 1 99 1 1 83 1
## 7 111 1 2 89 2 2 96 3
## 8 112 1 2 96 3 2 97 3
## 9 113 2 2 79 2 2 99 2
## 10 114 2 2 83 2 2 94 2
## 11 115 3 2 72 2 2 95 2
## 12 116 3 2 95 2 2 96 3
## 13 121 1 3 96 4 3 89 3
## 14 122 1 3 93 3 3 78 3
## 15 123 2 3 97 3 3 96 4
## 16 124 2 3 98 3 3 95 3
## 17 125 3 3 96 3 3 97 4
## 18 126 3 3 87 3 3 90 3
HIST.11.LEV stands for the student's academic level in their 11th grade history course. (5 = highest academic level, 1 = lowest academic level. For example, 5 might be an AP or IB course.) HIST.11.SCORE indicates the student's score in the course.
When a student scores 95 or higher in a course, they're eligible to move up to a higher academic level in the following year (such that HIST.12.LEV = 1 + HIST.11.LEV). However, only some of these eligible students actually move up, and the teacher must agree to it. What I'm analyzing is whether these move-up rates for eligible students differ by reported federal ethnicity.
Here's how I'm achieving this so far:
var.level <- 1
var.ethn <- 1
actual.move.ups <-
(df %>% filter(FedEthn==var.ethn,
HIST.11.LEV==var.level,
HIST.11.SCORE>94,
HIST.12.LEV==var.level+1) %>%
count) +
(df %>% filter(FedEthn==var.ethn,
SCI.9.LEV==var.level,
SCI.9.SCORE>94,
SCI.10.LEV==var.level+1) %>%
count)
eligible.move.ups <-
(df %>% filter(FedEthn==var.ethn,
HIST.11.LEV==var.level,
HIST.11.SCORE>94) %>%
count) +
(df %>% filter(FedEthn==var.ethn,
SCI.9.LEV==var.level,
SCI.9.SCORE>94) %>%
count)
This works, and I could iterate var.level from 1:5 and var.ethnicity from 1:7 and store the results in a data frame. But in my actual data, this approach would require 15 iterations of df %>% filter(...) %>% count (and I'd sum them all). The reason is that, in my actual data, there are 15 opportunities to move up across 5 subjects (HIST, SCI, MATH, ENG, WL) and 4 grade levels (9, 10, 11, 12).
My question is whether there's a more compact way to filter and count all instances where COURSE.GRADE.LEV==i, COURSE.GRADE+1.LEV==i+1, and COURSE.GRADE.SCORE>94 without typing/hard-coding each course name (HIST, SCI, MATH, ENG, WL) and each grade level (9, 10, 11, 12). And, what's the best way to store the results in a data frame?
For my sample data above, here's the ideal output. The data frame doesn't need to have this exact structure, though.
## FedEthn L1.Actual L1.Eligible L2.Actual L2.Eligible L3.Actual L3.Eligible
## 1 1 3 3 3 3 1 1
## 2 2 2 3 0 1 1 3
## 3 3 0 1 1 3 1 2
*Note: I've read this helpful answer, but for my variable names, the grade level (9, 10, 11, 12) doesn't have a consistent string location (e.g., SCI.9 vs. HIST.11). Also, in some instances, I need to count a single row multiple times, since a single student could move up in multiple classes. Maybe the solution is to reshape the data from wide to long before performing the count?
Using this great answer from #akrun, I was able to come up with a solution. I think I'm still making it unnecessarily complicated, though, and I hope to accept someone else's more compact answer.
course.names <- c("HIST.","SCI.")
grade.levels <- 9:11
tally.actual <- function(var.ethn, var.level){
total.tally.actual <- NULL
for(i in course.names){
course.tally.actual <- NULL
for(j in grade.levels){
new.tally.actual <- df %>% filter(
FedEthn == var.ethn,
!!(rlang::sym(paste0(i,j,".LEV"))) == var.level,
!!(rlang::sym(paste0(i,(j+1),".LEV"))) == (var.level+1),
!!(rlang::sym(paste0(i,j,".SCORE"))) > 94
) %>% count
course.tally.actual <- c(new.tally.actual, course.tally.actual)
}
total.tally.actual <- c(total.tally.actual, course.tally.actual)
}
return(sum(unlist(total.tally.actual)))
}
tally.eligible <- function(var.ethn, var.level){
total.tally.eligible <- NULL
for(i in course.names){
course.tally.eligible <- NULL
for(j in grade.levels){
new.tally.eligible <- df %>% filter(
FedEthn == var.ethn,
!!(rlang::sym(paste0(i,j,".LEV"))) == var.level,
!!(rlang::sym(paste0(i,j,".SCORE"))) > 94
) %>% count
course.tally.eligible <- c(new.tally.eligible, course.tally.eligible)
}
total.tally.eligible <- c(total.tally.eligible, course.tally.eligible)
}
return(sum(unlist(total.tally.eligible)))
}
results <- data.frame("FedEthn" = 1:3,
"L1.Actual" = NA, "L1.Eligible" = NA,
"L2.Actual" = NA, "L2.Eligible" = NA,
"L3.Actual" = NA, "L3.Eligible" = NA)
for(var.ethn in 1:3){
for(var.level in 1:3){
results[var.ethn,(var.level*2)] <- tally.actual(var.ethn,var.level)
results[var.ethn,(var.level*2+1)] <- tally.eligible(var.ethn,var.level)
}
}
This approach works, but it requires df to contain every combination of course (SCI, MATH, HIST, ENG, WL) and year (9, 10, 11, 12). See below for how I added to the original df. Including all possible combinations isn't a problem for my actual data, but I'm hoping there's a solution that doesn't require adding a bunch of columns filled with NA:
df$HIST.9.LEV = NA
df$HIST.9.SCORE = NA
df$HIST.10.LEV = NA
df$HIST.10.SCORE = NA
df$HIST.12.SCORE = NA
df$SCI.10.SCORE = NA
df$SCI.11.LEV = NA
df$SCI.11.SCORE = NA
df$SCI.12.LEV = NA
df$SCI.12.SCORE = NA

Formula to substitute dataframe column names with categories defined in a second dataframe

Let's say I have data in wide format (samples in row and species in columns).
species <- data.frame(
Sample = 1:10,
Lobvar = c(21, 15, 12, 11, 32, 42, 54, 10, 1, 2),
Limtru = c(2, 5, 1, 0, 2, 22, 3, 0, 1, 2),
Pocele = c(3, 52, 11, 30, 22, 22, 23, 10, 21, 32),
Genmes = c(1, 0, 22, 1, 2,32, 2, 0, 1, 2)
)
And I want to automatically change the species names, based on a reference of functional groups that I have for all of the species (so it works even if I have more references than actual species in the dataset), for example:
reference <- data.frame(
Species_name = c("Lobvar", "Ampmis", "Pocele", "Genmes", "Limtru", "Secgio", "Nasval", "Letgos", "Salnes", "Verbes"),
Functional_group = c("Crustose", "Geniculate", "Erect", "CCA", "CCA", "CCA", "Geniculate", "Turf","Turf", "Crustose"),
stringsAsFactors = FALSE
)
EDIT
Thanks to #Dan Y suggestions, I can now changes the species names to their functional group names:
names(species)[2:ncol(species)] <- reference$Functional_group[match(names(species), reference$Species_name)][-1]
However, in my actual data.frame I have more species, and this creates many functional groups with the same name in different columns. I now would like to sum the columns that have the same names. I updated the example to give a results in which there is more than one functional group with the same name.
So i get this:
Sample Crustose CCA Erect CCA Crustose
1 21 2 3 1 2
2 15 5 52 0 3
3 12 1 11 22 4
4 11 0 30 1 1
5 32 2 22 2 0
6 42 22 22 32 0
and the final result I am looking for is this:
Sample Crustose CCA Erect
1 23 3 3
2 18 5 52
3 16 22 11
4 12 1 30
5 32 4 22
6 42 54 22
How do you advise on approaching this? Thanks for your help and the amazing suggestions I already received.
Re Q1) We can use match to do the name lookup:
names(species)[2:ncol(species)] <- reference$Functional_group[match(names(species), reference$Species_name)][-1]
Re Q2) Then we can mapply the rowSums function after some regular expression work on the colnames:
namevec <- gsub("\\.[[:digit:]]", "", names(df))
mapply(function(x) rowSums(df[which(namevec == x)]), unique(namevec))

Create several pie charts in R from lists

I want to create several pie charts at once, I have a list of the names:
[1] 361 456 745 858 1294 1297 2360 2872 3034 5118 5189...
So the first pie chart should be labeled '361', and so on.
Then I have several lists with values for each pie chart
[1] 102 99 107 30 2 8 24 16 57 117 ...
[1] 1 1 2 1 0 0 0 1 1 2 ...
[1] 4 2 2 1 3 0 0 1 1 2 ...
So for '361', the first element is 102, the second is 1 and the third is 4. The total is 107.
I want to do all of the charts at once.
One way to get that is by setting par("mfrow"). I also adjusted the margins a bit to eliminate some unwanted whitespace around the charts.
par(mfrow=c(2,5), mar=rep(0, 4), oma=rep(0,4))
for(i in 1:length(names)) {
pie(df[i, ][df[i,] > 0], labels=(1:3)[df[i,] > 0])
title(names[i], line = -3) }
Data
## data
names = c(361, 456, 745, 858, 1294, 1297, 2360, 2872, 3034, 5118, 5189)
x = c(102, 99, 107, 30, 2, 8, 24, 16, 57, 117)
y = c(1, 1, 2, 1, 0, 0, 0, 1, 1, 2)
z = c(4, 2, 2, 1, 3, 0, 0, 1, 1, 2)
df = data.frame(x,y,z)

R: 3D visualisation of a 6-dimensional matrix

I am doing some computation as a part of a scientific research, and I stuck up in a problem. That has to do with data visualization.
I got a list of the sublists of a different length. Each of those sublists is a vector of a numeric values of the main variable for every single situation. The problem is this:
is there a way to display it in a 3D plotin the following way:
Let's say x-axis stands for one factor of experiment, y-axis stands for another factor of experiment, and z-axis is the axis the numerical values of our nnumeric variable. I need to display it in the way of vertical lines (that are parralel to z-axis). The number of those vertical lines is equal to the number of Factors combinations (the x-axis and y-axis). Here is the way it looked before with a smaller amount of values (when the lists were of the same size):
https://www.dropbox.com/s/wdcgihjcqzobsqs/sample0.jpeg
I would want to make it in the same layout, only with a bigger number of points. Each of thse sublists stands for one of those 6 situations of Factors combinations.
Or maybe there is a different way, a better way of 3D visualization of this kind of data.
And here is the list of sublists I need to make my visualization for (I do not know if this is relevant here):
`> temp
[[1]]
[1] 395 310 235 290 240 490 270 225 430 385 170 55 295 320 270 130 300 285 130 200 225 90 205
[24] 340
[[2]]
[1] 3 8
[[3]]
[1] 1 0 0 0 3 2 5 2 3 5 2 3
[[4]]
[1] 1 0 0 0 3 2 5 2 3 5 2 3
[[5]]
[1] 1 1 1 2 3 5 2 5 3 3 3 2 3 2 3
[[6]]
[1] 0 0 195 150 2 2 0 2 1 1 2 1 2 1 1 1 3 2 2 1 2 2 1
[24] 1 2 3 2 2 1 3 1 1
`
Any help/suggestions will be appreciated.
Here is an alternate visualization. Note that you don't have a 6D problem, it's really a 3D problem with 2 factor dimensions and one continuous one. There are 6 possible factor combinations. Note I had to make assumptions about what factor combination corresponds to what item in your list:
facs <- cbind(f1=rep(f1, length(f2)), f2=rep(f2, each=length(f1))) # create factor combos
lst <- list(c(395, 310, 235, 290, 240, 490, 270, 225, 430, 385, 170, 55, 295, 320, 270, 130, 300, 285, 130, 200, 225, 90, 205, 340 ), c(3, 8), c(1, 0, 0, 0, 3, 2, 5, 2, 3, 5, 2, 3), c(1, 0, 0, 0, 3, 2, 5, 2, 3, 5, 2, 3), c(1, 1, 1, 2, 3, 5, 2, 5, 3, 3, 3, 2, 3, 2, 3), c(0, 0, 195, 150, 2, 2, 0, 2, 1, 1, 2, 1, 2, 1, 1, 1, 3, 2, 2, 1, 2, 2, 1, 1, 2, 3, 2, 2, 1, 3, 1, 1))
library(data.table)
facs.dt <- as.data.table(facs)[,list(time=sort(lst[[.GRP]])), by=list(f1, f2)]
facs.dt[, id:=seq_along(time), by=list(f1, f2)]
library(ggplot2)
ggplot(facs.dt, aes(x=id, y=time)) +
geom_bar(stat="identity", position="dodge") +
scale_y_log10() + facet_grid(f1 ~ f2)
The resulting plot above displays, for each of the 6 factor combinations, the log all the time values. This makes it much easier to read the continuous variable than a 3D cube.
And an alternate view with free scales:
ggplot(facs.dt, aes(x=id, y=time)) +
geom_bar(stat="identity", position="dodge") +
facet_wrap(~ f1 + f2, scales="free") +
opts(axis.text.x=element_blank(), axis.ticks.x=element_blank())

Resources