After quite some google effort I hope somebody can help me with the problem, that appears quite simple to me, but is maybe more complicated than I thought:
I have a data.frame with three columns. The first two reflecting all possible combinations of five variables (1-5), the last the "strength" of the combination. I look for the five rows, which include all values of Var1 and Var2 (so values 1-5) and have the highest sum in the strength column. In the example beneath, it is the five rows with a strength of 1000, as they have the highest sum and all five values (1-5) are given in the first two columns.
How do I best approach that problem? Is there a package that has implemented that task? I found now the constrOptim() function, can I do it with that?
Code to create an example dataframe:
a <-cbind(expand.grid(seq(1,5,1),seq(1,5,1)),
strength = c(-11, 61, 230, 118, 156, 98, 169, 306, 6, -54,
207, -32, 27, 128, 101, 19, -18, 32, 153, 14,
63, 136, 165, 73, 35))
a <- a[order(a$strength, decreasing=T),]
Starting dataset:
Var1 Var2 strength
3 2 306
3 1 230
1 3 207
2 2 169
3 5 165
5 1 156
4 4 153
2 5 136
4 3 128
4 1 118
5 3 101
1 2 98
4 5 73
1 5 63
2 1 61
5 5 35
3 4 32
3 3 27
1 4 19
5 4 14
4 2 6
1 1 -11
2 4 -18
2 3 -32
5 2 -54
Not desired outcome:
Var1 Var2 strength
3 2 306
3 1 230
1 3 207
2 2 169
3 5 165
Desired outcome:
Var1 Var2 strength
3 2 306
1 3 207
5 1 156
4 4 153
2 5 136
I am not sure the presented solution is the most effective one, but somehow I feel that we must go over the entire dataset to find the unique pairs (for example change the value of (Var1 = 2, Var2 = 5, strength = 136) to (Var1 = 2, Var2 = 5, strength = 1). In order to find the unique pairs I use the apply function. First lets recreate the input:
a <-cbind(expand.grid(seq(1,5,1),seq(1,5,1)),
strength = c(-11, 61, 230, 118, 156, 98, 169, 306, 6, -54,
207, -32, 27, 128, 101, 19, -18, 32, 153, 14,
63, 136, 165, 73, 35))
a <- a[order(a$strength, decreasing=T),]
Now I prepare an empty matrix in which I will keep Var1 in the first column, Var2 in the second and strength in the third column.
V <- matrix(nrow = 5, ncol = 3)
Next I write a function that will get one row from the ordered dataset a, will check if Var1 and Var2 are unique and if so, will store strength.
mf <- function(x){
if( !(x[1] %in% V[,1]) & !(x[2] %in% V[,2])) {
i <- x[1]
V[i,1] <<- x[1]
V[i,2] <<- x[2]
V[i,3] <<- x[3]
}
}
Now I apply the function on each row of a:
apply(a, 1, mf)
The needed values are stored in the matrix V:
V
[,1] [,2] [,3]
[1,] 1 3 207
[2,] 2 5 136
[3,] 3 2 306
[4,] 4 4 153
[5,] 5 1 156
Sometimes, though going over the full dataset is not necessary (like in the example given), then we would like to be able to break the loop once the unique pairs were found. For that we can use a for loop. Here is the code:
a <-cbind(expand.grid(seq(1,5,1),seq(1,5,1)),
strength = c(-11, 61, 230, 118, 156, 98, 169, 306, 6, -54,
207, -32, 27, 128, 101, 19, -18, 32, 153, 14,
63, 136, 165, 73, 35))
a <- a[order(a$strength, decreasing=T),]
V <- matrix(nrow=5,ncol=3)
for (i in 1:nrow(a)) {
if( sum(is.na(V[,1])) == 0)
break
if( !(a[i,1] %in% V[,1]) & !(a[i,2] %in% V[,2])) {
j <- a[i,1]
V[j,1] <- a[i,1]
V[j,2] <- a[i,2]
V[j,3] <- a[i,3]
}
}
Hope this helps, or at least will lead to improvements.
Consider a series of aggregation and merges between Var1 and Var2 columns:
# MERGE MAX AGGREGATES WHERE Var COL ARE EQUAL AND NOT EQUAL
mergedf1 <- merge(aggregate(strength ~ Var1, data=a[a$Var1==a$Var2,], FUN=max),
a, by=c("Var1", "strength"))
mergedf2 <- merge(aggregate(strength ~ Var1, data=a[a$Var1!=a$Var2,], FUN=max),
a, by=c("Var1", "strength"))
# STACK RESULTS
mergedf <- rbind(mergedf1, mergedf2)
# FINAL MAX AGGREGATION AND MERGE
final <- merge(aggregate(strength ~ Var2, data=mergedf, FUN=max),
mergedf, by=c("Var2", "strength"))
final <- final[,c("Var1", "Var2", "strength")] # SORT COLUMNS
final <- final[with(final, order(-strength)),] # SORT ROWS
# REMOVE TEMP OBJECTS
rm(mergedf1, mergedf2, mergedf)
Related
I have a data.frame (corresponding to a leaderboard) like this one:
structure(list(PJ = c(4, 4, 4, 4, 4, 4), V = c(4, 2, 2, 2, 1,
1), E = c(0, 0, 0, 0, 0, 0), D = c(0, 2, 2, 2, 3, 3), GF = c(182,
91, 92, 185, 126, 119), GC = c(84, 143, 144, 115, 141, 168),
Dif = c(98, -52, -52, 70, -15, -49), Pts = c(12, 6, 6, 6,
3, 3)), class = "data.frame", row.names = c("Player1", "Player2",
"Player3", "Player4", "Player5", "Player6"))
I would like to order the rows according to the number of points Pts. This can be done by df[order(df$Pts, decreasing=T),]. The issue appears when there is a tie between several players, then, what I want to do is to order the rows according to Dif.
How can this be done?
The order function which you are already using can take multiple arguments, each used sequentially to break ties in the previous one; see ?order
So you simply have to add Dif to you existing call:
df[order(df$Pts, df$Dif, decreasing=T),]
You can add further terms to break any remaining ties, e.g. Player2 and Player3 who have identical Pts and Dif.
If you want to specify which direction each argument should be ordered by (increasing or decreasing), you can either specify the decreasing argument as a vector, as in #r.user.05apr's comment, or my preferred lazy solution of adding - to any term that should be ordered in a decreasing direction
df[order(-df$Pts, df$Dif),]
(this will order by Pts decreasing and Dif increasing; it won't work if e.g. one of the ordering columns is character)
You can use sqldf or dplyr library
library (sqldf)
sqldf('select *
from "df"
order by "Pts" desc, "Dif" desc ')
Output
PJ V E D GF GC Dif Pts
1 4 4 0 0 182 84 98 12
2 4 2 0 2 185 115 70 6
3 4 2 0 2 91 143 -52 6
4 4 2 0 2 92 144 -52 6
5 4 1 0 3 126 141 -15 3
6 4 1 0 3 119 168 -49 3
I have educational data in R that looks like this:
df <- data.frame(
"StudentID" = c(101, 102, 103, 104, 105, 106, 111, 112, 113, 114, 115, 116, 121, 122, 123, 124, 125, 126),
"FedEthn" = c(1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3),
"HIST.11.LEV" = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 5, 3, 3),
"HIST.11.SCORE" = c(96, 95, 95, 97, 88, 99, 89, 96, 79, 83, 72, 95, 96, 93, 97, 98, 96, 87),
"HIST.12.LEV" = c(2, 2, 1, 2, 1, 1, 2, 3, 2, 2, 2, 2, 4, 3, 3, 3, 3, 3),
"SCI.9.LEV" = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3),
"SCI.9.SCORE" = c(91, 99, 82, 95, 65, 83, 96, 97, 99, 94, 95, 96, 89, 78, 96, 95, 97, 90),
"SCI.10.LEV" = c(1, 2, 1, 2, 1, 1, 3, 3, 2, 2, 2, 3, 3, 3, 4, 3, 4, 3)
)
## StudentID FedEthn HIST.11.LEV HIST.11.SCORE HIST.12.LEV SCI.9.LEV SCI.9.SCORE SCI.10.LEV
## 1 101 1 1 96 2 1 91 1
## 2 102 1 1 95 2 1 99 2
## 3 103 2 1 95 1 1 82 1
## 4 104 2 1 97 2 1 95 2
## 5 105 3 1 88 1 1 65 1
## 6 106 3 1 99 1 1 83 1
## 7 111 1 2 89 2 2 96 3
## 8 112 1 2 96 3 2 97 3
## 9 113 2 2 79 2 2 99 2
## 10 114 2 2 83 2 2 94 2
## 11 115 3 2 72 2 2 95 2
## 12 116 3 2 95 2 2 96 3
## 13 121 1 3 96 4 3 89 3
## 14 122 1 3 93 3 3 78 3
## 15 123 2 3 97 3 3 96 4
## 16 124 2 3 98 3 3 95 3
## 17 125 3 3 96 3 3 97 4
## 18 126 3 3 87 3 3 90 3
HIST.11.LEV stands for the student's academic level in their 11th grade history course. (5 = highest academic level, 1 = lowest academic level. For example, 5 might be an AP or IB course.) HIST.11.SCORE indicates the student's score in the course.
When a student scores 95 or higher in a course, they're eligible to move up to a higher academic level in the following year (such that HIST.12.LEV = 1 + HIST.11.LEV). However, only some of these eligible students actually move up, and the teacher must agree to it. What I'm analyzing is whether these move-up rates for eligible students differ by reported federal ethnicity.
Here's how I'm achieving this so far:
var.level <- 1
var.ethn <- 1
actual.move.ups <-
(df %>% filter(FedEthn==var.ethn,
HIST.11.LEV==var.level,
HIST.11.SCORE>94,
HIST.12.LEV==var.level+1) %>%
count) +
(df %>% filter(FedEthn==var.ethn,
SCI.9.LEV==var.level,
SCI.9.SCORE>94,
SCI.10.LEV==var.level+1) %>%
count)
eligible.move.ups <-
(df %>% filter(FedEthn==var.ethn,
HIST.11.LEV==var.level,
HIST.11.SCORE>94) %>%
count) +
(df %>% filter(FedEthn==var.ethn,
SCI.9.LEV==var.level,
SCI.9.SCORE>94) %>%
count)
This works, and I could iterate var.level from 1:5 and var.ethnicity from 1:7 and store the results in a data frame. But in my actual data, this approach would require 15 iterations of df %>% filter(...) %>% count (and I'd sum them all). The reason is that, in my actual data, there are 15 opportunities to move up across 5 subjects (HIST, SCI, MATH, ENG, WL) and 4 grade levels (9, 10, 11, 12).
My question is whether there's a more compact way to filter and count all instances where COURSE.GRADE.LEV==i, COURSE.GRADE+1.LEV==i+1, and COURSE.GRADE.SCORE>94 without typing/hard-coding each course name (HIST, SCI, MATH, ENG, WL) and each grade level (9, 10, 11, 12). And, what's the best way to store the results in a data frame?
For my sample data above, here's the ideal output. The data frame doesn't need to have this exact structure, though.
## FedEthn L1.Actual L1.Eligible L2.Actual L2.Eligible L3.Actual L3.Eligible
## 1 1 3 3 3 3 1 1
## 2 2 2 3 0 1 1 3
## 3 3 0 1 1 3 1 2
*Note: I've read this helpful answer, but for my variable names, the grade level (9, 10, 11, 12) doesn't have a consistent string location (e.g., SCI.9 vs. HIST.11). Also, in some instances, I need to count a single row multiple times, since a single student could move up in multiple classes. Maybe the solution is to reshape the data from wide to long before performing the count?
Using this great answer from #akrun, I was able to come up with a solution. I think I'm still making it unnecessarily complicated, though, and I hope to accept someone else's more compact answer.
course.names <- c("HIST.","SCI.")
grade.levels <- 9:11
tally.actual <- function(var.ethn, var.level){
total.tally.actual <- NULL
for(i in course.names){
course.tally.actual <- NULL
for(j in grade.levels){
new.tally.actual <- df %>% filter(
FedEthn == var.ethn,
!!(rlang::sym(paste0(i,j,".LEV"))) == var.level,
!!(rlang::sym(paste0(i,(j+1),".LEV"))) == (var.level+1),
!!(rlang::sym(paste0(i,j,".SCORE"))) > 94
) %>% count
course.tally.actual <- c(new.tally.actual, course.tally.actual)
}
total.tally.actual <- c(total.tally.actual, course.tally.actual)
}
return(sum(unlist(total.tally.actual)))
}
tally.eligible <- function(var.ethn, var.level){
total.tally.eligible <- NULL
for(i in course.names){
course.tally.eligible <- NULL
for(j in grade.levels){
new.tally.eligible <- df %>% filter(
FedEthn == var.ethn,
!!(rlang::sym(paste0(i,j,".LEV"))) == var.level,
!!(rlang::sym(paste0(i,j,".SCORE"))) > 94
) %>% count
course.tally.eligible <- c(new.tally.eligible, course.tally.eligible)
}
total.tally.eligible <- c(total.tally.eligible, course.tally.eligible)
}
return(sum(unlist(total.tally.eligible)))
}
results <- data.frame("FedEthn" = 1:3,
"L1.Actual" = NA, "L1.Eligible" = NA,
"L2.Actual" = NA, "L2.Eligible" = NA,
"L3.Actual" = NA, "L3.Eligible" = NA)
for(var.ethn in 1:3){
for(var.level in 1:3){
results[var.ethn,(var.level*2)] <- tally.actual(var.ethn,var.level)
results[var.ethn,(var.level*2+1)] <- tally.eligible(var.ethn,var.level)
}
}
This approach works, but it requires df to contain every combination of course (SCI, MATH, HIST, ENG, WL) and year (9, 10, 11, 12). See below for how I added to the original df. Including all possible combinations isn't a problem for my actual data, but I'm hoping there's a solution that doesn't require adding a bunch of columns filled with NA:
df$HIST.9.LEV = NA
df$HIST.9.SCORE = NA
df$HIST.10.LEV = NA
df$HIST.10.SCORE = NA
df$HIST.12.SCORE = NA
df$SCI.10.SCORE = NA
df$SCI.11.LEV = NA
df$SCI.11.SCORE = NA
df$SCI.12.LEV = NA
df$SCI.12.SCORE = NA
Problem Explanation
Given two vectors of variable length, and no duplicates between or within the vectors, how can one efficiently combine the vectors in ascending and alternating order?
Here is an example:
a <- c(98, 101, 104, 136, 154, 193)
b <- c(31, 37, 41, 44, 48, 55, 80, 118, 179)
The expected output is
c(31, 98, 118, 136, 179, 193)
# b, a, b, a, b, a
You can see we start with 31 from a, the smallest between the vectors.
This is followed by 98 from b. Then from a, the next number larger than 98 is 118. And so on, resulting in:
A = 98, 136, 193
B = 31, 118, 179
My Attempt:
x <- c(min(a,b))
lastwas <- startedwithA <- ifelse(x %in% a, 1, 2)
for(i in 1:(length(a)+length(b))){
if(lastwas == 2){
x <- c(x, a[which(a > x[i])[1]])
lastwas <- 1
} else if(lastwas == 1){
x <- c(x, b[which(b > x[i])[1]])
lastwas <- 2
}
}
(x <- x[!is.na(x)])
# [1] 31 98 118 136 179 193
if(startedwithA == 1){
evenodd <- c(T,F)
} else {
evenodd <- c(F,T)
}
(A = x[evenodd])
# [1] 98 136 193
(B = x[!evenodd])
# [1] 31 118 179
Is there a better way to approach this problem?
You could use data.frame, order and diff like this:
a = c(98, 101, 104, 136, 154, 193)
b = c(31, 37, 41, 44, 48, 55, 80, 118, 179)
# Create a data frame for each vector with a different number in column 'set'
dfa = data.frame(val = a, set = 1)
dfb = data.frame(val = b, set = 2)
# Bind both together and order them by val
df = rbind(dfa,dfb)
df = df[order(df$val),]
# Only keep those that have a diff different to 0
keep = c(1, diff(df$set)) != 0
result = df[keep,]
Result:
val set
7 31 2
1 98 1
14 118 2
4 136 1
15 179 2
6 193 1
You could set name prefixes according to vector,
v <- sort(c(setNames(a, paste0("a", a)), setNames(b, paste0("b", b))))
and use diff of factorized initial characters.
res <- v[!!c(1, diff(as.numeric(as.factor(substr(names(v), 1, 1)))))]
res
# b31 a98 b118 a136 b179 a193
# 31 98 118 136 179 193
Then split into a data frame:
d <- as.data.frame(split(res, substr(names(res), 1, 1)))
d
# a b
# a98 98 31
# a136 136 118
# a193 193 179
Data
a <- c(98, 101, 104, 136, 154, 193)
b <- c(31, 37, 41, 44, 48, 55, 80, 118, 179)
I am trying to iterate the rows in a dataframe (data) to check if one of the columns (data$ID) has similar difference (e.g., 3) between consecutive elements. If yes, keep the row, otherwise remove the row. The tricky part is I need to re-compare consecutive elements after certain row is removed.
data <- data.frame(ID=c(3.1, 6, 6.9, 9, 10.5, 12, 14.2, 15),
score = c(70, 80, 90, 65, 43, 78, 44, 92))
data
ID score
1 3.1 70
2 6 80
3 6.9 90
4 9 65
5 10.5 43
6 12 78
7 14.2 44
8 15 92
for (i in (length(data$ID)-1)) {
first <- data$ID[i]
second <- data$ID[i+1]
if ((second-first) == 3){
data <- data[-(i+1),]
}
}
The expected output data should be
ID score
1 3.1 70
2 6 80
3 9 65
4 12 78
5 15 92
The initial row 3, 5, 7 are excluded due to the different diff. But my code failed.
I also try to use diff function,
DF <- diff(data)
But it doesn't take care the fact that after one row is removed, the difference will change. Should I use diff function in a loop, but the dataframe is dynamic changed.
Using a recursive function (a function that calls itself)
data <- data.frame(ID=c(3.1, 6, 6.9, 9, 10.5, 12, 14.2, 15),
score = c(70, 80, 90, 65, 43, 78, 44, 92))
# use recursive function to trim the remainder of the list
trim_ids <- function (ids) {
# if only one element, return it
if (length(ids) <= 1) {
return(ids)
}
# if the gap between element 2 and element 1 is small enough
if ((ids[2] - ids[1]) < 2.9 ) {
# trim after dropping the second element
return(trim_ids(ids[-2]))
} else {
# keep the first element and trim from the second element
return(c(ids[1], trim_ids(ids[2:length(ids)] )))
}
}
# find the ids to keep
keep_ids <- trim_ids(data$ID)
# select the matching rows
data[data$ID %in% keep_ids,]
# ID score
# 1 3.1 70
# 2 6.0 80
# 4 9.0 65
# 6 12.0 78
# 8 15.0 92
An option could be achieved using cumsum and diff as:
#data
data <- data.frame(ID=c(3.1, 6, 6.9, 9, 10.5, 12, 14.2, 15),
score = c(70, 80, 90, 65, 43, 78, 44, 92))
data[c(0, cumsum(diff(round(data$ID))) %% 3 ) == 0,]
# ID score
# 1 3.1 70
# 2 6.0 80
# 4 9.0 65
# 6 12.0 78
# 8 15.0 92
If you define you want to keep all rows of which the ID, when rounded to 0 digits, belongs to a product of 3, you could try:
df1 <- data.frame(ID=c(3.1, 6, 6.9, 9, 10.5, 12, 14.2, 15),
score = c(70, 80, 90, 65, 43, 78, 44, 92))
df1[round(df1$ID) %% 3 == 0,]
ID score
1 3.1 70
2 6.0 80
4 9.0 65
6 12.0 78
8 15.0 92
I hope the position distribution of NA is uniform in the vector (length = 30, NA < 6 ).
This one length is 30, 4 NA. It's easy to see these NA not uniform, mainly at left.
vector_x <- c(NA,3, NA, 1, NA, 5, 6, 7, 7, 9, 0, 2, 12, 324, 54,23, 12, 324, 122, 23, 324, 332, 45, 78, 32, 12, 342, 95, 67, NA)
But I have no idea about use which kind of statistic or test to discribe. Then I can quantitative screening by a cutoff.
Now, I have two preliminary thoughts.
To simplify the solution, all NA seemed as 0 and all number seemed as 1, to see the distribution.
Or I get the index of NA, to do variance analysis about c(1, 3, 5, 30)
Thanks for your any suggestions!
You want to perform a Mann-Whitney U test or Wilcoxon rank-sum test (which is more descriptive of what it's doing)
This is easy to do with your data
which(is.na(v))
# [1] 1 3 5 30
which(!is.na(v))
# [1] 2 4 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
wilcox.test(which(is.na(v)), which(!is.na(v)))
# Wilcoxon rank sum test
# data: which(is.na(v)) and which(!is.na(v))
# W = 29, p-value = 0.1766
# alternative hypothesis: true location shift is not equal to 0
Check that wilcox.test works the way we expect with
wilcox.test(1:5, 6:10) # low p value
wilcox.test(seq(1,10,2), seq(2,10,2)) # high p value
If we need the index of NA elements, use is.na to convert to a logical vector, then with which returns the numeric index where it is TRUE
which(is.na(vector_x))
#[1] 1 3 5 30
Or to convert to a binary vector where 0 represents NA and 1 for other values
as.integer(!is.na(vector_x))