Let's say I've got a dataframe with multiple columns, some of which I want to transform. The column names define what transformation needs to be used.
library(tidyverse)
set.seed(42)
df <- data.frame(A = 1:100, B = runif(n = 100, 0, 1), log10 = runif(n = 100, 10, 100), log2 = runif(n = 100, 10, 100), log1p = runif(n = 100, 10, 100), sqrt = runif(n = 100, 10, 100))
trans <- list()
trans$log10 <- log10
trans$log2 <- log2
trans$log1p <- log1p
trans$sqrt <- sqrt
Ideally, I would like to use an across call where the column names were matched up with the trans function names and the transformations would be performed on the fly.
The desired output is the following:
df_trans <- df %>%
dplyr::mutate(log10 = trans$log10(log10),
log2 = trans$log2(log2),
log1p = trans$log1p(log1p),
sqrt = trans$sqrt(sqrt))
df_trans
However, I don't want to manually specify each transformation separately. In the representative example I only have 4 but this number could vary and be significantly higher making manual specification cumbersome and error prone.
I have managed to match up the column names with the functions by turning the trans list into a data frame and left-joining but am then unable to call the function in the trans_function column.
trans_df <- enframe(trans, value = "trans_function")
df %>%
pivot_longer(cols = everything()) %>%
left_join(trans_df) %>%
dplyr::mutate(value = trans_function(value))
Error: Problem with mutate() column value.
i value = trans_function(value).
x could not find function "trans_function"
I think I either need to find a way of calling the functions from the list columns or another way of matching up the function names with the column names. All ideas are welcome.
We can use cur_column() in across to get the column name and use it to subset trans.
library(dplyr)
df %>%
mutate(across(names(trans), ~trans[[cur_column()]](.x))) %>%
head
# A B log10 log2 log1p sqrt
#1 1 0.9148060 1.821920 6.486402 3.998918 3.470303
#2 2 0.9370754 1.470472 5.821200 3.932046 7.496103
#3 3 0.2861395 1.469690 6.437524 2.799395 8.171007
#4 4 0.8304476 1.653261 5.639570 3.700698 6.905755
#5 5 0.6417455 1.976905 4.597484 4.500461 9.441077
#6 6 0.5190959 1.985133 5.638341 4.551289 4.440590
Comparing it with output of df_trans.
head(df_trans)
# A B log10 log2 log1p sqrt
#1 1 0.9148060 1.821920 6.486402 3.998918 3.470303
#2 2 0.9370754 1.470472 5.821200 3.932046 7.496103
#3 3 0.2861395 1.469690 6.437524 2.799395 8.171007
#4 4 0.8304476 1.653261 5.639570 3.700698 6.905755
#5 5 0.6417455 1.976905 4.597484 4.500461 9.441077
#6 6 0.5190959 1.985133 5.638341 4.551289 4.440590
One way can be to use lapply:
library(tidyverse)
set.seed(42)
df <- data.frame(A = 1:100, B = runif(n = 100, 0, 1), log10 = runif(n = 100, 10, 100), log2 = runif(n = 100, 10, 100), log1p = runif(n = 100, 10, 100), sqrt = runif(n = 100, 10, 100))
trans <- list()
trans$log10 <- log10
trans$log2 <- log2
trans$log1p <- log1p
trans$sqrt <- sqrt
df_trans <- setNames(lapply(names(df),
function(x) if(x %in% names(trans))
{ trans[[x]](df[,(x)])} else {df[,x]}),names(df)) %>%
bind_cols() %>%
as.data.frame()
head(df_trans)
which gives:
A B log10 log2 log1p sqrt
1 1 0.1365052 1.739051 6.301896 4.530600 4.318942
2 2 0.1771364 1.549601 5.793220 4.521715 3.649834
3 3 0.5195605 1.902438 4.819125 3.343266 6.788565
4 4 0.8111208 1.572253 6.219991 4.075945 3.322401
5 5 0.1153620 1.751276 6.306097 4.060292 7.817301
6 6 0.8934218 1.724403 6.201123 3.235938 9.749128
The original dataframe being:
head(df)
A B log10 log2 log1p sqrt
1 1 0.1365052 54.83409 78.89684 91.81428 18.65326
2 2 0.1771364 35.44878 55.45401 90.99323 13.32129
3 3 0.5195605 79.88006 28.22936 27.31143 46.08461
4 4 0.8111208 37.34675 74.54249 57.90612 11.03835
5 5 0.1153620 56.39961 79.12693 56.99123 61.11019
6 6 0.8934218 53.01557 73.57393 24.43022 95.04549
In base R, we may use Map
df[names(trans)] <- Map(function(x, y) x(y), trans, df[names(trans)])
-checking
> identical(df, df_trans)
[1] TRUE
Another possibility is the following:
library(tidyverse)
set.seed(42)
df <- data.frame(A = 1:100, B = runif(n = 100, 0, 1), log10 = runif(n = 100, 10, 100), log2 = runif(n = 100, 10, 100), log1p = runif(n = 100, 10, 100), sqrt = runif(n = 100, 10, 100))
df %>%
mutate(across(-(A:B), ~ getFunction(cur_column())(.x))) %>% head
#> A B log10 log2 log1p sqrt
#> 1 1 0.9148060 1.821920 6.486402 3.998918 3.470303
#> 2 2 0.9370754 1.470472 5.821200 3.932046 7.496103
#> 3 3 0.2861395 1.469690 6.437524 2.799395 8.171007
#> 4 4 0.8304476 1.653261 5.639570 3.700698 6.905755
#> 5 5 0.6417455 1.976905 4.597484 4.500461 9.441077
#> 6 6 0.5190959 1.985133 5.638341 4.551289 4.440590
Related
Weirdly for this one, I think its easier to start by viewing the df.
#reproducible data
quantiles<-c("50","90")
var=c("w","d")
df=data.frame(a=runif(20,0.01,.5),b=runif(20,0.02,.5),c=runif(20,0.03,.5),e=runif(20,0.04,.5),
q50=runif(20,1,5),q90=runif(20,10,50))
head(df)
I want to automate a function that I've created (below) to calculate vars using different combinations of values from my df.
For example, the calculation of w needs to use a and b, and d needs to use c and e such that w = a *q ^ b and d = c * q ^ e. Further, q is a quantile, so I actually want w50, w90, etc., which will correspond to q50, q90 etc. from the df.
The tricky part as i see it is setting the condition to use a & b vs. c & d without using nested loops.
I have a function to calculate vars using the appropriate columns, however I can't get all the pieces together efficiently.
#function to calculate the w, d
calc_wd <- function(df,col_name,col1,col2,col3){
#Calculate and create new column col_name for each combo of var and quantile, e.g. "w_50", "d_50", etc.
df[[col_name]] <- df[[col1]] * (df[[col2]] ^ (df[[col3]]))
df
}
I can get this to work for a single case, but not by automating the coefficient swap... you'll see I specify "a" and "b" below.
wd<-c("w_","d_")
make_wd_list<-apply(expand.grid(wd, quantiles), 1, paste,collapse="")
calc_wdv(df,make_wd_list[1],"a",paste0("q",sapply(strsplit(make_wd_list[1],"_"),tail,1)),"b")
Alternatively, I have tried to make this work using nested for loops, but can't seem to append the data correctly. And its ugly.
var=c("w","d")
dataf<-data.frame()
for(j in unique(var)){
if(j=="w"){
coeff1="a"
coeff2="b"
}else if(j=="d"){
coeff1="c"
coeff1="e"
}
print(coeff1)
print(coeff2)
for(k in unique(quantiles)){
dataf<-calc_wd(df,paste0(j,k),coeff1,paste0("q",k),coeff2)
dataf[k,j]=rbind(df,dataf) #this aint right. tried to do.call outside, etc.
}
}
In the end, I'm looking to have new columns with w_50, w_90, etc., which use q50, q90 and the corresponding coefficients as defined originally.
One approach I find easy to type is using purrr::pmap. I like this because when you use with(list(...),), you can access the column names of your data.frame by name. Additionally, you can supply additional arguments.
library(purrr)
pmap_df(df, quant = "q90", ~with(list(...),{
list(w = a * get(quant) ^ b, d = c * get(quant) ^ e)
}))
## A tibble: 20 x 2
# w d
# <dbl> <dbl>
# 1 0.239 0.295
# 2 0.152 0.392
# 3 0.476 0.828
# 4 0.344 0.236
# 5 0.439 1.00
You could combine this with for example a second map call to iterate over quantiles.
library(dplyr)
map(setNames(quantiles,quantiles),
~ pmap_df(df, quant = paste0("q",.x),
~ with(list(...),{list(w = a * get(quant) ^ b, d = c * get(quant) ^ e)}))
) %>% do.call(cbind,.)
# 50.w 50.d 90.w 90.d
#1 0.63585897 0.11045837 1.7276019 0.1784987
#2 0.17286184 0.22033649 0.2333682 0.5200265
#3 0.32437528 0.72502654 0.5722203 1.4490065
#4 0.68020897 0.33797621 0.8749206 0.6179557
#5 0.73516886 0.38481785 1.2782923 0.4870877
Then assigning a custom function is trivial.
calcwd <- function(df,quantiles){
map(setNames(quantiles,quantiles),
~ pmap_df(df, quant = paste0("q",.x),
~ with(list(...),{list(w = a * get(quant) ^ b, d = c * get(quant) ^ e)}))
) %>% do.call(cbind,.)
}
I love #Ian's answer for the completeness and the use of classics like with and do.call. I'm late to the scene with my solution but since I have been trying to get better with rowwise operations (including the use of rowwise thought I would offer up a less elegant but simpler and faster solution using just mutate, formula.tools and map_dfc
library(dplyr)
library(purrr)
require(formula.tools)
# same type example data plus a much larger version in df2 for
# performance testing
df <- data.frame(a = runif(20, 0.01, .5),
b = runif(20, 0.02, .5),
c = runif(20, 0.03, .5),
e = runif(20, 0.04, .5),
q50 = runif(20,1,5),
q90 = runif(20,10,50)
)
df2 <- data.frame(a = runif(20000, 0.01, .5),
b = runif(20000, 0.02, .5),
c = runif(20000, 0.03, .5),
e = runif(20000, 0.04, .5),
q50 = runif(20000,1,5),
q90 = runif(20000,10,50)
)
# from your original post
quantiles <- c("q50", "q90")
wd <- c("w_", "d_")
make_wd_list <- apply(expand.grid(wd, quantiles),
1,
paste, collapse = "")
make_wd_list
#> [1] "w_q50" "d_q50" "w_q90" "d_q90"
# an empty list to hold our formulas
eqn_list <- vector(mode = "list",
length = length(make_wd_list)
)
# populate the list makes it very extensible to more outcomes
# or to more quantile levels
for (i in seq_along(make_wd_list)) {
if (substr(make_wd_list[[i]], 1, 1) == "w") {
eqn_list[[i]] <- as.formula(paste(make_wd_list[[i]], "~ a * ", substr(make_wd_list[[i]], 3, 5), " ^ b"))
} else if (substr(make_wd_list[[i]], 1, 1) == "d") {
eqn_list[[i]] <- as.formula(paste(make_wd_list[[i]], "~ c * ", substr(make_wd_list[[i]], 3, 5), " ^ e"))
}
}
# formula.tools helps us grab both left and right sides
add_column <- function(df, equation){
df <- transmute_(df, rhs(equation))
colnames(df)[ncol(df)] <- as.character(lhs(equation))
return(df)
}
result <- map_dfc(eqn_list, ~ add_column(df = df, equation = .x))
#> w_q50 d_q50 w_q90 d_q90
#> 1 0.10580863 0.29136904 0.37839737 0.9014040
#> 2 0.34798729 0.35185585 0.64196417 0.4257495
#> 3 0.79714122 0.37242915 1.57594506 0.6198531
#> 4 0.56446922 0.43432160 1.07458217 1.1082825
#> 5 0.26896574 0.07374273 0.28557366 0.1678035
#> 6 0.36840408 0.72458466 0.72741030 1.2480547
#> 7 0.64484009 0.69464045 1.93290705 2.1663690
#> 8 0.43336109 0.21265672 0.46187366 0.4365486
#> 9 0.61340404 0.47528697 0.89286358 0.5383290
#> 10 0.36983212 0.53292900 0.53996112 0.8488402
#> 11 0.11278412 0.12532491 0.12486156 0.2413191
#> 12 0.03599639 0.25578020 0.04084221 0.3284659
#> 13 0.26308183 0.05322304 0.87057854 0.1817630
#> 14 0.06533586 0.22458880 0.09085436 0.3391683
#> 15 0.11625845 0.32995233 0.12749040 0.4730407
#> 16 0.81584442 0.07733376 2.15108243 0.1041342
#> 17 0.38198254 0.60263861 0.68082354 0.8502999
#> 18 0.51756058 0.43398089 1.06683204 1.3397900
#> 19 0.34490492 0.13790601 0.69168711 0.1580659
#> 20 0.39771037 0.33286225 1.32578056 0.4141457
microbenchmark::microbenchmark(result <- map_dfc(eqn_list, ~ add_column(df = df2, equation = .x)), times = 10)
#> Unit: milliseconds
#> expr min
#> result <- map_dfc(eqn_list, ~add_column(df = df2, equation = .x)) 10.58004
#> lq mean median uq max neval
#> 11.34603 12.56774 11.6257 13.24273 16.91417 10
The mutate and formula solution is about fifty times faster although both rip through 20,000 rows in less than a second
Created on 2020-04-30 by the reprex package (v0.3.0)
Here's an example of how the group label from cut() doesn't seem accurate. The observation with x1=200 is classified in the [0,200) group of x2, which is wrong. The label can be fixed by increasing dig.lab, but I still think the default rounding should give a result for x2 with face validity. Is this a bug?
df <- data.frame(x1 = c(100, 100.5, 200, 200.5))
df$x2 <- cut(df$x1, breaks = c(0,200.1,999), right = FALSE)
df$x3 <- cut(df$x1, breaks = c(0,200.1,999), right = FALSE, dig.lab = 4)
df
# x1 x2 x3
# 1 100.0 [0,200) [0,200.1)
# 2 100.5 [0,200) [0,200.1)
# 3 200.0 [0,200) [0,200.1)
# 4 200.5 [200,999) [200.1,999)
I'm working to implement a lpSolve solution to optimizing a hypothetical daily fantasy baseball problem. I'm having trouble applying my last constraint:
position - Exactly 3 outfielders (OF) 2 pitchers (P) and 1 of everything else
cost - Cost less than 200
team - Max number from any one team is 6
team - Minimum number of teams on a roster is 3**
Say for example you have a dataframe of 1000 players with points, cost, position, and team and you're trying to maximize average points:
library(tidyverse)
library(lpSolve)
set.seed(123)
df <- data_frame(avg_points = sample(5:45,1000, replace = T),
cost = sample(3:45,1000, replace = T),
position = sample(c("P","C","1B","2B","3B","SS","OF"),1000, replace = T),
team = sample(LETTERS,1000, replace = T)) %>% mutate(id = row_number())
head(df)
# A tibble: 6 x 5
# avg_points cost position team id
# <int> <int> <chr> <chr> <int>
#1 17 13 2B Y 1
#2 39 45 1B P 2
#3 29 33 1B C 3
#4 38 31 2B V 4
#5 17 13 P A 5
#6 10 6 SS V 6
I've implemented the first 3 constraints with the following code, but i'm having trouble figuring out how to implement the minimum number of teams on a roster. I think I need to add additional variable to the model, but i'm not sure how to do that.
#set the objective function (what we want to maximize)
obj <- df$avg_points
# set the constraint rows.
con <- rbind(t(model.matrix(~ position + 0,df)), cost = df$cost, t(model.matrix(~ team + 0, df)) )
#set the constraint values
rhs <- c(1,1,1,1,3,2,1, # 1. #exactly 3 outfielders 2 pitchers and 1 of everything else
200, # 2. at a cost less than 200
rep(6,26) # 3. max number from any team is 6
)
#set the direction of the constraints
dir <- c("=","=","=","=","=","=","=","<=",rep("<=",26))
result <- lp("max",obj,con,dir,rhs,all.bin = TRUE)
If it helps, i'm trying to replicate This paper (with minor tweaks) which has corresponding julia code here
This might be a solution for your problem.
This is the data I have used (identical to yours):
library(tidyverse)
library(lpSolve)
N <- 1000
set.seed(123)
df <- tibble(avg_points = sample(5:45,N, replace = T),
cost = sample(3:45,N, replace = T),
position = sample(c("P","C","1B","2B","3B","SS","OF"),N, replace = T),
team = sample(LETTERS,N, replace = T)) %>%
mutate(id = row_number())
You want to find x1...xn that maximise the objective function below:
x1 * average_points1 + x2 * average_points1 + ... + xn * average_pointsn
With the way lpSolve works, you will need to express every LHS as the sum over
x1...xn times the vector you provide.
Since you cannot express the number of teams with your current variables, you can introduce new ones (I will call them y1..yn_teams and z1..zn_teams):
# number of teams:
n_teams = length(unique(df$team))
Your new objective function (ys and zs will not influence your overall objective funtion, since the constant is set to 0):
obj <- c(df$avg_points, rep(0, 2 * n_teams))
)
The first 3 constraints are the same, but with the added constants for y and z:
c1 <- t(model.matrix(~ position + 0,df))
c1 <- cbind(c1,
matrix(0, ncol = 2 * n_teams, nrow = nrow(c1)))
c2 = df$cost
c2 <- c(c2, rep(0, 2 * n_teams))
c3 = t(model.matrix(~ team + 0, df))
c3 <- cbind(c3, matrix(0, ncol = 2 * n_teams, nrow = nrow(c3)))
Since you want to have at least 3 teams, you will first use y to count the number of players per team:
This constraint counts the number of players per team. You sum up all players of a team that you have picked and substract the corresponding y variable per team. This should be equal to 0. (diag() creates the identity matrix, we do not worry about z at this point):
# should be x1...xn - y1...n = 0
c4_1 <- cbind(t(model.matrix(~team + 0, df)), # x
-diag(n_teams), # y
matrix(0, ncol = n_teams, nrow = n_teams) # z
) # == 0
Since each y is now the number of players in a team, you can now make sure that z is binary with this constraint:
c4_2 <- cbind(t(model.matrix(~ team + 0, df)), # x1+...+xn ==
-diag(n_teams), # - (y1+...+yn )
diag(n_teams) # z binary
) # <= 1
This is the constraint that ensures that at least 3 teams are picked:
c4_3 <- c(rep(0, nrow(df) + n_teams), # x and y
rep(1, n_teams) # z >= 3
)
You need to make sure that
You can use the big-M method for that to create a constraint, which is:
Or, in a more lpSolve friendly version:
In this case you can use 6 as a value for M, because it is the largest value any y can take:
c4_4 <- cbind(matrix(0, nrow = n_teams, ncol = nrow(df)),
diag(n_teams),
-diag(n_teams) * 6)
This constraint is added to make sure all x are binary:
#all x binary
c5 <- cbind(diag(nrow(df)), # x
matrix(0, ncol = 2 * n_teams, nrow = nrow(df)) # y + z
)
Create the new constraint matrix
con <- rbind(c1,
c2,
c3,
c4_1,
c4_2,
c4_3,
c4_4,
c5)
#set the constraint values
rhs <- c(1,1,1,1,3,2,1, # 1. #exactly 3 outfielders 2 pitchers and 1 of everything else
200, # 2. at a cost less than 200
rep(6, n_teams), # 3. max number from any team is 6
rep(0, n_teams), # c4_1
rep(1, n_teams), # c4_2
3, # c4_3,
rep(0, n_teams), #c4_4
rep(1, nrow(df))# c5 binary
)
#set the direction of the constraints
dir <- c(rep("==", 7), # c1
"<=", # c2
rep("<=", n_teams), # c3
rep('==', n_teams), # c4_1
rep('<=', n_teams), # c4_2
'>=', # c4_3
rep('<=', n_teams), # c4_4
rep('<=', nrow(df)) # c5
)
The problem is almost the same, but I am using all.int instead of all.bin to make sure the counts work for the players in the team:
result <- lp("max",obj,con,dir,rhs,all.int = TRUE)
Success: the objective function is 450
roster <- df[result$solution[1:nrow(df)] == 1, ]
roster
# A tibble: 10 x 5
avg_points cost position team id
<int> <int> <chr> <chr> <int>
1 45 19 C I 24
2 45 5 P X 126
3 45 25 OF N 139
4 45 22 3B J 193
5 45 24 2B B 327
6 45 25 OF P 340
7 45 23 P Q 356
8 45 13 OF N 400
9 45 13 SS L 401
10 45 45 1B G 614
If you change your data to
N <- 1000
set.seed(123)
df <- tibble(avg_points = sample(5:45,N, replace = T),
cost = sample(3:45,N, replace = T),
position = sample(c("P","C","1B","2B","3B","SS","OF"),N, replace = T),
team = sample(c("A", "B"),N, replace = T)) %>%
mutate(id = row_number())
It will now be infeasable, because the number of teams in the data is less then 3.
You can check that it now works:
sort(unique(df$team))[result$solution[1027:1052]==1]
[1] "B" "E" "I" "J" "N" "P" "Q" "X"
sort(unique(roster$team))
[1] "B" "E" "I" "J" "N" "P" "Q" "X"
I am new to R and need to do pairwise comparison formulas across a set of variables. The number of elements to be compared will by dynamic but here is a hardcoded example with 4 elements, each compared against the other:
#there are 4 choices A, B, C, D -
#they are compared against each other and comparisons are stored:
df1 <- data.frame("A" = c(80),"B" = c(20))
df2 <- data.frame("A" = c(90),"C" = c(10))
df3 <- data.frame("A" = c(95), "D" = c(5))
df4 <- data.frame("B" = c(80), "C" = c(20))
df5 <- data.frame("B" = c(90), "D" = c(10))
df6 <- data.frame("C" = c(80), "D" = c(20))
#show the different comparisons in a matrix
matrixA <- matrix(c("", df1$B[1], df2$C[1], df3$D[1],
df1$A[1], "", df4$C[1], df5$D[1],
df2$A[1], df4$B[1], "", df6$D[1],
df3$A[1], df5$B[1], df6$C[1], ""),
nrow=4,ncol = 4,byrow = TRUE)
dimnames(matrixA) = list(c("A","B","C","D"),c("A","B","C","D"))
#perform calculations on the comparisons
matrixB <- matrix(
c(1, df1$B[1]/df1$A[1], df2$C[1]/df2$A[1], df3$D[1]/df3$A[1],
df1$A[1]/df1$B[1], 1, df4$C[1]/df4$B[1], df5$D[1]/df5$B[1],
df2$A[1]/df2$C[1], df4$B[1]/df4$C[1], 1, df6$D[1]/df6$C[1],
df3$A[1]/df3$D[1], df5$B[1]/df5$D[1], df6$C[1]/df6$D[1], 1),
nrow = 4, ncol = 4, byrow = TRUE)
matrixB <- rbind(matrixB, colSums(matrixB)) #add the sum of the colums
dimnames(matrixB) = list(c("A","B","C","D","Sum"),c("A","B","C","D"))
#so some more calculations that I'll use later on
dfC <- data.frame("AB" = c(matrixB["A","A"] / matrixB["A","B"],
matrixB["B","A"] / matrixB["B","B"],
matrixB["C","A"] / matrixB["C","B"],
matrixB["D","A"] / matrixB["D","B"]),
"BC" = c(matrixB["A","B"] / matrixB["A","C"],
matrixB["B","B"] / matrixB["B","C"],
matrixB["C","B"] / matrixB["C","C"],
matrixB["D","B"] / matrixB["D","C"]
),
"CD" = c(matrixB["A","C"] / matrixB["A","D"],
matrixB["B","C"] / matrixB["B","D"],
matrixB["C","C"] / matrixB["C","D"],
matrixB["D","C"] / matrixB["D","D"]))
dfCMeans <- colMeans(dfC)
#create the normalization matrix
matrixN <- matrix(c(
matrixB["A","A"] / matrixB["Sum","A"], matrixB["A","B"] / matrixB["Sum","B"], matrixB["A","C"] / matrixB["Sum","C"], matrixB["A","D"] / matrixB["Sum","D"],
matrixB["B","A"] / matrixB["Sum","A"], matrixB["B","B"] / matrixB["Sum","B"], matrixB["B","C"] / matrixB["Sum","C"], matrixB["B","D"] / matrixB["Sum","D"],
matrixB["C","A"] / matrixB["Sum","A"], matrixB["C","B"] / matrixB["Sum","B"], matrixB["C","C"] / matrixB["Sum","C"], matrixB["C","D"] / matrixB["Sum","D"],
matrixB["D","A"] / matrixB["Sum","A"], matrixB["D","B"] / matrixB["Sum","B"], matrixB["D","C"] / matrixB["Sum","C"], matrixB["D","D"] / matrixB["Sum","D"]
), nrow = 4, ncol = 4, byrow = TRUE)
Since R is so concise it seems like there should be a much better way to do this, I would like to know an easier way to figure out these type of calculations using R.
OK, I might be starting to piece together something here.
We start with a matrix like so:
A <- structure(
c(NA, 20, 10, 5, 80, NA, 20, 10, 90, 80, NA, 20, 95, 90, 80, NA),
.Dim = c(4, 4),
.Dimnames = list(LETTERS[1:4], LETTERS[1:4]))
A
# A B C D
# A NA 80 90 95
# B 20 NA 80 90
# C 10 20 NA 80
# D 5 10 20 NA
This matrix is the result of a pairwise comparison on a vector of length 4. We know nothing of this vector, and the only thing we know about the function used in the comparison is that it is binary non-commutative, or more precisely: f(x, y) = 100 - f(y, x) and the result is ∈ [0, 100].
matrixB appears to be simply matrixA divided by its own transpose:
B = ATA-1
or if you prefer:
B = (100 - A) / A
Potato patato due to above mentioned properties.
B <- (100 - A) / A
B <- t(A) / A
# fill in the diagonal with 1s
diag(B) <- 1
round(B, 2)
# A B C D
# A 1 0.25 0.11 0.05
# B 4 1.00 0.25 0.11
# C 9 4.00 1.00 0.25
# D 19 9.00 4.00 1.00
The 'normalized' matrix as you call it seems to be simply each column divided by its sum.
B.norm <- t(t(B) / colSums(B))
round(B.norm, 3)
# A B C D
# A 0.030 0.018 0.021 0.037
# B 0.121 0.070 0.047 0.079
# C 0.273 0.281 0.187 0.177
# D 0.576 0.632 0.746 0.707
I am trying to use approx() and dplyr to interpolate values in an existing array. My initial code looks like this ...
p = c(1,1,1,2,2,2)
q = c(1,2,3,1,2,3)
r = c(1,2,3,4,5,6)
Inputs<- data.frame(p,q,r)
new.inputs= as.numeric(c(1.5,2.5))
library(dplyr)
Interpolated <- Inputs %>%
group_by(p) %>%
arrange(p, q) %>%
mutate(new.output=approx(x=q, y=r, xout=new.inputs)$y)
I expect to see 1.5, 2.5, 4.5, 5.5 but instead I get
Error: incompatible size (2), expecting 3 (the group size) or 1
Can anyone tell me where I am going wrong?
You can get the values you expect using dplyr.
library(dplyr)
Inputs %>%
group_by(p) %>%
arrange(p, q, .by_group = TRUE) %>%
summarise(new.outputs = approx(x = q, y = r, xout = new.inputs)$y)
# p new.outputs
# <dbl> <dbl>
# 1 1.5
# 1 2.5
# 2 4.5
# 2 5.5
You can also get the values you expect using the ddply function from plyr.
library(plyr)
# Output as coordinates
ddply(Inputs, .(p), summarise, new.output = paste(approx(
x = q, y = r, xout = new.inputs
)$y, collapse = ","))
# p new.output
# 1 1.5,2.5
# 2 4.5,5.5
#######################################
# Output as flattened per group p
ddply(Inputs,
.(p),
summarise,
new.output = approx(x = q, y = r, xout = new.inputs)$y)
# p new.output
# 1 1.5
# 1 2.5
# 2 4.5
# 2 5.5