Add rows with specific values in between existing rows - r

I have hockey data, called df
structure(list(event_index = 1:57, coords_x = c(80, 53, 31, -56,
-34, -33, -40, 30, -66, -36, 45, 17, -6, 47, -51, -31, -69, -86,
-70, 80, 65, -76, -71, 81, -57, 80, 75, 77, -71, -40, -83, 62,
77, 76, NA, -61, 69, -45, 68, 31, 58, 61, 80, 34, 80, -85, -37,
-57, 76, 14, 49, -82, -34, -36, -83, -84, -55), coords_y = c(-1,
14, -30, 17, 26, -23, -37, 17, -32, -18, 25, 17, -38, 21, 28,
22, 17, 13, 10, -37, -17, 9, 18, -11, 21, -7, 3, 3, -38, 31,
8, -30, -2, 4, NA, -5, 15, 10, -30, -34, 20, 27, -4, 8, -18,
19, 32, -21, 0, 40, -4, -30, -24, -28, -2, -3, 34), event_rinkside = c("R",
"R", "R", "L", "L", "L", "L", "R", "L", "L", "R", "N", "N", "R",
"L", "L", "L", "L", "L", "R", "R", "L", "L", "R", "L", "R", "R",
"R", "L", "L", "L", "R", "R", "R", NA, "L", "R", "L", "R", "R",
"R", "R", "R", "R", "R", "L", "L", "L", "R", "N", "R", "L", "L",
"L", "L", "L", "L")), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -57L))
How do I create rows after every single row, leaving me with 57 * 2 (114 rows), but the values in my newly created rows depend on event_rinkside column.
If event_rinkside equals R, then, I want to insert 82 into coords_x and 0 into coords_y.
If event_rinkside equals L, then, I want to insert -82 into coords_x and 0 into coords_y.
I feel like the solution to this SO question is a good starting points, but I don't know how to incorporate my own conditions:
Here is the solution I'm talking about:
library(purrr)
df %>%
group_by(id) %>%
map_dfr(rbind, NA) %>%
mutate(id = rep(df$id, each = 2))

Here's a solution with dplyr:
library(dplyr)
df %>%
mutate(coords_x = 82 * ifelse(event_rinkside == "L", -1, 1),
coords_y = 0) %>%
rbind(df, .) %>%
arrange(event_index)
How it works:
In the first step, mutate is used to modify an unassigned copy of df. The column coords_x gets the value of 82; the value is multiplied with -1 if event_rinkside == "L" and 1 otherwise. The column coords_y gets the value of 0.
In the next step, the unchanged original data frame df and the current unassigned and modified copy of it are combined with rbind. Here, . represents the result of the mutate step above. The result of rbind has the rows of the original version above the rows of the modified version.
In the last step, arrange is used to sort the rows along the values of event_index. In this way, each original row is directly followed by the corresponding modified row.
The result:
# A tibble: 114 x 4
event_index coords_x coords_y event_rinkside
<int> <dbl> <dbl> <chr>
1 1 80 -1 R
2 1 82 0 R
3 2 53 14 R
4 2 82 0 R
5 3 31 -30 R
6 3 82 0 R
7 4 -56 17 L
8 4 -82 0 L
9 5 -34 26 L
10 5 -82 0 L
# … with 104 more rows

I'm not too familiar with r, the my algorithm should work regardless of that. You want to shift the row up to the 2n-1 row. I would create a second array and manually place them in at the specific indexes.
some pseudo code for you (i usually write in python so my pseudo shows it)
reinsert(list):
array_out = [len(list)*2,len(list[0]) // initialize to the desired dimensions
array_out[0] = list[0] /// manually insert first row cause math
for n in range(0,len(list)):
array_out[2n-1] = list[n]
array_out[2n] = event_rinkside // make a function call or make an ifthen clause to do you logic
return(array_out)
you can insert the newly created rows in the loop or add them after the fact knowing they will all be at even numbered indexes.

This is similar to Sven's answer, using case_when to distinguish between the possibilities within event_rinkside:
new_df <- df %>% bind_rows(
df %>% mutate(
coords_x = case_when(
event_rinkside == 'R' ~ 82,
event_rinkside == 'L' ~ -82,
TRUE ~ coords_x
),
coords_y = case_when(
event_rinkside == 'R' ~ 0,
event_rinkside == 'L' ~ 0,
TRUE ~ coords_y
)
)
) %>% arrange(
event_index
)
If you know the ranges of your variables, it could be simplified into if_elses.

My attempt, which is pretty similar to other answers already,
df <- df[rep(1:nrow(df), each = 2),] ## Create a duplicate row after each row
df[seq(2,nrow(df),2),] <- df[seq(2,nrow(df),2),] %>% mutate(coords_x = case_when(event_rinkside == "R" ~ 82,
event_rinkside == "L" ~ -82,
TRUE ~ coords_x),
coords_y = case_when(event_rinkside == "R" ~ 0,
event_rinkside == "L" ~ 0,
TRUE ~ coords_y)
)

Related

Transferring name of column to a function in R

I'm trying to write a function which returns specific details about outliers (only sex, age, education, and the outlying value). I need to do it with many parameters, so I would like to transfer name of column to the function. Is there a way to do it?
For example, this code should return: f, 27, 12, 110.
my_data= data.frame( sex= c("f", "m", "f", "f", "m"),
age= c(22, 30, 24, 27, 30),
eduyears= c(12,16, 15, 12, 17),
weight= c(53, 70, 60, 110, 75),
height= c(160, 183, 157, 168, 180))
find_outliers= function (my_data, colname) {
out_values= boxplot.stats(my_data$colname)$out
out_ind= which(my_data$colname %in% out_values) #find outliers indices
outliers= my_data[out_ind ,c("sex","age","eduyears", colname)]
return (outliers)
}
find_outliers(weight)
If the function has two arguments you need to pass them both in its call, you are only passing one, weight. And passing as an unquoted variable means the function must get the column name as a character string in order to access it.
Finally, see the famous question on how to Dynamically select data frame columns using $ and a vector of column names.
my_data <- data.frame(sex = c("f", "m", "f", "f", "m"),
age = c(22, 30, 24, 27, 30),
eduyears = c(12,16, 15, 12, 17),
weight = c(53, 70, 60, 110, 75),
height = c(160, 183, 157, 168, 180))
find_outliers <- function (my_data, colname) {
# get the colname as a character string
colname <- as.character(substitute(colname))
out_values <- boxplot.stats(my_data[[colname]])$out
out_ind <- which(my_data[[colname]] %in% out_values) #find outliers indices
outliers <- my_data[out_ind, c("sex","age","eduyears", colname)]
outliers
}
find_outliers(my_data, weight)
#> sex age eduyears weight
#> 4 f 27 12 110
my_data |> find_outliers(weight)
#> sex age eduyears weight
#> 4 f 27 12 110
Created on 2022-11-05 with reprex v2.0.2

Error in PCA(Table, quanti.sup =, graph = FALSE) : The following variables are not quantitative

I'm tyring to do PCA, I use the following command to do this :
res.table=PCA(table,quanti.sup=1:5,graph=FALSE)
But I got the following error :
Error in PCA(EAUX, quanti.sup = 1:5, graph = FALSE) :
The following variables are not quantitative: NOM
The following variables are not quantitative: ACRO
The following variables are not quantitative: PAYS
The following variables are not quantitative: TYPE
The following variables are not quantitative: PG
Here's the table :
dput(head(table)) :
structure(list(NOM = c("Evian", "Montagne des Pyrenees", "Cristaline-St-Cyr",
"Fiee des Lois", "Volcania", "Saint Diery"), ACRO = c("EVIAN",
"MTPYR", "CRIST", "FIEE", "VOLCA", "STDIE"), PAYS = c("F", "F",
"F", "F", "F", "F"), TYPE = c("M", "S", "S", "S", "S", "M"),
PG = c("P", "P", "P", "P", "P", "G"), CA = c(78, 48, 71,
89, 4.1, 85), MG = c(24, 11, 5.5, 31, 1.7, 80), `NA` = c(5,
34, 11.2, 17, 2.7, 385), K = c(1, 1, 3.2, 2, 0.9, 65), SUL = c(10,
16, 5, 47, 1.1, 25), NO3 = c(3.8, 4, 1, 0, 0.8, 1.9), HCO3 = c(357,
183, 250, 360, 25.8, 1350), CL = c(4.5, 50, 20, 28, 0.9,
285), MOY = c(60.41, 43.38, 45.86, 71.75, 4.75, 284.61)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Thank you
This code is applied to the sample dataset you have sent:
library(fastDummies)
table <- data.frame(NOM = c("Evian", "Montagne des Pyrenees", "Cristaline-St-Cyr",
"Fiee des Lois", "Volcania", "Saint Diery"),
ACRO = c("EVIAN", "MTPYR", "CRIST", "FIEE", "VOLCA", "STDIE"),
PAYS = c("F", "F","F", "F", "F", "F"),
TYPE = c("M", "S", "S", "S", "S", "M"),
PG = c("P", "P", "P", "P", "P", "G"),
CA = c(78, 48, 71, 89, 4.1, 85),
MG = c(24, 11, 5.5, 31, 1.7, 80),
`NA` = c(5,34, 11.2, 17, 2.7, 385),
K = c(1, 1, 3.2, 2, 0.9, 65),
SUL = c(10,16, 5, 47, 1.1, 25),
NO3 = c(3.8, 4, 1, 0, 0.8, 1.9),
HCO3 = c(357,183, 250, 360, 25.8, 1350),
CL = c(4.5, 50, 20, 28, 0.9,285),
MOY = c(60.41, 43.38, 45.86, 71.75, 4.75, 284.61),
stringsAsFactors = TRUE)
table_conv <- dummy_cols(table,
select_columns = c("NOM","ACRO","PAYS","TYPE","PG"),
remove_selected_columns = TRUE)
pca <- prcomp(as.matrix(table_conv))
pca
First, an easy way to convert categorical variables to numbers is on-hot-encoding. You can use the library fastDummies.
After converting, you can apply PCA using prcomp function in R. Since prcomp uses a matrix, you have to convert it before.
Standard deviations (1, .., p=6):
[1] 5.168512e+02 5.385374e+01 1.796098e+01 1.134412e+01 3.412325e+00 3.841269e-14
Rotation (n x k) = (26 x 6):
PC1 PC2 PC3 PC4
CA 3.406007e-02 0.4583258631 -0.449804772 0.583419015
MG 5.452648e-02 0.0567007399 -0.120337337 -0.431460022
NA. 2.857387e-01 -0.6581371706 -0.015607808 -0.028124352
K 4.881587e-02 -0.1004287315 0.078457844 -0.025483654
SUL 1.165818e-02 0.1358452663 -0.676545259 -0.582029743
NO3 6.980208e-05 -0.0009273267 0.013738946 0.056474181
HCO3 9.122393e-01 0.3046381925 0.184652450 -0.053887132
CL 2.046765e-01 -0.4804253636 -0.498672171 0.351834600
MOY 1.939708e-01 -0.0355753827 -0.185620957 -0.016150352
NOM_Cristaline-St-Cyr -1.441857e-04 0.0010557813 0.003263708 0.026324720
NOM_Evian -7.179722e-05 0.0044191573 0.013208312 -0.001548583
NOM_Fiee des Lois -6.100869e-05 0.0038456894 -0.014279448 -0.018364975
NOM_Montagne des Pyrenees -1.811771e-04 -0.0029622537 -0.012327470 0.013222722
NOM_Saint Diery 7.681733e-04 -0.0016729014 0.001350556 -0.001045186
NOM_Volcania -3.100046e-04 -0.0046854729 0.008784342 -0.018588698
ACRO_CRIST -1.441857e-04 0.0010557813 0.003263708 0.026324720
ACRO_EVIAN -7.179722e-05 0.0044191573 0.013208312 -0.001548583
ACRO_FIEE -6.100869e-05 0.0038456894 -0.014279448 -0.018364975
ACRO_MTPYR -1.811771e-04 -0.0029622537 -0.012327470 0.013222722
ACRO_STDIE 7.681733e-04 -0.0016729014 0.001350556 -0.001045186
ACRO_VOLCA -3.100046e-04 -0.0046854729 0.008784342 -0.018588698
PAYS_F 0.000000e+00 0.0000000000 0.000000000 0.000000000
TYPE_M 6.963761e-04 0.0027462560 0.014558868 -0.002593769
TYPE_S -6.963761e-04 -0.0027462560 -0.014558868 0.002593769
PG_G 7.681733e-04 -0.0016729014 0.001350556 -0.001045186
PG_P -7.681733e-04 0.0016729014 -0.001350556 0.001045186
PC5 PC6
CA -0.117348632 0.135395809
MG 0.564075505 0.597488041
NA. -0.240227688 -0.095918767
K -0.452908211 0.420830725
SUL -0.283494587 -0.290154729
NO3 0.440026805 -0.480971856
HCO3 0.001376105 -0.077751617
CL 0.297037810 0.102927429
MOY 0.026458834 0.119000634
NOM_Cristaline-St-Cyr -0.074866123 0.150502225
NOM_Evian 0.066867097 -0.097960878
NOM_Fiee des Lois -0.034173834 0.021559689
NOM_Montagne des Pyrenees 0.072334607 -0.062871374
NOM_Saint Diery -0.004010064 0.001954674
NOM_Volcania -0.026151683 -0.009820343
ACRO_CRIST -0.074866123 0.150502225
ACRO_EVIAN 0.066867097 -0.097960878
ACRO_FIEE -0.034173834 0.021559689
ACRO_MTPYR 0.072334607 -0.062871374
ACRO_STDIE -0.004010064 0.001954674
ACRO_VOLCA -0.026151683 -0.009820343
PAYS_F 0.000000000 0.000000000
TYPE_M 0.062857033 -0.094621030
TYPE_S -0.062857033 0.097787141
PG_G -0.004010064 0.001954674
PG_P 0.004010064 -0.001822753
Here you have the dummy conversion and the PCA. Apply accordingly to your data. It is advisable not to name your column NA. It's a reserved name.

mutate und case_when with multiple cases

I would like to basically write a Syntax to get general scales to T-Scores.
To norm these, there are two conditions, the gender and the age, which requires a separate T-Score.
So my data looks something like this:
w <- factor(c("m", "w", "w", "m", "m", "w", "w", "w", "m", "m"))
x <- c(28, 18, 25, 29, 21, 19, 27, 26, 31, 22)
y <- c(80, 55, 74, 101, 84, 74, 65, 56, 88, 78)
z <- c(170, 174, 183, 190, 185, 178, 169, 163, 189, 184)
bsp1 <- data.frame(w, x, y, z)
colnames(bsp1) <- c("Geschlecht", "Alter", "xx", "yy")
rm(w, x, y, z)
bsp1
So far, I've created something like this, even though in this example it's not complete.
bsp1 <- bsp1 %>%
mutate(xxx =
case_when(
Geschlecht = "m" & Alter > 18 & xx == 55 ~ "1",
Geschlecht = "m" & Alter > 18 & xx == 56 ~ "2",
Geschlecht = "m" & Alter > 18 & xx == TRUE ~ "3",
))
I can't seem to figure out, how to combine these multiple conditions into the case_when function. Also, if there needs to be a TRUE statement for it at the end, where does it go?
I hope it's kind of understandable, what I want to do here.
Thank you in advance.
You probably meant to write :
library(dplyr)
bsp1 <- bsp1 %>%
mutate(xxx =
case_when(
Geschlecht == "m" & Alter > 18 & xx == 55 ~ 1,
Geschlecht == "m" & Alter > 18 & xx == 56 ~ 2,
TRUE ~ 3
))

Assigning a range of values to a descriptive variable in R

Apologies in advance - I am relatively new to R/RStudio and am trying to figure out how to assign a value of ranges to a letter grade. In the project I am working on I am trying to predict a hidden value, and one portion of it derives from the three semi-revealed values represented by letter grades. For example, I may know that the three traits revealed are an A, B+, and B but not the exact numbers. However, from the previous data I have pulled, I know the following ranges are correct for each letter grade:
A+: 90 or greater
A: 86-89
A-: 82-85
B+: 82-77
B: 75-78
B-: 72-74
C+:69-71
C: 68-66
C-: 63-65
D: 60-62
F: 0-59
Is there a way for me to link these associated values to the letter grades to use later on in a multiple regression model?
Appreciate it.
I think you just want to organize all these values in a dataframe. Like this:
grades <- c("A+", "A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D", "F")
min_value <- c(90, 86, 82, 79, 75, 72, 69, 66, 63, 60, 0)
max_value <- c(100, 89, 85, 81, 77, 74, 71, 68, 65, 62, 59)
mean_value <- (max_value+min_value)/2
df <- data.frame(grades, min_value, max_value, mean_value)
df
Edit: I'm no longer sure I understood your goal correctly. Here are two options to convert numeric grades to letter grades.
First, use the data.table package and perform a "rolling join". You can learn about rolling joins in this blog post:
https://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html
library(data.table)
grades =
"let, num
A+, 90
A, 86
A-, 82
B+, 80
B, 75
B-, 72
C+, 69
C, 68
C-, 63
D, 60
F, 0"
grades = read.csv(text=grades)
students =
"name, result
john, 60
mary, 86
anish, 79"
students = read.csv(text=students)
setDT(students)
setDT(grades)
grades[students, roll=TRUE, on=c("num"="result")]
#> let num name
#> 1: D 60 john
#> 2: A 86 mary
#> 3: B 79 anish
Alternatively, you could write a conversion function with a for loop. That would look something like this:
pct2let = function(grades, slack_fail = 2, slack_pass = 1){
bareme = structure(list(lb = c(90, 85, 80, 77, 73, 70, 65, 60, 57, 54,
50, 35, 0.01, 0), ub = c(100, 89.99, 84.99, 79.99, 76.99, 72.99,
69.99, 64.99, 59.99, 56.99, 53.99, 49.99, 34.99, 0.01), let = c("A+",
"A", "A-", "B+", "B", "B-", "C+", "C", "C-", "D+", "D", "E",
"F", "F*")), .Names = c("lb", "ub", "let"), class = "data.frame", row.names = c(NA,
-14L))
grades = ifelse(grades < 50, grades + slack_fail, grades + slack_pass)
out = grades
for(i in nrow(bareme):1){
out[grades >= bareme$lb[i]] = bareme$let[i]
}
return(out)
}
pct2let(c(45, 65, 78))
#> [1] "E" "C+" "B+"

R Team Roster constraint with lpsolve - must pick at least x players from Team

I'm having trouble trying to build on the foundation of a previous question
I'd like to optimize so there are at least 3 players from the same team, but I don't care which team it is.
In the code below I can brute-force it to pick 3 players from the Bears (or another team I specify). How would you go about picking the optimal roster with 3 players from the same team, any team?
library(Rglpk)
DF <- data.frame(Team=c(rep("Bears",5), rep("Jets",5), rep("49ers", 5)), Player=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"), Role=c(rep(c("WR", "RB", "TE"),5)), Avgpts=c(22, 19, 30, 25, 20, 21, 26, 14, 21, 13, 11, 8, 4, 3, 5), Salary=c(930, 900, 1300, 970, 910, 920, 980, 720, 650, 589, 111, 1239, 145, 560, 780))
obj = DF$Avgpts
con = rbind(as.numeric(DF$Role=="WR"), as.numeric(DF$Role=="RB"), as.numeric(DF$Role=="TE"), as.numeric(DF$Team == "Bears"), DF$Salary)
dir = c("==","==","==","==","<=")
rhs = c(1,1,1,3,100000)
sol <- Rglpk_solve_LP(obj = obj
, mat = con
, dir = dir
, rhs = rhs
, types = rep("B", length(DF$Team))
, max=TRUE)
solution <- DF[sol$solution==1,]
Forgive me if I get some of the terms wrong, but here's the solution I ended coming up with. Each player is treated as a column and I also have a column for each team. I put in a dummy variable for each Team=Team equal to the minimum number of players I want on a single team.
library("lpSolveAPI")
DF <- data.frame(Team=c(rep("Bears",5), rep("Jets",5), rep("49ers", 5)), Player=c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O"), Role=c(rep(c("WR", "RB", "TE"),5)), Avgpts=c(22, 19, 30, 25, 20, 21, 26, 14, 21, 13, 11, 8, 4, 3, 5), Salary=c(930, 900, 1300, 970, 910, 920, 980, 720, 650, 589, 111, 1239, 145, 560, 780))
ncol <- nrow(DF) # of players in DF
nteams <- length(unique(DF$Team))
teams <- unique(DF$Team)
lp_rowpicker <- make.lp(ncol=(ncol+nteams))
obj_vals <- DF[, "Avgpts"]
set.objfn(lp_rowpicker, c(obj_vals, rep(0, nteams))) #dummy 0s for team variable
lp.control(lp_rowpicker,sense='max')
set.type(lp_rowpicker, columns=1:(ncol+nteams), type = "binary")
add.constraint(lp_rowpicker, xt=c(DF$Salary, rep(0, nteams)), type="<=", rhs=35000)
add.constraint(lp_rowpicker, xt=c(as.numeric(DF$Role=="WR"), rep(0, nteams)), type="=", rhs=1)
add.constraint(lp_rowpicker, xt=c(as.numeric(DF$Role=="RB"), rep(0, nteams)), type="=", rhs=1)
add.constraint(lp_rowpicker, xt=c(as.numeric(DF$Role=="TE"), rep(0, nteams)), type="=", rhs=1)
I then set a constraint that the number of team columns set to one is equal to the total number of teams minus the number of teams I want in the optimal solution. In this case since I'm looking for 1 team out of the 3 in the dataframe, 2 teams will be set to 1 and the team that is set to 0 will require at least 3 players in order to meet the mininum constraint on the row level.
#3 players total
add.constraint(lp_rowpicker, xt=c(rep(1, ncol), rep(0, nteams)), type="=", rhs=3)
# add a constraint that every team must have between 3 and 6 players.
# put a dummy value of 3 in for each team
# if the flag for the team column is 0 then 3 players must be selected (each with a value of 1 in that team's column.
for (i in 1:nteams){
team <- teams[i]
add.constraint(lp_rowpicker, lhs=3, xt=c(as.numeric(DF$Team==team), rep(0, i-1), 3, rep(0, nteams-i)), type="<=", rhs=7)
}
# one team will not have the dummy value in the team column, forcing at least 3 players picked from the same team to meet the lhs of the above constraint
add.constraint(lp_rowpicker, xt=c(rep(0, ncol), rep(1, nteams)), type="=", rhs=(nteams-1))
solve(lp_rowpicker)
get.objective(lp_rowpicker)
soln <- get.variables(lp_rowpicker)>0
solution <- DF[soln[0:ncol],]
print(solution[order(solution$Team),])

Resources