Obtaining Probabilities in KNN Classifier in R - r

I have the following the data set:
TRAIN dataset
Sr A B C XX
1 0.09 0.52 11.1 high
2 0.13 0.25 11.1 low
3 0.20 0.28 11.1 high
4 0.29 0.50 11.1 low
5 0.31 0.58 11.1 high
6 0.32 0.37 11.1 high
7 0.37 0.58 11.1 low
8 0.38 0.40 11.1 low
9 0.42 0.65 11.1 high
10 0.42 0.79 11.1 low
11 0.44 0.34 11.1 high
12 0.45 0.89 11.1 low
13 0.57 0.72 11.1 low
TEST dataset
Sr A B C XX
1 0.54 1.36 9.80 low
2 0.72 0.82 9.80 low
3 0.19 0.38 9.90 high
4 0.25 0.44 9.90 high
5 0.29 0.54 9.90 high
6 0.30 0.54 9.90 high
7 0.42 0.86 9.90 low
8 0.44 0.86 9.90 low
9 0.49 0.66 9.90 low
10 0.54 0.76 9.90 low
11 0.54 0.76 9.90 low
12 0.68 1.08 9.90 low
13 0.88 0.51 9.90 high
Sr : Serial Number
A-C : Parameters
XX : Output Binary Parameter
I am trying to use the KNN classifier to develop a predictor model with 5 nearest neighbors. Following is the code that I have written:
train_input <- as.matrix(train[,-ncol(train)])
train_output <- as.factor(train[,ncol(train)])
test_input <- as.matrix(test[,-ncol(test)])
prediction <- knn(train_input, test_input, train_output, k=5, prob=TRUE)
resultdf <- as.data.frame(cbind(test[,ncol(test)], prediction))
colnames(resultdf) <- c("Actual","Predicted")
RESULT dataset
A P
1 2 2
2 2 2
3 1 2
4 1 1
5 1 1
6 1 2
7 2 2
8 2 2
9 2 2
10 2 2
11 2 2
12 2 1
13 1 2
I have the following concerns:
What should I do to obtain probability values? Is this a probability of getting high or low i.e. P(high) or P(low)?
The levels are set to 1 (high) and 2 (low), which is based on the order of first appearance. If low appeared before high in the train dataset, it would have a value 1. I feel this is not good practice. Is there anyway I can avoid this?
If there were more classes (more than 2) in the classifier, how would I handle this in the classifier?
I am using the class and e1071 library.
Thanks.

Utility function built before the "text" argument to scan was introduced:
rd.txt <- function (txt, header = TRUE, ...)
{ tconn <- textConnection(txt)
rd <- read.table(tconn, header = header, ...)
close(tconn)
rd}
RESULT <- rd.txt(" A P
1 2 2
2 2 2
3 1 2
4 1 1
5 1 1
6 1 2
7 2 2
8 2 2
9 2 2
10 2 2
11 2 2
12 2 1
13 1 2
")
> prop.table(table(RESULT))
P
A 1 2
1 0.15385 0.23077
2 0.07692 0.53846
You can also set up prop.table to deliver row or column proportions (AKA probabilities).

Related

Method in R to find difference between rows with varying row spacing

I want to add an extra column in a dataframe which displays the difference between certain rows, where the distance between the rows also depends on values in the table.
I found out that:
mutate(Col_new = Col_1 - lead(Col_1, n = x))
can find the difference for a fixed n, but only a integer can be used as input. How would you find the difference between rows for a varying distance between the rows?
I am trying to get the output in Col_new, which is the difference between the i and i+n row where n should take the value in column Count. (The data is rounded so there might be 0.01 discrepancies in Col_new).
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA
Data:
df <- data.frame(Col_1 = c(0.90, 1.58, 1.89, 1.84, 1.57, 1.30, 1.35,
1.56, 2.24, 3.14, 4.04, 4.72, 5.04, 4.99,
4.71, 4.44, 4.39, 4.70, 5.38, 6.28),
Count = sort(rep(1:4, 5)))
Some code that generates the intended output, but can undoubtably be made more efficient.
library(dplyr)
df %>%
mutate(col_2 = sapply(1:4, function(s){lead(Col_1, n = s)})) %>%
rowwise() %>%
mutate(Col_new = Col_1 - col_2[Count]) %>%
select(-col_2)
Output:
# A tibble: 20 × 3
# Rowwise:
Col_1 Count Col_new
<dbl> <int> <dbl>
1 0.9 1 -0.68
2 1.58 1 -0.310
3 1.89 1 0.0500
4 1.84 1 0.27
5 1.57 1 0.27
6 1.3 2 -0.26
7 1.35 2 -0.89
8 1.56 2 -1.58
9 2.24 2 -1.8
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.0100
13 5.04 3 0.600
14 4.99 3 0.600
15 4.71 3 0.0100
16 4.44 4 -1.84
17 4.39 4 NA
18 4.7 4 NA
19 5.38 4 NA
20 6.28 4 NA
df %>% mutate(Col_new = case_when(
df$count == 1 ~ df$col_1 - lead(df$col_1 , n = 1),
df$count == 2 ~ df$col_1 - lead(df$col_1 , n = 2),
df$count == 3 ~ df$col_1 - lead(df$col_1 , n = 3),
df$count == 4 ~ df$col_1 - lead(df$col_1 , n = 4),
df$count == 5 ~ df$col_1 - lead(df$col_1 , n = 5)
))
col_1 count Col_new
1 0.90 1 -0.68
2 1.58 1 -0.31
3 1.89 1 0.05
4 1.84 1 0.27
5 1.57 1 0.27
6 1.30 2 -0.26
7 1.25 2 -0.99
8 1.56 2 -1.58
9 2.24 2 -1.80
10 3.14 2 -1.58
11 4.04 3 -0.95
12 4.72 3 0.01
13 5.04 3 0.60
14 4.99 3 0.60
15 4.71 3 0.01
16 4.44 4 -1.84
17 4.39 4 NA
18 4.70 4 NA
19 5.38 4 NA
20 6.28 4 NA
This would give you your desired results but is not a very good solution for more cases. Imagine your task with 10 or more different counts another solution is required.

Scrape Data into R

I'm currently trying to scrape the Player Standard Stats table into R but am having trouble getting the right table.
html_link <- "https://fbref.com/en/comps/9/stats/Premier-League-Stats#stats_standard::1"
"https://fbref.com/en/comps/9/stats/Premier-League-Stats#stats_standard::1"
df <- html_link %>%
xml2::read_html() %>%
rvest::html_nodes("table") %>%
rvest::html_table(fill = T)
The link provides a copy link to clipboard, so I was trying to use that link and scrape the data in, but it looks like I'm not getting the right results. Does anyone know how to do this automatically in R without having to download the CSV file?
Thanks.
You can use the "embed link" on the table...
url <- "https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fstats%2FPremier-League-Stats&div=div_stats_standard"
f <- url %>%
xml2::read_html() %>%
rvest::html_nodes('table') %>%
html_table() %>%
.[[1]]
> head(f)
1 Rk Player Nation Pos Squad Age Born
2 1 Patrick van Aanholt nl NED DF Crystal Palace 30-170 1990
3 2 Tammy Abraham eng ENG FW Chelsea 23-136 1997
4 3 Che Adams eng ENG FW Southampton 24-217 1996
5 4 Tosin Adarabioyo eng ENG DF Fulham 23-144 1997
6 5 Adrián es ESP GK Liverpool 34-043 1987
Playing Time Playing Time Playing Time Playing Time Performance
1 MP Starts Min 90s Gls
2 14 13 1,144 12.7 0
3 18 10 957 10.6 6
4 22 20 1,735 19.3 4
5 19 19 1,710 19.0 0
6 2 2 180 2.0 0
Performance Performance Performance Performance Performance
1 Ast G-PK PK PKatt CrdY
2 1 0 0 0 1
3 1 6 0 0 0
4 4 4 0 0 1
5 0 0 0 0 1
6 0 0 0 0 0
Performance Per 90 Minutes Per 90 Minutes Per 90 Minutes
1 CrdR Gls Ast G+A
2 0 0.00 0.08 0.08
3 0 0.56 0.09 0.66
4 0 0.21 0.21 0.41
5 0 0.00 0.00 0.00
6 0 0.00 0.00 0.00
Per 90 Minutes Per 90 Minutes Expected Expected Expected Expected
1 G-PK G+A-PK xG npxG xA npxG+xA
2 0.00 0.08 0.8 0.8 0.8 1.6
3 0.56 0.66 5.5 5.5 0.9 6.3
4 0.21 0.41 5.1 5.1 4.3 9.4
5 0.00 0.00 0.8 0.8 0.1 0.9
6 0.00 0.00 0.0 0.0 0.0 0.0
Per 90 Minutes Per 90 Minutes Per 90 Minutes Per 90 Minutes
1 xG xA xG+xA npxG
2 0.06 0.06 0.12 0.06
3 0.51 0.08 0.60 0.51
4 0.26 0.22 0.49 0.26
5 0.04 0.01 0.05 0.04
6 0.00 0.00 0.00 0.00
Per 90 Minutes
1 npxG+xA Matches
2 0.12 Matches
3 0.60 Matches
4 0.49 Matches
5 0.05 Matches
6 0.00 Matches

Finding the mean of a subset

I have made a subset from the dataframe 'Indometh' called 'indo':
indo
Subject time conc
1 1 0.25 1.50
13 2 0.50 1.63
24 3 0.50 1.49
25 3 0.75 1.16
34 4 0.25 1.85
35 4 0.50 1.39
36 4 0.75 1.02
46 5 0.50 1.04
57 6 0.50 1.44
58 6 0.75 1.03
I want to find what the average concentration for the subset is. I have used code but to no avail:
mean(subset(indo, conc >1 & conc <2))
I know summary(indo) will show the mean of the concentration but wanted to know if there was another way I could do this just for conc.
You can try subsetting via bracket notation:
mean(indo$conc[indo$conc > 1 & indo$conc < 2])

Replace values in data frame based on a table in R

Data Frame:
set.seed(90)
df <- data.frame(id = 1:10, values = round(rnorm(10),1))
id values
1 1 0.1
2 2 -0.2
3 3 -0.9
4 4 -0.7
5 5 0.7
6 6 0.4
7 7 1.0
8 8 0.9
9 9 -0.6
10 10 2.4
Table:
table <- data.frame(values = c(-2.0001,1.0023,0.0005,1.0002,2.00009), final_values = round(rnorm(5),2))
values final_values
1 -2.00010 -0.81
2 1.00230 -0.08
3 0.00050 0.87
4 1.00020 1.66
5 2.00009 -0.24
I need to replace the values in data frame based on the closest match of the values in table.
Final Output:
id final_values
1 1 0.87
2 2 0.87
3 3 -0.08
4 4 -0.08
5 5 1.66
6 6 0.87
7 7 1.66
8 8 1.66
9 9 -0.08
10 10 -0.24
What is the best way to do this with base R?
Here is a way and you can overwrite the result back to df:
sapply(df$values, function(x) table$final_values[which.min(abs(x - table$values))])
[1] 0.87 0.87 -0.08 -0.08 1.66 0.87 1.66 1.66 -0.08 -0.24

How to merge three tables by inserting to each other in R?

I have a data frame as following. I want to know the evolution from RIK_T1 to RIK_T2 by seeing their frequency, row% and Column%. How to show them at once?
ID<-c('1','2','3','4','5','6','7','8','9','10')
RIK_T1<-c('20','15','20','20','97','20','20','20','15','15')
RIK_T2<-c('20','15','15','20','97','97','20','20','20','20')
df<-data.frame(ID,RIK_T1,RIK_T2)
df
TAB=table(df$RIK_T1,df$RIK_T2)
t1<-addmargins(TAB) #TABLE-01
TAB_row=prop.table(TAB,1)#row
t2<-round(addmargins(TAB_row),digits=2)#TABLE-01-1
TAB_col=prop.table(TAB,2)#column
t3<-round(addmargins(TAB_col),digits=2)#TABLE-01-2
I get three tables as following:table, row% and col%
15 20 97 Sum
15 1 2 0 3
20 1 4 1 6
97 0 0 1 1
Sum 2 6 2 10
15 20 97 Sum
15 0.33 0.67 0.00 1.00
20 0.17 0.67 0.17 1.00
97 0.00 0.00 1.00 1.00
Sum 0.50 1.33 1.17 3.00
15 20 97 Sum
15 0.50 0.33 0.00 0.83
20 0.50 0.67 0.50 1.67
97 0.00 0.00 0.50 0.50
Sum 1.00 1.00 1.00 3.00
Is it possible to merge them into one table as following?
15 20 97 Sum
R%/C% R%/C% R%/C% R%/C%
15 1 2 0 3
0.33/0.50 0.67/0.33 0.00/0.00 1.00/0.83
20 1 4 1 6
0.17/0.50 0.67/0.67 0.17/0.50 1.00/1.67
97 0 0 1 1
0.00/0.00 0.00/0.00 1.00/0.50 1.00/0.50
Sum 2 6 2 10
0.50/1.00 1.33/1.00 1.17/1.00 3.00/3.00
Thanks in advance.

Resources