I am trying to plot a k means cluster plot in R - r

I am trying to plot a k means cluster for the following data set:
Elo Rank Elo Score
Man City 1208
Man United 1123
Tottenham 1121
Liverpool 1107
Chelsea 1064
Arsenal 1032
Crystal Palace 996
Burnley 992
Everton 988
Bournemouth 978
West Ham 976
Newcastle 970
Leicester 965
Brighton 955
Southampton 938
Watford 927
Huddersfield 926
West Brom 919
Stoke 914
Swansea 901
I am trying to run kmeans code but Im getting the error:
'Error in colMeans(x, na.rm = TRUE): 'x' must be numeric'
Im assuming this is because of the first column. however I want to label each point on the plot with the team name so I know which point is what.
an example of what I want is the first diagram in this link:
https://www.geeksforgeeks.org/clustering-in-r-programming/
how do i go about plotting this?

Maybe you want something like this. I created 4 clusters of your data using kmeans. You can use this code:
First your data:
# A tibble: 20 × 2
`Elo Rank` `Elo Score`
<chr> <dbl>
1 Man City 1208
2 Man United 1123
3 Tottenham 1121
4 Liverpool 1107
5 Chelsea 1064
6 Arsenal 1032
7 Crystal Palace 996
8 Burnley 992
9 Everton 988
10 Bournemouth 978
11 West Ham 976
12 Newcastle 970
13 Leicester 965
14 Brighton 955
15 Southampton 938
16 Watford 927
17 Huddersfield 926
18 West Brom 919
19 Stoke 914
20 Swansea 901
Next scale your data (really important for kmeans clustering):
df$`Elo Score` <- scale(df$`Elo Score`)
Next create 4 clusters:
library(factoextra)
# Compute k-means with k = 4
km.res <- kmeans(df$`Elo Score`[!is.na(df$`Elo Score`)], 4)
Results:
print(km.res)
K-means clustering with 4 clusters of sizes 9, 1, 4, 6
Cluster means:
[,1]
1 -0.1962215
2 2.4819362
3 1.2379850
4 -0.9446472
Clustering vector:
[1] 2 3 3 3 3 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4
Within cluster sum of squares by cluster:
[1] 0.5729761 0.0000000 0.3216049 0.1143089
(between_SS / total_SS = 94.7 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
Finally add your clusters to data:
# add clusters to data
df$cluster <- km.res$cluster
Result:
# A tibble: 20 × 3
`Elo Rank` `Elo Score`[,1] cluster
<chr> <dbl> <int>
1 Man City 2.48 2
2 Man United 1.47 3
3 Tottenham 1.44 3
4 Liverpool 1.28 3
5 Chelsea 0.764 3
6 Arsenal 0.382 1
7 Crystal Palace -0.0477 1
8 Burnley -0.0955 1
9 Everton -0.143 1
10 Bournemouth -0.263 1
11 West Ham -0.286 1
12 Newcastle -0.358 1
13 Leicester -0.418 1
14 Brighton -0.537 1
15 Southampton -0.740 4
16 Watford -0.871 4
17 Huddersfield -0.883 4
18 West Brom -0.967 4
19 Stoke -1.03 4
20 Swansea -1.18 4

Related

Combining grouping and filtering on a dataframe to plot in ggplot and shiny

I am creating a shiny app that tracks various stats of 6 teams in a competition over 6 years. The df is as follows:
Year Pos Team P W L D GF GA GD G. BP Pts
1 2017 1 Southern Steel 15 15 0 0 1062 812 250 130.8 0 30
2 2017 2 Central Pulse 15 9 6 0 783 756 27 103.6 2 20
3 2017 3 Northern Mystics 15 8 7 0 878 851 27 111.3 3 19
4 2017 4 Waikato Bay of Plenty Magic 15 7 8 0 873 848 25 103.0 5 19
5 2017 5 Northern Stars 15 4 11 0 738 868 -130 85.0 1 9
6 2017 6 Mainland Tactix 15 2 13 0 676 875 -199 77.3 2 6
7 2018 1 Central Pulse 15 12 3 0 850 679 171 125.2 3 27
8 2018 2 Southern Steel 15 10 5 0 874 866 8 100.9 2 22
9 2018 3 Mainland Tactix 15 7 8 0 746 761 -15 98.0 5 19
10 2018 4 Northern Mystics 15 7 8 0 783 796 -13 98.4 3 17
11 2018 5 Waikato Bay of Plenty Magic 15 5 10 0 804 878 -74 91.6 3 13
12 2018 6 Northern Stars 15 4 11 0 832 909 -77 91.5 5 13
13 2019 1 Central Pulse 15 13 2 0 856 676 180 126.6 0 39
14 2019 2 Southern Steel 15 12 3 0 946 809 137 116.9 2 38
15 2019 3 Northern Stars 15 6 9 0 785 840 -55 93.5 3 21
16 2019 4 Waikato Bay of Plenty Magic 15 5 10 0 713 793 -80 89.9 0 15
17 2019 5 Mainland Tactix 15 5 10 0 740 849 -109 87.2 0 15
18 2019 6 Northern Mystics 15 4 11 0 786 859 -73 91.5 2 14
19 2020 1 Central Pulse 15 11 2 2 594 474 120 125.3 1 49
20 2020 2 Mainland Tactix 15 9 4 2 606 566 40 107.1 2 42
21 2020 3 Northern Mystics 15 7 6 2 582 475 7 101.2 3 35
22 2020 4 Northern Stars 15 5 7 3 590 626 -36 94.2 3 29
23 2020 5 Southern Steel 15 4 10 1 578 637 -59 90.7 3 21
24 2020 6 Waikato Bay of Plenty Magic 15 2 9 4 520 592 -72 87.8 3 19
25 2021 1 Northern Mystics 15 11 4 0 924 878 46 105.2 4 37
26 2021 2 Southern Steel 15 11 4 0 813 801 12 101.5 2 35
27 2021 3 Mainland Tactix 15 9 6 0 801 775 26 103.4 4 31
28 2021 4 Northern Stars 15 9 6 0 825 791 34 104.3 2 29
29 2021 5 Central Pulse 15 4 11 0 789 810 -21 97.4 8 20
30 2021 6 Waikato Bay of Plenty Magic 15 1 15 0 807 904 -97 89.3 6 9
31 2022 1 Central Pulse 15 10 5 0 828 732 96 113.1 4 34
32 2022 2 Northern Stars 15 11 4 0 836 783 53 106.8 1 34
33 2022 3 Northern Mystics 15 9 6 0 858 807 51 106.3 4 31
34 2022 4 Southern Steel 15 6 9 0 853 898 -45 95.0 2 20
35 2022 5 Waikato Bay of Plenty Magic 15 4 11 0 733 803 -70 91.3 4 16
36 2022 6 Mainland Tactix 15 5 0 0 788 873 -85 90.3 1 16
I need 3 graphs:
A stacked bar chart showing wins/draws/losses for each team across the 6 years.
A line chart showing the position of each team at the end of each of the 6 years.
A bubble chart showing total goals for/ goals against for each team across all 6 years, with total wins dictating size of the plots.
I also need to be able to filter the data for these graphs with a checkbox for choosing teams and a slider to select the year range.
I have got a stacked bar chart which can not be filtered - I can't figure out how to group the original df by team AND have it connected to the reactive filter I have. Currently the graph is connected to a melted df which is no good as I need the reactive filtered one defined in the function. The graph is also a bit ugly - how can I flip the chart so that wins are on bottom and draws are on top?
The second chart is all good.
The third chart again I need to group the data so that I have total stats across the 6 years- currently there are 36 bubbles but I only want 6.
Screenshots of shiny app output: https://imgur.com/a/qzqlUob
Code:
library(ggplot2)
library(shiny)
library(dplyr)
library(reshape2)
library(scales)
df <- read.csv("ANZ_Premiership_2017_2022.csv")
teams <- c("Central Pulse", "Northern Stars", "Northern Mystics",
"Southern Steel", "Waikato Bay of Plenty Magic", "Mainland Tactix")
mdf <- melt(df %>%
group_by(Team) %>% summarise(Wins = sum(W),
Losses = sum(L),
Draws = sum(D)),
id.vars = "Team")
ui <- fluidPage(
titlePanel("ANZ Premiership Analysis"),
sidebarLayout(
sidebarPanel(
checkboxGroupInput("teams",
"Choose teams",
choices = teams,
selected = teams),
sliderInput("years",
"Choose years",
sep="",
min=2017, max=2022, value=c(2017,2022))
),
mainPanel(
h2("Chart Tabs"),
tabsetPanel(
tabPanel("Wins/ Losses/ Draws", plotOutput("winLoss")),
tabPanel("Standings", plotOutput("standings")),
tabPanel("Goals", plotOutput ("goals"))
)
)
)
)
server <- function(input, output){
filterTeams <- reactive({
df.selection <- filter(df, Team %in% input$teams, Year %in% (input$years[1]:input$years[2]))
})
output$winLoss <- renderPlot({
ggplot(mdf, mapping=aes(Team, value, fill=variable))+
geom_bar(stat = "identity", position = "stack")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
ylab("Wins")+
xlab("Team")
})
output$standings <- renderPlot({
filterTeams() %>%
ggplot(aes(x=Year, y=Pos, group=Team, color=Team)) +
geom_line(size=1.25) +
geom_point(size=2.5)+
ggtitle("Premiership Positions") +
ylab("Position")
})
output$goals <- renderPlot({
filterTeams()%>%
ggplot(aes(GF, GA, size=W, color=Team))+
geom_point(alpha=0.7)+
scale_size(range=c(5,15),name = "Wins")+
xlab("Goals for")+
ylab("Goals against")
})
}
shinyApp(ui = ui, server = server)

How can I write a commmand in R that groups by multiple critera?

I am looking for a function where I can classify my data into five different industries given their SIC code
Permno SIC Industry
1 854
2 977
3 549
4 1231
5 3295
6 2000
7 1539
8 2549
9 3950
10 4758
11 4290
12 5498
13 5248
14 142
15 3209
16 2759
17 4859
18 2569
19 739
20 4529
It could be that all SICS between 100-200 and 400-700 should be in Industry 1, all SICs between 300-350 and 980-1020 should be in Industry 2 etc.
So in short - an 'If = or' function where I could list all the SICs that could match a given industry
Thank you!
You can add a new column with the filters by number:
For example:
data$Group <- 0
data[data$SCIS < 1000, data$Group == 1]
data[data$SCIS >= 1000, data$Group == 2 ]
floor the value after dividing the SIC value by 1000.
df$Industry <- floor(df$SIC/1000) + 1
df
# Permno SIC Industry
#1 1 854 1
#2 2 977 1
#3 3 549 1
#4 4 1231 2
#5 5 3295 4
#6 6 2000 3
#7 7 1539 2
#8 8 2549 3
#9 9 3950 4
#10 10 4758 5
#11 11 4290 5
#12 12 5498 6
#13 13 5248 6
#14 14 142 1
#15 15 3209 4
#16 16 2759 3
#17 17 4859 5
#18 18 2569 3
#19 19 739 1
#20 20 4529 5
If there is no way to programmatically define groups you may need to individually define the ranges. It is convenient to do this with case_when in dplyr.
library(dplyr)
df %>%
mutate(Industry = case_when(between(SIC, 100, 200) | between(SIC, 400, 700) ~ 'Industry 1',
between(SIC, 300, 350) | between(SIC, 980, 1020) ~ 'Industry 2'))

Not getting the correct degrees of freedom in R

I'm unsure what I'm doing wrong. This is the data that I'm using:
dtf <- read.table(text=
"Litter Treatment Tube.L
1 Control 1641
2 Control 1290
3 Control 2411
4 Control 2527
5 Control 1930
6 Control 2158
1 GH 1829
2 GH 1811
3 GH 1897
4 GH 1506
5 GH 2060
6 GH 1207
1 FSH 3395
2 FSH 3113
3 FSH 2219
4 FSH 2667
5 FSH 2210
6 FSH 2625
1 GH+FSH 1537
2 GH+FSH 1991
3 GH+FSH 3639
4 GH+FSH 2246
5 GH+FSH 1840
6 GH+FSH 2217", header=TRUE)
What I did was:
BoarsMod1 <- aov(Tube.L ~ Litter + Treatment, data=dtf)
anova(BoarsMod1)
I'm getting an incorrect number of degrees of freedom for litter. It should be 5 (as there are 6 litter blocks) but it is 1. Am I doing something wrong?

how can i extract California county locations from given latitude and longitude information [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I have the following dataset for California housing data:
head(calif_cluster,15)
MedianHouseValue MedianIncome MedianHouseAge TotalRooms TotalBedrooms Population
1 190300 4.20510 16 2697.00 490.00 1462
2 150800 2.54810 33 2821.00 652.00 1206
3 252600 6.08290 17 6213.20 1276.05 3288
4 269700 4.03680 52 919.00 213.00 413
5 91200 1.63680 28 3072.00 790.00 1375
6 66200 2.18980 30 744.00 156.00 410
7 148800 2.63640 39 620.95 136.00 348
8 384800 4.46150 20 2270.00 498.00 1070
9 153200 2.75000 22 1931.00 445.00 1009
10 66200 1.60057 36 973.00 219.00 613
11 461500 3.78130 43 3070.00 668.00 1240
12 144600 2.85000 22 5175.00 1213.00 2804
13 143700 5.09410 8 6213.20 1276.05 3288
14 195500 5.30620 16 2918.00 444.00 1697
15 268800 2.42110 22 620.95 136.00 348
Households Latitude Longitude cluster_kmeans gender_dom marital race edu_level rental
1 515 38.48 -122.47 1 M other black jrcollege rented
2 640 38.00 -122.13 1 F other hispanic doctorate owned
3 1162 33.88 -117.79 3 M other white jrcollege owned
4 193 37.85 -122.25 1 M single others jrcollege owned
5 705 38.13 -122.26 1 F single white doctorate rented
6 165 38.96 -122.21 1 F single others jrcollege owned
7 125 34.01 -118.18 2 M married others postgrad owned
8 521 33.83 -118.38 2 F single white graduate rented
9 407 38.95 -121.04 1 M married others postgrad leased
10 187 35.34 -119.01 2 M single hispanic doctorate owned
11 646 33.76 -118.12 2 F other others highschl leased
12 1091 37.95 -122.05 3 M other white graduate rented
13 1162 36.87 -119.75 3 M other others postgrad leased
14 444 32.93 -117.13 2 M other asian jrcollege owned
15 125 37.71 -120.98 1 F single asian postgrad leased
As i have latitude & longitude information in the datasets, i would like to extract corresponding county for the given geo information using R. Also is it possible to getting the capital city(or largest city) for each of the extracted counties .These could make my stratified analysis more insightful;intend to do some clustering/mapping exercise.
take a look at ggmap::revgeocode
code
library(ggmap)
revgeocode(c(-122.47,38.48)) # longitude then latitude
# [1] "2233 Sulphur Springs Ave, St Helena, CA 94574, USA"
library(dplyr)
library(magrittr)
df12 %<>% rowwise %>% mutate(address = revgeocode(c(Longitude,Latitude))) %>% ungroup # add full address using google api through ggmap
df12 %<>% separate(address,c("street_address", "city","county","country"),remove=F,sep=",") # structure all the info you need
result
df12 %>% select(Longitude,Latitude,address,county)
# A tibble: 15 x 4
# Longitude Latitude address county
# * <dbl> <dbl> <chr> <chr>
# 1 -122.47 38.48 2233 Sulphur Springs Ave, St Helena, CA 94574, USA CA 94574
# 2 -122.13 38.00 3400-3410 Brookside Dr, Martinez, CA 94553, USA CA 94553
# 3 -117.79 33.88 19721 Bluefield Plaza, Yorba Linda, CA 92886, USA CA 92886
# 4 -122.25 37.85 6365 Florio St, Oakland, CA 94618, USA CA 94618
# 5 -122.26 38.13 119 Mimosa Ct, Vallejo, CA 94589, USA CA 94589
# 6 -122.21 38.96 Unnamed Road, Arbuckle, CA 95912, USA CA 95912
# 7 -118.18 34.01 4360-4414 Noakes St, Los Angeles, CA 90023, USA CA 90023
# 8 -118.38 33.83 903 Serpentine St, Redondo Beach, CA 90277, USA CA 90277
# 9 -121.04 38.95 14666-14690 Musso Rd, Auburn, CA 95603, USA CA 95603
# 10 -119.01 35.34 800 Ming Ave, Bakersfield, CA 93307, USA CA 93307
# 11 -118.12 33.76 6211-6295 E Marina Dr, Long Beach, CA 90803, USA CA 90803
# 12 -122.05 37.95 1120 Carey Dr, Concord, CA 94520, USA CA 94520
# 13 -119.75 36.87 1815-1899 E Pryor Dr, Fresno, CA 93720, USA CA 93720
# 14 -117.13 32.93 9010-9016 Danube Ln, San Diego, CA 92126, USA CA 92126
# 15 -120.98 37.71 748-1298 Claribel Rd, Modesto, CA 95356, USA CA 95356
data
df1 <- read.table(text = "MedianHouseValue MedianIncome MedianHouseAge TotalRooms TotalBedrooms Population
1 190300 4.20510 16 2697.00 490.00 1462
2 150800 2.54810 33 2821.00 652.00 1206
3 252600 6.08290 17 6213.20 1276.05 3288
4 269700 4.03680 52 919.00 213.00 413
5 91200 1.63680 28 3072.00 790.00 1375
6 66200 2.18980 30 744.00 156.00 410
7 148800 2.63640 39 620.95 136.00 348
8 384800 4.46150 20 2270.00 498.00 1070
9 153200 2.75000 22 1931.00 445.00 1009
10 66200 1.60057 36 973.00 219.00 613
11 461500 3.78130 43 3070.00 668.00 1240
12 144600 2.85000 22 5175.00 1213.00 2804
13 143700 5.09410 8 6213.20 1276.05 3288
14 195500 5.30620 16 2918.00 444.00 1697
15 268800 2.42110 22 620.95 136.00 348",header=T,stringsAsFactors=F)
df2 <- read.table(text = "Households Latitude Longitude cluster_kmeans gender_dom marital race edu_level rental
1 515 38.48 -122.47 1 M other black jrcollege rented
2 640 38.00 -122.13 1 F other hispanic doctorate owned
3 1162 33.88 -117.79 3 M other white jrcollege owned
4 193 37.85 -122.25 1 M single others jrcollege owned
5 705 38.13 -122.26 1 F single white doctorate rented
6 165 38.96 -122.21 1 F single others jrcollege owned
7 125 34.01 -118.18 2 M married others postgrad owned
8 521 33.83 -118.38 2 F single white graduate rented
9 407 38.95 -121.04 1 M married others postgrad leased
10 187 35.34 -119.01 2 M single hispanic doctorate owned
11 646 33.76 -118.12 2 F other others highschl leased
12 1091 37.95 -122.05 3 M other white graduate rented
13 1162 36.87 -119.75 3 M other others postgrad leased
14 444 32.93 -117.13 2 M other asian jrcollege owned
15 125 37.71 -120.98 1 F single asian postgrad leased",header=T,stringsAsFactors=F)
df12 <- cbind(df1,df2)
I don't think the library offers an option to get the capital or largest city in the county but I think you won't have too much trouble building a lookup table from online info.

r grep with or statement

I've been working on a r function to filter a large data frame of baseball team batting stats by game id, (i.e."2016/10/11/chnmlb-sfnmlb-1"), to create a list of past team matchups by season.
When I use some combinations of teams, output is correct, but others are not. (output contains a variety of ids)
I'm not real familiar with grep, and assume that is the problem. I patched my grep line and list output together by searching stack overflow and thought I had it till testing proved otherwise.
matchup.func <- function (home, away, df) {
matchups <- grep(paste('[0-9]{4}/[0-9]{2}/[0-9]{2}/[', home, '|', away, 'mlb]{6}-[', away, '|', home, 'mlb]{6}-[0-9]{1}', sep = ''), df$game.id, value = TRUE)
df <- df[df$game.id %in% matchups, c(1, 3:ncol(df))]
out <- list()
for (n in 1:length(unique(df$season))) {
for (s in unique(df$season)[n]) {
out[[s]] <- subset(df, season == s)
}
}
return(out)
}
sample of data frame:
bat.stats[sample(nrow(bat.stats), 3), ]
date game.id team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
1192 2016-04-11 2016/04/11/texmlb-seamlb-1 sea 2 5 away 38 7 14 3 0 0 7 2 27 8 11 15 0.226 0.303 0.336 0.639 0.286 R
764 2016-03-26 2016/03/26/wasmlb-slnmlb-1 sln 8 12 away 38 7 9 2 1 1 5 2 27 8 11 19 0.289 0.354 0.474 0.828 0.400 S
5705 2016-09-26 2016/09/26/oakmlb-anamlb-1 oak 67 89 home 29 2 6 1 0 1 2 2 27 13 4 12 0.260 0.322 0.404 0.726 0.429 R
sample of errant output:
matchup.func('tex', 'sea', bat.stats)
$S
date team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
21 2016-03-02 atl 1 0 home 32 4 7 0 0 2 3 2 27 19 2 11 0.203 0.222 0.406 0.628 1.000 S
22 2016-03-02 bal 0 1 away 40 11 14 3 2 2 11 10 27 13 4 28 0.316 0.415 0.532 0.947 0.000 S
47 2016-03-03 bal 0 2 home 41 10 17 7 0 2 10 0 27 9 3 13 0.329 0.354 0.519 0.873 0.000 S
48 2016-03-03 tba 1 1 away 33 3 5 0 1 0 3 2 24 10 8 13 0.186 0.213 0.343 0.556 0.500 S
141 2016-03-05 tba 2 2 home 35 6 6 2 0 0 5 3 27 11 5 15 0.199 0.266 0.318 0.584 0.500 S
142 2016-03-05 bal 0 5 away 41 10 17 5 1 0 10 4 27 9 10 13 0.331 0.371 0.497 0.868 0.000 S
sample of good:
matchup.func('bos', 'bal', bat.stats)
$S
date team wins losses flag ab r h d t hr rbi bb po da so lob avg obp slg ops roi season
143 2016-03-06 bal 0 6 home 34 8 14 4 0 0 8 5 27 5 8 22 0.284 0.330 0.420 0.750 0.000 S
144 2016-03-06 bos 3 2 away 38 7 10 3 0 0 7 7 24 7 13 25 0.209 0.285 0.322 0.607 0.600 S
209 2016-03-08 bos 4 3 home 37 1 12 1 1 0 1 4 27 15 8 26 0.222 0.292 0.320 0.612 0.571 S
210 2016-03-08 bal 0 8 away 36 5 12 5 0 1 4 4 27 9 4 27 0.283 0.345 0.429 0.774 0.000 S
On the good it gives a list of matchups as it should, (i.e. S, R, F, D), on the bad it outputs by season, but seems to only give matchups by date and not team. Not sure what to think.
I think that the issue is that regex inside [] behaves differently than you might expect. Specifically, it is looking for any matches to those characters, and in any order. Instead, you might try
matchups <- grep(paste0("(", home, "|", away, ")mlb-(", home, "|", away, ")mlb")
, df$game.id, value = TRUE)
That should give you either the home or the away team, followed by either the home or away team. Without more sample data though, I am not sure if this will catch edge cases.
You should also note that you don't have to match the entire string, so the date-finding regex at the beginning is likely superfluous.

Resources