Not getting the correct degrees of freedom in R - r

I'm unsure what I'm doing wrong. This is the data that I'm using:
dtf <- read.table(text=
"Litter Treatment Tube.L
1 Control 1641
2 Control 1290
3 Control 2411
4 Control 2527
5 Control 1930
6 Control 2158
1 GH 1829
2 GH 1811
3 GH 1897
4 GH 1506
5 GH 2060
6 GH 1207
1 FSH 3395
2 FSH 3113
3 FSH 2219
4 FSH 2667
5 FSH 2210
6 FSH 2625
1 GH+FSH 1537
2 GH+FSH 1991
3 GH+FSH 3639
4 GH+FSH 2246
5 GH+FSH 1840
6 GH+FSH 2217", header=TRUE)
What I did was:
BoarsMod1 <- aov(Tube.L ~ Litter + Treatment, data=dtf)
anova(BoarsMod1)
I'm getting an incorrect number of degrees of freedom for litter. It should be 5 (as there are 6 litter blocks) but it is 1. Am I doing something wrong?

Related

Combining grouping and filtering on a dataframe to plot in ggplot and shiny

I am creating a shiny app that tracks various stats of 6 teams in a competition over 6 years. The df is as follows:
Year Pos Team P W L D GF GA GD G. BP Pts
1 2017 1 Southern Steel 15 15 0 0 1062 812 250 130.8 0 30
2 2017 2 Central Pulse 15 9 6 0 783 756 27 103.6 2 20
3 2017 3 Northern Mystics 15 8 7 0 878 851 27 111.3 3 19
4 2017 4 Waikato Bay of Plenty Magic 15 7 8 0 873 848 25 103.0 5 19
5 2017 5 Northern Stars 15 4 11 0 738 868 -130 85.0 1 9
6 2017 6 Mainland Tactix 15 2 13 0 676 875 -199 77.3 2 6
7 2018 1 Central Pulse 15 12 3 0 850 679 171 125.2 3 27
8 2018 2 Southern Steel 15 10 5 0 874 866 8 100.9 2 22
9 2018 3 Mainland Tactix 15 7 8 0 746 761 -15 98.0 5 19
10 2018 4 Northern Mystics 15 7 8 0 783 796 -13 98.4 3 17
11 2018 5 Waikato Bay of Plenty Magic 15 5 10 0 804 878 -74 91.6 3 13
12 2018 6 Northern Stars 15 4 11 0 832 909 -77 91.5 5 13
13 2019 1 Central Pulse 15 13 2 0 856 676 180 126.6 0 39
14 2019 2 Southern Steel 15 12 3 0 946 809 137 116.9 2 38
15 2019 3 Northern Stars 15 6 9 0 785 840 -55 93.5 3 21
16 2019 4 Waikato Bay of Plenty Magic 15 5 10 0 713 793 -80 89.9 0 15
17 2019 5 Mainland Tactix 15 5 10 0 740 849 -109 87.2 0 15
18 2019 6 Northern Mystics 15 4 11 0 786 859 -73 91.5 2 14
19 2020 1 Central Pulse 15 11 2 2 594 474 120 125.3 1 49
20 2020 2 Mainland Tactix 15 9 4 2 606 566 40 107.1 2 42
21 2020 3 Northern Mystics 15 7 6 2 582 475 7 101.2 3 35
22 2020 4 Northern Stars 15 5 7 3 590 626 -36 94.2 3 29
23 2020 5 Southern Steel 15 4 10 1 578 637 -59 90.7 3 21
24 2020 6 Waikato Bay of Plenty Magic 15 2 9 4 520 592 -72 87.8 3 19
25 2021 1 Northern Mystics 15 11 4 0 924 878 46 105.2 4 37
26 2021 2 Southern Steel 15 11 4 0 813 801 12 101.5 2 35
27 2021 3 Mainland Tactix 15 9 6 0 801 775 26 103.4 4 31
28 2021 4 Northern Stars 15 9 6 0 825 791 34 104.3 2 29
29 2021 5 Central Pulse 15 4 11 0 789 810 -21 97.4 8 20
30 2021 6 Waikato Bay of Plenty Magic 15 1 15 0 807 904 -97 89.3 6 9
31 2022 1 Central Pulse 15 10 5 0 828 732 96 113.1 4 34
32 2022 2 Northern Stars 15 11 4 0 836 783 53 106.8 1 34
33 2022 3 Northern Mystics 15 9 6 0 858 807 51 106.3 4 31
34 2022 4 Southern Steel 15 6 9 0 853 898 -45 95.0 2 20
35 2022 5 Waikato Bay of Plenty Magic 15 4 11 0 733 803 -70 91.3 4 16
36 2022 6 Mainland Tactix 15 5 0 0 788 873 -85 90.3 1 16
I need 3 graphs:
A stacked bar chart showing wins/draws/losses for each team across the 6 years.
A line chart showing the position of each team at the end of each of the 6 years.
A bubble chart showing total goals for/ goals against for each team across all 6 years, with total wins dictating size of the plots.
I also need to be able to filter the data for these graphs with a checkbox for choosing teams and a slider to select the year range.
I have got a stacked bar chart which can not be filtered - I can't figure out how to group the original df by team AND have it connected to the reactive filter I have. Currently the graph is connected to a melted df which is no good as I need the reactive filtered one defined in the function. The graph is also a bit ugly - how can I flip the chart so that wins are on bottom and draws are on top?
The second chart is all good.
The third chart again I need to group the data so that I have total stats across the 6 years- currently there are 36 bubbles but I only want 6.
Screenshots of shiny app output: https://imgur.com/a/qzqlUob
Code:
library(ggplot2)
library(shiny)
library(dplyr)
library(reshape2)
library(scales)
df <- read.csv("ANZ_Premiership_2017_2022.csv")
teams <- c("Central Pulse", "Northern Stars", "Northern Mystics",
"Southern Steel", "Waikato Bay of Plenty Magic", "Mainland Tactix")
mdf <- melt(df %>%
group_by(Team) %>% summarise(Wins = sum(W),
Losses = sum(L),
Draws = sum(D)),
id.vars = "Team")
ui <- fluidPage(
titlePanel("ANZ Premiership Analysis"),
sidebarLayout(
sidebarPanel(
checkboxGroupInput("teams",
"Choose teams",
choices = teams,
selected = teams),
sliderInput("years",
"Choose years",
sep="",
min=2017, max=2022, value=c(2017,2022))
),
mainPanel(
h2("Chart Tabs"),
tabsetPanel(
tabPanel("Wins/ Losses/ Draws", plotOutput("winLoss")),
tabPanel("Standings", plotOutput("standings")),
tabPanel("Goals", plotOutput ("goals"))
)
)
)
)
server <- function(input, output){
filterTeams <- reactive({
df.selection <- filter(df, Team %in% input$teams, Year %in% (input$years[1]:input$years[2]))
})
output$winLoss <- renderPlot({
ggplot(mdf, mapping=aes(Team, value, fill=variable))+
geom_bar(stat = "identity", position = "stack")+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
ylab("Wins")+
xlab("Team")
})
output$standings <- renderPlot({
filterTeams() %>%
ggplot(aes(x=Year, y=Pos, group=Team, color=Team)) +
geom_line(size=1.25) +
geom_point(size=2.5)+
ggtitle("Premiership Positions") +
ylab("Position")
})
output$goals <- renderPlot({
filterTeams()%>%
ggplot(aes(GF, GA, size=W, color=Team))+
geom_point(alpha=0.7)+
scale_size(range=c(5,15),name = "Wins")+
xlab("Goals for")+
ylab("Goals against")
})
}
shinyApp(ui = ui, server = server)

I am trying to plot a k means cluster plot in R

I am trying to plot a k means cluster for the following data set:
Elo Rank Elo Score
Man City 1208
Man United 1123
Tottenham 1121
Liverpool 1107
Chelsea 1064
Arsenal 1032
Crystal Palace 996
Burnley 992
Everton 988
Bournemouth 978
West Ham 976
Newcastle 970
Leicester 965
Brighton 955
Southampton 938
Watford 927
Huddersfield 926
West Brom 919
Stoke 914
Swansea 901
I am trying to run kmeans code but Im getting the error:
'Error in colMeans(x, na.rm = TRUE): 'x' must be numeric'
Im assuming this is because of the first column. however I want to label each point on the plot with the team name so I know which point is what.
an example of what I want is the first diagram in this link:
https://www.geeksforgeeks.org/clustering-in-r-programming/
how do i go about plotting this?
Maybe you want something like this. I created 4 clusters of your data using kmeans. You can use this code:
First your data:
# A tibble: 20 × 2
`Elo Rank` `Elo Score`
<chr> <dbl>
1 Man City 1208
2 Man United 1123
3 Tottenham 1121
4 Liverpool 1107
5 Chelsea 1064
6 Arsenal 1032
7 Crystal Palace 996
8 Burnley 992
9 Everton 988
10 Bournemouth 978
11 West Ham 976
12 Newcastle 970
13 Leicester 965
14 Brighton 955
15 Southampton 938
16 Watford 927
17 Huddersfield 926
18 West Brom 919
19 Stoke 914
20 Swansea 901
Next scale your data (really important for kmeans clustering):
df$`Elo Score` <- scale(df$`Elo Score`)
Next create 4 clusters:
library(factoextra)
# Compute k-means with k = 4
km.res <- kmeans(df$`Elo Score`[!is.na(df$`Elo Score`)], 4)
Results:
print(km.res)
K-means clustering with 4 clusters of sizes 9, 1, 4, 6
Cluster means:
[,1]
1 -0.1962215
2 2.4819362
3 1.2379850
4 -0.9446472
Clustering vector:
[1] 2 3 3 3 3 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4
Within cluster sum of squares by cluster:
[1] 0.5729761 0.0000000 0.3216049 0.1143089
(between_SS / total_SS = 94.7 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
Finally add your clusters to data:
# add clusters to data
df$cluster <- km.res$cluster
Result:
# A tibble: 20 × 3
`Elo Rank` `Elo Score`[,1] cluster
<chr> <dbl> <int>
1 Man City 2.48 2
2 Man United 1.47 3
3 Tottenham 1.44 3
4 Liverpool 1.28 3
5 Chelsea 0.764 3
6 Arsenal 0.382 1
7 Crystal Palace -0.0477 1
8 Burnley -0.0955 1
9 Everton -0.143 1
10 Bournemouth -0.263 1
11 West Ham -0.286 1
12 Newcastle -0.358 1
13 Leicester -0.418 1
14 Brighton -0.537 1
15 Southampton -0.740 4
16 Watford -0.871 4
17 Huddersfield -0.883 4
18 West Brom -0.967 4
19 Stoke -1.03 4
20 Swansea -1.18 4

How can I write a commmand in R that groups by multiple critera?

I am looking for a function where I can classify my data into five different industries given their SIC code
Permno SIC Industry
1 854
2 977
3 549
4 1231
5 3295
6 2000
7 1539
8 2549
9 3950
10 4758
11 4290
12 5498
13 5248
14 142
15 3209
16 2759
17 4859
18 2569
19 739
20 4529
It could be that all SICS between 100-200 and 400-700 should be in Industry 1, all SICs between 300-350 and 980-1020 should be in Industry 2 etc.
So in short - an 'If = or' function where I could list all the SICs that could match a given industry
Thank you!
You can add a new column with the filters by number:
For example:
data$Group <- 0
data[data$SCIS < 1000, data$Group == 1]
data[data$SCIS >= 1000, data$Group == 2 ]
floor the value after dividing the SIC value by 1000.
df$Industry <- floor(df$SIC/1000) + 1
df
# Permno SIC Industry
#1 1 854 1
#2 2 977 1
#3 3 549 1
#4 4 1231 2
#5 5 3295 4
#6 6 2000 3
#7 7 1539 2
#8 8 2549 3
#9 9 3950 4
#10 10 4758 5
#11 11 4290 5
#12 12 5498 6
#13 13 5248 6
#14 14 142 1
#15 15 3209 4
#16 16 2759 3
#17 17 4859 5
#18 18 2569 3
#19 19 739 1
#20 20 4529 5
If there is no way to programmatically define groups you may need to individually define the ranges. It is convenient to do this with case_when in dplyr.
library(dplyr)
df %>%
mutate(Industry = case_when(between(SIC, 100, 200) | between(SIC, 400, 700) ~ 'Industry 1',
between(SIC, 300, 350) | between(SIC, 980, 1020) ~ 'Industry 2'))

How to dynamically select columns

Lets assume i ran a random Forest model and i get the variable importance info as below:
set.seed(121)
ImpMeasure<-data.frame(mod.varImp$importance)
ImpMeasure$Vars<-row.names(ImpMeasure)
ImpMeasure.df<-ImpMeasure[order(-ImpMeasure$Overall),]
row.names(ImpMeasure.df)<-NULL
class(ImpMeasure.df)
ImpMeasure.df<-ImpMeasure.df[,c(2,1)] # so now we have the importance variable info in a data frame
ImpMeasure.df
Vars Overall
1 num_voted_users 100.000000
2 num_critic_for_reviews 58.961441
3 num_user_for_reviews 56.500707
4 movie_facebook_likes 50.680318
5 cast_total_facebook_likes 30.012205
6 gross 27.652559
7 actor_3_facebook_likes 24.094213
8 actor_2_facebook_likes 19.633290
9 imdb_score 16.063007
10 actor_1_facebook_likes 15.848972
11 duration 11.886036
12 budget 11.853066
13 title_year 7.804387
14 director_facebook_likes 7.318787
15 facenumber_in_poster 1.868376
16 aspect_ratio 0.000000
Now If i decide that i want only top 5 variables for further analysis then in do this:
library(dplyr)
top.var<-ImpMeasure.df[1:5,] %>% select(Vars)
top.var
Vars
1 num_voted_users
2 num_critic_for_reviews
3 num_user_for_reviews
4 movie_facebook_likes
5 cast_total_facebook_likes
How can use this info to select these var only from the original dataset (given below) without spelling out the actual variable names but using say the output of top.var....how to use dplyr select function for this..
My original dataset is like this:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes
1 723 178 0 855
2 302 169 563 1000
3 602 148 0 161
4 813 164 22000 23000
5 255 95 131 782
6 462 132 475 530
actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes
1 1000 760505847 886204 4834
2 40000 309404152 471220 48350
3 11000 200074175 275868 11700
4 27000 448130642 1144337 106759
5 131 228830 8 143
6 640 73058679 212204 1873
facenumber_in_poster num_user_for_reviews budget title_year
1 0 3054 237000000 2009
2 0 1238 300000000 2007
3 1 994 245000000 2015
4 0 2701 250000000 2012
5 0 97 26000000 2002
6 1 738 263700000 2012
actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes cluster
1 936 7.9 1.78 33000 2
2 5000 7.1 2.35 0 3
3 393 6.8 2.35 85000 2
4 23000 8.5 2.35 164000 3
5 12 7.1 1.85 0 1
6 632 6.6 2.35 24000 2
movies.imp<-moviesdf.cluster%>% select(one_of(top.vars),cluster)
head(movies.imp)
## num_voted_users num_user_for_reviews num_critic_for_reviews
## 1 886204 3054 723
## 2 471220 1238 302
## 3 275868 994 602
## 4 1144337 2701 813
## 5 8 127 37
## 6 212204 738 462
## movie_facebook_likes cast_total_facebook_likes cluster
## 1 33000 4834 1
## 2 0 48350 1
## 3 85000 11700 1
## 4 164000 106759 1
## 5 0 143 2
## 6 24000 1873 1
That done!
Hadley provided the answer to that here:
select_(df, .dots = top.var)

geom_bar : There are extra x-axis appear in my bar plot

My data is follow the sequence:
deptime .count
1 4.5 6285
2 14.5 5901
3 24.5 6002
4 34.5 5401
5 44.5 5080
6 54.5 4567
7 104.5 3162
8 114.5 2784
9 124.5 1950
10 134.5 1800
11 144.5 1630
12 154.5 1076
13 204.5 738
14 214.5 556
15 224.5 544
16 234.5 650
17 244.5 392
18 254.5 309
19 304.5 356
20 314.5 364
My ggplot code:
ggplot(pplot, aes(x=deptime, y=.count)) + geom_bar(stat="identity",fill='#FF9966',width = 5) + labs(x="time", y="count")
output figure
There are a gap between each 100. Does anyone know how to fix it?
Thank You

Resources