I am having trouble with a legend in a lda analysis. Here is toy data:
>d_e_a.train
Lymphoprol. CORT Testo FDL RBC.Ab. ifn.g il.4 Profile
52 0.00 0.58 1.94 2.54 6 98 40 Med
81 22.23 0.58 0.05 1.56 4 203 45 Med
66 5.31 1.75 0.30 2.73 3 49 74 High
62 35.00 0.81 0.95 4.30 6 322 60 Low
9 6.52 2.36 0.03 0.92 4 51 75 High
70 13.27 0.47 1.67 2.57 3 278 75 Med
56 18.23 0.46 1.89 2.99 4 54 60 High
72 31.25 0.31 1.52 3.37 5 305 57 Low
90 22.09 0.40 0.06 1.62 5 254 58 Med
37 4.32 1.34 0.05 0.71 3 41 73 High
3 15.65 0.50 0.07 0.97 5 280 67 Med
17 39.32 1.71 0.30 2.06 2 93 53 High
57 19.25 1.15 0.05 1.75 5 95 73 Med
24 17.03 0.14 1.28 3.22 4 79 77 Med
85 13.73 0.52 1.59 2.20 3 62 75 Med
41 23.16 0.89 0.09 1.48 2 99 57 Med
65 29.25 0.28 0.04 2.56 5 298 55 Low
75 0.00 0.86 0.11 1.65 3 110 47 Med
22 14.25 1.09 1.46 1.96 5 76 69 Med
20 35.14 0.26 1.12 5.16 6 282 47 Low
83 36.94 0.55 1.62 2.15 4 298 60 Low
45 28.58 1.50 0.21 1.41 5 201 65 Med
2 13.91 0.65 1.34 2.27 6 195 58 Med
73 0.00 0.99 0.09 0.92 3 133 77 Med
29 35.80 0.12 0.01 1.80 7 307 65 Low`
and this is the model: model_a <- lda(Profile ~., data = d_e_a.train)
when I try to plot it using the following code I get two legends as it can be seen in the plot
library(ggplot2)
library(ggfortify)
library(devtools)
install_github('fawda123/ggord')
library(ggord)
plota<-ggord(model_a, d_e_a.train$Profile)+
theme_classic()+
scale_fill_manual(name = "Profile",
labels = c("Fischer - like", "Lewis - like", "Medium"))+
theme(text = element_text(size = 20 ),
axis.line.x = element_line(color="black", size = 1),
axis.line.y = element_line(color="black", size = 1),
axis.text.x=element_text(colour="black",angle = 360,vjust = 0.6),
axis.text.y=element_text(colour="black"))
plota
I would like to get only the legend that is seen in the top.
Regards
You need to have both a fill scale and a color scale with the same labels. You also need to remove the shape guide that this function seems to add, even though the shape of the points appears constant.
ggord(model_a, d_e_a.train$Profile)+
theme_classic()+
scale_fill_discrete(name = "Profile",
labels = c("Fischer - like", "Lewis - like", "Medium"))+
scale_color_discrete(name = "Profile",
labels = c("Fischer - like", "Lewis - like", "Medium"))+
theme(text = element_text(size = 20 ),
axis.line.x = element_line(color="black", size = 1),
axis.line.y = element_line(color="black", size = 1),
axis.text.x=element_text(colour="black",angle = 360,vjust = 0.6),
axis.text.y=element_text(colour="black")) +
guides(shape = guide_none())
I'm attempting to scrape the second table shown at the URL below, and I'm running into issues which may be related to the interactive nature of the table.
div_stats_standard appears to refer to the table of interest.
The code runs with no errors but returns an empty list.
url <- 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
data <- url %>%
read_html() %>%
html_nodes(xpath = '//*[(#id = "div_stats_standard")]') %>%
html_table()
Can anyone tell me where I'm going wrong?
Look for the table.
library(rvest)
url <- "https://fbref.com/en/comps/9/stats/Premier-League-Stats"
page <- read_html(url)
nodes <- html_nodes(page, "table") # you can use Selectorgadget to identify the node
table <- html_table(nodes[[1]]) # each element of the nodes list is one table that can be extracted
head(table)
Result:
head(table)
Playing Time Playing Time Playing Time Performance Performance
1 Squad # Pl MP Starts Min Gls Ast
2 Arsenal 26 27 297 2,430 39 26
3 Aston Villa 28 27 297 2,430 33 27
4 Bournemouth 25 28 308 2,520 27 17
5 Brighton 23 28 308 2,520 28 19
6 Burnley 21 28 308 2,520 32 23
Performance Performance Performance Performance Per 90 Minutes Per 90 Minutes
1 PK PKatt CrdY CrdR Gls Ast
2 2 2 64 3 1.44 0.96
3 1 3 54 1 1.22 1.00
4 1 1 60 3 0.96 0.61
5 1 1 44 2 1.00 0.68
6 2 2 53 0 1.14 0.82
Per 90 Minutes Per 90 Minutes Per 90 Minutes Expected Expected Expected Per 90 Minutes
1 G+A G-PK G+A-PK xG npxG xA xG
2 2.41 1.37 2.33 35.0 33.5 21.3 1.30
3 2.22 1.19 2.19 30.6 28.2 22.0 1.13
4 1.57 0.93 1.54 31.2 30.5 20.8 1.12
5 1.68 0.96 1.64 33.8 33.1 22.4 1.21
6 1.96 1.07 1.89 30.9 29.4 18.9 1.10
Per 90 Minutes Per 90 Minutes Per 90 Minutes Per 90 Minutes
1 xA xG+xA npxG npxG+xA
2 0.79 2.09 1.24 2.03
3 0.81 1.95 1.04 1.86
4 0.74 1.86 1.09 1.83
5 0.80 2.01 1.18 1.98
6 0.68 1.78 1.05 1.73
I have two large dataframes, one is called Dates_only and the other Values
**Dates_only:**
ID Quart_y Quart
1 1118 2017Q3 0.25
2 1118 2017Q4 0.50
3 1118 2018Q1 0.75
4 1118 2018Q2 1.00
5 1118 2018Q3 1.25
6 1118 2018Q4 1.50
7 1118 2019Q1 1.75
8 1118 2019Q2 2.00
9 1119 2017Q3 0.25
10 1119 2017Q4 0.50
11 1119 2018Q1 0.75
12 1119 2018Q2 1.00
13 1119 2018Q3 1.25
14 1119 2018Q4 1.50
15 1119 2019Q1 1.75
16 1119 2019Q2 2.00
17 13PP 2017Q3 0.25
18 13PP 2017Q4 0.50
19 13PP 2018Q1 0.75
20 13PP 2018Q2 1.00
21 13PP 2018Q3 1.25
22 13PP 2018Q4 1.50
23 13PP 2019Q1 1.75
24 13PP 2019Q2 2.00
And the second dataset:
**Values**
ID Day Value
1 1118 0 7.6
2 1119 0 6.2
3 13PP 0 6.8
4 1118 0.14 7.1
5 1119 0.13 6.2
6 13PP 0.13 5.9
7 1118 0.20 6.8
8 1119 0.23 5.8
9 13PP 0.24 4.6
10 1118 0.27 6.5
11 1119 0.28 5.4
12 13PP 0.32 4.2
13 1118 0.32 6.3
14 1119 0.32 4.8
15 13PP 0.44 4.0
16 1118 0.47 6.0
17 1119 0.49 4.3
18 13PP 0.49 3.8
19 1118 0.59 5.9
20 1119 0.64 4.0
21 13PP 0.61 3.6
22 1118 0.72 5.6
23 1119 0.71 3.8
24 13PP 0.73 3.4
25 1118 0.95 5.4
26 1119 0.86 3.2
27 13PP 0.78 3.0
28 1118 1.10 5.0
29 1119 0.93 2.9
30 13PP 1.15 2.9
What I want to do is to create another column (a fourth) in the Dates_only called Value_average, and it will contain average scores extracted from Values dataframe from the column Values$Value.
Specifically, as you can observe in Dates_only the Quart_y represents quarters/year, the Quart quantify this with a number from 0.25:2.
So, the pattern goes like this Q3 - x.25, Q4 - x.50, Q1 - x.75, Q2 - x.00.
In the second dataframe, Values, we have some scores that represent days of the year. The concept is that for days that have scores 0<Day<0.25 belong to the 2017Q3, days with scores 0.25<Day<0.50 belong to 2017Q4, and days with scores 1.00<Day<1.25 belong to 2018Q3.
I want for each ID from Dates_only dataframe to find the average of the Values$Value numbers that belong to the appropriate time frame:
For ID=1118 and for 2017Q3 the 'Values$Day' elements that are between 0<Day<=0.25 are (0, 0.14, 0.20) and the equivalent Values$Value are (7.6, 7.1, 6.8) so the Dates_only$Value_average is going to be 7.16. The next will average values for days 0.25<Day<=0.50 etc.
**Dates_only:**
ID Quart_y Quart Value_average
1 1118 2017Q3 0.25 7.16
2 1118 2017Q3 0.50 6.27
The code that I have used is:
Dates_only$Value_average <- 0
for (i in 1:length(Dates_only$ID)){
id <- as.character(Dates_only$ID[i])
quart <- as.numeric(Dates_only$Quart[i])
quart_prev <- quart-0.25
count_d <- 0
sum_val <- 0
for (k in 1:length(Values$ID)){
if (id==as.character(Values$ID[k])
&& quart>=as.numeric(Values$Day[k])
&& as.numeric(Values$Day[k])>quart_prev){
sum_val <- as.numeric(Values$Value[k]) + sum_val
count_d <- count_d + 1
}
}
av_value <- sum_val/count_d
Dates_only$Value_average[i] <- av_value
}
Is there a more efficient code to do that in very large datasets (over 300K observations)? I am pretty sure there is but my novice skills on R do not help a lot.
To replicate the two dataframes:
Dates_only <- data.frame(ID=c('1118','1118','1118','1118','1118',
'1118','1118','1118','1119','1119',
'1119','1119','1119','1119','1119',
'1119','13PP','13PP','13PP','13PP',
'13PP','13PP','13PP','13PP'),
Quart_y=c('2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2',
'2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2',
'2017Q3','2017Q4','2018Q1','2018Q2',
'2018Q3','2018Q4','2019Q1','2019Q2'),
Quart=c(0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00,
0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00,
0.25,0.50,0.75,1.00,1.25,1.50,1.75,2.00))
Values <- data.frame(ID=c('1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP',
'1118','1119','13PP','1118','1119','13PP'),
Day=c(0,0,0,0.14,0.13,0.13,0.2,0.23,0.24,0.27,0.28,
0.32,0.32,0.32,0.44,0.47,0.49,0.49,0.59,0.64,
0.61,0.72,0.71,0.73,0.95,0.86,0.78,1.1,0.93,1.15),
Value=c(7.6,6.2,6.8,7.1,6.2,5.9,6.8,5.8,4.6,6.5,5.4,
4.2,6.3,4.8,4,6,4.3,3.8,5.9,4,3.6,5.6,3.8,
3.4,5.4,3.2,3,5,2.9,2.9))
We can accomplish almost all of this using the dplyr package
library(dplyr)
Values %>%
mutate(Day = ifelse(Day == 0, 0.01, Day)) %>%
mutate(Quart = ceiling(Day / 0.25) * 0.25) %>%
full_join(., Dates_only, by = c("ID", "Quart")) %>%
group_by(ID, Quart, Quart_y) %>%
summarise(Value_average = mean(Value, na.rm = TRUE))
Which gives you:
ID Quart Quart_y Value_average
<fctr> <dbl> <fctr> <dbl>
1 1118 0.25 2017Q3 7.166667
2 1118 0.50 2017Q4 6.266667
3 1118 0.75 2018Q1 5.750000
4 1118 1.00 2018Q2 5.400000
5 1118 1.25 2018Q3 5.000000
6 1118 1.50 2018Q4 NaN
7 1118 1.75 2019Q1 NaN
8 1118 2.00 2019Q2 NaN
9 1119 0.25 2017Q3 6.066667
10 1119 0.50 2017Q4 4.833333
# ... with 14 more rows
See below for a breakdown of each line of code for any questions:
# Start with your `Values` data frame
Values %>%
# Recode `Day` that are '0.00', as they currently will be excluded from
# the rule 2017Q3: 0<Day<=0.25
# I picked 0.01 arbitrarily to fit this rule
mutate(Day = ifelse(Day == 0, 0.01, Day)) %>%
# Now round all `Day` values up to the nearest 0.25
mutate(Quart = ceiling(Day / 0.25) * 0.25) %>%
# Now join the two data frames using a `full_join`
# A left_join may also be used if you are uninterested in NA's
full_join(., Dates_only, by = c("ID", "Quart")) %>%
# Finally, designate groupings to calculate the mean values
# for each ID for each quarter
group_by(ID, Quart, Quart_y) %>%
summarise(Value_average = mean(Value, na.rm = TRUE))
I am producing a plot with 4 facets.
I thought I would attempt to produce just a plot of one part of the data first, and then facet it.
But I am having issues setting up the plot that I want. I think this is primarily because my x-axis was as a factor, but there are issues I cannot get around after converting it to numeric.
The data has a placeholder name right now, HOLD (columns transformation and replicate are factors:
transformation replicate calibration validation difference X1 X2 X3 X4 x1min x1max x2min x2max x3min x3max x4min x4max
1 NSE 1 0.847 0.794 0.053 185.67 0.53 1063.31 1.02 100 1200 -5 3 20 300 1.1 2.9
2 NSE 2 0.758 0.760 -0.002 552.53 0.95 235.70 1.05 100 1200 -5 3 20 300 1.1 2.9
3 NSE 3 0.813 0.817 -0.004 953.37 0.65 225.88 1.01 100 1200 -5 3 20 300 1.1 2.9
4 NSE 4 0.916 0.802 0.114 1232.67 0.86 141.11 1.01 100 1200 -5 3 20 300 1.1 2.9
5 NSE 5 0.787 0.799 -0.012 888.91 1.29 239.85 0.99 100 1200 -5 3 20 300 1.1 2.9
6 NSE 6 0.846 0.760 0.086 996.63 1.93 201.67 0.95 100 1200 -5 3 20 300 1.1 2.9
7 sqrt 1 0.864 0.817 0.047 190.57 0.57 1064.22 1.00 100 1200 -5 3 20 300 1.1 2.9
8 sqrt 2 0.793 0.763 0.030 482.99 1.07 284.29 1.03 100 1200 -5 3 20 300 1.1 2.9
9 sqrt 3 0.820 0.829 -0.009 862.64 0.71 244.69 1.01 100 1200 -5 3 20 300 1.1 2.9
10 sqrt 4 0.922 0.805 0.117 1195.74 0.88 146.52 1.02 100 1200 -5 3 20 300 1.1 2.9
11 sqrt 5 0.805 0.807 -0.002 862.64 1.49 270.43 0.96 100 1200 -5 3 20 300 1.1 2.9
12 sqrt 6 0.855 0.751 0.104 915.67 2.40 248.72 0.93 100 1200 -5 3 20 300 1.1 2.9
13 log 1 0.870 0.802 0.068 192.48 0.49 1085.72 0.99 100 1200 -5 3 20 300 1.1 2.9
14 log 2 0.817 0.734 0.083 186.41 -1.19 746.40 1.03 100 1200 -5 3 20 300 1.1 2.9
15 log 3 0.808 0.812 -0.004 820.57 0.70 247.15 1.02 100 1200 -5 3 20 300 1.1 2.9
16 log 4 0.912 0.780 0.132 1224.15 0.77 130.32 1.03 100 1200 -5 3 20 300 1.1 2.9
17 log 5 0.812 0.793 0.019 828.82 1.66 298.87 0.95 100 1200 -5 3 20 300 1.1 2.9
18 log 6 0.857 0.718 0.139 787.60 2.86 296.08 0.92 100 1200 -5 3 20 300 1.1 2.9
19 inv 1 0.854 0.659 0.195 202.73 0.24 1135.53 0.98 100 1200 -5 3 20 300 1.1 2.9
20 inv 2 0.765 0.622 0.143 186.83 -0.03 689.33 0.97 100 1200 -5 3 20 300 1.1 2.9
21 inv 3 0.689 0.684 0.005 962.95 0.27 175.91 0.98 100 1200 -5 3 20 300 1.1 2.9
22 inv 4 0.867 0.670 0.197 1436.55 0.44 91.84 0.92 100 1200 -5 3 20 300 1.1 2.9
23 inv 5 0.781 0.683 0.098 743.07 1.78 364.78 0.94 100 1200 -5 3 20 300 1.1 2.9
24 inv 6 0.773 0.626 0.147 711.62 2.78 285.22 0.92 100 1200 -5 3 20 300 1.1 2.9
Code for plots:
ggplot(data = HOLD, aes(x = as.numeric(replicate))) +
geom_ribbon(aes(ymin = x1min-1, ymax = x1max+1), alpha = 0.25) +
geom_jitter(aes(y = X1, color = transformation), size = 3, width = 0.125, height = 0) +
scale_x_continuous(breaks = 1:6) +
theme(panel.grid.minor = element_blank())
The plots are essentially x = replicate and y = X#. I'm representing this using geom_jitter, with the colouration from the factor transformation. This all works fine
However, I need to plot over the 80% confidence interval range of these X values; these are in the columns labelled with min and max. I was told that geom_hline() isn't clear enough so I opted to use geom_ribbon(). I'm aware that ribbon only works for a continuous variable so I have converted my replicate factor into numeric.
This does work, but there are gaps at the side. I know I can get rid of them by using expand() but then my values on the jitter geom will be at the edges. Is there some way I can have the ribbon go to the edges of the plot, but not the jitter? Or is there an alternative to using geom_ribbon? I have added some images to illustrate below...
You can use geom_rect instead and set xmin and xmax to -Inf/Inf, but as lots of rectangles will be plotted on top of each other (one for each row), you need to decrease alpha to get the transparency.
ggplot(data = HOLD, aes(x = as.numeric(replicate))) +
geom_rect(aes(ymin = x1min-1, ymax = x1max+1, xmin = -Inf, xmax = Inf), alpha = 0.01) +
geom_jitter(aes(y = X1, color = transformation), size = 3, width = 0.125, height = 0) +
scale_x_continuous(breaks = 1:6) +
theme(panel.grid.minor = element_blank())
You can probably try to get geom_ribbon to work, if you do some transformation to the x-axis coordinates, but the easiest way to achieve your result is to use geom_rect, because it understands the xmin and xmax aesthetics. Setting xmin = -Inf and xmax = Inf ensures that the rectangle will span the whole x-axis.
As your x1min and x1max variables are equal in all rows of the dataset, you only need to draw a single rect, so it's best to add annotate("rect", ...) than geom_rect(...) to your plot.
So all you have to do is change the geom_ribbon line to
annotate("rect", ymin = HOLD$x1min[1]-1, ymax = HOLD$x1max[1]+1,
xmin = -Inf, xmax = Inf, alpha = .25)
Result:
I'm trying to use "stat_sum_single" with a factor variable but I get the error:
Error: could not find function "stat_sum_single"
I tried converting the factor variable to a numeric but it doesn't seem to work - any ideas?
Full code:
ggplot(sn, aes(x = person,y = X, group=Plan, colour = Plan)) +
geom_line(size=0.5) +
scale_y_continuous(limits = c(0, 1.5)) +
scale_x_discrete(breaks = c(0,50,100), labels= c(0,50,100)) +
labs(x = "X",y = "%") +
stat_sum_single(mean, geom = 'line', aes(x = as.numeric(as.character(person))), size = 3, colour = 'red')
Data:
Plan person X m mad mmad
1 1 95 0.323000 0.400303 0.12
1 2 275 0.341818 0.400303 0.12
1 3 2 0.618000 0.400303 0.12
1 4 75 0.320000 0.400303 0.12
1 5 13 0.399000 0.400303 0.12
1 6 20 0.400000 0.400303 0.12
2 7 219 0.393000 0.353350 0.45
2 8 50 0.060000 0.353350 0.45
2 9 213 0.390000 0.353350 0.45
2 15 204 0.496100 0.353350 0.45
2 19 19 0.393000 0.353350 0.45
2 24 201 0.388000 0.353350 0.45
3 30 219 0.567 0.1254 0.89
3 14 50 0.679 0.1254 0.89
3 55 213 0.1234 0.1254 0.89
3 18 204 0.6135 0.1254 0.89
3 59 19 0.39356 0.1254 0.89
3 101 201 0.300 0.1254 0.89
Person is a factor variable.
Function stat_sum_single() isn't directly implemented in library ggplot2 but this function should be defined before using as shown in the help file of function stat_summary().
stat_sum_single <- function(fun, geom="point", ...) {
stat_summary(fun.y=fun, colour="red", geom=geom, size = 3, ...)
}
Here is the ggplot2 cran package:
http://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf
on page 185, there is an example of using stat_sum_single.
I believe you need to somehow define it first in stat_summary.