I am interested in web scraping Pro Football Reference. I need to set up a function that enables me to scrape multiple pages. So far, I have code that seems to be functional. However, I continuously get an error...
scrapeData = function(urlprefix, urlend, startyr, endyr) {
master = data.frame()
for (i in startyr:endyr) {
cat('Loading Year', i, '\n')
URL = paste(urlprefix, as.character(i), urlend, sep = "")
table = readHTMLTable(URL, stringsAsFactors = F)[[1]]
table$Year = i
master = rbind(table, master)
}
return(master)
}
drafts = scrapeData('http://www.pro-football-reference.com/years/', '/draft.htm', 2010, 2010)
When running it, the return is --
Error: failed to load external entity "http://www.pro-football-reference.com/years/2010/draft.htm"
Any advice would be helpful. Thank you.
library(tidyverse)
library(rvest)
get_football <- function(year) {
str_c("https://www.pro-football-reference.com/years/",
year,
"/draft.htm") %>%
read_html() %>%
html_table() %>%
pluck(1) %>%
janitor::row_to_names(1) %>%
janitor::clean_names() %>%
mutate(year = year)
}
map_dfr(2010:2015, get_football)
# A tibble: 1,564 × 30
rnd pick tm player pos age to ap1 pb st w_av dr_av g cmp att
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 STL Sam B… QB 22 2018 0 0 5 44 25 83 1855 2967
2 1 2 DET Ndamu… DT 23 2022 3 5 12 100 59 196 0 0
3 1 3 TAM Geral… DT 22 2021 1 6 10 69 65 140 0 0
4 1 4 WAS Trent… T 22 2022 1 9 11 78 58 160 0 0
5 1 5 KAN Eric … DB 21 2018 3 5 5 50 50 89 0 0
6 1 6 SEA Russe… T 21 2020 0 2 9 56 31 131 0 0
7 1 7 CLE Joe H… DB 21 2021 0 3 10 62 39 158 0 0
8 1 8 OAK Rolan… LB 21 2015 0 0 5 25 15 65 0 0
9 1 9 BUF C.J. … RB 23 2017 0 1 3 34 32 90 0 0
10 1 10 JAX Tyson… DT 23 2022 0 0 7 44 33 188 0 0
# … with 1,554 more rows, and 15 more variables
I have a big data-set with over 1000 subjects, a small piece of the data-set looks like:
mydata <- read.table(header=TRUE, text="
Id DAYS QS Event
01 50 1 1
01 57 4 1
01 70 1 1
01 78 2 1
01 85 3 1
02 70 2 1
02 92 4 1
02 98 5 1
02 105 6 1
02 106 7 0
")
I would like to get row number of the observation 28 or more days prior to last observation, eg. for id=01; last observation is 85 minus 28 would be 57 which is row number 2. For id=02; last observation is 106 minus 28; 78 and because 78 does not exist, we will use row number of 70 which is 1 (I will be getting the row number for each observation separately) or first observation for id=02.
This should work:
mydata %>% group_by(Id) %>%
mutate(row_number = last(which(DAYS <= max(DAYS) - 28)))
# A tibble: 10 x 6
# Groups: Id [2]
Id DAYS QS Event max row_number
<int> <int> <int> <int> <dbl> <int>
1 1 50 1 1 57 2
2 1 57 4 1 57 2
3 1 70 1 1 57 2
4 1 78 2 1 57 2
5 1 85 3 1 57 2
6 2 70 2 1 78 1
7 2 92 4 1 78 1
8 2 98 5 1 78 1
9 2 105 6 1 78 1
10 2 106 7 0 78 1
I am looking for finding a method to find the association between words in the table (or list). In each cell of the table, I have several words separated by ";".
lets say I have a table as below; some words are 'af' or 'aa' belong to one cell.
df<-read.table(text="
A B C D
af;aa;az bf;bb c;cc df;dd
aa;az bf;bc c dc;dd
ah;al;aa bb c;cd dd
af;aa bf cc dd",header=T,stringsAsFactors = F)
I want to find associations between all words in the entire dataset, between cells(not interested in within cell association). for example, how many times aa and dd appear in one row, or show me which words have the highest association (e.g. aa with bb, aa with dd,....).
expected output: (the numbers can be inaccurate and association rep does not have be shown with '--')
2 pairs association (numbers can be counts, probability or normalized association)
association number of associations
aa--dd 3
aa--c 3
bb--dd 2
...
3 pairs association
aa--bb--dd 3
aa--bb--c 3
...
4 pairs association
aa--bb--c--dd 2
aa--bf--c--dd 2
...
can you help me to implement it in R?
Tx
I am not sure if you have something like the approach below in mind. It is basically a custom function which we use in a nested purrr::map call. The outer call loops over the number of pairs: 2,3, 4 and the inner call uses combn to create all possible combinations as input and uses the custom function to create the desired output.
library(tidyverse)
count_pairs <- function(x) {
s <- seq(x)
df[, x] %>%
reduce(s, separate_rows, .init = ., sep = ";")
group_by(across()) %>%
count() %>%
rename(set_names(s))
}
map(2:4,
~ map_dfr(combn(1:4, .x, simplify = FALSE),
count_pairs) %>% arrange(-n))
#> [[1]]
#> # A tibble: 50 x 3
#> # Groups: 1, 2 [50]
#> `1` `2` n
#> <chr> <chr> <int>
#> 1 aa dd 4
#> 2 aa bf 3
#> 3 aa c 3
#> 4 bf dd 3
#> 5 c dd 3
#> 6 aa bb 2
#> 7 af bf 2
#> 8 az bf 2
#> 9 aa cc 2
#> 10 af cc 2
#> # ... with 40 more rows
#>
#> [[2]]
#> # A tibble: 70 x 4
#> # Groups: 1, 2, 3 [70]
#> `1` `2` `3` n
#> <chr> <chr> <chr> <int>
#> 1 aa bf dd 3
#> 2 aa c dd 3
#> 3 aa bb c 2
#> 4 aa bf c 2
#> 5 aa bf cc 2
#> 6 af bf cc 2
#> 7 az bf c 2
#> 8 aa bb dd 2
#> 9 af bf dd 2
#> 10 az bf dd 2
#> # ... with 60 more rows
#>
#> [[3]]
#> # A tibble: 35 x 5
#> # Groups: 1, 2, 3, 4 [35]
#> `1` `2` `3` `4` n
#> <chr> <chr> <chr> <chr> <int>
#> 1 aa bb c dd 2
#> 2 aa bf c dd 2
#> 3 aa bf cc dd 2
#> 4 af bf cc dd 2
#> 5 az bf c dd 2
#> 6 aa bb c df 1
#> 7 aa bb cc dd 1
#> 8 aa bb cc df 1
#> 9 aa bb cd dd 1
#> 10 aa bc c dc 1
#> # ... with 25 more rows
# the data
df<-read.table(text="
A B C D
af;aa;az bf;bb c;cc df;dd
aa;az bf;bc c dc;dd
ah;al;aa bb c;cd dd
af;aa bf cc dd",header=T,stringsAsFactors = F)
Created on 2021-08-11 by the reprex package (v2.0.1)
You can try the base R code below
lst <- apply(df, 1, function(x) sort(unlist(strsplit(x, ";"))))
res <- lapply(
2:4,
function(k) {
setNames(
data.frame(
table(
unlist(
sapply(
lst,
function(v) combn(v, k, paste0, collapse = "--")
)
)
)
), c("association", "count")
)
}
)
and you will see that
> res
[[1]]
association count
1 aa--af 2
2 aa--ah 1
3 aa--al 1
4 aa--az 2
5 aa--bb 2
6 aa--bc 1
7 aa--bf 3
8 aa--c 3
9 aa--cc 2
10 aa--cd 1
11 aa--dc 1
12 aa--dd 4
13 aa--df 1
14 af--az 1
15 af--bb 1
16 af--bf 2
17 af--c 1
18 af--cc 2
19 af--dd 2
20 af--df 1
21 ah--al 1
22 ah--bb 1
23 ah--c 1
24 ah--cd 1
25 ah--dd 1
26 al--bb 1
27 al--c 1
28 al--cd 1
29 al--dd 1
30 az--bb 1
31 az--bc 1
32 az--bf 2
33 az--c 2
34 az--cc 1
35 az--dc 1
36 az--dd 2
37 az--df 1
38 bb--bf 1
39 bb--c 2
40 bb--cc 1
41 bb--cd 1
42 bb--dd 2
43 bb--df 1
44 bc--bf 1
45 bc--c 1
46 bc--dc 1
47 bc--dd 1
48 bf--c 2
49 bf--cc 2
50 bf--dc 1
51 bf--dd 3
52 bf--df 1
53 c--cc 1
54 c--cd 1
55 c--dc 1
56 c--dd 3
57 c--df 1
58 cc--dd 2
59 cc--df 1
60 cd--dd 1
61 dc--dd 1
62 dd--df 1
[[2]]
association count
1 aa--af--az 1
2 aa--af--bb 1
3 aa--af--bf 2
4 aa--af--c 1
5 aa--af--cc 2
6 aa--af--dd 2
7 aa--af--df 1
8 aa--ah--al 1
9 aa--ah--bb 1
10 aa--ah--c 1
11 aa--ah--cd 1
12 aa--ah--dd 1
13 aa--al--bb 1
14 aa--al--c 1
15 aa--al--cd 1
16 aa--al--dd 1
17 aa--az--bb 1
18 aa--az--bc 1
19 aa--az--bf 2
20 aa--az--c 2
21 aa--az--cc 1
22 aa--az--dc 1
23 aa--az--dd 2
24 aa--az--df 1
25 aa--bb--bf 1
26 aa--bb--c 2
27 aa--bb--cc 1
28 aa--bb--cd 1
29 aa--bb--dd 2
30 aa--bb--df 1
31 aa--bc--bf 1
32 aa--bc--c 1
33 aa--bc--dc 1
34 aa--bc--dd 1
35 aa--bf--c 2
36 aa--bf--cc 2
37 aa--bf--dc 1
38 aa--bf--dd 3
39 aa--bf--df 1
40 aa--c--cc 1
41 aa--c--cd 1
42 aa--c--dc 1
43 aa--c--dd 3
44 aa--c--df 1
45 aa--cc--dd 2
46 aa--cc--df 1
47 aa--cd--dd 1
48 aa--dc--dd 1
49 aa--dd--df 1
50 af--az--bb 1
51 af--az--bf 1
52 af--az--c 1
53 af--az--cc 1
54 af--az--dd 1
55 af--az--df 1
56 af--bb--bf 1
57 af--bb--c 1
58 af--bb--cc 1
59 af--bb--dd 1
60 af--bb--df 1
61 af--bf--c 1
62 af--bf--cc 2
63 af--bf--dd 2
64 af--bf--df 1
65 af--c--cc 1
66 af--c--dd 1
67 af--c--df 1
68 af--cc--dd 2
69 af--cc--df 1
70 af--dd--df 1
71 ah--al--bb 1
72 ah--al--c 1
73 ah--al--cd 1
74 ah--al--dd 1
75 ah--bb--c 1
76 ah--bb--cd 1
77 ah--bb--dd 1
78 ah--c--cd 1
79 ah--c--dd 1
80 ah--cd--dd 1
81 al--bb--c 1
82 al--bb--cd 1
83 al--bb--dd 1
84 al--c--cd 1
85 al--c--dd 1
86 al--cd--dd 1
87 az--bb--bf 1
88 az--bb--c 1
89 az--bb--cc 1
90 az--bb--dd 1
91 az--bb--df 1
92 az--bc--bf 1
93 az--bc--c 1
94 az--bc--dc 1
95 az--bc--dd 1
96 az--bf--c 2
97 az--bf--cc 1
98 az--bf--dc 1
99 az--bf--dd 2
100 az--bf--df 1
101 az--c--cc 1
102 az--c--dc 1
103 az--c--dd 2
104 az--c--df 1
105 az--cc--dd 1
106 az--cc--df 1
107 az--dc--dd 1
108 az--dd--df 1
109 bb--bf--c 1
110 bb--bf--cc 1
111 bb--bf--dd 1
112 bb--bf--df 1
113 bb--c--cc 1
114 bb--c--cd 1
115 bb--c--dd 2
116 bb--c--df 1
117 bb--cc--dd 1
118 bb--cc--df 1
119 bb--cd--dd 1
120 bb--dd--df 1
121 bc--bf--c 1
122 bc--bf--dc 1
123 bc--bf--dd 1
124 bc--c--dc 1
125 bc--c--dd 1
126 bc--dc--dd 1
127 bf--c--cc 1
128 bf--c--dc 1
129 bf--c--dd 2
130 bf--c--df 1
131 bf--cc--dd 2
132 bf--cc--df 1
133 bf--dc--dd 1
134 bf--dd--df 1
135 c--cc--dd 1
136 c--cc--df 1
137 c--cd--dd 1
138 c--dc--dd 1
139 c--dd--df 1
140 cc--dd--df 1
[[3]]
association count
1 aa--af--az--bb 1
2 aa--af--az--bf 1
3 aa--af--az--c 1
4 aa--af--az--cc 1
5 aa--af--az--dd 1
6 aa--af--az--df 1
7 aa--af--bb--bf 1
8 aa--af--bb--c 1
9 aa--af--bb--cc 1
10 aa--af--bb--dd 1
11 aa--af--bb--df 1
12 aa--af--bf--c 1
13 aa--af--bf--cc 2
14 aa--af--bf--dd 2
15 aa--af--bf--df 1
16 aa--af--c--cc 1
17 aa--af--c--dd 1
18 aa--af--c--df 1
19 aa--af--cc--dd 2
20 aa--af--cc--df 1
21 aa--af--dd--df 1
22 aa--ah--al--bb 1
23 aa--ah--al--c 1
24 aa--ah--al--cd 1
25 aa--ah--al--dd 1
26 aa--ah--bb--c 1
27 aa--ah--bb--cd 1
28 aa--ah--bb--dd 1
29 aa--ah--c--cd 1
30 aa--ah--c--dd 1
31 aa--ah--cd--dd 1
32 aa--al--bb--c 1
33 aa--al--bb--cd 1
34 aa--al--bb--dd 1
35 aa--al--c--cd 1
36 aa--al--c--dd 1
37 aa--al--cd--dd 1
38 aa--az--bb--bf 1
39 aa--az--bb--c 1
40 aa--az--bb--cc 1
41 aa--az--bb--dd 1
42 aa--az--bb--df 1
43 aa--az--bc--bf 1
44 aa--az--bc--c 1
45 aa--az--bc--dc 1
46 aa--az--bc--dd 1
47 aa--az--bf--c 2
48 aa--az--bf--cc 1
49 aa--az--bf--dc 1
50 aa--az--bf--dd 2
51 aa--az--bf--df 1
52 aa--az--c--cc 1
53 aa--az--c--dc 1
54 aa--az--c--dd 2
55 aa--az--c--df 1
56 aa--az--cc--dd 1
57 aa--az--cc--df 1
58 aa--az--dc--dd 1
59 aa--az--dd--df 1
60 aa--bb--bf--c 1
61 aa--bb--bf--cc 1
62 aa--bb--bf--dd 1
63 aa--bb--bf--df 1
64 aa--bb--c--cc 1
65 aa--bb--c--cd 1
66 aa--bb--c--dd 2
67 aa--bb--c--df 1
68 aa--bb--cc--dd 1
69 aa--bb--cc--df 1
70 aa--bb--cd--dd 1
71 aa--bb--dd--df 1
72 aa--bc--bf--c 1
73 aa--bc--bf--dc 1
74 aa--bc--bf--dd 1
75 aa--bc--c--dc 1
76 aa--bc--c--dd 1
77 aa--bc--dc--dd 1
78 aa--bf--c--cc 1
79 aa--bf--c--dc 1
80 aa--bf--c--dd 2
81 aa--bf--c--df 1
82 aa--bf--cc--dd 2
83 aa--bf--cc--df 1
84 aa--bf--dc--dd 1
85 aa--bf--dd--df 1
86 aa--c--cc--dd 1
87 aa--c--cc--df 1
88 aa--c--cd--dd 1
89 aa--c--dc--dd 1
90 aa--c--dd--df 1
91 aa--cc--dd--df 1
92 af--az--bb--bf 1
93 af--az--bb--c 1
94 af--az--bb--cc 1
95 af--az--bb--dd 1
96 af--az--bb--df 1
97 af--az--bf--c 1
98 af--az--bf--cc 1
99 af--az--bf--dd 1
100 af--az--bf--df 1
101 af--az--c--cc 1
102 af--az--c--dd 1
103 af--az--c--df 1
104 af--az--cc--dd 1
105 af--az--cc--df 1
106 af--az--dd--df 1
107 af--bb--bf--c 1
108 af--bb--bf--cc 1
109 af--bb--bf--dd 1
110 af--bb--bf--df 1
111 af--bb--c--cc 1
112 af--bb--c--dd 1
113 af--bb--c--df 1
114 af--bb--cc--dd 1
115 af--bb--cc--df 1
116 af--bb--dd--df 1
117 af--bf--c--cc 1
118 af--bf--c--dd 1
119 af--bf--c--df 1
120 af--bf--cc--dd 2
121 af--bf--cc--df 1
122 af--bf--dd--df 1
123 af--c--cc--dd 1
124 af--c--cc--df 1
125 af--c--dd--df 1
126 af--cc--dd--df 1
127 ah--al--bb--c 1
128 ah--al--bb--cd 1
129 ah--al--bb--dd 1
130 ah--al--c--cd 1
131 ah--al--c--dd 1
132 ah--al--cd--dd 1
133 ah--bb--c--cd 1
134 ah--bb--c--dd 1
135 ah--bb--cd--dd 1
136 ah--c--cd--dd 1
137 al--bb--c--cd 1
138 al--bb--c--dd 1
139 al--bb--cd--dd 1
140 al--c--cd--dd 1
141 az--bb--bf--c 1
142 az--bb--bf--cc 1
143 az--bb--bf--dd 1
144 az--bb--bf--df 1
145 az--bb--c--cc 1
146 az--bb--c--dd 1
147 az--bb--c--df 1
148 az--bb--cc--dd 1
149 az--bb--cc--df 1
150 az--bb--dd--df 1
151 az--bc--bf--c 1
152 az--bc--bf--dc 1
153 az--bc--bf--dd 1
154 az--bc--c--dc 1
155 az--bc--c--dd 1
156 az--bc--dc--dd 1
157 az--bf--c--cc 1
158 az--bf--c--dc 1
159 az--bf--c--dd 2
160 az--bf--c--df 1
161 az--bf--cc--dd 1
162 az--bf--cc--df 1
163 az--bf--dc--dd 1
164 az--bf--dd--df 1
165 az--c--cc--dd 1
166 az--c--cc--df 1
167 az--c--dc--dd 1
168 az--c--dd--df 1
169 az--cc--dd--df 1
170 bb--bf--c--cc 1
171 bb--bf--c--dd 1
172 bb--bf--c--df 1
173 bb--bf--cc--dd 1
174 bb--bf--cc--df 1
175 bb--bf--dd--df 1
176 bb--c--cc--dd 1
177 bb--c--cc--df 1
178 bb--c--cd--dd 1
179 bb--c--dd--df 1
180 bb--cc--dd--df 1
181 bc--bf--c--dc 1
182 bc--bf--c--dd 1
183 bc--bf--dc--dd 1
184 bc--c--dc--dd 1
185 bf--c--cc--dd 1
186 bf--c--cc--df 1
187 bf--c--dc--dd 1
188 bf--c--dd--df 1
189 bf--cc--dd--df 1
190 c--cc--dd--df 1
df<-read.table(text="
A B C D
af;aa;az bf;bb c;cc df;dd
aa;az bf;bc c dc;dd
ah;al;aa bb c;cd dd
af;aa bf cc dd",header=T,stringsAsFactors = F)
df<-apply(df, 2, function(x) gsub(";", " ", x))
df2<-as.data.frame(apply(df, 2,function(x) paste(x[1:4],sep = "," )))
df3<-splitstackshape::cSplit(df2, c("A", "B", "C","D"), " ")
df4<-as.data.frame(t(df3))
A<-expand.grid(df4[,1],df4[,1])
B<-expand.grid(df4[,2],df4[,2])
C<-expand.grid(df4[,3],df4[,3])
D<-expand.grid(df4[,4],df4[,4])
DF_list<-bind_rows(A,B,C,D)
DF_list$Var1 == DF_list$Var2 -> DF_list$filter
DF_list[DF_list$filter == "FALSE", ]-> DF_list
DF_list$Var1<-as.character(DF_list$Var1)
DF_list$Var2<-as.character(DF_list$Var2)
DF_list[nchar(DF_list$Var1) > 0, ]-> DF_list
DF_list[nchar(DF_list$Var2) > 0, ]-> DF_list
pair_df<-data.frame("pairs" = paste(DF_list[["Var1"]], "--", DF_list[["Var2"]], sep = ""))
pairs<-data.frame("combos"= table(pair_df$pairs))
####output###
combos.Var1 combos.Freq
1 aa--af 2
2 aa--ah 1
3 aa--al 1
4 aa--az 2
5 aa--bb 2
6 aa--bc 1
7 aa--bf 3
8 aa--c 3
9 aa--cc 2
10 aa--cd 1
11 aa--dc 1
12 aa--dd 4
13 aa--df 1
14 af--aa 2
15 af--az 1
16 af--bb 1
17 af--bf 2
18 af--c 1
19 af--cc 2
20 af--dd 2
21 af--df 1
22 ah--aa 1
23 ah--al 1
24 ah--bb 1
25 ah--c 1
26 ah--cd 1
27 ah--dd 1
28 al--aa 1
29 al--ah 1
30 al--bb 1
31 al--c 1
32 al--cd 1
33 al--dd 1
34 az--aa 2
35 az--af 1
36 az--bb 1
37 az--bc 1
38 az--bf 2
39 az--c 2
40 az--cc 1
41 az--dc 1
42 az--dd 2
43 az--df 1
44 bb--aa 2
45 bb--af 1
46 bb--ah 1
47 bb--al 1
48 bb--az 1
49 bb--bf 1
50 bb--c 2
51 bb--cc 1
52 bb--cd 1
53 bb--dd 2
54 bb--df 1
55 bc--aa 1
56 bc--az 1
57 bc--bf 1
58 bc--c 1
59 bc--dc 1
60 bc--dd 1
61 bf--aa 3
62 bf--af 2
63 bf--az 2
64 bf--bb 1
65 bf--bc 1
66 bf--c 2
67 bf--cc 2
68 bf--dc 1
69 bf--dd 3
70 bf--df 1
71 c--aa 3
72 c--af 1
73 c--ah 1
74 c--al 1
75 c--az 2
76 c--bb 2
77 c--bc 1
78 c--bf 2
79 c--cc 1
80 c--cd 1
81 c--dc 1
82 c--dd 3
83 c--df 1
84 cc--aa 2
85 cc--af 2
86 cc--az 1
87 cc--bb 1
88 cc--bf 2
89 cc--c 1
90 cc--dd 2
91 cc--df 1
92 cd--aa 1
93 cd--ah 1
94 cd--al 1
95 cd--bb 1
96 cd--c 1
97 cd--dd 1
98 dc--aa 1
99 dc--az 1
100 dc--bc 1
101 dc--bf 1
102 dc--c 1
103 dc--dd 1
104 dd--aa 4
105 dd--af 2
106 dd--ah 1
107 dd--al 1
108 dd--az 2
109 dd--bb 2
110 dd--bc 1
111 dd--bf 3
112 dd--c 3
113 dd--cc 2
114 dd--cd 1
115 dd--dc 1
116 dd--df 1
117 df--aa 1
118 df--af 1
119 df--az 1
120 df--bb 1
121 df--bf 1
122 df--c 1
123 df--cc 1
124 df--dd 1
Say I have a dataset called wage that looks like this:
wage
# A tibble: 935 x 17
wage hours iq kww educ exper tenure age married black south urban sibs brthord meduc
<int> <int> <int> <int> <int> <int> <int> <int> <fctr> <fctr> <fctr> <fctr> <int> <int> <int>
1 769 40 93 35 12 11 2 31 1 0 0 1 1 2 8
2 808 50 119 41 18 11 16 37 1 0 0 1 1 NA 14
3 825 40 108 46 14 11 9 33 1 0 0 1 1 2 14
4 650 40 96 32 12 13 7 32 1 0 0 1 4 3 12
5 562 40 74 27 11 14 5 34 1 0 0 1 10 6 6
6 1400 40 116 43 16 14 2 35 1 1 0 1 1 2 8
7 600 40 91 24 10 13 0 30 0 0 0 1 1 2 8
8 1081 40 114 50 18 8 14 38 1 0 0 1 2 3 8
9 1154 45 111 37 15 13 1 36 1 0 0 0 2 3 14
10 1000 40 95 44 12 16 16 36 1 0 0 1 1 1 12
# ... with 925 more rows, and 2 more variables: feduc <int>, lwage <dbl>
Say I then look at a simple linear regression btw wage and IQ:
m_wage_iq = lm(wage ~ iq, data = wage)
m_wage_iq$coefficients
which gives me:
## (Intercept) iq
## 116.991565 8.303064
I want check that the errors are:
ϵi∼N(0,σ2)
How do I check this using R?
There are a number of ways you can try.
One way would be the shapiro.test to test for normality. A p.value greater than your alpha level (typically up to 10%) would mean that the null hypothesis (i.e. the errors are normally distributed) cannot be rejected. However, the test is biased by sample size so you might want to reinforce your results by looking at the QQplot.
You can see that by plotting m_wage_iq (plot(m_wage_iq )) and looking at the second graph. If your points approximately lie on the x=y line then that would suggest that the errors follow a normal distribution.
I am struggling to create a Sankey diagram using the package "riverplot". I did not manage to create a minimal toy example, so I have to include the riverplot object created by makeRiver() here. makeRiver did not throw any errors, so I thought it would work, but it does not. I hope that anyone of you has an idea.
This is the riverplot object I am trying to plot:
$edges
ID N1 N2 Value
102 102 2 10 3
106 106 6 10 2
111 111 2 11 7
115 115 6 11 2
119 119 1 12 1
120 120 2 12 72
121 121 3 12 4
125 125 7 12 7
127 127 9 12 4
129 129 2 13 14
134 134 7 13 2
136 136 9 13 1
145 145 9 14 1
147 147 2 15 4
152 152 7 15 1
154 154 9 15 1
156 156 2 16 1
165 165 2 17 69
166 166 3 17 3
167 167 4 17 1
168 168 5 17 1
169 169 6 17 2
170 170 7 17 7
171 171 8 17 1
172 172 9 17 8
$nodes
ID labels x
1 1 Albanisch 1
2 2 Arabisch 1
3 3 Arabisch;Englisch 1
4 4 Arabisch;Türkisch 1
5 5 Englisch;Kurdisch;Arabisch 1
6 6 Kurdisch 1
7 7 Kurdisch;Arabisch 1
8 8 Syrisch;Arabisch 1
9 9 keine 1
10 10 Arabisch 2
11 11 Arabisch;Englisch 2
12 12 Englisch 2
13 13 Englisch;Französisch 2
14 14 Englisch;Französisch;Arabisch 2
15 15 Französisch 2
16 16 Französisch;Englisch 2
17 17 keine 2
$styles
list()
attr(,"class")
[1] "list" "riverplot"
Calling riverplot(river) ("river" being the name of the variable I saved the object in), I get the following output (sorry that the error message is in German, it says "Index(ing) out of bounds"):
[1] "calculating positions"
[1] 21.9
ID labels x
1 1 Albanisch 1
2 2 Arabisch 1
3 3 Arabisch;Englisch 1
4 4 Arabisch;Türkisch 1
5 5 Englisch;Kurdisch;Arabisch 1
6 6 Kurdisch 1
7 7 Kurdisch;Arabisch 1
8 8 Syrisch;Arabisch 1
9 9 keine 1
10 10 Arabisch 2
11 11 Arabisch;Englisch 2
12 12 Englisch 2
13 13 Englisch;Französisch 2
14 14 Englisch;Französisch;Arabisch 2
15 15 Französisch 2
16 16 Französisch;Englisch 2
17 17 keine 2
[1] "done"
[1] "drawing edges"
Fehler in styles[[id]] : Indizierung außerhalb der Grenzen
I THINK I traced the problem to the function riverplot:::getattr, but I am not sure about that. Any help?
In case anyone is interested in the solution to the problem I described above: I used numeric IDs for nodes (1, 2, 3, ...) and edges (101, 102, ...).
makeRiver() checks if IDs are duplicated among nodes and edges and throws an error if that happens. However, it does NOT check if the IDs are purely numeric, which is apparently the source of the error.
I now added an "E" at the beginning of the edge IDs (E1, E2, ...) and an "N" at the beginning of node IDs (N1, N2, ...). It works now!