Correct syntax for xpathSApply in R - r

I'm struggling to get the statistics table on a website in a dataframe to do analysis on it. The table an be found here:
http://nl.soccerway.com/teams/netherlands/afc-ajax/1515/squad/
My code so far:
library(XML)
url <- "http://nl.soccerway.com/teams/netherlands/afc-ajax/1515/squad/"
doc <- htmlParse(url)
xpathSApply(doc, "//tr[#*]/td/child::node()", xmlValue)
But this returns the data in an unworkable form. What is the correct xpathSApply code?

The table with the data has id='page_team_1_block_team_squad_3-table' you can use this in an xpath. An xpath
"//table[#id='page_team_1_block_team_squad_3-table']/tbody" will find the table with that id and return the table body. You can then use readHTMLTable with argument header = FALSE to return the data
library(XML)
url <- "http://nl.soccerway.com/teams/netherlands/afc-ajax/1515/squad/"
doc <- htmlParse(url)
res <- readHTMLTable(doc["//table[#id='page_team_1_block_team_squad_3-table']/tbody"][[1]], header = FALSE)
head(res)
> head(res)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1 1 K. Vermeer 28 K 856 10 10 0 1 24 0 0 0 0
2 22 J. Cillessen 25 K 2204 25 24 1 0 8 0 0 0 0
3 30 M. van der Hart 20 K 0 0 0 0 0 2 0 0 0 0
4 2 R. van Rhijn 23 V 2786 32 31 1 1 1 2 3 6 0
5 3 T. Alderweireld 25 V 360 4 4 0 0 0 0 0 0 0
6 4 N. Moisander 28 V 1985 23 22 1 0 3 1 2 0 0
V17
1 0
2 0
3 0
4 1
5 0
6 0

You don't need xpathSapply. This one-liner can do it given the url:
readHTMLTable(url, header = "")[[1]]
giving:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1 1 K. Vermeer 28 K 856 10 10 0 1 24 0 0 0 0 0
2 22 J. Cillessen 25 K 2204 25 24 1 0 8 0 0 0 0 0
3 30 M. van der Hart 20 K 0 0 0 0 0 2 0 0 0 0 0
4 2 R. van Rhijn 23 V 2786 32 31 1 1 1 2 3 6 0 1
5 3 T. Alderweireld 25 V 360 4 4 0 0 0 0 0 0 0 0
6 4 N. Moisander 28 V 1985 23 22 1 0 3 1 2 0 0 0
7 6 M. van der Hoorn 21 V 166 3 2 1 1 21 0 0 0 0 0
8 12 J. Veltman 22 V 2158 25 24 1 1 2 2 2 2 0 0
9 15 N. Boilesen 22 V 1445 20 17 3 6 6 1 2 3 0 0
10 17 D. Blind 24 V 2531 29 29 0 5 3 1 1 4 0 0
11 24 S. Denswil 21 V 1350 17 15 2 1 14 1 0 1 0 0
12 27 R. Ligeon 22 V 350 5 4 1 3 8 0 1 0 0 0
13 42 J. Riedewald 17 V 222 5 3 2 3 10 2 0 1 0 0
14 44 K. Tete 18 V 0 0 0 0 0 1 0 0 0 0 0
15 5 C. Poulsen 34 M 1523 29 14 15 3 20 1 3 2 0 0
16 8 L. Duarte 23 M 655 14 6 8 2 14 3 0 1 0 0
17 8 C. Eriksen 22 M 360 4 4 0 0 0 2 3 1 0 0
18 10 S. de Jong 25 M 1257 19 16 3 8 3 7 1 1 0 0
19 18 D. Klaassen 21 M 2102 26 23 3 2 5 10 3 1 0 0
20 20 L. Schöne 28 M 2149 29 25 4 6 6 9 8 1 0 0
21 25 T. Serero 24 M 2276 29 25 4 6 6 3 3 3 0 0
22 34 L. de Sa 21 M 512 12 5 7 5 12 1 1 1 0 0
23 7 V. Fischer 20 A 1636 24 19 5 6 6 3 2 1 0 0
24 9 K. Sigþórsson 24 A 1928 30 20 10 16 11 10 2 0 0 0
25 11 Bojan 23 A 1357 24 17 7 12 11 4 3 2 0 0
26 16 L. Andersen 19 A 405 9 4 5 3 14 0 0 0 0 0
27 19 T. Sana 24 A 223 4 2 2 1 7 0 0 0 0 0
28 23 D. Hoesen 23 A 450 14 4 10 2 15 2 1 0 0 0
29 43 R. Kishna 19 A 389 8 5 3 5 5 1 2 0 0 0

Related

R: table frequencies of letters in string based on Alphabet

I need to compute letter frequencies of a large list of words. For each of the locations in the word (first, second,..), I need to find how many times each letter (a-z) appeared in the list and then table the data according to the word positon.
For example, if my word list is: words <- c("swims", "seems", "gills", "draws", "which", "water")
then the result table should like that:
letter
first position
second position
third position
fourth position
fifth position
a
0
1
1
0
0
b
0
0
0
0
0
c
0
0
0
1
0
d
1
0
0
0
0
e
0
1
1
1
0
f
0
0
0
0
0
...continued until z
...
...
...
...
...
All words are of same length (5).
What I have so far is:
alphabet <- letters[1:26]
words.df <- data.frame("Words" = words)
words.df <- words.df %>% mutate("First_place" = substr(words.df$words,1,1))
words.df <- words.df %>% mutate("Second_place" = substr(words.df$words,2,2))
words.df <- words.df %>% mutate("Third_place" = substr(words.df$words,3,3))
words.df <- words.df %>% mutate("Fourth_place" = substr(words.df$words,4,4))
words.df <- words.df %>% mutate("Fifth_place" = substr(words.df$words,5,5))
x1 <- words.df$First_place
x1 <- table(factor(x1,alphabet))
x2 <- words.df$Second_place
x2 <- table(factor(x2,alphabet))
x3 <- words.df$Third_place
x3 <- table(factor(x3,alphabet))
x4 <- words.df$Fourth_place
x4 <- table(factor(x4,alphabet))
x5 <- words.df$Fifth_place
x5 <- table(factor(x5,alphabet))
My code is not effective and gives tables to each letter position sepretely. All help will be appreicated.
in base R use table:
table(let = unlist(strsplit(words,'')),pos = sequence(nchar(words)))
pos
let 1 2 3 4 5
a 0 1 1 0 0
c 0 0 0 1 0
d 1 0 0 0 0
e 0 1 1 1 0
g 1 0 0 0 0
h 0 1 0 0 1
i 0 1 2 0 0
l 0 0 1 1 0
m 0 0 0 2 0
r 0 1 0 0 1
s 2 0 0 0 4
t 0 0 1 0 0
w 2 1 0 1 0
Note that if you need all the values from a-z then use
table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))
Also to get a dataframe you could do:
d <- table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))
cbind(letters = rownames(d), as.data.frame.matrix(d))
Here is a tidyverse solution using dplyr, purrr, and tidyr:
strsplit(words.df$Words, "") %>%
map_dfr(~setNames(.x, seq_along(.x))) %>%
pivot_longer(everything(),
values_drop_na = T,
names_to = "pos",
values_to = "letter") %>%
count(pos, letter) %>%
pivot_wider(names_from = pos,
names_glue = "pos{pos}",
id_cols = letter,
values_from = n,
values_fill = 0L)
Output
letter pos1 pos2 pos3 pos4 pos5 pos6 pos7 pos8 pos9 pos10 pos11
1 a 65 127 88 38 28 17 14 5 3 0 0
2 b 58 4 7 9 2 4 2 0 1 0 0
3 c 83 14 45 37 20 19 8 3 2 0 0
4 C 2 0 0 0 0 0 0 0 0 0 0
5 d 43 8 33 47 21 22 9 3 1 1 0
6 e 45 156 81 132 114 69 48 23 14 2 2
7 f 54 11 18 10 5 2 1 0 0 0 0
8 g 23 7 27 21 15 8 7 1 0 0 0
9 h 38 56 6 28 21 10 3 3 1 1 0
10 i 25 106 51 58 38 28 8 4 1 0 0
11 j 6 0 2 2 0 0 0 0 0 0 0
12 k 9 1 6 22 12 0 0 0 0 0 0
13 l 45 41 54 54 36 9 7 6 0 2 0
14 m 45 8 31 19 8 8 4 2 0 0 0
15 n 23 42 75 53 34 41 16 16 4 2 0
16 o 28 167 76 41 38 9 11 2 1 0 0
17 p 72 20 34 30 8 3 1 1 1 0 0
18 q 7 2 1 0 0 0 0 0 0 0 0
19 r 46 74 92 59 56 45 12 9 1 1 0
20 s 119 8 67 35 31 22 18 4 1 0 0
21 t 65 30 73 83 57 42 31 9 6 3 1
22 u 12 66 39 36 20 7 7 2 0 0 0
23 v 8 7 20 12 5 5 1 0 0 0 0
24 w 53 8 13 10 2 3 0 1 0 0 0
25 y 6 4 16 15 17 15 10 5 6 1 1
26 x 0 12 5 0 0 0 0 0 0 0 0
27 z 0 0 1 0 0 0 1 1 0 0 0

Could any one explain me about below error in the R germinationmetrics Package?

I would like to compute cumulative germination counts and Compute germination indices and Plot FPHF curves
My data structure is the following:
concentration temp rep Day01 Day02 Day03 Day04 Day05 Day06 Day07
1 0.0 10 1 0 0 0 0 0 0 0
2 0.5 10 1 0 0 0 0 6 6 6
3 0.3 10 1 0 0 0 0 8 8 8
4 0.1 10 1 0 0 0 0 6 6 6
5 0.0 10 2 0 0 0 0 0 0 0
6 0.5 10 2 0 0 0 0 9 9 9
7 0.3 10 2 0 0 0 0 8 8 8
8 0.1 10 2 0 0 0 0 6 6 6
9 0.0 10 3 0 0 0 0 0 0 0
10 0.5 10 3 0 0 0 0 5 5 5
11 0.3 10 3 0 0 0 0 8 8 8
12 0.1 10 3 0 0 0 0 2 2 2
13 0.0 20 1 0 0 0 0 0 7 7
14 0.5 20 1 0 0 0 0 17 17 17
15 0.3 20 1 0 0 0 0 21 21 21
16 0.1 20 1 0 0 0 0 20 20 20
17 0.0 20 2 0 0 0 0 0 7 10
18 0.5 20 2 0 0 0 0 13 13 13
19 0.3 20 2 0 0 0 0 18 18 18
20 0.1 20 2 0 0 0 0 22 22 22
21 0.0 20 3 0 0 0 0 0 14 14
22 0.5 20 3 0 0 0 0 15 15 15
23 0.3 20 3 0 0 0 0 15 15 15
24 0.1 20 3 0 0 0 0 14 14 14
25 0.0 30 1 0 0 0 0 0 0 0
26 0.5 30 1 0 0 0 0 0 0 0
27 0.3 30 1 0 0 0 0 0 0 0
28 0.1 30 1 0 0 0 0 0 0 0
29 0.0 30 2 0 0 0 0 0 0 0
30 0.5 30 2 0 0 0 0 0 0 0
31 0.3 30 2 0 0 0 0 0 0 0
32 0.1 30 2 0 0 0 0 0 0 0
33 0.0 30 3 0 0 0 0 0 0 0
34 0.5 30 3 0 0 0 0 0 0 0
35 0.3 30 3 0 0 0 0 0 0 0
36 0.1 30 3 0 0 0 0 0 0 0
Day08 Day09 Day10 Day11 Day12 Day13 Day14 Day15 Day16 Day17 Day18
1 0 0 1 1 1 1 1 1 1 1 1
2 18 18 18 18 20 20 20 20 20 20 20
3 18 18 18 18 20 20 20 20 20 20 20
4 16 16 16 16 18 18 18 19 19 19 19
5 0 0 1 1 1 1 1 1 1 1 1
6 22 22 22 22 23 23 23 23 23 23 23
7 22 22 22 22 23 23 23 23 23 23 23
8 18 18 18 18 19 19 19 19 19 19 19
9 0 0 2 2 2 4 4 4 4 4 4
10 20 20 20 20 21 21 21 21 21 21 21
11 17 17 17 17 20 20 20 20 20 20 20
12 22 22 22 22 23 23 23 23 23 23 23
13 7 7 7 7 7 7 7 7 7 7 7
14 23 23 23 23 23 23 23 23 23 23 23
15 24 24 24 24 24 24 24 24 24 24 24
16 24 24 24 24 24 24 24 24 24 24 24
17 10 10 10 10 10 10 10 10 10 10 10
18 25 25 25 25 25 25 25 25 25 25 25
19 23 23 23 23 23 23 23 23 23 23 23
20 23 23 23 23 23 23 23 23 23 23 23
21 14 14 14 14 14 14 14 14 14 14 14
22 23 23 23 23 23 23 23 23 23 23 23
23 21 21 21 21 21 21 21 21 21 21 21
24 20 20 20 20 20 20 20 20 20 20 20
25 0 0 0 0 0 0 0 0 0 0 0
26 0 0 0 0 0 0 0 0 0 0 0
27 0 0 0 0 0 0 0 0 0 0 0
28 0 0 0 0 0 0 0 0 0 0 0
29 0 0 0 0 0 0 0 0 0 0 0
30 0 0 0 0 0 0 0 0 0 0 0
31 0 0 0 0 0 0 0 0 0 0 0
32 0 0 0 0 0 0 0 0 0 0 0
33 0 0 0 0 0 0 0 0 0 0 0
34 0 0 0 0 0 0 0 0 0 0 0
35 0 0 0 0 0 0 0 0 0 0 0
36 0 0 0 0 0 0 0 0 0 0 0
Day19 Total.Seeds
1 1 25
2 20 25
3 20 25
4 19 25
5 1 25
6 23 25
7 23 25
8 19 25
9 4 25
10 21 25
11 20 25
12 23 25
13 7 25
14 23 25
15 24 25
16 24 25
17 10 25
18 25 25
19 23 25
20 23 25
21 14 25
22 23 25
23 21 25
24 20 25
25 0 25
26 0 25
27 0 25
28 0 25
29 0 25
30 0 25
31 0 25
32 0 25
33 0 25
34 0 25
35 0 25
36 0 25
I receive the following error:
data(gcdata1)
Warning message:
In data(gcdata1) : data set ‘gcdata1’ not found
I created the below variable for counts.per.intervals
counts.per.intervals <- c("Day01", "Day02", "Day03", "Day04", "Day05",
+ "Day06", "Day07", "Day08", "Day09", "Day10",
+ "Day11", "Day12", "Day13", "Day14", "Day15", "Day16", "Day17", "Day18", "Day19")
As the following variable for indices
indices<-germination.indices(gcdata1, total.seeds.col = "Total.Seeds",
counts.intervals.cols = counts.per.intervals,
intervals = 1:19, partial = FALSE, max.int = 5)
I received the below error:
Error in if (nearest[2] == nearest[1]) { :
missing value where TRUE/FALSE needed
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Summing up different elements in a matrix in R

I'm trying to perform calculations on different elements in a matrix in R. My Matrix is 18x18 and I would like to get e.g. the mean of each 6x6 array (which makes 9 arrays in total). My desired arrays would be:
A1 <- df[1:6,1:6]
A2 <- df[1:6,7:12]
A3 <- df[1:6,13:18]
B1 <- df[7:12,1:6]
B2 <- df[7:12,7:12]
B3 <- df[7:12,13:18]
C1 <- df[13:18,1:6]
C2 <- df[13:18,7:12]
C3 <- df[13:18,13:18]
The matrix looks like this:
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
5 14 17 9 10 8 4 10 12 18 9 13 14 NA NA 19 15 10 10
10 30 32 23 27 17 28 25 12 28 29 28 26 19 25 34 24 11 17
15 16 16 16 9 17 27 17 16 30 13 18 13 15 13 19 8 7 9
20 15 12 18 18 18 6 4 6 9 11 10 10 13 11 8 10 15 15
25 7 13 21 7 3 5 2 5 5 4 3 2 3 5 2 1 5 6
30 5 9 1 7 7 4 4 12 8 9 2 0 5 2 1 0 2 6
35 3 0 2 0 0 4 4 7 4 4 5 2 0 0 1 0 0 0
40 0 4 0 0 0 1 3 9 10 10 1 0 0 0 1 0 1 0
45 0 0 0 0 0 3 10 9 17 9 1 0 0 0 0 0 0 0
50 0 0 2 0 0 0 2 8 20 0 0 0 0 0 1 0 0 0
55 0 0 0 0 0 0 7 3 21 0 0 0 0 0 0 0 0 0
60 0 0 0 0 3 4 10 2 2 0 0 1 0 0 0 0 0 0
65 0 0 0 0 0 4 8 4 8 11 0 0 0 0 0 0 0 0
70 0 0 0 0 0 6 2 5 14 0 0 0 0 0 0 0 0 0
75 0 0 0 0 0 4 0 5 9 0 0 0 0 0 0 0 0 0
80 0 0 0 0 0 4 4 0 4 2 0 0 0 0 0 0 0 0
85 0 0 0 0 0 0 0 4 1 1 0 0 0 0 0 0 0 0
90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Is there a clean way to solve this issue with a loop?
Thanks a lot in advance,
Paul
Given your matrix, e.g.
x <- matrix(1:(18*18), ncol=18)
Try, for example for sub matrices of 6
step <- 6
nx <- nrow(x)
if((nx %% step) != 0) stop("nx %% step should be 0")
indI <- seq(1, nx, by=step)
nbStep <- length(indI)
for(Col in 1:nbStep){
for(Row in 1:nbStep){
name <- paste0(LETTERS[Col],Row)
theCol <- indI[Col]:(indI[Col]+step-1)
theRow <- indI[Row]:(indI[Row]+step-1)
assign(name, sum(x[theCol, theRow]))
}
}
You'll get your results in A1, A2, A3...
This is the idea. Twist the code for non square matrices, different size of sub matrices, ...
Here's one way:
# generate fake data
set.seed(47)
n = 18
m = matrix(rpois(n * n, lambda = 5), nrow = n)
# generate starting indices
n_array = 6
start_i = seq(1, n, by = n_array)
arr_starts = expand.grid(row = start_i, col = start_i)
# calculate sums
with(arr_starts, mapply(function(x, y) sum(m[(x + 1:n_array) - 1, (y + 1:n_array) - 1]), row, col))
# [1] 158 188 176 201 188 201 197 206 204

Shortest path function returns a wrong path in R igraph

I use get.shortest.paths method to find the shortest path between two vertices. However, something odd is happening. After the comment that I received, I am changing the entire question body. I produced my graph with g <- sample_smallworld(1, 20, 5, 0.1) and here is the adjacency list.
*Vertices 20
*Edges
1 2 0
2 3 0
3 4 0
4 5 0
5 6 0
6 7 0
7 8 0
8 9 0
9 10 0
10 11 0
11 12 0
12 13 0
13 14 0
14 15 0
6 15 0
16 17 0
17 18 0
18 19 0
19 20 0
1 20 0
1 11 0
1 19 0
1 4 0
1 18 0
1 5 0
1 17 0
6 17 0
15 16 0
2 20 0
2 4 0
2 19 0
2 5 0
2 18 0
2 9 0
2 17 0
2 13 0
3 5 0
3 20 0
3 6 0
3 19 0
3 7 0
3 18 0
3 8 0
4 6 0
4 7 0
4 20 0
4 8 0
5 19 0
4 9 0
5 7 0
5 8 0
5 9 0
5 20 0
5 10 0
6 8 0
6 9 0
6 10 0
6 11 0
7 9 0
7 10 0
7 11 0
7 12 0
1 10 0
8 11 0
1 12 0
8 13 0
9 11 0
9 12 0
9 13 0
7 14 0
12 19 0
10 13 0
10 14 0
10 15 0
11 13 0
11 14 0
11 15 0
4 16 0
12 14 0
9 15 0
12 16 0
12 17 0
13 15 0
13 16 0
13 17 0
13 18 0
14 16 0
14 17 0
14 18 0
14 19 0
15 17 0
15 18 0
15 19 0
1 15 0
16 18 0
16 19 0
9 20 0
17 19 0
17 20 0
10 18 0
The shortest path reported between 7 and 2 is:
> get.shortest.paths(g,7,2)
$vpath
$vpath[[1]]
+ 4/20 vertices, from c915453:
[1] 7 14 19 2
Here is the adjacent nodes to node 7 and node 2:
> unlist(neighborhood(g, 1, 7, mode="out"))
[1] 7 3 4 5 6 8 9 10 11 12 14
> unlist(neighborhood(g, 1, 2, mode="out"))
[1] 2 1 3 4 5 9 13 17 18 19 20
As you can see, I can go from 7 to 3 and from 3 to 2. It looks like there is a shorter path. What could I be missing?
Yes, the problem is your edge weights of zero. Looking at the help page ?shortest_paths
weights
Possibly a numeric vector giving edge weights. If this is
NULL and the graph has a weight edge attribute, then the attribute is
used. If this is NA then no weights are used (even if the graph has a
weight attribute).
Note that weights=NULL is the default, so weights will be used. Therefore the weight of the path that was returned is zero - the same as the path that you wanted to get. The weighted distance is the same. If you want to find the path with the smallest number of hops, turn off the use of the weights like this:
get.shortest.paths(g,7,2, weights=NA)$vpath

validating model in new dataset

I have a dataset (d) in which I am looking at the prediction of hospital mortality (0 or 1) using tn1 and 3 other variables (rms package). I have built the model and now I would like to validate it in a second dataset using the same model coefficients. The variables have the same names etc, but I don’t know how to keep the coefficients from f1, rather than letting the model generate new coefficients for the second dataset.
I would be grateful for your expertise, many thanks, Annemarie
f1 <- lrm(outcomehosp ~ I(log2((tn1+0.001))) + apscore_ad + emsurg +
corrapiidiag, data = d)
record_id| corrapiidiag| tn1 |emsurg| apscore_ad |outcomehosp
7 3 0.27 1 24 1
8 9 0 1 21 0
9 7 0.11 0 22 0
11 9 0 0 13 0
12 9 13.9 0 17 0
13 22 5.02 0 37 0
21 9 9.6 0 34 0
25 9 0 0 10 0
27 9 0 0 33 1
28 25 0 0 18 1
30 9 0 0 19 0
31 9 0.16 0 26 1
32 9 0 0 13 1
34 7 0 0 18 0
35 9 0 0 20 0
36 9 3.03 0 41 1
37 9 0 0 18 0
38 9 0 0 18 0
39 9 0 0 17 0
40 9 0.14 0 23 0
41 9 0 0 10 0
42 9 0 0 8 0
43 9 2.45 0 11 0
45 9 0 1 12 0
46 9 0.16 1 17 0
49 9 0 1 22 0
50 9 0 0 15 0
51 9 0.05 1 16 0

Resources