Multiple tables from different pages into one data frame - r

I am trying to extract the tables from different pages into one data frame. However, I am only able to get it as a list and I am unable to convert to one table. Could you please help me out?
Code we are using so far:
Tables_recent <- lapply(paste0("http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;home_or_away=1;home_or_away=2;home_or_away=3;page=",
1:50,
";template=results;type=batting"),
function(url){
url %>% read_html() %>%
html_nodes(xpath= '//*[#id="ciHomeContentlhs"]/div[3]/table[3]') %>%
html_table()
})

It's nested within a list, so you need to get out the first element, and also remove entries that are "No records available to match this query"
library(dplyr)
library(textreadr)
library(rvest)
library(dplyr)
LINK = "http://stats.espncricinfo.com/ci/engine/stats/index.html?class=3;home_or_away=1;home_or_away=2;home_or_away=3;page="
Tables_recent <- lapply(paste0(LINK, 1:50,";template=results;type=batting"), function(url){
url %>% read_html() %>%
html_nodes(xpath= '//*[#id="ciHomeContentlhs"]/div[3]/table[3]') %>%
html_table()
})
we check the number of columns for each page entry:
> sapply(Tables_recent,function(i)ncol(i[[1]]))
[1] 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
[26] 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 1 1 1 1 1 1 1 1 1 1
Those with ncol == 1 are empty:
> Tables_recent[[50]]
[[1]]
X1
1 No records available to match this query
We loop through non-empty and then rbind
wh = which(sapply(Tables_recent,function(i)ncol(i[[1]]))==16)
Table = do.call(rbind,lapply(Tables_recent[wh],"[[",1))
head(Table)
Player Span Mat Inns NO Runs HS Ave BF SR 100
1 RG Sharma (INDIA) 2007-2019 101 93 14 2539 118 32.13 1843 137.76 4
2 V Kohli (INDIA) 2010-2019 72 67 18 2450 90* 50 1811 135.28 0
3 MJ Guptill (NZ) 2009-2019 83 80 7 2436 105 33.36 1810 134.58 2
4 Shoaib Malik (ICC/PAK) 2006-2019 111 104 30 2263 75 30.58 1824 124.06 0
5 BB McCullum (NZ) 2005-2015 71 70 10 2140 123 35.66 1571 136.21 2
6 DA Warner (AUS) 2009-2019 76 76 8 2079 100* 30.57 1476 140.85 1
50 0 4s 6s
1 18 6 225 115 NA
2 22 2 235 58 NA
3 15 2 215 113 NA
4 7 1 186 61 NA
5 13 3 199 91 NA
6 15 5 203 86 NA

What you have is a list. try
do.call(rbind,lapply(Tables_recent,function(x){x<-as.data.frame(x); if(length(x)>1)x}))
or you could do
do.call(rbind,Filter(function(x)length(x)>1,lapply(Tables_recent,as.data.frame)))

Related

Join data frame into one in r

I have 4 data frames that all look like this:
Product 2018
Number
Minimum
Maximum
1
56
1
5
2
42
12
16
3
6523
23
56
4
123
23
102
5
56
23
64
6
245623
56
87
7
546
25
540
8
54566
253
560
Product 2019
Number
Minimum
Maximum
1
56
32
53
2
642
423
620
3
56423
432
560
4
3
431
802
5
2
2
6
6
4523
43
68
7
555
23
54
8
55646
3
6
Product 2020
Number
Minimum
Maximum
1
23
2
5
2
342
4
16
3
223
3
5
4
13
4
12
5
2
4
7
6
223
7
8
7
5
34
50
8
46
3
6
Product 2021
Number
Minimum
Maximum
1
234
3
5
2
3242
4
16
3
2423
43
56
4
123
43
102
5
24
4
6
6
2423
4
18
7
565
234
540
8
5646
23
56
I want to join all the tables so I get a table that looks like this:
Products
Number 2021
Min-Max 2021
Number 2020
Min-Max 2020
Number 2019
Min-Max 2019
Number 2018
Min-Max 2018
1
234
3 to 5
23
2 to 5
...
...
...
...
2
3242
4 to 16
342
4 to 16
...
...
...
...
3
2423
43 to 56
223
3 to 5
...
...
...
...
4
123
43 to 102
13
4 to 12
...
...
...
...
5
24
4 to 6
2
4 to 7
...
...
...
...
6
2423
4 to 18
223
7 to 8
...
...
...
...
7
565
234 to 540
5
34 to 50
...
...
...
...
8
5646
23 to 56
46
3 to 6
...
...
...
...
The Product for all years are the same so I would like to have a data frame that contains the number for each year as a column and joins the column for minimum and maximum as one.
Any help is welcome!
How about something like this. You are trying to join several dataframes by a single column, which is relatively straight forward using full_join. The difficulty is that you are trying to extract information from the column names and combine several columns at the same time. I would map out everying you want to do and then reduce the list of dataframes at the end. Here is an example with two dataframes, but you could add as many as you want to the list at the begining.
library(tidyverse)
#test data
set.seed(23)
df1 <- tibble("Product 2018" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
set.seed(46)
df2 <- tibble("Product 2019" = seq(1:8),
Number = sample(1:100, 8),
Minimum = sample(1:100, 8),
Maximum = map_dbl(Minimum, ~sample(.x:1000, 1)))
list(df1, df2) |>
map(\(x){
year <- str_extract(colnames(x)[1], "\\d+?$")
mutate(x, !!quo_name(paste0("Min-Max ", year)) := paste(Minimum, "to", Maximum))|>
rename(!!quo_name(paste0("Number ", year)) := Number)|>
rename_with(~gsub("\\s\\d+?$", "", .), 1) |>
select(-c(Minimum, Maximum))
}) |>
reduce(full_join, by = "Product")
#> # A tibble: 8 x 5
#> Product `Number 2018` `Min-Max 2018` `Number 2019` `Min-Max 2019`
#> <int> <int> <chr> <int> <chr>
#> 1 1 29 21 to 481 50 93 to 416
#> 2 2 28 17 to 314 78 7 to 313
#> 3 3 72 40 to 787 1 91 to 205
#> 4 4 43 36 to 557 47 55 to 542
#> 5 5 45 70 to 926 52 76 to 830
#> 6 6 34 96 to 645 70 20 to 922
#> 7 7 48 31 to 197 84 6 to 716
#> 8 8 17 86 to 951 99 75 to 768
This is a similar answer, but includes bind_rows to combine the data.frames, then pivot_wider to end in a wide format.
The first steps strip the year from the Product XXXX column name, as this carries relevant information on year for that data.frame. If that column is renamed as Product they are easily combined (with a separate column containing the Year). If this step can be taken earlier in the data collection or processing timeline, it is helpful.
library(tidyverse)
list(df1, df2, df3, df4) %>%
map(~.x %>%
mutate(Year = gsub("Product", "", names(.x)[1])) %>%
rename(Product = !!names(.[1]))) %>%
bind_rows() %>%
mutate(Min_Max = paste(Minimum, Maximum, sep = " to ")) %>%
pivot_wider(id_cols = Product, names_from = Year, values_from = c(Number, Min_Max), names_vary = "slowest")
Output
Product Number_2018 Min_Max_2018 Number_2019 Min_Max_2019 Number_2020 Min_Max_2020 Number_2021 Min_Max_2021
<int> <int> <chr> <int> <chr> <int> <chr> <int> <chr>
1 1 56 1 to 5 56 32 to 53 23 2 to 5 234 3 to 5
2 2 42 12 to 16 642 423 to 620 342 4 to 16 3242 4 to 16
3 3 6523 23 to 56 56423 432 to 560 223 3 to 5 2423 43 to 56
4 4 123 23 to 102 3 431 to 802 13 4 to 12 123 43 to 102
5 5 56 23 to 64 2 2 to 6 2 4 to 7 24 4 to 6
6 6 245623 56 to 87 4523 43 to 68 223 7 to 8 2423 4 to 18
7 7 546 25 to 540 555 23 to 54 5 34 to 50 565 234 to 540
8 8 54566 253 to 560 55646 3 to 6 46 3 to 6 5646 23 to 56

Script out of bounds in R

I am using a code based on Deseq2. One of my goals is to plot a heatmap of data.
heatmap.data <- counts(dds)[topGenes,]
The error I am getting is
Error in counts(dds)[topGenes, ]: subscript out of bounds
the first few line sof my counts(dds) function looks like this.
99h1 99h2 99h3 99h4 wth1 wth2
ENSDARG00000000002 243 196 187 117 91 96
ENSDARG00000000018 42 55 53 32 48 48
ENSDARG00000000019 91 91 108 64 95 94
ENSDARG00000000068 3 10 10 10 30 21
ENSDARG00000000069 55 47 43 53 51 30
ENSDARG00000000086 46 26 36 18 37 29
ENSDARG00000000103 301 289 289 199 347 386
ENSDARG00000000151 18 19 17 14 22 19
ENSDARG00000000161 16 17 9 19 10 20
ENSDARG00000000175 10 9 10 6 16 12
ENSDARG00000000183 12 8 15 11 8 9
ENSDARG00000000189 16 17 13 10 13 21
ENSDARG00000000212 227 208 259 234 78 69
ENSDARG00000000229 68 72 95 44 71 64
ENSDARG00000000241 71 92 67 76 88 74
ENSDARG00000000324 11 9 6 2 8 9
ENSDARG00000000370 12 5 7 8 0 5
ENSDARG00000000394 390 356 339 283 313 286
ENSDARG00000000423 0 0 2 2 7 1
ENSDARG00000000442 1 1 0 0 1 1
ENSDARG00000000472 16 8 3 5 7 8
ENSDARG00000000476 2 1 2 4 6 3
ENSDARG00000000489 221 203 169 144 84 114
ENSDARG00000000503 133 118 139 89 91 112
ENSDARG00000000529 31 25 17 26 15 24
ENSDARG00000000540 25 17 17 10 28 19
ENSDARG00000000542 15 9 9 6 15 12
How do I ensure all the elements of the top genes are present in it?
When I try to see 20 top genes in the dataset. it looks like a list of genes
6339" "12416" "1241" "3025" "12791" "846" "15090"
[8] "6529" "14564" "4863" "12777" "1122" "7454" "13716"
[15] "5790" "3328" "1231" "13734" "2797" "9072" with the column head V1.
I have used both
topGenes <- read.table("E://mir99h50 Cheng data//topGenesresordered.txt",header = TRUE)
and
topGenes <- read.table("E://mir99h50 Cheng data//topGenesresordered.txt",header = FALSE)
to see if the out of bounds error is removed. However it was of no use. I guess the V1 head is causing the issue.
The top genes function has been generated using the above code snippet.
resordered <- res[order(res$padj),]
#Reorder gene list by increasing pAdj
resordered <- as.data.frame(res[order(res$padj),])
#Filter for genes that are differentially expressed with an FDR < 0.01
ii <- which(res$padj < 0.01)
length(ii)
# Use the rownames() function to get the top 20 differentially expressed genes from our results table
topGenes <- rownames(resordered[1:20,])
topGenes
# Get the counts from the DESeqDataSet using the counts() function
heatmap.data <- counts(dds)[topGenes,]
Perhaps this will do what you want?
counts_dds <- counts(dds)
topgenes <- c("ENSDARG00000000002", "ENSDARG00000000489", "ENSDARG00000000503",
"ENSDARG00000000540", "ENSDARG00000000529", "ENSDARG00000000542")
heatmap.data <- counts_dds[rownames(counts_dds) %in% topgenes,]
If you provide more information it will be easier to advise you on how to fix your problem.

find max column value in r conditional on another column

I have a data frame of baseball player information:
playerID nameFirst nameLast bats throws yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
81955 rolliji01 Jimmy Rollins B R 2007 1 PHI NL 162 716 139 212 38 20 30 94 41 6 49 85 5
103358 wilsowi02 Willie Wilson B R 1980 1 KCA AL 161 705 133 230 28 15 3 49 79 10 28 81 3
93082 suzukic01 Ichiro Suzuki L R 2004 1 SEA AL 161 704 101 262 24 5 8 60 36 11 49 63 19
83973 samueju01 Juan Samuel R R 1984 1 PHI NL 160 701 105 191 36 19 15 69 72 15 28 168 2
15201 cashda01 Dave Cash R R 1975 1 PHI NL 162 699 111 213 40 3 4 57 13 6 56 34 5
75531 pierrju01 Juan Pierre L L 2006 1 CHN NL 162 699 87 204 32 13 3 40 58 20 32 38 0
HBP SH SF GIDP average
81955 7 0 6 11 0.2960894
103358 6 5 1 4 0.3262411
93082 4 2 3 6 0.3721591
83973 7 0 1 6 0.2724679
15201 4 0 7 8 0.3047210
75531 8 10 1 6 0.2918455
I want to return a maximum value of the batting average ('average') column where the at-bats ('AB') are greater than 100. There are also 'NaN' in the average column.
If you want to return the entire row for which the two conditions are TRUE, you can do something like this.
library(tidyverse)
data <- tibble(
AB = sample(seq(50, 150, 10), 10),
avg = c(runif(9), NaN)
)
data %>%
filter(AB >= 100) %>%
filter(avg == max(avg, na.rm = TRUE))
Where the first filter is to only keep rows where AB is greater than or equal to 100 and the second filter is to select the entire row where it is max. If you want to to only get the maximum value, you can do something like this:
data %>%
filter(AB >= 100) %>%
summarise(max = max(avg, na.rm = TRUE))

r - How to create multiple new columns when conditions met in original columns

I'm pretty fresh to r (like 2 days old). I have a set of data that is a time series taken every 200 msecs over a few hours. Here's the
head(dat):
Date Time MSec Sample Pat1 Pat2 Pat3
1 8/7/~ 14:34 411 0 100 13 13
2 8/7/~ 14:34 615 1 13 13 143
3 8/7/~ 14:34 814 2 13 13 13
4 8/7/~ 14:34 12 3 130 13 13
5 8/7/~ 14:34 216 4 13 13 130
6 8/7/~ 14:34 417 5 139 13 13
It goes down for 2 hours, so several thousands points and over for several hundred patients. The 13 is our baseline and what we are interested in spikes in activity over say 100. I have been trying to create a new column for each Patient column for every time a signal is over 100. I've worked out the follow code:
dat$Pat1exc <- as.numeric(dat$Pat1 >=100)
This works and gives me the new column and my data looks like below:
Date Time MSec Sample Pat1 Pat2 Pat3 Pat1exc
1 8/7/~ 14:34 411 0 100 13 13 1
2 8/7/~ 14:34 615 1 13 13 143 0
3 8/7/~ 14:34 814 2 13 13 13 0
4 8/7/~ 14:34 12 3 130 13 13 1
5 8/7/~ 14:34 216 4 13 13 130 0
6 8/7/~ 14:34 417 5 139 13 13 1
This is exactly what I want, but I don't know how to iterate through each column to create Pat2exc, Pat3exc, etc. I figured I could use sapply or vapply after I create a function. However, I can't get the function to work.
excite <- function(x, y) {y <- as.numeric(x >=100)}
excite(x=dat$Pat2, y=dat$Pat2exc)
This gives me no errors, but doesn't modify the dat data frame. Essentially, in the end I just want to sum up all the excited columns (>=100). If there is an easier way to count the samples over 100 for each patient then I'd be happy to learn how to do that as well.
Sorry if this is unclear. Thanks in advance.
P.S.: I am also looking for a good way to combine the Time and Msec columns.
Edit: Added in unabbreviated data:
Date Time Msecs
8/7/2018 14:34:07 411
8/7/2018 14:34:07 615
8/7/2018 14:34:07 814
8/7/2018 14:34:08 12
8/7/2018 14:34:08 216
8/7/2018 14:34:08 417
8/7/2018 14:34:08 619
8/7/2018 14:34:08 816
8/7/2018 14:34:09 15
We can use mutate_at from dplyr to create the binary variables and mutate + rowSums to add them all up:
library(dplyr)
df %>%
mutate_at(vars(starts_with("Pat")), funs(exc = (. >= 100)*1)) %>%
mutate(exc_total = rowSums(.[grepl('_exc', names(.))]))
Result:
Date Time MSec Sample Pat1 Pat2 Pat3 Pat1_exc Pat2_exc Pat3_exc exc_total
1 8/7/~ 14:34 411 0 100 13 13 1 0 0 1
2 8/7/~ 14:34 615 1 13 13 143 0 0 1 1
3 8/7/~ 14:34 814 2 13 13 13 0 0 0 0
4 8/7/~ 14:34 12 3 130 13 13 1 0 0 1
5 8/7/~ 14:34 216 4 13 13 130 0 0 1 1
6 8/7/~ 14:34 417 5 139 13 13 1 0 0 1

how to fix "undefined columns selected" for network meta-analysis in R?

I am conducting a network meta-analysis on R with two packages, gemtc and rjags. However, when I type
Model <- mtc.model (network, linearmodel=’fixed’).
R always returns “
Error in [.data.frame(data, sel1 | sel2, columns, drop = FALSE) :
undefined columns selected In addition: Warning messages: 1: In
mtc.model(network, linearModel = "fixed") : Likelihood can not be
inferred. Defaulting to normal. 2: In mtc.model(network, linearModel =
"fixed") : Link can not be inferred. Defaulting to identity “
How to fix this problem? Thanks!
I am attaching my codes and data here:
SAE <- read.csv(file.choose(),head=T, sep=",")
head(SAE)
network <- mtc.network(data.ab=SAE)
summary(network)
plot(network)
model.fe <- mtc.model (network, linearModel="fixed")
plot(model.fe)
summary(model.fe)
cat(model.fe$code)
model.fe$data
# run this model
result.fe <- mtc.run(model.fe, n.adapt=0, n.iter=50)
plot(result.fe)
gelman.diag(result.fe)
result.fe <- mtc.run(model.fe, n.adapt=1000, n.iter=5000)
plot(result.fe)
gelman.diag(result.fe)
following is my data: SAE
study treatment responder sample.size
1 1 3 0 76
2 1 30 2 72
3 2 3 99 1389
4 2 23 132 1383
5 3 1 6 352
6 3 30 2 178
7 4 2 6 106
8 4 30 3 95
9 5 3 49 393
10 5 25 18 198
11 6 1 20 65
12 6 22 10 26
13 7 1 1 76
14 7 30 3 76
15 8 3 7 441
16 8 26 1 220
17 9 2 1 47
18 9 30 0 41
19 10 3 10 156
20 10 30 9 150
21 11 1 4 85
22 11 25 5 85
23 11 30 4 84
24 12 3 6 152
25 12 30 5 160
26 13 18 4 158
27 13 21 8 158
28 14 1 3 110
29 14 30 2 111
30 15 3 3 83
31 15 30 1 92
32 16 1 3 124
33 16 22 6 123
34 16 30 4 125
35 17 3 236 1553
36 17 23 254 1546
37 18 6 5 398
38 18 7 6 403
39 19 1 64 588
40 19 22 73 584
How about reading the manual ?mtc.model. It clearly states the following:
Required columns [responders, sampleSize]
So your responder variable should be responders and your sample.size variable should be sampleSize.
Next, your plot(network) should help you determine that some comparisons can not be made. In your data, there are 2 subgroups of trials that were compared. Treatment 18 and 21 were not compared with any of the others. Therefore you can only do a meta-analysis of 21 and 18 or a network meta-analysis of the rest.
network <- mtc.network(data.ab=SAE[!SAE$treatment %in% c(21, 18), ])
model.fe <- mtc.model(network, linearModel="fixed")

Resources