web scraping a table with rvest - r

I am trying to extract a table from the url: http://gnomad.broadinstitute.org/variant/9-34647855-C-T
I do the following:
library(rvest)
url<-"http://gnomad.broadinstitute.org/variant/9-34647855-C-T"
frq_table <- read_html(url) %>% html_nodes("#frequency_table") %>% html_table()
I got that "#frequency_table" bit by using inspect element in Chrome and
copying selector corresponding to the table. However the table I get do to contain any values just NAs.
frq_table
[[1]]
Population Allele Count Allele Number Number of Homozygotes Allele Frequency
1 European (Non-Finnish) NA NA NA NA
2 Ashkenazi Jewish* NA NA NA NA
3 East Asian NA NA NA NA
4 Other NA NA NA NA
5 African NA NA NA NA
6 Latino NA NA NA NA
7 South Asian NA NA NA NA
8 European (Finnish) NA NA NA NA
9 Total NA NA NA NA
I must be assigning the wrong path .... can't figure out how to extract the values.
Any help is much appreciated!

Related

Correlation analysis for more than two binary (categorical) variables in R

I have a 996x12 database that collects categorical variables. All of them are dummy variables (1,0). One of the variables indicates whether or not they disclose on the environment and the other eleven variables indicate different sectors whether or not it belongs to that sector.
My intention is for R to return me a table where the correlation between whether they disclose or not and the sector it belongs to is calculated. In other words, compare a variable with the other eleven variables.
How would it be done?
I have tried to test the cor () function but I get missing values NA.
DISCL Energy Materials Industrials Consumer.discretionary Consumer.staples .......
DISCL 1
Energy NA 1
Materials NA NA 1
Industrials NA NA NA 1
Consumer.discretionary NA NA NA NA 1
Consumer.staples NA NA NA NA NA 1
Health.care NA NA NA NA NA NA
Financials NA NA NA NA NA NA
Information.technology NA NA NA NA NA NA
Communication.services NA NA NA NA NA NA

Finding regional annual maximum values in R

I am working with globally gridded data of annual maximum precipitation. However, I want to isolate those maximum value for land areas "only" for each of my 145 years by using a mask (so 145 maximum values based on all land areas). That said, I am receiving only NA values when I apply the mask, and I cannot understand why (when the mask is not applied, the below procedure works just fine). Here is what I have done so far:
Model66 <- brick("MaxPrecNOAA-GFDLGFDL-ESM2Ghistorical.nc", var="onedaymax")
#Applying the mask to isolate land areas only:
data("wrld_simpl")
b <- wrld_simpl
land <- mask(Model66,b)
#To derive highest maximum value for each layer/year for land only (145 years = 145 maximum values)
Gmax <- sapply(unstack(land), function(r){max(values(r))})
Gmax
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA
[40] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA
[79] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA
[118] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
Why would this be happening? I isolated land only, and my plots correctly show that the mask worked, as only land has values on the plots for each layer/year (and the idea would be take the highest value among these for each layer/year, as I attempted to do with object "Gmax"). Again, when a mask is not applied, NAs don't show up, so I wonder if it is just a small detail causing this when using the mask?
Any help with this would be greatly appreciated!
Thanks!
Try with:
Gmax <- sapply(unstack(land), function(r){max(values(r), na.rm=T)})
Your NAs are considered by R like the maximum value (positive infinitum), you can disable that option with na.rm=TRUE

Parsing HPO obo file to extract xrefs

I need to extract information from an OBO file.
What I need is to get the information from the xref row for each term id. The information inside the file look as follows for 13.000 terms aprox:
[Term]
id: HP:0011540
name: Congenitally corrected transposition of the great arteries
def: "The essence of the lesion is the combination of discordant atrioventricular and ventriculo-arterial connections. Thus, the morphologically right atrium is connected to a morphologically left ventricle across the mitral valve, with the left ventricle then connected to the pulmonary trunk. The morphologically left atrium is connected to the morphologically right ventricle across the tricuspid valve, with the morphologically right ventricle connected to the aorta." [DDD:dbrown, pmid:21569592]
synonym: "L-transposition" RELATED []
synonym: "Ventricular inversion" RELATED []
xref: EPCC:01.01.03
xref: ICD-10:Q20.5
xref: MSH:C535426
xref: SNOMEDCT_US:56743000
xref: SNOMEDCT_US:83799000
xref: UMLS:C0232301
xref: UMLS:C0344616
is_a: HP:0011534 ! Abnormal spatial orientation of the cardiac segments
is_a: HP:0011603 ! Congenital malformation of the great arteries
created_by: peter
creation_date: 2012-04-07T10:48:56Z
[Term]
id: HP:0011555
name: Double inlet left ventricle
def: "The condition in which both atria are joined to the left ventricle each by its own atrioventricular valve. Usually there is a hypoplastic right ventricle, which may be on the opposite side of the heart as usual." [DDD:dbrown, HPO:probinson]
xref: EPCC:01.04.04
xref: ICD-10:Q20.4
xref: SNOMEDCT_US:253283000
xref: UMLS:C0344622
is_a: HP:0001750 ! Single ventricle
is_a: HP:0011554 ! Double inlet atrioventricular connection
created_by: peter
creation_date: 2012-04-07T11:53:33Z
[Term]
id: HP:0011589
name: Common origin of the right brachiocephalic artery and left common carotid artery
def: "The left common carotid artery has a common origin with the innominate artery." [DDD:dbrown, HPO:probinson, pmid:17138027]
comment: Commonly the three great vessels (innominate artery, left common carotid artery, and the left subclavian artery) originate from the arch of the aorta. The second most common variant of aortic arch branching occurs when the left common carotid artery has a common origin with the innominate artery.
synonym: "Bovine arch" RELATED []
synonym: "Common brachiocephalic trunk" EXACT []
synonym: "Ovine arch" RELATED []
xref: SNOMEDCT_US:460890003
xref: UMLS:C3532020
xref: UMLS:C4020746
xref: UMLS:C4021141
is_a: HP:0011587 ! Abnormal branching pattern of the aortic arch
created_by: peter
creation_date: 2012-04-08T01:38:36Z
The result should look like this in txt or xlsx format:
id UMLS SNOMEDCT_US MSH EPCC ICD-10 ICD-9 ICD-O Fyler MEDDRA
HP:0011540 C0232301;C0344616 56743000;83799000 C535426 01.01.03 Q20.5
HP:0011555 C0344622 253283000 01.04.04 Q20.4
HP:0011589 C3532020;C4020746;C4021141 460890003
The headers (UMLS, SNOMEDCT_US, MSH, MEDDRA,...) are all possible xrefs.
Here is an approach with ontologyIndex and tidyverse:
library(tidyverse)
library(ontologyIndex)
hpo <- get_ontology("https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/master/hp.obo",
extract_tags = "everything") #Download HPO file from GitHub and import
simplify2array(hpo) %>% #Convert to array
as_tibble() %>% #Convert to tibble
select(id,xref) %>% #select HPO ID and xref
unnest(c(id,xref)) %>% #unnest list columns
separate(xref, into = c("Ontology","Term"), sep = ":") %>% #separate ontology from code
pivot_wider(id_cols = id, names_from = "Ontology",
values_from = Term,
values_fn = \(x)paste(x,collapse = ";")) #pivot wider and combine terms with paste
## A tibble: 11,652 x 22
# id UMLS MSH SNOMEDCT_US MEDDRA Fyler NCIT COHD EFO ICD10 ICD9 `ICD-10` EPCC DOID MONDO `ICD-O` MP MPATH PMID ORPHA SNOMED_CT `ICD-9`
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 HP:0000001 C0444868 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 2 HP:0000002 C4025901 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 3 HP:0000003 C3714581 D021782 204962002;82525005 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 4 HP:0000005 C1708511 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 HP:0000006 C0443147 NA 263681008 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 6 HP:0000007 C0441748;C4020899 NA 258211005 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 7 HP:0000008 C4025900 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 8 HP:0000009 C3806583 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 9 HP:0000010 C0262655 NA 197927001 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#10 HP:0000011 C0005697 D001750 397732007;398064005 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
From here you could write the results out with write.table() or write_delim().

How do you collapse multiple rows based on multiple columns in r?

So basically I have a dataframe that kinda looks like this:
Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24
Akutan city NA NA NA NA NA NA 71
Alcan Border NA NA 2 NA NA NA NA
Alcan Border NA NA NA NA NA 2 NA
Alcan Border NA NA NA NA 5 NA NA
Ambler City 224 NA NA NA NA NA NA
Ambler City NA NA NA 17 NA NA NA
Is there a simple way to combine multiple rows based on multiple column data? I've seen a few scripts that say you can combine one duplicate variable in a column based on one or two data columns but I need to do it more large scale (I have ~400 rows with duplicates and ~30 columns (and each column has a large name).
Ideally it would look like:
Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24
Akutan city NA NA NA NA NA NA 71
Alcan Border NA NA 2 NA 5 2 NA
Ambler City 224 NA NA 17 NA NA NA
I'm very new at R. Thank you!
Edit - I used the following code however a lot of column data (the data in rows after the first duplicate community name disappeared ex: the Alcon border values for 10-14 and 15-19 became NA) went missing when I collapsed it. Ideas?
library(dplyr)
census8 <- census7 %>%
group_by(Community) %>%
summarise_each(funs(sum))
To keep the NAs in there the way you want you could use data.table:
library(data.table)
setDT(df)[,lapply(.SD, function(x) ifelse(all(is.na(x)), NA_integer_, sum(x, na.rm = T))),
by = Community]
# Community Pop_Total Median_Age Under_5 5-9 10-14 15-19 20-24
#1: Akutan_city NA NA NA NA NA NA 71
#2: Alcan_Border NA NA 2 NA 5 2 NA
#3: Ambler_City 224 NA NA 17 NA NA NA

Create a new data frame

I have a data frame with only one column. Column contain some names. I need change this data frame.
I created a list with some places:
voos_inter <- c("PUJ","SCL","EZE","MVD","ASU","VVI")
How can i include on this data frame the number of column according the names of the list?
Is a vector your one column data frame? You can convert a vector to a data.frame and add columns. I use to add columns with NA and add values later. Check this example:
vtr <-c(1:6)
df <- as.data.frame(vtr)
voos_inter <- c("PUJ","SCL","EZE","MVD","ASU","VVI")
df[,2:(length(voos_inter)+1)] <- NA
names(df)[2:(length(voos_inter)+1)] <- voos_inter
df
vtr PUJ SCL EZE MVD ASU VVI
1 1 NA NA NA NA NA NA
2 2 NA NA NA NA NA NA
3 3 NA NA NA NA NA NA
4 4 NA NA NA NA NA NA
5 5 NA NA NA NA NA NA
6 6 NA NA NA NA NA NA

Resources