R xml2 : How to query only corresponding xml nodes - r

I'm trying to read and transform many XML files into R data frames (or preferably Tibbles).
All R packages I've tried, unfortunately (XML, flatxml, xmlconvert) failed when I tried to convert the files using built-in functions (e.g. xmltodataframe from the XML Package and xml_to_df from the xmlconvert package), so I have to do it manually with XML2.
Here is my question with a small working example:
# Minimal Working Example
library(tidyverse)
library(xml2)
interimxml <- read_xml("<Subdivision>
<Name>Charles</Name>
<Salary>100</Salary>
<Name>Laura</Name>
<Name>Steve</Name>
<Salary>200</Salary>
</Subdivision>")
names <- xml_text(xml_find_all(interimxml ,"//Subdivision/Name"))
salary <- xml_text(xml_find_all(interimxml ,"//Subdivision/Salary"))
names
salary
# combine in to tibble (doesn't work because of inequal vector lengths)
result <- tibble(names=names,
salary = salary)
result
rbind(names, salary)
From the (made up) XML file you can see that Charles earns 100 dollars, Laura earns nothing ( because of the missing entry, here is the problem) and Steve earns 200 dollars.
What I want xml2 do to is, when querying names and salary nodes is to return an "NA" (or zero which would also be okay), when it finds a name but no corresponding salary entry, so that I would end up a nice table like this:
Name
Salary
Charles
100
Laura
NA
Steve
200
I know that I could modify the "xpath" to only pick up the last value (for Steve), which wouldn't help me, since (in the real data) it could also be the 100th or the 23rd person with missing salary information.
[ I'm aware that Salary Numbers are pulled as character values from the xml file. I would mutate(across(salary, as.double) over columns afterwards.]
Any help is highly appreciated. Thank you very much in advance.

You need to be a bit more careful to match up the names and salaries. Basically first find all the <Name> nodes, then check only if their next sibling is a <Salary> node. If not, then return NA.
nameNodes <- xml_find_all(interimxml ,"//Subdivision/Name")
names <- xml_text(nameNodes)
salary <- map_chr(nameNodes, ~xml_text(xml_find_first(., "./following-sibling::*[1][self::Salary]")))
tibble::tibble(names, salary)
# names salary
# <chr> <chr>
# 1 Charles 100
# 2 Laura NA
# 3 Steve 200

Related

Is there an easy way of text searching using lookup tables in R? (Version 2 - multiple word searching)

I've previously asked a very similar question which was superbly answered but I have since slightly changed the search terms to multiple words so I am posting a fresh question with updated code/example.
I have a use case where I have lots of 'lookup tables', i.e. dataframes containing strings I am searching for in rows within a large second dataframe. I need to extract rows where a string exists within the dataframe but there may be other strings in the dataframe. I also need to extract the whole row and that of the lookup table when a match is found.
I've successfully achieved what I need via a nested for loop, but my actual dataset is massive and the lookup table will be circa 50,000 rows. So a for loop is going to be very inefficient. I have had success using dplyr::semi_join but that only works when the entries match exactly, whereas I am searching for a single word in a longer string:
fruit_lookup <- data.frame(fruit=c("banana drop","apple juice","pear","plum"), rating=c(3,4,3,5))
products <- data.frame(product_code=c("535A","535B","283G","786X","765G"), product_name=c("banana drop syrup","apple juice concentrate","melon juice","coconut oil","strawberry jelly"))
results <- data.frame(product_code=NA, product_name=NA, fruit=NA, rating=NA)
for(i in 1:nrow(products)) {
for(j in 1:nrow(fruit_lookup)){
if(stringr::str_detect(products$product_name[i], fruit_lookup$fruit[j])) {
results <- tibble::add_row(results)
results$product_code[i] <- products$product_code[i]
results$product_name[i] <- products$product_name[i]
results$fruit[i] <- fruit_lookup$fruit[j]
results$rating[i] <- fruit_lookup$rating[j]
break
}
}
}
results <- stats::na.omit(results)
print(results)
This yields the result I am wanting:
product_code product_name fruit rating
535A banana drop syrup banana drop 3
535B apple juice concentrate apple juice 4
Any advice gratefully received and I won't be hurt if I have missed something obvious. Please feel free to critique my other coding practices, which may not be ideal!
This seems like a regex-join. Up-front, I'm not certain how well this scales with any of the offerings:
fuzzyjoin::regex_inner_join(products, fruit_lookup, by = c("product_name" = "fruit"))
# product_code product_name fruit rating
# 1 535A banana drop syrup banana drop 3
# 2 535B apple juice concentrate apple juice 4
Similarly, sqldf:
sqldf::sqldf("
select p.*, f.*
from fruit_lookup f
inner join products p on p.product_name like '%'||f.fruit||'%'
")

Create a new row to assign M/F to a column based on heading, referencing second table?

I am new to R (and coding in general) and am really stuck on how to approach this problem.
I have a very large data set; columns are sample ID# (~7000 samples) and rows are gene expression (~20,000 genes). Column headings are BIOPSY1-A, BIOPSY1-B, BIOPSY1-C, ..., BIOPSY200-Z. Each number (1-200) is a different patient, and each sample for that patient is a different letter (-A, -Z).
I would like to do some comparisons between samples that came from men and women. Gender is not included in this gene expression table. I have a separate file with patient numbers (BIOPSY1-200) and their gender M/F.
I would like to code something that will look at the column ID (ex: BIOPSY7-A), recognize that it includes "BIOPSY7" (but not == BIOPSY7 because there is BIOPSY7-A through BIOPSY7-Z), find "BIOPSY7" in the reference file, extrapolate M/F, and create a new row with M/F designation.
Honestly, I am so overwhelmed with coding this that I tried to open the file in Excel to manually input M/F, for the 7000 columns as it would probably be faster. However, the file is so large that Excel crashes when it opens.
Any input or resources that would put me on the right path would be extremely appreciated!!
I don't quite know how your data looks like, so I made mine based on your definitions. I'm sure you can modify this answer based on your needs and your dataset structure:
library(data.table)
genderfile <-data.frame("ID"=c("BIOPSY1", "BIOPSY2", "BIOPSY3", "BIOPSY4", "BIOPSY5"),"Gender"=c("F","M","M","F","M"))
#you can just read in your gender file to r with the line below
#genderfile <- read.csv("~/gender file.csv")
View(genderfile)
df<-matrix(rnorm(45, mean=10, sd=5),nrow=3)
colnames(df)<-c("BIOPSY1-A", "BIOPSY1-B", "BIOPSY1-C", "BIOPSY2-A", "BIOPSY2-B", "BIOPSY2-C","BIOPSY3-A", "BIOPSY3-B", "BIOPSY3-C","BIOPSY4-A", "BIOPSY4-B", "BIOPSY4-C","BIOPSY5-A", "BIOPSY5-B", "BIOPSY5-C")
df<-cbind(Gene=seq(1:3),df)
df<-as.data.frame(df)
#you can just read in your main df to r with the line below, fread prevents dashes to turn to period in r, you need data.table package installed and checked in
#df<-fread("~/first file.csv")
View(df)
Note that the following line of code removes the dash and letter from the column names of df (I removed the first column by df[,-c(1)] because it is the Gene id):
substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2)
#[1] "BIOPSY1" "BIOPSY1" "BIOPSY1" "BIOPSY2" "BIOPSY2" "BIOPSY2" "BIOPSY3" "BIOPSY3" "BIOPSY3" "BIOPSY4" "BIOPSY4"
#[12] "BIOPSY4" "BIOPSY5" "BIOPSY5" "BIOPSY5"
Now, we are ready to match the columns of df with the ID in genderfile to get the Gender column:
Gender<-genderfile[, "Gender"][match(substr(x=names(df[,-c(1)]),start=1,stop=nchar(names(df[,-c(1)]))-2), genderfile[,"ID"])]
Gender
#[1] F F F M M M M M M F F F M M M
Last step is to add the Gender defined above as a row to the df:
df_withGender<-rbind(c("Gender", as.character(Gender)), df)
View(df_withGender)

R: How to merge two similar rows into one within the same dataframe?

I have a dataframe structured like this
path clicks
/a/b/index.html 1000
/a/b/index.html#1 500
/a/index.html#1 250
R Code:
path <- c('/a/b/index.html','/a/b/index.html#1','/a/index.html#1')
clicks <- c(1000, 500, 250)
d.f <- data.frame(path,clicks)
The first two rows are basically the same URL path. Hence, I would like to merge these two rows into one by path adding clicks and reducing the path name of the result to simply '#1' while getting rid of the old names. The result would look something like this:
path clicks
#1 1500
/a/index.html#1 250
From what I read this can be achieved by using aggregate(), but I can't quiet find a decent introduction thoroughly explaining how this function works.
Anyways, I'd be thankful if you could either provide me with a solution or point me to a beginner-friendly source to educate myself with the relevant material.
This is really what you want I think (I will explain why at the end).
path <- c('/a/b/index.html','/a/b/index.html#1','/a/index.html#1')
clicks <- c(1000, 500, 250)
d.f <- data.frame(path,clicks)
d.f$path <- gsub("\\#\\d", "", d.f$path)
d.f
aggregate(d.f$clicks ~ d.f$path, FUN = sum)
Reducing the link to "#1" would be next to impossible since that would make rows 1 and 3 identical, which is not what you want. Plus I assume if you had "/a/b/index.html#2" you would want that aggregated with rows 1 and 2 and not kept separately.
The other option would be to append a "#1" to all links that do not have one and then aggregate
d.f$path[grep("html$", d.f$path)]<-paste0(d.f$path[grep("html$", d.f$path)], "#1")
A possible dplyr solution with #1 and #2:
df=data.frame(path=c("/a/b/index.html","/a/b/index.html#1","/a/b/index.html#2","/a/index.html"),
clicks=c(1000,500,150,250))
path clicks
1 /a/b/index.html 1000
2 /a/b/index.html#1 500
3 /a/b/index.html#2 150
4 /a/index.html 250
df%>%
mutate(path_simp=gsub("#.*","",path))%>%
transform(path=gsub("^[^#]*","",path,perl=T))%>%
group_by(path_simp)%>%
mutate(path=ifelse(any(!path==""),path[path!=""][length(path[path!=""])],path_simp))%>%
summarise(clicks=sum(clicks),path=last(path))%>%
select(path,clicks)
Which gives:
path clicks
<chr> <dbl>
1 #2 1650
2 /a/index.html 250
The idea is to create a new column path_simp which contains the path without any # afterwards and replace in path any path containing #number with just #number.
path_simp is used for grouping, and path is changed to only have the #number if there is one.
Summary of clicks and path are computed with sum() and last() for path.

Gene ontology (GO) analysis for a list of Genes (with ENTREZID) in R?

I am very new with the GO analysis and I am a bit confuse how to do it my list of genes.
I have a list of genes (n=10):
gene_list
SYMBOL ENTREZID GENENAME
1 AFAP1 60312 actin filament associated protein 1
2 ANAPC11 51529 anaphase promoting complex subunit 11
3 ANAPC5 51433 anaphase promoting complex subunit 5
4 ATL2 64225 atlastin GTPase 2
5 AURKA 6790 aurora kinase A
6 CCNB2 9133 cyclin B2
7 CCND2 894 cyclin D2
8 CDCA2 157313 cell division cycle associated 2
9 CDCA7 83879 cell division cycle associated 7
10 CDCA7L 55536 cell division cycle associated 7-like
and I simply want to find their function and I've been suggested to use GO analysis tools.
I am not sure if it's a correct way to do so.
here is my solution:
x <- org.Hs.egGO
# Get the entrez gene identifiers that are mapped to a GO ID
xx<- as.list(x[gene_list$ENTREZID])
So, I've got a list with EntrezID that are assigned to several GO terms for each genes.
for example:
> xx$`60312`
$`GO:0009966`
$`GO:0009966`$GOID
[1] "GO:0009966"
$`GO:0009966`$Evidence
[1] "IEA"
$`GO:0009966`$Ontology
[1] "BP"
$`GO:0051493`
$`GO:0051493`$GOID
[1] "GO:0051493"
$`GO:0051493`$Evidence
[1] "IEA"
$`GO:0051493`$Ontology
[1] "BP"
My question is :
how can I find the function for each of these genes in a simpler way and I also wondered if I am doing it right or?
because I want to add the function to the gene_list as a function/GO column.
Thanks in advance,
EDIT: There is a new Bioinformatics SE (currently in beta mode).
I hope I get what you are aiming here.
BTW, for bioinformatics related topics, you can also have a look at biostar which have the same purpose as SO but for bioinformatics
If you just want to have a list of each function related to the gene, you can query database such ENSEMBl through the biomaRt bioconductor package which is an API for querying biomart database.
You will need internet though to do the query.
Bioconductor proposes packages for bioinformatics studies and these packages come generally along with good vignettes which get you through the different steps of the analysis (and even highlight how you should design your data or which would be then some of the pitfalls).
In your case, directly from biomaRt vignette - task 2 in particular:
Note: there are slightly quicker way that the one I reported below:
# load the library
library("biomaRt")
# I prefer ensembl so that the one I will query, but you can
# query other bases, try out: listMarts()
ensembl=useMart("ensembl")
# as it seems that you are looking for human genes:
ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl)
# if you want other model organisms have a look at:
#listDatasets(ensembl)
You need to create your query (your list of ENTREZ ids). To see which filters you can query:
filters = listFilters(ensembl)
And then you want to retrieve attributes : your GO number and description. To see the list of available attributes
attributes = listAttributes(ensembl)
For you, the query would look like something as:
goids = getBM(
#you want entrezgene so you know which is what, the GO ID and
# name_1006 is actually the identifier of 'Go term name'
attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',
values=gene_list$ENTREZID,
mart=ensembl)
The query itself can take a while.
Then you can always collapse the information in two columns (but I won't recommend it for anything else that reporting purposes).
Go.collapsed<-Reduce(rbind,lapply(gene_list$ENTREZID,function(x)
tempo<-goids[goids$entrezgene==x,]
return(
data.frame('ENTREZGENE'= x,
'Go.ID'= paste(tempo$go_id,collapse=' ; '),
'GO.term'=paste(tempo$name_1006,collapse=' ; '))
)
Edit:
If you want to query a past version of the ensembl database:
ens82<-useMart(host='sep2015.archive.ensembl.org',
biomart='ENSEMBL_MART_ENSEMBL',
dataset='hsapiens_gene_ensembl')
and then the query would be:
goids = getBM(attributes=c('entrezgene','go_id', 'name_1006'),
filters='entrezgene',values=gene_list$ENTREZID,
mart=ens82)
However, if you had in mind to do a GO enrichment analysis, your list of genes is too short.

Simple lookup to insert values in an R data frame

This is a seemingly simple R question, but I don't see an exact answer here. I have a data frame (alldata) that looks like this:
Case zip market
1 44485 NA
2 44488 NA
3 43210 NA
There are over 3.5 million records.
Then, I have a second data frame, 'zipcodes'.
market zip
1 44485
1 44486
1 44488
... ... (100 zips in market 1)
2 43210
2 43211
... ... (100 zips in market 2, etc.)
I want to find the correct value for alldata$market for each case based on alldata$zip matching the appropriate value in the zipcode data frame. I'm just looking for the right syntax, and assistance is much appreciated, as usual.
Since you don't care about the market column in alldata, you can first strip it off using and merge the columns in alldata and zipcodes based on the zip column using merge:
merge(alldata[, c("Case", "zip")], zipcodes, by="zip")
The by parameter specifies the key criteria, so if you have a compound key, you could do something like by=c("zip", "otherfield").
Another option that worked for me and is very simple:
alldata$market<-with(zipcodes, market[match(alldata$zip, zip)])
With such a large data set you may want the speed of an environment lookup. You can use the lookup function from the qdapTools package as follows:
library(qdapTools)
alldata$market <- lookup(alldata$zip, zipcodes[, 2:1])
Or
alldata$zip %l% zipcodes[, 2:1]
Here's the dplyr way of doing it:
library(tidyverse)
alldata %>%
select(-market) %>%
left_join(zipcodes, by="zip")
which, on my machine, is roughly the same performance as lookup.
The syntax of match is a bit clumsy. You might find the lookup package easier to use.
alldata <- data.frame(Case=1:3, zip=c(44485,44488,43210), market=c(NA,NA,NA))
zipcodes <- data.frame(market=c(1,1,1,2,2), zip=c(44485,44486,44488,43210,43211))
alldata$market <- lookup(alldata$zip, zipcodes$zip, zipcodes$market)
alldata
## Case zip market
## 1 1 44485 1
## 2 2 44488 1
## 3 3 43210 2

Resources