R: How to best extract two XML attributes from a node? - r

The following code extracts one attribute (or all) from an XML file:
library(xml2);library(magrittr);library(readr);library(tibble);library(knitr)
fname<-'https://raw.githubusercontent.com/wardblonde/ODM-to-i2b2/master/odm/examples/CDISC_ODM_example_3.xml'
fname
x<-read_xml(fname)
xpath="//d1:ItemDef"
itemsNames <- x %>% xml_find_all(xpath, ns=xml_ns(x)) %>% xml_attr('Name')
items <- x %>% xml_find_all(xpath, ns=xml_ns(x))
Item looks like this:
<ItemDef OID="IT.ABNORM" Name="Normal/Abnormal/Not Done" DataType="integer" Length="1" ...
Sample file can be viewed here: https://raw.githubusercontent.com/wardblonde/ODM-to-i2b2/master/odm/examples/CDISC_ODM_example_3.xml
Using pipes and xml_attr, what is the best way to extract both the Name and DataType attributes and have them rbinded?
Ideally it would be a single line of super efficient piped code. I can extract names and types and have 'data.frame(name=names,type=types)' but that seems not the best and most modern.
The result should be a tibble with columns name and data type.

library(purrr)
map(items, xml_attrs) %>%
map_df(as.list) %>%
select(Name, DataType)
## # A tibble: 94 × 2
## Name DataType
## <chr> <chr>
## 1 Normal/Abnormal/Not Done integer
## 2 Actions taken re study drug text
## 3 Actions taken, other text
## 4 Stop Day - Enter Two Digits 01-31 text
## 5 Derived Stop Date text
## 6 Stop Month - Enter Two Digits 01-12 text
## 7 Stop Year - Enter Four Digit Year text
## 8 Outcome text
## 9 Relationship to study drug text
## 10 Severity text
## # ... with 84 more rows
One "base" version:
lapply(items, xml_attrs) %>%
lapply(function(x) as.data.frame(as.list(x))[,c("Name", "DataType")]) %>%
do.call(rbind, .) %>%
tbl_df()
NOTE: an issue with ^^ is that if Name or DataType is missing then you're SOL. You can mitigate that with:
lapply(items, xml_attrs) %>%
lapply(function(x) as.data.frame(as.list(x))[,c("Name", "DataType")]) %>%
data.table::rbindlist(fill=TRUE) %>%
tbl_df()
or:
lapply(items, xml_attrs) %>%
lapply(function(x) as.data.frame(as.list(x))[,c("Name", "DataType")]) %>%
bind_rows() %>%
tbl_df()
if you don't like purrr.

Related

extracting string patterns that repeat to a separate column or columns

Hi I have not seen a similar solution to this problem I am having. I am trying to make a regrex pattern to extract the characters following the word major within { } and place them in a major column. However, the major repeats in row 2 and I need to extract and combine all characters within both { } following major. Ideally I would do this for minor and incidental attributes as well. Not sure what I am getting wrong here. Thanks!
test <- data.frame(lith=c("major{basalt} minor{andesite} incidental{dacite rhyolite}",
"major {andesite flows} major {dacite flows}",
"major{andesite} minor{dacite}",
"major{basaltic andesitebasalt}"))
test %>%
mutate(major = str_extract_all(test$lith, "[major].*[{](\\D[a-z]*)[}]") %>%
map_chr(toString))
What I am looking for:
major minor incidental
1 basalt andesite dacite ryolite
2 andesite flows, decite flows <NA> <NA>
3 basaltic andesitebasalt <NA> <NA>
First, (almost) never use test$ within a dplyr pipe starting with test %>%. At best it's just a little inefficient; if there are any intermediate steps that re-order, alter, or filter the data, then the results will be either (a) an error, preferred; or (b) silently just wrong. The reason: let's say you do
test %>%
filter(grepl("[wy]", lith)) %>%
mutate(major = str_extract_all(test$lith, ...))
In this case, the filter reduced the data from 4 rows to just 2 rows. However, since you're using test$lith, that's taken from the contents of test before the pipe started, so here test$lith is length-4 where we need it to be length-2.
Alternatively (and preferred),
test %>%
filter(grepl("[wy]", lith)) %>%
mutate(major = str_extract_all(lith, ...))
Here, the str_extract_all(lith, ...) sees only two values, not the original four.
On to the regularly-scheduled answer ...
I'll add a row number rn column as an original row reference (id of sources). This is both functional (for things to work internally) and in case you need to tie it back to the original data somehow. I'm inferring that you group the values together as strings instead of list-columns, though it's easy enough to switch to the latter if desired.
library(dplyr)
library(stringr) # str_extract_all
library(tidyr) # unnest, pivot_wider
test %>%
mutate(
rn = row_number(),
tmp = str_extract_all(lith, "\\b([[:alpha:]]+) ?\\{([^}]+)\\}"),
tmp = lapply(tmp, function(z) strcapture("^([^{}]*) ?\\{(.*)\\}", z, list(categ="", val="")))
) %>%
unnest(tmp) %>%
mutate(across(c(categ, val), trimws)) %>%
group_by(rn, categ) %>%
summarize(val = paste(val, collapse = ", ")) %>%
pivot_wider(rn, names_from = "categ", values_from = "val") %>%
ungroup()
# # A tibble: 4 x 4
# rn incidental major minor
# <int> <chr> <chr> <chr>
# 1 1 dacite rhyolite basalt andesite
# 2 2 NA andesite flows, dacite flows NA
# 3 3 NA andesite dacite
# 4 4 NA basaltic andesitebasalt NA

Count unique words in a string using dplyr (R)

Let's say I have a string as follows:
string <- "the home home on the range the friend"
All I want to do is determine which words in the string appear at least 2 times.
The psuedocode here is:
Count how many times each word appears
Return list of words that have more than two appearances in the string
Final result should be a list featuring both the and home, in that order.
I am hoping to do this using the tidyverse, ideally with stringr or dplyr. Was attempting to use tidytext as well but have been struggling.
We can split the string by space, get the table and subset based on frequency
out <- table(strsplit(string, "\\s+")[[1]])
out[out >=2]
home the
2 3
Yet another possible solution:
library(tidyverse)
data.frame(x = str_split(string, "\\s+", simplify = T) %>% t) %>%
add_count(x) %>%
filter(n >= 2) %>%
distinct %>%
pull(x)
#> [1] "the" "home"
library(tidyverse)
data.frame(string) %>%
separate_rows(string) %>%
count(string, sort = TRUE) %>%
filter(n >= 2)
Result
# A tibble: 2 × 2
string n
<chr> <int>
1 the 3
2 home 2
Here's an approach using quanteda that prints "the" before "home" as requested in the original post.
library(quanteda)
aString <- "the home home on the range the friend"
aDfm<- dfm(tokens(aString))
# extract the features where the count > 1
aDfm#Dimnames$features[aDfm#x > 1]
...and the output:
> aDfm#Dimnames$features[aDfm#x > 1]
[1] "the" "home"
Here is another option using tidytext and tidyverse, where we first separate each word (unnest_tokens), then we can count each word and sort by frequency. Then, we keep only words that have more than 1 observation, then use tibble::deframe to return a named vector.
library(tidytext)
library(tidyverse)
tibble(string) %>%
unnest_tokens(word, string) %>%
count(word, sort = TRUE) %>%
filter(n >= 2) %>%
deframe()
Output
the home
3 2
Or if you want to leave as a dataframe, then you can just ignore the last step with deframe.

How do I find the most common words in a character vector in R?

I am analysing some fmri data – in particular, I am looking at what sorts of cognitive functions are associated with coordinates from an fmri scan (conducted while subjects were performing a task. My data can be obtained with the following function:
library(httr)
scrape_and_sort = function(neurosynth_link){
result = content(GET(neurosynth_link), "parsed")$data
names = c("Name", "z_score", "post_prob", "func_con", "meta_analytic")
df = do.call(rbind, lapply(result, function(x) setNames(as.data.frame(x), names)))
df$z_score = as.numeric(df$z_score)
df = df[order(-df$z_score), ]
df = df[-which(df$z_score<3),]
df = na.omit(df)
return(df)
}
RO4 = scrape_and_sort('https://neurosynth.org/api/locations/-58_-22_6_6/compare')
Now, I want know which key words are coming up most often and ideally construct a list of the most common words. I tried the following:
sort(table(RO4$Name),decreasing=TRUE)
But this clearly won't work.The problem is that the names (for example: "auditory cortex") are strings with multiple words in, so results such 'auditory' and 'auditory cortex' come out as two separate entries, whereas I want them counted as two instances of 'auditory'.
But I am not sure how to search inside each string and record individual words like that. Any ideas?
using packages {jsonlite}, {dplyr} and the pipe operator %>% for legibility:
store response as dataframe df
url <- 'https://neurosynth.org/api/locations/-58_-22_6_6/compare/'
df <- jsonlite::fromJSON(url) %>% as.data.frame
reshape and aggregate
df %>%
## keep first column only and name it 'keywords':
select('keywords' = 1) %>%
## multiple cell values (as separated by a blank)
## into separate rows:
separate_rows(keywords, sep = " ") %>%
group_by(keywords) %>%
summarise(count = n()) %>%
arrange(desc(count))
result:
+ # A tibble: 965 x 2
keywords count
<chr> <int>
1 cortex 53
2 gyrus 26
3 temporal 26
4 parietal 23
5 task 22
6 anterior 19
7 frontal 18
8 visual 17
9 memory 16
10 motor 16
# ... with 955 more rows
edit: or, if you want to proceed from your dataframe
RO4 %>%
select(Name) %>%
## select(everything())
## select(Name:func_con)
separate_rows(Name, sep=' ') %>%
## do remaining stuff
You can of course select more columns in a number of convenient ways (see commented lines above and ?dplyr::select). Mind that values of the other variables will repeated as many times as rows are needed to accomodate any multivalue in column "Name", so that will introduce some redundancy.
If you want to adopt {dplyr} style, arranging by descending z-score and excluding unwanted z-scores would read like this:
RO4 %>%
filter(z_score < 3 & !is.na(z_score)) %>%
arrange(desc(z_score))
Not sure to understand. Can't you proceed like this:
x <- c("auditory cortex", "auditory", "auditory", "hello friend")
unlist(strsplit(x, " "))
# "auditory" "cortex" "auditory" "auditory" "hello" "friend"

Using R to determine author position in journals?

I'm looking to analyze the order of authors in academic papers, and have a dataset of journals, authors, publication titles, publication dates, etc. that I'm working with. The data comes with each publication title as a row, and the author(s) of the piece listed in a semi-colon-delimited list. For example:
authors, pubtitle, title, date
Name 1; Name 2; Name 3, Journal Title, Article Title, 2018
Name 1; Name 2, Journal Title, Article Title, 2019
Name 1; Name 2; Name 3; Name 4; Name 5, Journal Title, Article Title, 2018
I've come up with a pretty inefficient way to determine author order, but I'm wondering about suggestions to improve this. Right now, the general workflow looks like this:
data_name_listed <- readxl::read_xlsx("data-raw/data.xlsx")
data_name_listed <- data_name_listed %>%
rename(author = "Author") %>%
rename(title = "Title") %>%
rename(pubtitle = "Publication Title") %>%
rename(publisher = "Publisher") %>%
rename(date = "Date")
# Select just the author column
data_name_order <- data_name_listed %>% select(author)
data_name_order$author <- str_trim(data_name_order$author)
# Separate lists of names into columns according to the order they appear in the comma-separated list
# This is really inelegant.
data_name_order <- data_name_order %>%
separate(col = author, into = c("1","2","3","4","5","6","7","8","9","10","11",
"12","13","14","15", "16","17","18","19","20",
"21","22","23","24","25","26","27","28","29",
"30","31","32","33","34","35"), sep = ";")
# Gather the data into a tidy df
data_name_order <- data_name_order %>%
gather(position, name)
# Clean up special characters in names
data_name_order$name <- gsub("(.*)\\s+[A-Z]\\.?$", "\\1", data_name_order$name)
# Get rid of missing data
data_name_order <- data_name_order %>% drop_na()
# Convert position number to numeric
data_name_order$position <- as.numeric(data_name_order$position)
# Ensure no whitespace
data_name_order$name <- str_trim(data_name_order$name)
# Then merge this data with tidy journal data
# ... code ...
In particular, the separate() function is particularly messy, even though it seems to achieve what I hoped it would. I'd love any advice to make this a bit more clean and more reproducible/applicable to other datasets. Thanks!
Here's a suggestion without separate:
library(dplyr)
library(tidyr)
x %>%
select(authors) %>%
transmute(
id = row_number(),
author = strsplit(authors, ";")
) %>%
unnest() %>%
group_by(id) %>%
mutate(
position = row_number(),
author = trimws(author)
) %>%
ungroup()
# # A tibble: 10 x 3
# id author position
# <int> <chr> <int>
# 1 1 Name 1 1
# 2 1 Name 2 2
# 3 1 Name 3 3
# 4 2 Name 1 1
# 5 2 Name 2 2
# 6 3 Name 1 1
# 7 3 Name 2 2
# 8 3 Name 3 3
# 9 3 Name 4 4
# 10 3 Name 5 5
The introduction of id into the frame is to work around tidyr::spread's expectation that there are two columns, one to preserve and one to spread. It also (for your case) serves as an ability to re-merge authors back with the original data. If there is a better column that uniquely identifies each row/publication, use that instead. If you have no better fields, then it might be better to add it before you start this process, so "ensure" the original data and this lengthened data have identical ids, perhaps with:
x <- mutate(x, id = row_number())
# or with base
x$id <- seq_len(nrow(x))
Data:
x <- read.csv(header=TRUE, stringsAsFactors=FALSE, text="
authors, pubtitle, title, date
Name 1; Name 2; Name 3, Journal Title, Article Title, 2018
Name 1; Name 2, Journal Title, Article Title, 2019
Name 1; Name 2; Name 3; Name 4; Name 5, Journal Title, Article Title, 2018")

Can I combine pairwise_cor and pairwise_count to get the phi coefficient AND number of occurrences for each pair of words?

I'm new to R, and I'm using widyr to do text mining. I successfully used the methods found here to get a list of co-occurring words within each section of text and their phi coefficient.
Code as follows:
word_cors <- review_words %>%
group_by(word) %>%
pairwise_cor(word, title, sort = TRUE) %>%
filter(correlation > .15)
I understand that I can also generate a data frame with co-occurring words and the number of times they appear, using code like:
word_pairs <- review_words %>%
pairwise_count(word, title, sort = TRUE)
What I need is a table that has both the phi coefficient and the number of occurrences for each pair of words. I've been digging into pairwise_cor and pairwise_count but still can't figure out how to combine them. If I understand correctly, joins only take one column into account for matching, so I couldn't use a regular join reliably since there may be multiple pairs that have the same word in the item1 column.
Is this possible using widyr? If not, is there another package that will allow me to do this?
Here is the full code:
#Load packages
pacman::p_load(XML, dplyr, stringr, rvest, httr, xml2, tidytext, tidyverse, widyr)
#Load source material
prod_reviews_df <- read_csv("SOURCE SPREADSHEET.csv")
#Split into one word per row
review_words <- prod_reviews_df %>%
unnest_tokens(word, comments, token = "words", format = "text", drop = FALSE) %>%
anti_join(stop_words, by = c("word" = "word"))
#Find phi coefficient
word_cors <- review_words %>%
group_by(word) %>%
pairwise_cor(word, title, sort = TRUE) %>%
filter(correlation > .15)
#Write data to CSV
write.csv(word_cors, "WORD CORRELATIONS.csv")
I want to add in pairwise_count, but I need it alongside the phi coefficient.
Thank you!
If you are getting into using tidy data principles and tidyverse tools, I would suggest GOING ALL THE WAY :) and using dplyr to do the joins you are interested in. You can use left_join to connect the calculations from pairwise_cor() and pairwise_count(), and you can just pipe from one to the other, if you like.
library(dplyr)
library(tidytext)
library(janeaustenr)
library(widyr)
austen_section_words <- austen_books() %>%
filter(book == "Pride & Prejudice") %>%
mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word)
austen_section_words %>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, section, sort = TRUE) %>%
left_join(austen_section_words %>%
pairwise_count(word, section, sort = TRUE),
by = c("item1", "item2"))
#> # A tibble: 154,842 x 4
#> item1 item2 correlation n
#> <chr> <chr> <dbl> <dbl>
#> 1 bourgh de 0.9508501 29
#> 2 de bourgh 0.9508501 29
#> 3 pounds thousand 0.7005808 17
#> 4 thousand pounds 0.7005808 17
#> 5 william sir 0.6644719 31
#> 6 sir william 0.6644719 31
#> 7 catherine lady 0.6633048 82
#> 8 lady catherine 0.6633048 82
#> 9 forster colonel 0.6220950 27
#> 10 colonel forster 0.6220950 27
#> # ... with 154,832 more rows
I discovered and used merge today, and it appears to have used both relevant columns to merge the data. I'm not sure how to check for accuracy, but I think it worked.

Resources