Efficiently transform XML to data frame - r

I need to transform some vanilla xml into a data frame. The XML is a simple representation of rectangular data (see example below). I can achieve this pretty straightforwardly in R with xml2 and a couple of for loops. However, I'm sure there is a much better/faster way (purrr?). The XML I will be ultimately working with are very large, so more efficient methods are preferred. I would be grateful for any advice from the community.
library(tidyverse)
library(xml2)
demo_xml <-
"<DEMO>
<EPISODE>
<item1>A</item1>
<item2>1</item2>
</EPISODE>
<EPISODE>
<item1>B</item1>
<item2>2</item2>
</EPISODE>
</DEMO>"
dx <- read_xml(demo_xml)
episodes <- xml_find_all(dx, xpath = "//EPISODE")
dx_names <- xml_name(xml_children(episodes[1]))
df <- data.frame()
for(i in seq_along(episodes)) {
for(j in seq_along(dx_names)) {
df[i, j] <- xml_text(xml_find_all(episodes[i], xpath = dx_names[j]))
}
}
names(df) <- dx_names
df
#> item1 item2
#> 1 A 1
#> 2 B 2
Created on 2019-09-19 by the reprex package (v0.3.0)
Thank you in advance.

This is a general solution which handles a varying number of different sub-nodes for each parent node. Each Episode node may have different sub-nodes.
This strategy parses the children nodes identifying the name and values of each sub node. Then it converts this list into a longer style dataframe and then reshapes it into your desired wider style:
library(tidyr)
library(xml2)
demo_xml <-
"<DEMO>
<EPISODE>
<item1>A</item1>
<item2>1</item2>
</EPISODE>
<EPISODE>
<item1>B</item1>
<item2>2</item2>
</EPISODE>
</DEMO>"
dx <- read_xml(demo_xml)
#find all episodes
episodes <- xml_find_all(dx, xpath = "//EPISODE")
#extract the node names and values from all of the episodes
nodenames<-xml_name(xml_children(episodes))
contents<-trimws(xml_text(xml_children(episodes)))
#Idenitify the number of subnodes under each episodes for labeling
IDlist<-rep(1:length(episodes), sapply(episodes, length))
#make a long dataframe
df<-data.frame(episodes=IDlist, nodenames, contents, stringsAsFactors = FALSE)
#make the dataframe wide, Remove unused blank nodes:
answer <- spread(df[df$contents!="",], nodenames, contents)
#tidyr 1.0.0 version
#answer <- pivot_wider(df, names_from = nodenames, values_from = contents)
# A tibble: 2 x 3
episodes item1 item2
<int> <chr> <chr>
1 1 A 1
2 2 B 2

This may be an option without using a for loop,
episodes <- xml_find_all(dx, xpath = "//EPISODE") %>% xml_attr("item1")
dx_names <- xml_name(xml_children(episodes[1]))
# You can get all values between the tags by xml_text()
values <- xml_children(episodes) %>% xml_text()
as.data.frame(matrix(values,
ncol=length(dx_names),
dimnames =list(seq(dx_names),dx_names),byrow=TRUE))
gives,
item1 item2
1 A 1
2 B 2
Note that, you may need to change the Item2 column to a numeric one by as.numeric() since it's been assigned as factor by this solution.

Related

How to produce an if statement to compare rows in R

I need to compare two rows next to each other in a column in a dataframe, if the data in both those rows matches, then save the most recent row, e.g.
# Animals
# 1 dog
# 2 cat
# 3 cat
It should compare dog and cat, then not save any data. So it won't save row 1 and 2.
But when it moves onto compare cat and cat, realise they are the same and save those rows. So save rows 2 and 3. As they are the same. There are several other columns but the animals column is the only one I need to use to decide whether the row is saved. However I want to keep all the data in the columns within the saved rows.
I need to do this for lots of rows, iterating through to compare a big set of data (~68,000)
I've tried to produce an if statement in which:
# results <- list()
#
# if(isTRUE(data$Animals[i+1] == data$Animals[i])) {
# output <- print(data$Animals[i+1])
# results[[i+1]] <- output
# output <- print(data$Animals[i])
# results[[i]] <- output
# }
#}
I then converted this results list into a dataframe for further manipulation. However this method only provides me with the animal name, I would prefer it the entire row was saved. I'm not too sure how to achieve this, I've been trying to edit the statement but I can't seem to get it working.
I'm new to R and learning, please help anyway you can, I'd appreciate it :)
To "prove" that we're saving the "most recent row", I'll add a row-number column. The data:
dat <- structure(list(Animals = c("dog", "cat", "cat"), row = 1:3), row.names = c(NA, -3L), class = "data.frame")
dat
# Animals row
# 1 dog 1
# 2 cat 2
# 3 cat 3
base R
dat[c(with(dat, Animals[-nrow(dat)] != Animals[-1])),,drop=FALSE]
# Animals row
# 1 dog 1
# 3 cat 3
dplyr
library(dplyr)
dat %>%
filter(Animals != lead(Animals, default = ''))
# Animals row
# 1 dog 1
# 2 cat 3
The only caution I have with this is that if package-loading is at all out-of-order, there exists both stats::filter and stats::lag that behave completely differently. If you see odd results, try prepending dplyr:: to make sure it isn't a which-function-am-I-using problem.
dat %>%
dplyr::filter(Animals != dplyr::lead(Animals, default = ''))
We could use lead and filter
library(dplyr)
df %>%
mutate(helper = lead(animals)) %>%
filter(animals == helper) %>%
select(animals)
Output:
animals
<chr>
1 cat

Using regular expressions to change element names for dataframes in a list

I have a list of many dataframes, in which I'd like to change certain elements within the dataframes using regular expressions. Here is a shortened mock-up of my data:
df1 <- data.frame(ID = c("KBS_2015_08_25_A1_P1", "KBS_2015_08_25_A2_P10", "KBS_2015_09_04_A2_P2"),
Site = c("KBS","KBS","KBS"))
df2 <- data.frame(ID = c("UMBS_2015_08_12_A1_P1", "UMBS_2015_08_29_D3_P3", "UMBS_2015_08_29_D5_P5"),
Site = c("UMBS","UMBS","UMBS"))
df_list <- list(df1=df1,df2=df2)
I attempted to make a function that takes the information in the ID column and changes it to a character string of a date.
change_id <- function(df){
df$ID[df$ID == "^KBS_2015_08_25*P\\d"] <- "8/25/2015"
df$ID[df$ID == "^KBS_2015_09_04*P\\d"] <- "9/4/2015"
df$ID[df$ID == "^UMBS_2015_08_12*P\\d"] <- "8/12/2015"
df$ID[df$ID == "^UMBS_2015_08_29*P\\d"] <- "8/29/2015"
return(df)
}
df_list <- lapply(df_list, change_id)
I don't get any errors, but this function doesn't change anything in the dataframes. I must be missing something for my attempt at character matching.
Using R version 4.0.2, Mac OS X 10.13.6
We can use sub
lapply(df_list, transform, ID = sub(".*_(\\d{4}_\\d{2}_\\d{2})_.*", "\\1", ID))
If needed to be in a specific format, convert to Date class and then use format
df_list1 <- lapply(df_list, transform,
ID = format(as.Date(sub(".*_(\\d{4}_\\d{2}_\\d{2})_.*",
"\\1", ID), "%Y_%m_%d"), "%m/%d/%Y"))
-output
df_list1
#$df1
# ID Site
#1 08/25/2015 KBS
#2 08/25/2015 KBS
#3 09/04/2015 KBS
#$df2
# ID Site
#1 08/12/2015 UMBS
#2 08/29/2015 UMBS
#3 08/29/2015 UMBS
An alternative to #akrun's excellent answer is a "join" methodology. The reason this can be good is so that the pattern/replacement list can be kept as a single frame/table, making maintenance a bit easier.
It operates by using fuzzyjoin::regex_left_join, which is similar to merge and dplyr::left_join but with pattern-matches.
ptns <- data.frame(
ID_ptn = c("^KBS_2015_08_25.*P\\d", "^KBS_2015_09_04.*P\\d",
"^UMBS_2015_08_12.*P\\d", "^UMBS_2015_08_29.*P\\d"),
ID_new = c("8/25/2015", "9/4/2015", "8/12/2015", "8/29/2015")
)
fuzzyjoin::regex_left_join(df1, ptns, by = c("ID" = "ID_ptn"))
# ID Site ID_ptn ID_new
# 1 KBS_2015_08_25_A1_P1 KBS ^KBS_2015_08_25.*P\\d 8/25/2015
# 2 KBS_2015_08_25_A2_P10 KBS ^KBS_2015_08_25.*P\\d 8/25/2015
# 3 KBS_2015_09_04_A2_P2 KBS ^KBS_2015_09_04.*P\\d 9/4/2015
Expanding this to the larger list can be done with:
lapply(df_list, function(df) {
tmp <- fuzzyjoin::regex_left_join(df, ptns, by = c("ID" = "ID_ptn"))
tmp$ID <- replace(tmp$ID, !is.na(tmp$ID_new), tmp$ID_new)
tmp[ names(ptns) ] <- NULL
tmp
})
# $df1
# ID Site
# 1 8/25/2015 KBS
# 2 8/25/2015 KBS
# 3 9/4/2015 KBS
# $df2
# ID Site
# 1 8/12/2015 UMBS
# 2 8/29/2015 UMBS
# 3 8/29/2015 UMBS
This is an alternative to the more straight-forward (and perhaps easier-to-see-and-understand) answer by #akrun. I offer it as a different way of looking at the problem.
(I will offer one caution: if it is possible that patterns may overlap, where two or more patterns could match a single ID, then some more steps need to be taken to determine which one to use. This will evidence as some rows repeating and the number of rows increasing through the join. This is not likely given the current patterns, but ... caveat emptor.)

Saving the output of a str_which loop in R

I work with a sheet of data that lists a variety of scientific publications. Rows are publications,
columns are a variety of metrics describing each publication (author name and position, Pubmed IDs, Date etc...)
I want to filter for publications for each author and extract parts of them. The caveat is the format:
all author names (5-80 per cell) are lumped together in one cell for each row.
I managed to solve this with the use of str_which, saving the coordinates for each author and later extract. This works only for manual use. When I try to automate this process using a loop to draw on a list of authors I fail to save the output.
I am at a bit of a loss on how to store the results without overwriting previous ones.
sampleDat <-
data.frame(var1 = c("Doe J, Maxwell M, Kim HE", "Cronauer R, Carst W, Theobald U", "Theobald U, Hey B, Joff S"),
var2 = c(1:3),
var3 = c("2016-01", "2016-03", "2017-05"))
list of names that I want the coordinates for
namesOfInterest <-
list(c("Doe J", "Theobald U"))
the manual extraction, requiring me to type the exact name and output object
Doe <- str_which(sampleDat$var1, "Doe J")
Theobald <- str_which(sampleDat$var1, "Theobald U")
one of many attempts that does not replicate the manual version.
results <- c()
for (i in namesOfInterest) {
results[i] <- str_which(sampleDat$var1, i)
}
The for loop is set up incorrectly (it needs to be something like for(i in 1:n){do something}). Also, even if you fix that, you'll get an error related to the fact that str_which returns a vector of varying length, indicating the position of each of the matches it makes (and it can make multiple matches). Thus, indexing a vector in a loop won't work here because whenever a author has multiple matches, more than one entry will be saved to a single element, throwing an error.
Solve this by working with lists, because lists can hold vectors of arbitrary length. Index the list with double bracket notation: [[.
library(stringr)
sampleDat <-
data.frame(var1 = c("Doe J, Maxwell M, Kim HE", "Cronauer R, Carst W, Theobald U", "Theobald U, Hey B, Joff S"),
var2 = c(1:3),
var3 = c("2016-01", "2016-03", "2017-05"))
# no need for list here. a simple vector will do
namesOfInterest <- c("Doe J", "Theobald U")
# initalize list
results <- vector("list", length = length(namesOfInterest))
# loop over list, saving output of `str_which` in each list element.
# seq_along(x) is similar to 1:length(x)
for (i in seq_along(namesOfInterest)) {
results[[i]] <- str_which(sampleDat$var1, namesOfInterest[i])
}
which returns:
> results
[[1]]
[1] 1
[[2]]
[1] 2 3
The way to understand the output above is that the ith element of the list, results[[i]] contains the output of str_which(sampleDat$var1, namesOfInterest[i]), where namesOfInterest[i] is always exactly one author. However, the length of results[[i]] can be longer than one:
> sapply(results, length)
[1] 1 2
indicating that a single author can be mentioned multiple times. In the example above, sapply counts the length of each vector along the list results, showing that namesOfInterest[1] has one paper, and namesOfInterest[2] has 2. `
Here is another approach for you. If you want to know which scholar is in which publication, you can do the following as well. First, assign unique IDs to publications. Then, split authors and create a long-format data frame. Define groups by authors and aggregate publication ID (pub_id) as string (character). If you need to extract some authors, you can use this data frame (foo) and subset rows.
library(tidyverse)
mutate(sampleDat, pub_id = 1:n()) %>%
separate_rows(var1, sep = ",\\s") %>%
group_by(var1) %>%
summarize(pub_id = toString(pub_id)) -> foo
var1 pub_id
<chr> <chr>
1 Carst W 2
2 Cronauer R 2
3 Doe J 1
4 Hey B 3
5 Joff S 3
6 Kim HE 1
7 Maxwell M 1
8 Theobald U 2, 3
filter(foo, var1 %in% c("Doe J", "Theobald U"))
var1 pub_id
<chr> <chr>
1 Doe J 1
2 Theobald U 2, 3
If you want to have index as numeric, you can twist the idea above and do the following. You can subset rows with targeted names with filter().
mutate(sampleDat, pub_id = 1:n()) %>%
separate_rows(var1, sep = ",\\s") %>%
group_by(var1) %>%
summarize(pub_id = list(pub_id)) %>%
unnest(pub_id)
var1 pub_id
<chr> <int>
1 Carst W 2
2 Cronauer R 2
3 Doe J 1
4 Hey B 3
5 Joff S 3
6 Kim HE 1
7 Maxwell M 1
8 Theobald U 2
9 Theobald U 3

I subsetted a list of words from a larger list of 72 items. How do I determine what list number (1-72) those words came from?

I imported 720 sentences from this website (https://www.cs.columbia.edu/~hgs/audio/harvard.html). There are 72 lists (each list contains 10 sentences.) and saved it in an appropriate structure. I did those step in R. The code is immediately depicted below.
#Q.1a
library(xml2)
library(rvest)
url <- 'https://www.cs.columbia.edu/~hgs/audio/harvard.html'
sentences <- read_html(url) %>%
html_nodes("li") %>%
html_text()
headers <- read_html(url) %>%
html_nodes("h2") %>%
html_text()
#Q.1b
harvardList <- list()
sentenceList <- list()
n <- 1
for(sentence in sentences){
sentenceList <- c(sentenceList, sentence)
print(sentence)
if(length(sentenceList) == 10) { #if we have 10 sentences
harvardList[[headers[n]]] <- sentenceList #Those 10 sentences and the respective list from which they are derived, are appended to the harvard list
sentenceList <- list() #emptying our temporary list which those 10 sentences were shuffled into
n <- n+1 #set our list name to the next one
}
}
#Q.1c
sentences1 <- split(sentences, ceiling(seq_along(sentences)/10))
getwd()
setwd("/Users/juliayudkovicz/Documents/Homework 4 Datascience")
sentences.df <- do.call("rbind", lapply(sentences1, as.data.frame))
names(sentences.df)[1] <- "Sentences"
write.csv(sentences.df, file = "sentences1.csv", row.names = FALSE)
THEN, in PYTHON, I computed a list of all the words ending in "ing" and what their frequency was, aka, how many times they appeared across all 72 lists.
path="/Users/juliayudkovicz/Documents/Homework 4 Datascience"
os.chdir(path)
cwd1 = os.getcwd()
print(cwd1)
import pandas as pd
df = pd.read_csv(r'/Users/juliayudkovicz/Documents/Homework 4 Datascience/sentences1.csv', sep='\t', engine='python')
print(df)
df['Sentences'] = df['Sentences'].str.replace(".", "")
print(df)
sen_List = df['Sentences'].values.tolist()
print(sen_List)
ingWordList = [];
for line in sen_List:
for word in line.split():
if word.endswith('ing'):
ingWordList.append(word)
ingWordCountDictionary = {};
for word in ingWordList:
word = word.replace('"', "")
word = word.lower()
if word in ingWordCountDictionary:
ingWordCountDictionary[word] = ingWordCountDictionary[word] + 1
else:
ingWordCountDictionary[word] = 1
print(ingWordCountDictionary)
f = open("ingWordCountDictionary.txt", "w")
for key, value in ingWordCountDictionary.items():
keyValuePairToWrite = "%s, %s\n"%(key, value)
f.write(keyValuePairToWrite)
f.close()
Now, I am being asked to create a dataset which shows what list (1 from 72) each "ing" word is derived from. THIS IS WHAT I DON'T KNOW HOW TO DO. I obviously know they are a subset of huge 72 item list, but how do I figure out what list those words came from.
The expected output should look something like this:
[List Number] [-ing Word]
List 1 swing, ring, etc.,
List 2 moving
so and so forth
Here is one way for you. As far as I see the expected result, you seem to want to get verbs in progressive forms (V-ing). (I do not understand why you have king in your result. If you have king, you should have spring here as well, for example.) If you need to consider lexical classes, I think you want to use the koRpus package. If not, you can use the textstem package, for example.
First, I scraped the link and created a data frame. Then, I split sentences into words using unnest_tokens() in the tidytext package, and subsetted words ending with 'ing'. Then, I used treetag() in the koRpus package. You need to install Treetagger by yourself before you use the package. Finally, I counted how many times these verbs in progressive forms appear in the data set. I hope this will help you.
library(tidyverse)
library(rvest)
library(tidytext)
library(koRpus)
read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>%
html_nodes("h2") %>%
html_text() -> so_list
read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>%
html_nodes("li") %>%
html_text() -> so_text
# Create a data frame
sodf <- tibble(list_name = rep(so_list, each = 10),
text = so_text)
# Split senteces into words and get words ending with ING.
unnest_tokens(sodf, input = text, output = word) %>%
filter(grepl(x = word, pattern = "ing$")) -> sowords
# Use koRpus package to lemmatize the words in sowords$word.
treetag(sowords$word, treetagger = "manual", format = "obj",
TT.tknz = FALSE , lang = "en", encoding = "UTF-8",
TT.options = list(path = "C:\\tree-tagger-windows-3.2\\TreeTagger",
preset = "en")) -> out
# Access to the data frame and filter the words. It seems that you are looking
# for verbs. So I did that here.
filter(out#TT.res, grepl(x = token, pattern = "ing$") & wclass == "verb") %>%
count(token)
# A tibble: 16 x 2
# token n
# <chr> <int>
# 1 adding 1
# 2 bring 4
# 3 changing 1
# 4 drenching 1
# 5 dying 1
# 6 lodging 1
# 7 making 1
# 8 raging 1
# 9 shipping 1
#10 sing 1
#11 sleeping 2
#12 wading 1
#13 waiting 1
#14 wearing 1
#15 winding 2
#16 working 1
How did you store the data from the lists (ie what does your data.frame look like? Could you provide an example?
Without seeing this, I suggest you save the data in a list as follows:
COLUMN 1 , COLUMN 2, COLUMN 3
"List number", "Sentence", "-ING words (as vector)"
I hope this makes sense, let me know if you need more help. I wasn't able to comment on this post unfortunately.

R: Stacking Multiple Punch Question Data

Suppose we have 2 questions in a survey, one is about how likely an individual is to recommend a company (let's say there's 2 companies for simplicity).
So, I have one data.frame with 2 columns for this question:
df.recommend <- data.frame(rep(1:5,20),rep(1:5,20))
colnames(df.recommend) <- c("Company1","Company2")
And, suppose we have another question that asks respondents to checkmark a box beside an attribute that they believe "fits" with the company.
So, I have another data.frame with 4 columns for this question:
df.attribute <- data.frame(rep(0:1,50),rep(1:0,50),rep(0:1,50),rep(1:0,50))
colnames(df.attribute) <- c(
"Attribute1.Company1",
"Attribute2.Company1",
"Attribute1.Company2",
"Attribute2.Company2")
Now, what I would like to be able to do is review how Attributes 1 and 2 are related to the scale in the likelyhood to recommend question, for all companies (company independent). Just to get an idea of what inertia lies between those people that are highly likely to recommend and attribute 1 for example.
So, I start off by binding the two questions together:
df <- cbind(df.recommend, df.attribute)
My problem is trying to figure out how to stack these data such that the columns look something like:
df.stacked <- data.frame(c(df$Company1,df$Company2),
c(df$Attribute1.Company1,df$Attribute1.Company2),
c(df$Attribute2.Company1,df$Attribute2.Company2))
colnames(df.stacked) <- c("Likelihood","Attribute1","Attribute2")
This example is simplified to a large degree. In my actual problem, I have 34 companies and 24 attributes.
Could you think of a way to stack them effectively, without having to type out all the c() statements?
Note: The column pattern for likelyhood is Co1,Co2,Co3,Co4... and the pattern for the attributes is At1.Co1,At2.Co1,At3.Co1 ... At1.Co34,At2.Co34...
For this type of problem, Hadley's reshape package is the perfect tool. I combine it with a few stringr and plyr statements (also packages written by Hadley).
Here is what I believe to be a complete solution in about a dozen lines of code.
First, create some data
library(reshape2) # EDIT 1: reshape2 is faster
library(stringr)
library(plyr)
# Create data frame
# Important: note the addition of a respondent id column
df_comp <- data.frame(
RespID = 1:10,
Company1 = rep(1:5, 2),
Company2 = rep(1:5, 2)
)
df_attr <- data.frame(
RespID = 1:10,
Attribute1.Company1 = rep(0:1,5),
Attribute2.Company1 = rep(1:0,5),
Attribute1.Company2 = rep(0:1,5),
Attribute2.Company2 = rep(1:0,5)
)
Now start the data manipulation:
# Use melt to convert data from wide to tall
melt_comp <- melt(df_comp, id.vars="RespID")
melt_comp <- rename(melt_comp, c(variable="comp", value="likelihood"))
melt_attr <- melt(df_attr, id.vars="RespID")
# Use str_split to split attribute variables into attribute and company
# "." period needs to be escaped
# EDIT 2: reshape::colsplit is simpler than str_split
split <- colsplit(melt_attr$variable, "\\.", names=c("attr", "comp"))
melt_attr <- data.frame(melt_attr, split)
melt_attr$variable <- NULL
# Use cast to convert from tall to somewhat tall
cast_attr <- cast(melt_attr, RespID + comp ~ attr, mean)
# Combine data frames using join() in package plyr
df <- join(melt_comp, cast_attr)
head(df)
And the output:
RespID comp likelihood Attribute1 Attribute2
1 1 Company1 1 0 1
2 2 Company1 2 1 0
3 3 Company1 3 0 1
4 4 Company1 4 1 0
5 5 Company1 5 0 1
6 6 Company1 1 1 0
Something I quickly cooked up. Doesn't look the best and uses a for-loop but that shouldn't be a problem with only 24 values
df.recommend <- data.frame(rep(1:5,20),rep(1:5,20))
colnames(df.recommend) <- c("Co1","Co2")
df.attribute <- data.frame(rep(0:1,50),rep(1:0,50),rep(0:1,50),rep(1:0,50))
colnames(df.attribute) <- c(
"At1.Co1",
"At2.Co1",
"At1.Co2",
"At2.Co2")
df.stacked <- data.frame(
likelihood <- unlist(df.recommend)
)
str <- strsplit(names(df.attribute),split="\\.")
atts <- unique(sapply(str,function(x)x[1]))
for (i in 1:length(atts))
{
df.stacked[,i+1] <- unlist(df.attribute[sapply(str,function(x)x[1]==atts[i])])
}
names(df.stacked) <- c("likelihood",paste("attribute",1:length(atts),sep=""))
EDIT: It assumes that companies are in the same order for each attribute

Resources