Parsing XML for Ancient Greek Plays with speaker and dialogue - r

I am currently trying to read Greek plays which are available online as XML files into a data frame with a dialogue and speaker column.
I run the following commands to download the XML and parse the dialogue and speakers.
library(XML)
library(RCurl)
url <- "http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus%3Atext%3A1999.01.0186"
html <- getURL(url, followlocation = TRUE)
doc <- htmlParse(html, asText=TRUE)
plain.text <- xpathSApply(doc, "//p", xmlValue)
speakersc <- xpathSApply(doc, "//speaker", xmlValue)
dialogue <- data.frame(text = plain.text, stringsAsFactors = FALSE)
speakers <- data.frame(text = speakersc, stringsAsFactors = FALSE)
However, I then encounter a problem. The dialogue will yield 300 rows (for 300 distinct lines in the play), but the speaker will yield 297.
The reason for the problem is due to the structure of the XML as reproduced below, where the <speaker> tag is not repeated for continued dialogue interrupted by stage direction. Because I have to separate the dialogue
with the <p> tag, it yields two dialogue rows, but only one speaker row, without duplicating the speaker accordingly.
<speaker>Creon</speaker>
<stage>To the Guard.</stage>
-<p>
You can take yourself wherever you please,
<milestone n="445" unit="line" ed="p"/>
free and clear of a heavy charge.
<stage>Exit Guard.</stage>
</p>
</sp>
-<sp>
<stage>To Antigone.</stage>
<p>You, however, tell me—not at length, but briefly—did you know that an edict had forbidden this?</p>
</sp>
How can I parse the XML so the data will correctly yield the same number of dialogue rows for the same number of corresponding speaker rows?
For the above example, I would like the resulting data frame to either contain two rows for Creon's dialogue corresponding to the two lines of dialogue prior and after the stage direction, or one row which treats Creon's dialogue as one line ignoring the separation due to the stage direction.
Thank you very much for your help.

Consider using xpath's forward looking following-sibling to search for the next <p> tag when speaker is empty, all while iterating through <sp> which is the parent to <speaker> and <p>:
# ALL SP NODES
sp <- xpathSApply(doc, "//body/descendant::sp", xmlValue)
# ITERATE THROUGH EACH SP BY NODE INDEX TO CREATE LIST OF DFs
dfList <- lapply(seq_along(sp), function(i){
data.frame(
speakers = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker,'')"), xmlValue),
dialogue = xpathSApply(doc, paste0("concat(//body/descendant::sp[",i,"]/speaker/following-sibling::p[1], ' ',
//body/descendant::sp[position()=",i+1," and not(speaker)]/p[1])"), xmlValue)
)
# ROW BIND LIST OF DFs AND SUBSET EMPTY SPEAKER/DIALOGUE
finaldf <- subset(do.call(rbind, dfList), speakers!="" & dialogue!="")
})
# SPECIFIC ROWS IN OP'S HIGHLIGHT
finaldf[85,]
# speakers
# 85 Creon
#
# dialogue
# 85 You can take yourself wherever you please,free and clear of a heavy
# charge.Exit Guard. You, however, tell me—not at length, but
# briefly—did you know that an edict had forbidden this?
finaldf[86,]
# speakers dialogue
# 87 Antigone I knew it. How could I not? It was public.

Another option is the read the url and make some updates before parsing XML, in this case replace milestone tags with a space to avoid mashing words together, remove stage tags and then fix the sp node without a speaker
x <- readLines(url)
x <- gsub("<milestone[^>]*>", " ", x) # add space
x <- gsub("<stage>[^>]*stage>", "", x) # no space
x <- paste(x, collapse = "")
x <- gsub("</p></sp><sp><p>", "", x) # fix sp without speaker
Now the XML has the same number of sp and speaker tags.
doc <- xmlParse(x)
summary(doc)
p sp speaker div2 placeName
299 297 297 51 25 ...
Finally, get the sp nodes and parse speaker and paragraph.
sp <- getNodeSet(doc, "//sp")
s1 <- sapply( sp, xpathSApply, ".//speaker", xmlValue)
# collapse the 1 node with 2 <p>
p1 <- lapply( sp, xpathSApply, ".//p", xmlValue)
p1 <- trimws(sapply(p1, paste, collapse= " "))
speakers <- data.frame(speaker=s1, dialogue = p1)
speaker dialogue
1 Antigone Ismene, my sister, true child of my own mother, do you know any evil o...
2 Ismene To me no word of our friends, Antigone, either bringing joy or bringin...
3 Antigone I knew it well, so I was trying to bring you outside the courtyard gat...
4 Ismene Hear what? It is clear that you are brooding on some dark news.
5 Antigone Why not? Has not Creon destined our brothers, the one to honored buri...
6 Ismene Poor sister, if things have come to this, what would I profit by loose...
7 Antigone Consider whether you will share the toil and the task.
8 Ismene What are you hazarding? What do you intend?
9 Antigone Will you join your hand to mine in order to lift his corpse?
10 Ismene You plan to bury him—when it is forbidden to the city?
...

Related

R: find a specific string next to another string with for loop

I have the text of a novel in a single vector, it has been split by words novel.vector.words I am looking for all instances of the string "blood of". However since the vector is split by words, each word is its own string and I don't know to search for adjacent strings in a vector.
I have a basic understanding of what for loops do, and following some instructions from a text book, I can use this for loop to target all positions of "blood" and the context around it to create a tab-delineated KWIC display (key words in context).
node.positions <- grep("blood", novel.vector.words)
output.conc <- "D:/School/U Alberta/Classes/Winter 2019/LING 603/dracula_conc.txt"
cat("LEFT CONTEXT\tNODE\tRIGHT CONTEXT\n", file=output.conc) # tab-delimited header
#This establishes the range of how many words we can see in our KWIC display
context <- 10 # specify a window of ten words before and after the match
for (i in 1:length(node.positions)){ # access each match...
# access the current match
node <- novel.vector.words[node.positions[i]]
# access the left context of the current match
left.context <- novel.vector.words[(node.positions[i]-context):(node.positions[i]-1)]
# access the right context of the current match
right.context <- novel.vector.words[(node.positions[i]+1):(node.positions[i]+context)]
# concatenate and print the results
cat(left.context,"\t", node, "\t", right.context, "\n", file=output.conc, append=TRUE)}
What I am not sure how to do however, is use something like an if statement or something to only capture instances of "blood" followed by "of". Do I need another variable in the for loop? What I want it to do basically is for every instance of "blood" that it finds, I want to see if the word that immediately follows it is "of". I want the loop to find all of those instances and tell me how many there are in my vector.
You can create an index using dplyr::lead to match 'of' following 'blood':
library(dplyr)
novel.vector.words <- c("blood", "of", "blood", "red", "blood", "of", "blue", "blood")
which(grepl("blood", novel.vector.words) & grepl("of", lead(novel.vector.words)))
[1] 1 5
In response to the question in the comments:
This certainly could be done with a loop based approach but there is little point in re-inventing the wheel when there are already packages better designed and optimized to do the heavy lifting in text mining tasks.
Here is an example of how to find how frequently the words 'blood' and 'of' appear within five words of each other in Bram Stoker's Dracula using the tidytext package.
library(tidytext)
library(dplyr)
library(stringr)
## Read Dracula into dataframe and add explicit line numbers
fulltext <- data.frame(text=readLines("https://www.gutenberg.org/ebooks/345.txt.utf-8", encoding = "UTF-8"), stringsAsFactors = FALSE) %>%
mutate(line = row_number())
## Pair of words to search for and word distance
word1 <- "blood"
word2 <- "of"
word_distance <- 5
## Create ngrams using skip_ngrams token
blood_of <- fulltext %>%
unnest_tokens(output = ngram, input = text, token = "skip_ngrams", n = 2, k = word_distance - 1) %>%
filter(str_detect(ngram, paste0("\\b", word1, "\\b")) & str_detect(ngram, paste0("\\b", word2, "\\b")))
## Return count
blood_of %>%
nrow
[1] 54
## Inspect first six line number indices
head(blood_of$line)
[1] 999 1279 1309 2192 3844 4135

JSON applied over a dataframe in R

I used the below on one website and it returned a perfect result:
looking for key word: Emaar pasted at the end of the query:
library(httr)
library(jsonlite)
query<-"https://www.googleapis.com/customsearch/v1?key=AIzaSyA0KdZHRkAjmoxKL14eEXp2vnI4Yg_po38&cx=006431301429107149113:as7yqcm2qc8&q=Emaar"
result11 <- content(GET(query))
print(result11)
result11_JSON <- toJSON(result11)
result11_JSON <- fromJSON(result11_JSON)
result11_df <- as.data.frame(result11_JSON)
now I want to apply the same function over a data.frame containing key words:
so i did the below testing .csv file:
Company Name
[1] ADES International Holding Ltd
[2] Emirates REIT (CEIC) Limited
[3] POLARCUS LIMITED
called it Testing Website Extraction.csv
code used:
test_companies <- read.csv("... \\Testing Website Extraction.csv")
#removing space and adding "+" sign then pasting query before it (query already has my unique google key and search engine ID
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
Result here shows a list of length 3 (the 3 search terms) and sublist within each term containing: url (list[2]), queries (list[2]), ... items (list[10]) and these are the same for each search term (same length separately), my issue here is applying the remainder of the code
#when i run:
result_JSON <- toJSON(result)
result_JSON <- as.list(fromJSON(result_JSON))
I get a list of 6 list that has sublists
and putting it into a tidy dataframe where the results are listed under each other (not separately) is proving to be difficult
also note that I tried taking from the "result" list that has 3 separate lists in it each one by itself but its a lot of manual labor if I have a longer list of keywords
The expected end result should include 30 observations of 37 variables (for each search term 10 observations of 37 variables and all are underneath each other.
Things I have tried unsuccessfully:
These work to flatten the list:
#do.call(c , result)
#all.equal(listofvectors, res, check.attributes = FALSE)
#unlist(result, recursive = FALSE)
# for (i in 1:length(result)) {listofvectors <- c(listofvectors, result[[i]])}
#rbind()
#rbind.fill()
even after flattening I dont know how to organize them into a tidy final output for a non-R user to interact with.
Any help here would be greatly appreciated,
I am here in case anything is not clear about my question,
Always happy to learn more about R so please bear with me as I am just starting to catch up.
All the best and thanks in advance!
Basically what I did is extract only the columns I need from the dataframe list, below is the final code:
library(httr)
library(jsonlite)
library(tidyr)
library(stringr)
library(purrr)
library(plyr)
test_companies <- read.csv("c:\\users\\... Companies Without Websites List.csv")
test_companies$plus <- gsub(" ", "+", test_companies$Company.Name)
query <- "https://www.googleapis.com/customsearch/v1?key=AIzaSyCmD6FRaonSmZWrjwX6JJgYMfDSwlR1z0Y&cx=006431301429107149113:as7yqcm2qc8&q="
test_companies$plus <- paste0(query, test_companies$plus)
a <- test_companies$plus
length(a)
function_webs_search <- function(web_search) {content(GET(web_search))}
result <- lapply(as.character(a), function_webs_search)
function_toJSONall <- function(all) {toJSON(all)}
a <- lapply(result, function_toJSONall)
function_fromJSONall <- function(all) {fromJSON(all)}
b <- lapply(a, function_fromJSONall)
function_dataframe <- function(all) {as.data.frame(all)}
c <- lapply(b, function_dataframe)
function_column <- function(all) {all[ ,15:30]}
result_final <- lapply(c, function_column)
results_df <- rbind.fill(c[])

How to break apart a play script with the form **Speaker: Dialogue** to get all of a character's dialogue into a single text block?

The text I am using is below.
So far, I have imported the text:
tempest.v <- scan("data/plainText/tempest.txt", what="character", sep="\n")
Identified where all of the speaker positions begin:
speaker.positions.v <- grep('^[^\\s]\\w+:', tempest.v)
Added a marker at the end of the text:
tempest.v <- c(tempest.v, "END:")
Here's the part where I'm having difficulty (assuming what I've already done is useful):
for(i in 1:length(speaker.positions.v)){
if(i != length(speaker.positions.v)){
speaker.name <- debate.v[speaker.positions.v[i]]
speaker.name <- strsplit(speaker.name, ":")
speaker.name <- unlist(speaker.name)
start <- speaker.positions.v[i]+1
end <- speaker.positions.v[i+1]-1
speaker.lines.v <- debate.v[start:end]
}
}
Now I have variable speaker.name that has, on the left-hand side of the split, the name of the character who is speaking. The right-hand side of the split is the dialogue only up through the first line break.
I set the start of the dialogue block at position [i]+1 and
the end at [i+1]-1 (i.e., one position back from the beginning of the subsequent speaker's name).
Now I have a variable, speaker.lines.v with all of the lines of dialogue for that speaker for that one speech.
How can I collect all of Prospero's then Miranda's (then any other character's) dialogue into a single (list? vector? data frame?) for analysis?
Any help with this would be greatly appreciated.
Happy New Year!
--- *TEXT ---
*Miranda: If by your art, my dearest father, you have
Put the wild waters in this roar, allay them.
The sky, it seems, would pour down stinking pitch,
But that the sea, mounting to the welkin's cheek,
Dashes the fire out. O, I have suffered
With those that I saw suffer -- a brave vessel,
Who had, no doubt, some noble creature in her,
Dash'd all to pieces. O, the cry did knock
Against my very heart. Poor souls, they perish'd.
Had I been any god of power, I would
Have sunk the sea within the earth or ere
It should the good ship so have swallow'd and
The fraughting souls within her.
Prospero: Be collected:
No more amazement: tell your piteous heart
There's no harm done.
Miranda: O, woe the day!
Prospero: No harm.
I have done nothing but in care of thee,
Of thee, my dear one, thee, my daughter, who
Art ignorant of what thou art, nought knowing
Of whence I am, nor that I am more better
Than Prospero, master of a full poor cell,
And thy no greater father.
Miranda: More to know
Did never meddle with my thoughts.
Prospero: 'Tis time
I should inform thee farther. Lend thy hand,
And pluck my magic garment from me. So:
[Lays down his mantle]
Lie there, my art. Wipe thou thine eyes; have comfort.
The direful spectacle of the wreck, which touch'd
The very virtue of compassion in thee,
I have with such provision in mine art
So safely ordered that there is no soul—
No, not so much perdition as an hair
Betid to any creature in the vessel
Which thou heard'st cry, which thou saw'st sink. Sit down;
For thou must now know farther.
--- END TEXT ---
I first saved the text you put here as test.txt. Then read it:
tempest <- scan("~/Desktop/test.txt", what = "character", sep = "\n")
Then pulled only the spoken lines, as you:
speakers <- tempest[grepl("^[^\\s]\\w+:", tempest)]
Then we split off the speaker's name:
speaker_split <- strsplit(speakers, split = ":")
And get the names:
speaker_names <- sapply(speaker_split, "[", 1L)
And what they said (collapsing because their lines may have had other colons that we lost):
speaker_parts <- sapply(speaker_split, function(x) paste(x[-1L], collapse = ":"))
From here we just need indices of who said what and we can do what we want:
prosp <- which(speaker_names == "Prospero")
miran <- which(speaker_names == "Miranda")
And play to your hearts content.
Who said the most words?
> sum(unlist(strsplit(speaker_parts[prosp], split = "")) == " ")
[1] 82
> sum(unlist(strsplit(speaker_parts[miran], split = "")) == " ")
[1] 67
Prospero.
What is the frequency of letters used by Miranda?
> table(tolower(unlist(strsplit(gsub("[^A-Za-z]", "", speaker_parts[miran]),
split = ""))))
a b c d e f g h i k l m n o p r s t u v w y
17 3 2 11 34 7 3 21 16 5 7 7 9 17 3 14 18 30 11 5 10 8
We're going to use the rebus package to create regular expressions, stringi to match those regular expressions, and data.table to store the data.
library(rebus)
library(stringi)
library(data.table)
First trim leading and trailing spaces from the lines
tempest.v <- stri_trim(tempest.v)
Get rid of empty lines
tempest.v <- tempest.v[nzchar(tempest.v)]
Remove stage directions
stage_dir_rx <- exactly(
OPEN_BRACKET %R%
one_or_more(printable()) %R%
"]"
)
is_stage_dir_line <- stri_detect_regex(tempest.v, stage_dir_rx)
tempest.v <- tempest.v[!is_stage_dir_line]
Match lines containing "character: dialogue".
character_dialogue_rx <- START %R%
optional(capture(one_or_more(alpha()) %R% lookahead(":"))) %R%
optional(":") %R%
zero_or_more(space()) %R%
capture(one_or_more(printable()))
matches <- stri_match_first_regex(tempest.v, character_dialogue_rx)
Store the matches in a data.table (we need this for the roll functionality). A line number key column is also needed in a moment.
tempest_data <- data.table(
line_number = seq_len(nrow(matches)),
character = matches[, 2],
dialogue = matches[, 3]
)
Fill in missing values, using the method described in this answer.
setkey(tempest_data, line_number)
tempest_data[, character := tempest_data[!is.na(character)][tempest_data, character, roll = TRUE]]
The data currently has line information preserved: each row contains one line of dialogue.
line_number character dialogue
1: 1 Miranda If by your art, my de....
2: 2 Miranda Who had, no doubt, so....
3: 3 Prospero Be collected: No more....
4: 4 Miranda O, woe the day!
5: 5 Prospero No harm. I have done ....
6: 6 Miranda More to know Did neve....
7: 7 Prospero 'Tis time I should in....
8: 8 Prospero Lie there, my art. Wi....
To get all the dialogue for a given character as a single string, summarise using the by argument.
tempest_data[, .(all_dialogue = paste(dialogue, collapse = "\n")), by = "character"]
I was interested in this question because I'm developing a series of tools for these types of tasks. Here is how to solve this problem using those tools.
if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/textshape", "trinker/qdapRegex")
pacman::p_load(dplyr)
pat <- '^[^\\s]\\w+:'
"tempest.txt" %>%
readLines() %>%
{.[!grepl("^(---)|(^\\s*$)", .)]} %>%
split_match(pat, regex=TRUE, include=TRUE) %>%
textshape::combine() %>%
{setNames(., sapply(., function(x) unlist(ex_default(x, pattern = pat))))} %>%
bind_list("person") %>%
mutate(content = gsub(pat, "", content)) %>%
`[` %>%
textshape::combine()
result
person content
1 Miranda: If by your art, my dearest father, you ...
2 Prospero: Be collected No more amazement tell you ..
To avoid combining (As #RichieCotton displays initially) leave off the last textshape::combine() in the chain.

Most efficient way to read key value pairs where values span multiple lines?

What is the fastest way to parse a text file such as the example below into a two column data.frame which then then be transformed into a wide format?
FN Thomson Reuters Web of Science™
VR 1.0
PT J
AU Panseri, Sara
Chiesa, Luca Maria
Brizzolari, Andrea
Santaniello, Enzo
Passero, Elena
Biondi, Pier Antonio
TI Improved determination of malonaldehyde by high-performance liquid
chromatography with UV detection as 2,3-diaminonaphthalene derivative
SO JOURNAL OF CHROMATOGRAPHY B-ANALYTICAL TECHNOLOGIES IN THE BIOMEDICAL
AND LIFE SCIENCES
VL 976
BP 91
EP 95
DI 10.1016/j.jchromb.2014.11.017
PD JAN 22 2015
PY 2015
Using readLines is problematic because the multi-line fields don't have the keys. Reading as fixed width table also doesn't work. Suggestions? If not for the multiline issue, this would be easily accomplished with a function that operates on each row/record like so:
x <- "FN Thomson Reuters Web of Science"
re <- "^([^\\s]+)\\s*(.*)$"
key <- sub(re, "\\1", x, perl=TRUE)
value <- sub(re, "\\2", x, perl=TRUE)
data.frame(key, value)
key value
1 FN Thomson Reuters Web of Science
Notes: The fields will always be uppercase and two characters. The entire title and list of authors can be concatenated into a single cell.
This should work:
library(zoo)
x <- read.fwf(file="tempSO.txt",widths=c(2,500),as.is=TRUE)
x$V1[x$V1==" "] <- NA
x$V1 <- na.locf(x$V1)
res <- aggregate(V2 ~ V1, data = x, FUN = paste, collapse = "")
Here's another idea, that might be useful if you want to stay in base R:
parseEntry <- function(entry) {
## Split at beginning of each line that starts with a non-space character
ll <- strsplit(entry, "\\n(?=\\S)", perl=TRUE)[[1]]
## Clean up empty characters at beginning of continuation lines
ll <- gsub("\\n(\\s){3}", "", ll)
## Split each field into its two components
read.fwf(textConnection(ll), c(2, max(nchar(ll))))
}
## Read in and collapse entry into one long character string.
## (If file contained more than one entry, you could preprocess it accordingly.)
ee <- paste(readLines("egFile.txt"), collapse="\n")
## Parse the entry
parseEntry(ee)
Read lines of the file into a character vector using readLines and append a colon to each key. The result is then in DCF format so we can read it using read.dcf - this is the function used to read R package DESCRIPTION files. The result of read.dcf is wide, a matrix with one column per key. Finally we create long, a long data.frame with one row per key:
L <- readLines("myfile.dat")
L <- sub("^(\\S\\S)", "\\1:", L)
wide <- read.dcf(textConnection(L))
long <- data.frame(key = colnames(wide), value = wide[1,], stringsAsFactors = FALSE)

R html scrape with redirect links, word searches, and counts

I am trying to streamline a tedious process of online data collection with R scraping code. The website I am currently interested in is here : Wisconsin Bills- Author index.
The website features a redirect link to each legislator, and then under each legislator there is a list of bills introduced, and a link to the major action summaries for each bill. My end goal is to create a data frame that includes a column for legislator name, number of assembly bills (only links that that include "AB") introduced, number of bills passed the assembly, and number of bills signed into law.
Scraping the website, I have successfully created a data frame with each legislator's first name, last name, district, state (always WI) and year (always 1999, t-1 is when the session ended). Below is my code:
#specify the URL
url <- "https://docs.legis.wisconsin.gov/1997/related/author_index/assembly"
#download the HTML code
html <- getURL(url, ssl.verifypeer = FALSE, followlocation = TRUE)
#parse the HTML code
html.parsed <- htmlTreeParse(html, useInternalNodes = T)
# Get list of legislator names:
names <- xpathSApply(html.parsed, path="//a[contains(#href, 'authorindex')]", xmlValue)
# get all links into a list:
links <- xpathSApply(html.parsed2, "//a/#href")
# see what I have:
head(links) # still have hrefs in there
links <- as.vector(links)
head(links) # good, hrefs are dropped.
# I only need the links that begin with /document/authorindex/1997.
typeof(links) # confirming its character
links # looking to see which ones to keep (only ones with "authorindex" and "A__", where the number that follows A is the district)
links <- links[14:114] # now the links only have the legislator redirects!!!
# Lets begin to build the final data frame needed:
# first, take a look at names- there are 104, but there are only 100 legislators...
names # elements 3-103 are leg names
names <- names[3:103]
# split up by first name, last name, etc.
names <- as.vector(names)
names1 <- strsplit(names, ",")
last.names <- sapply(names1, "[[", 1) # good- create a data frame
id = c(1:101)
df <- data.frame(ID= id)
df$last.name = last.names # now have an ID and their last name.
# now need district, party, and first names.
first_names <- strsplit(names, "p.")
first_names # now republicans have 3 elements, dems have 2, first word of 2nd element is first name
# do another strsplit
first_names <- as.character(first_names)
first_names <- strsplit(first_names, " ()")
first_names # 4th element is almost always their name! do it that way, correct those that messed up by hand
first_names <- sapply(first_names, "[[", 4)
first_names # 10 (Timothy), 90 (William) 80 (Joan H) 80 (Tom) 47 (John)
# 25 (Jose) 17 (Stephen) 5 (Spencer)
first_names[5] <- "Spencer"
first_names[10] <- "Timothy"
first_names[90] <- "William"
first_names[80] <- "Joan H."
first_names[81] <- "Tom"
first_names[47] <- "John"
first_names[25] <- "Jose"
first_names[17] <- "Stephen"
df$first.name <- first_names # first names- done.
# district:
district <- regmatches(names, gregexpr("[[:digit:]]+", names))
df$district <- district
df$state <- "WI"
df$year <- 1999
Now, I'm stumped. I need to follow each redirect link, and count the number of AB links under that legislator's name ONLY, follow the AB links, and count the # of AB sites for each legislator that have the word "passed" in them and the # of AB sites that have the word "Sen." in them. I would thus like to add to the existing df the following columns:
Bills Introduced Bills Passed Assembly Bills Signed into Law
4 3 2
39 18 14
Etc. I get the sense I need to use loops, but I don't know how to approach it.
Any help would be incredibly appreciated.
Thank you!

Resources