Creating Tidy Text - r

I am using R for text analysis. I used the 'readtext' function to pull in text from a pdf. However, as you can imagine, it is pretty messy. I used 'gsub' to replace text for different purposes. The general goal is to use one type of delimiter '%%%%%' to split records into rows, and another delimiter '#' into columns. I accomplished the first but am at a loss of how to accomplish the latter. A sample of the data found in the dataframe is as follows:
895 "The ambulatory case-mix development project\n#Published:: June 6, 1994#Authors: Baker A, Honigfeld S, Lieberman R, Tucker AM, Weiner JP#Country: United States #Journal:Project final report. Baltimore, MD, USA: Johns Hopkins University and Aetna Health Plans. Johns Hopkins\nUniversity and Aetna Health Plans, USA As the US […"
896 "Ambulatory Care Groups: an evaluation for military health care use#Published:: June 6, 1994#Authors: Bolling DR, Georgoulakis JM, Guillen AC#Country: United States #Journal:Fort Sam Houston, TX, USA: United States Army Center for Healthcare Education and Studies, publication #HR 94-\n004. United States Army Center for Healthcare Education and […]#URL: http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA27804"
I want to take this data and split the #Published, #Authors, #Journal, #URL into columns -- c("Published", "Authors", "Journal", "URL").
Any suggestions?
Thanks in advance!

This seems to work OK:
dfr <- data.frame(TEXT=c("The ambulatory case-mix development project\n#Published:: June 6, 1994#Authors: Baker A, Honigfeld S, Lieberman R, Tucker AM, Weiner JP#Country: United States #Journal:Project final report. Baltimore, MD, USA: Johns Hopkins University and Aetna Health Plans. Johns Hopkins\nUniversity and Aetna Health Plans, USA As the US […",
"Ambulatory Care Groups: an evaluation for military health care use#Published:: June 6, 1994#Authors: Bolling DR, Georgoulakis JM, Guillen AC#Country: United States #Journal:Fort Sam Houston, TX, USA: United States Army Center for Healthcare Education and Studies, publication #HR 94-\n004. United States Army Center for Healthcare Education and […]#URL: http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA27804"),
stringsAsFactors = FALSE)
library(magrittr)
do.call(rbind, strsplit(dfr$TEXT, "#Published::|#Authors:|#Country:|#Journal:")) %>%
as.data.frame %>%
setNames(nm = c("Preamble","Published","Authors","Country","Journal"))
Basically split the text by one of four fields (noticing double :: after Published!), row-binding the result, converting to a dataframe, and giving some names.

Related

extracting strings from a dataframe row containing multiple entries?

I have a messy csv dataset that contains several (but not all) rows that unfortunately contains multiple entries. For each row, I'd like to separate each entry out so that i can create a list of the unique values (in this case, a list of specific clinical trial sites). The multiple entries are separated by "|". To make life even more fun, I'd like to exclude any entry that isn't from the US.
I'm just having a very tough time conceptualizing how to even start. if each line only had one value I think i could work through it with base R. Maybe something from tidyverse...separate or separate_rows, or use regex to extract everything bounded by "|"?
Example data:
Locations
1
University of Pennsylvania, Philadelphia, Pennsylvania, United States | University of Texas Southwestern Medical Center - Dallas, Dallas, Texas, United States | Houston Methodist Cancer Center, Houston, Texas, United States
2
Hem-Onc Associates of the Treasure Coast, Port Saint Lucie, Florida, United States | Moffitt Cancer Center, Tampa, Florida, United States | Biomira Inc.
Edmonton, Alberta, Canada
3
Massachusetts General Hospital, Boston, Massachusetts, United States
4
Moffitt Cancer Center, Tampa, Florida, United States | Sunnybrook Health Sciences Centre
Toronto, Ontario, Canada
5
Memorial Sloan Kettering Cancer Center, New York, New York, United States
6
Duke University Medical Center, Durham, North Carolina, United States
7
Moffitt Cancer Center, Tampa, Florida, United States
8
Moffitt Cancer Center, Tampa, Florida, United States | Tom Baker Cancer Centre
Calgary, Alberta, Canada
9
Houston Methodist Cancer Center, Houston, Texas, United States
10
University of Texas Southwestern Medical Center - Dallas, Dallas, Texas, United States
Desired output:
University of Pennsylvania, Philadelphia, Pennsylvania, United States
University of Texas Southwestern Medical Center - Dallas, Dallas, Texas, United States
Houston Methodist Cancer Center, Houston, Texas, United States
Hem-Onc Associates of the Treasure Coast, Port Saint Lucie, Florida, United States
Moffitt Cancer Center, Tampa, Florida, United States
(etc etc etc)
Duh, turned out to be almost trivial.
df %>%
tidyr::separate_rows(Locations,sep="\\|",convert=T)
Tricky thing was escaping out the "|" symbol!
library(stringr)
z <- c("Thing one | another thing | yet another", "different thing | other thing")
z
#> [1] "Thing one | another thing | yet another"
#> [2] "different thing | other thing"
str_split(z, "\\|") %>%
unlist
#> [1] "Thing one " " another thing " " yet another" "different thing "
#> [5] " other thing"

How to remove the first few characters in a column in R?

My data (csv file) has a column that contains uninformative characters (e.g. special characters, random lowercase letters), and I want to remove them.
df <- data.frame(Affiliation = c(". Biotechnology Centre, Malaysia Agricultural Research and Development Institute (MARDI), Serdang, Malaysia","**Institute for Research in Molecular Medicine (INFORMM), Universiti Sains Malaysia, Pulau Pinang, Malaysia","aas Massachusetts General Hospital and Harvard Medical School, Center for Human Genetic Research and Department of Neurology , Boston , MA , USA","ac Albert Einstein College of Medicine , Department of Pathology , Bronx , NY , USA"))
The number of characters I want to remove (e.g. ".","**","aas","ac") per line is indefinite as shown above.
Expected output:
df <- data.frame(Affiliation = c("Biotechnology Centre, Malaysia Agricultural Research and Development Institute (MARDI), Serdang, Malaysia","Institute for Research in Molecular Medicine (INFORMM), Universiti Sains Malaysia, Pulau Pinang, Malaysia","Massachusetts General Hospital and Harvard Medical School, Center for Human Genetic Research and Department of Neurology , Boston , MA , USA","Albert Einstein College of Medicine , Department of Pathology , Bronx , NY , USA"))
I was thinking of using dplyr's mutate function, but I'm not sure how to go about it.
If we assume that the valid text starts from the first uppercase onwards, the following works:
library(tidyverse)
df %>%
mutate(Affiliation = str_extract(Affiliation, "[:upper:].+"))
Base R regex solution:
df$cleaned_str <- gsub("^\\w+ |^\\*+|^\\. ", "", df$Affiliation)
Tidyverse regex solution:
library(tidyverse)
df %>%
mutate(Affiliation = str_replace(Affiliation, "^\\w+ |^\\*+|^\\. ", ""))

search and replace multiple patterns in R

I'm attempting to use grepl in nested ifelse statements to simplify a column of data containing researchers' institutional affiliations with the country they belong to, i.e. '1234 University Way, University, Washington, United States' would become 'United States'. The column contains universities in over 100 countries. At first I tried nested ifelse statements with grepl:
H$FAF1 <- ifelse(grepl("Hungary", H$AF1), "Hungary",
ifelse(grepl("United States", H$AF1), "United States", ...
etc., but I realized the limit is 50 for nested ifelse statements. Does anyone know another way to do this? I tried writing a function but am unfortunately not that adept at R yet.
An alternative for the regex-approach by csgroen, where you have to write down countries manually, you could try the countrycode-package, where they are already included, which might save you some time... Try:
countrycode::countrycode(sourcevar = "1234 University Way, University, Washington, United States",
origin = "country.name",
destination = "country.name")
Maybe using str_extract? I've made a little example.
min_ex <- c("1234 University Way, University, Washington, United States",
c("354 A Road, University B, City A, Romania"),
c("447 B Street, National C University, City B, China"))
library(stringr)
str_extract(min_ex, regex("United States|Romania|China"))

Transforming kwic objects into single dfm

I have a corpus of newspaper articles of which only specific parts are of interest for my research. I'm not happy with the results I get from classifying texts along different frames because the data contains too much noise. I therefore want to extract only the relevant parts from the documents. I was thinking of doing so by transforming several kwic objects generated by the quanteda package into a single df.
So far I've tried the following
exampletext <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
mycorpus <- corpus(exampletext)
mycorpus.nat <- corpus(kwic(mycorpus, "nationalit*", window = 5, valuetype = "glob"))
mycorpus.cit <- corpus(kwic(mycorpus, "citizenship", window = 5, valuetype = "glob"))
mycorpus.kwic <- mycorpus.nat + mycorpus.cit
mydfm <- dfm(mycorpus.kwic)
This, however, generates a dfm that contains 4 documents instead of 2, and when both keywords are present in a document even more. I can't think of a way to bring the dfm down to the original number of documents.
Thank you for helping me out.
We recently added window argument to tokens_select() for this purpose:
require(quanteda)
txt <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
toks <- tokens(txt)
mt_nat <- dfm(tokens_select(toks, "nationalit*", window = 5))
mt_cit <- dfm(tokens_select(toks, "citizenship*", window = 5))
Please make sure that you are using the latest version of Quanteda.

Read text file, create new rows by separator

I am trying to import a text file as a data frame as a single column and multiple rows. I want a new row created for every sentence and I want to repeat the process for every word.
Like this.
Mr. Trump has been leading most national polls in the Republican presidential contest, but he is facing a potentially changed landscape. With the Iowa caucuses less than three months away, attention has shifted to national security in the wake of the terrorist attacks in Paris last week. While the Republican electorate so far has favored political outsiders like Mr. Trump and Ben Carson, the concerns over terrorism and the arrival of refugees from Syria into the United States could change things.
should be read as
V1
[1] Mr
[2] Trump has been leading most national polls in the Republican presidential contest, but he is facing a potentially changed landscape
[3] With the Iowa caucuses less than three months away, attention has shifted to national security in the wake of the terrorist attacks in Paris last week
[4] While the Republican electorate so far has favored political outsiders like Mr
[5] Trump and Ben Carson, the concerns over terrorism and the arrival of refugees from Syria into the United States could change things
Thanks.
We can use strsplit
strsplit(txt, '[.]\\s*')[[1]]
data
txt <- "Mr. Trump has been leading most national polls in the Republican presidential contest, but he is facing a potentially changed landscape. With the Iowa caucuses less than three months away, attention has shifted to national security in the wake of the terrorist attacks in Paris last week. While the Republican electorate so far has favored political outsiders like Mr. Trump and Ben Carson, the concerns over terrorism and the arrival of refugees from Syria into the United States could change things."

Resources