Extract characters from a string by a succession of colons - r

I am trying to pull some information out of a variable in a data frame. I am using R 3.3.3.
The information formatted as follows:
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
I would like to break down each section into a separate variable like so:
w = "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
x = "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
y = "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south."
z = "DOMINCAN REPUBLIC: Is a country located in the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
I am having some difficulty trying to extract this information. SO questions such as this and this have been very helpful. From these, I gathered that some form of stringr/ gsub can be used to pull this information but I can't figure out how to specify the ranges within a gsub statement.
I have been able to work out the how to pull the first portion:
>test4 <- gsub("(.*{1})(:.*)","\\1", t)
which gives
[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"
My overall question is:
[1] "CHINA: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC"
It would be nice if I did not have to clean up the "DOMINICAN REPUBLIC" part from the end of the string.
In summary:
1. How you extract characters from a string by a succession of colons? (1st to 2nd colon, 2nd to 3rd etc)
2. Is there a way to keep the words infront of the colon as well?
Any information or guidance would be greatly appreciated.

You can use strsplit with an appropriate regex:
strsplit(t, "\\.\\s(?=[\\w\\s]+:)", perl=TRUE)
or
stringr::str_split(t, "\\.\\s(?=[\\w\\s]+:)")
Notes:
\\.\\s matches a literal dot and a space.
(?=[\\w\\s]+:) is a positive lookahead that matches either a word character or space one or more times following a colon.
\\.\\s(?=[\\w\\s]+:) thus matches a dot and a space only if it is immediately followed by either a word character or a space one or more times and a colon. This would be the end of each paragraph.
Since I am using the regex within strsplit, I am splitting by whatever is matched by the regex. This results in splitting by the end of each paragraph.
perl=TRUE is needed to enable lookaheads/behinds.
Result:
[[1]]
[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion"
[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean"
[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south"
[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."

How about the following in base R?
# Your sample string
t <- "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion. GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean. MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region.";
# Get position of regexp matches
matches <- data.frame(
idx = unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t)),
len = c(diff(unlist(gregexpr(pattern = "([A-Z]*\\s*[A-Z]+:|\\w+:)", t))), nchar(t))
);
# Get substrings based on positions and store in list
lst <- apply(matches, 1, function(x) {
trimws(substr(t, x[1], sum(x) - 1));
})
lst;
#[1] "China: Officially the People's Republic of China (PRC), is a unitary sovereign state in East Asia and the world's most populous country, with a population of around 1.404 billion."
#[2] "GUAM: Is an unincorporated and organized territory of the United States in Micronesia in the western Pacific Ocean."
#[3] "MICRONESIA: Is a subregion of Oceania, comprising thousands of small islands in the western Pacific Ocean. It has a shared cultural history with two other island regions, Polynesia to the east and Melanesia to the south. DOMINCAN"
#[4] "DOMINCAN REPUBLIC: Is a country located on the island of Hispaniola, in the Greater Antilles archipelago of the Caribbean region."
Note: Regexp-matching countries is a bit awkward because your example contains all caps multi-word countries (DOMINCAN REPUBLIC), all caps single-word countries (e.g. GUAM), and "first-letter-caps" countries (China).

Related

extracting strings from a dataframe row containing multiple entries?

I have a messy csv dataset that contains several (but not all) rows that unfortunately contains multiple entries. For each row, I'd like to separate each entry out so that i can create a list of the unique values (in this case, a list of specific clinical trial sites). The multiple entries are separated by "|". To make life even more fun, I'd like to exclude any entry that isn't from the US.
I'm just having a very tough time conceptualizing how to even start. if each line only had one value I think i could work through it with base R. Maybe something from tidyverse...separate or separate_rows, or use regex to extract everything bounded by "|"?
Example data:
Locations
1
University of Pennsylvania, Philadelphia, Pennsylvania, United States | University of Texas Southwestern Medical Center - Dallas, Dallas, Texas, United States | Houston Methodist Cancer Center, Houston, Texas, United States
2
Hem-Onc Associates of the Treasure Coast, Port Saint Lucie, Florida, United States | Moffitt Cancer Center, Tampa, Florida, United States | Biomira Inc.
Edmonton, Alberta, Canada
3
Massachusetts General Hospital, Boston, Massachusetts, United States
4
Moffitt Cancer Center, Tampa, Florida, United States | Sunnybrook Health Sciences Centre
Toronto, Ontario, Canada
5
Memorial Sloan Kettering Cancer Center, New York, New York, United States
6
Duke University Medical Center, Durham, North Carolina, United States
7
Moffitt Cancer Center, Tampa, Florida, United States
8
Moffitt Cancer Center, Tampa, Florida, United States | Tom Baker Cancer Centre
Calgary, Alberta, Canada
9
Houston Methodist Cancer Center, Houston, Texas, United States
10
University of Texas Southwestern Medical Center - Dallas, Dallas, Texas, United States
Desired output:
University of Pennsylvania, Philadelphia, Pennsylvania, United States
University of Texas Southwestern Medical Center - Dallas, Dallas, Texas, United States
Houston Methodist Cancer Center, Houston, Texas, United States
Hem-Onc Associates of the Treasure Coast, Port Saint Lucie, Florida, United States
Moffitt Cancer Center, Tampa, Florida, United States
(etc etc etc)
Duh, turned out to be almost trivial.
df %>%
tidyr::separate_rows(Locations,sep="\\|",convert=T)
Tricky thing was escaping out the "|" symbol!
library(stringr)
z <- c("Thing one | another thing | yet another", "different thing | other thing")
z
#> [1] "Thing one | another thing | yet another"
#> [2] "different thing | other thing"
str_split(z, "\\|") %>%
unlist
#> [1] "Thing one " " another thing " " yet another" "different thing "
#> [5] " other thing"

R data from sf package is missing data from small island states

I am creating a map with R that should include all SADC economies. This map should be coloured in a later step according to an additional data set that I want to merge with the map. At the moment I have been using the sf package to map the SADC economies.
These include the following 16 Member States: Angola, Botswana, Comoros, Democratic Republic of Congo, Eswatini, Lesotho, Madagascar, Malawi, Mauritius, Mozambique, Namibia, Seychelles, South Africa, United Republic Tanzania, Zambia and Zimbabwe.
While selecting the countries for my map, I could not find data for the three island states: Comoros, Mauritius & Seychelles
Is there any opportunity to **manually add the geom (MULTIPOLYGON) data **and if so, where do I find the information?
Alternatively: is there an alternative package, which includes all SADC country coordinates with which I could plot the map?
I have not found the missing data in the iso_a2 column (containing all iso2 codes) in the name_long column (containing all names), or when filtering for all countries on the continent Africa in the continent column
Here is my sample code
# install packages
library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
library(sf) # for geographic data # classes and functions for vector data
# show African countries
Africa <- world %>%
filter(continent == "Africa")
View(Africa) # find all SADC economies with the right name
# problem: missing: Comoros, Mauritius & Seychelles
# create SADC vector according to country names in dataset
SADCvector2 <- SADCvector <- c("Angola","Botswana", "Democratic Republic of the Congo", "eSwatini", "Lesotho", "Madagascar", "Malawi",
"Mozambique","Namibia", "Seychelles", "South Africa", "Tanzania","Zambia", "Zimbabwe")
# select SADC countries
SADC1 <- world %>%
filter(name_long %in% SADCvector2) %>%
#select only variables of interest
select(name_long, geom)
plot(SADC1)

I need to find a way to count the number of repetition of value x where another column value is y

I am using the twitter sentiment analysis for airline flights dataset and It has a column called negative result and another column called airline name. I need to know how to count the repetitions of the value "Bad Flight" in the column negative result Where the airline name is "Virgin America" and repeat this step for "Late Flight" and "Virgin America" and then compare between values and choose the bigger number and use it in plotting.
for example :
Negative Result Airline Name
Bad Flight Virgin America
Bad Flight Virgin America
Bad Flight Virgin America
Late Flight Virgin America
Late Flight Virgin America
Bad Flight United
Damaged Luggage United
Bad Flight United
Late Flight United
Late Flight United
Bad Flight Virgin America
Bad Flight Virgin America
Late Flight Virgin America
expected output will be 5 for bad flight and 3 for late flight so after comparing, bad flight will be the value to be plotted.
If your dataframe is called df you can just do table(df).
Using dplyr:
library(dplyr)
df %>%
filter(`Airline Name` == "Virgin America") %>%
group_by(`Negative Result`) %>%
summarize(n = n())

Transforming kwic objects into single dfm

I have a corpus of newspaper articles of which only specific parts are of interest for my research. I'm not happy with the results I get from classifying texts along different frames because the data contains too much noise. I therefore want to extract only the relevant parts from the documents. I was thinking of doing so by transforming several kwic objects generated by the quanteda package into a single df.
So far I've tried the following
exampletext <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
mycorpus <- corpus(exampletext)
mycorpus.nat <- corpus(kwic(mycorpus, "nationalit*", window = 5, valuetype = "glob"))
mycorpus.cit <- corpus(kwic(mycorpus, "citizenship", window = 5, valuetype = "glob"))
mycorpus.kwic <- mycorpus.nat + mycorpus.cit
mydfm <- dfm(mycorpus.kwic)
This, however, generates a dfm that contains 4 documents instead of 2, and when both keywords are present in a document even more. I can't think of a way to bring the dfm down to the original number of documents.
Thank you for helping me out.
We recently added window argument to tokens_select() for this purpose:
require(quanteda)
txt <- c("The only reason for (the haste) which we can discern is the prospect of an Olympic medal, which is the raison d'etat of the banana republic,'' The Guardian said in an editorial under the headline ''Whatever Zola Wants. . .'' The Government made it clear it had acted promptly on the application to insure that the 5-foot-2-inch track star could qualify for the British Olympic team. The International Olympic Organization has a rule that says athletes who change their nationality must wait three years before competing for that country - a rule, however, that is often waived by the I.O.C. The British Olympic Association said it consulted with the I.O.C. before asserting Miss Budd's eligibility for the British team. ''Since Zola is now here and has a British passport she should be made to feel welcome and accepted by other British athletes,'' said Paul Dickenson, chairman of the International Athletes Club, an organization that raises money for amateur athletes and looks after their political interests. ''The thing we objected to was the way she got into the country by the Government and the Daily Mail and the commercialization exploitation associated with it.", "That left 14 countries that have joined the Soviet-led withdrawal. Albania and Iran had announced that they would not compete and did not send written notification. Bolivia, citing financial trouble, announced Sunday it would not participate.The 1972 Munich Games had the previous high number of competing countries, 122.No Protest Planned on Zola Budd YAOUNDE, Cameroon, June 4 (AP) - African countries do not plan to boycott the Los Angeles Olympics in protest of the inclusion of Zola Budd, the South African-born track star, on the British team, according to Lamine Ba, the secretary-general of the Supreme Council for Sport in Africa. Because South Africa is banned from participation in the Olympics, Miss Budd, whose father is of British descent, moved to Britain in March and was granted British citizenship.75 Olympians to Train in Atlanta ATLANTA, June 4 (AP) - About 75 Olympic athletes from six African countries and Pakistan will participate in a month-long training camp this summer in Atlanta under a program financed largely by a grant from the United States Information Agency, Anne Bassarab, a member of Mayor Andrew Young's staff, said today. The athletes, from Mozambique, Tanzania, Zambia, Zimbabwe, Uganda, Somalia and Pakistan, will arrive here June 24.")
toks <- tokens(txt)
mt_nat <- dfm(tokens_select(toks, "nationalit*", window = 5))
mt_cit <- dfm(tokens_select(toks, "citizenship*", window = 5))
Please make sure that you are using the latest version of Quanteda.

Creating Tidy Text

I am using R for text analysis. I used the 'readtext' function to pull in text from a pdf. However, as you can imagine, it is pretty messy. I used 'gsub' to replace text for different purposes. The general goal is to use one type of delimiter '%%%%%' to split records into rows, and another delimiter '#' into columns. I accomplished the first but am at a loss of how to accomplish the latter. A sample of the data found in the dataframe is as follows:
895 "The ambulatory case-mix development project\n#Published:: June 6, 1994#Authors: Baker A, Honigfeld S, Lieberman R, Tucker AM, Weiner JP#Country: United States #Journal:Project final report. Baltimore, MD, USA: Johns Hopkins University and Aetna Health Plans. Johns Hopkins\nUniversity and Aetna Health Plans, USA As the US […"
896 "Ambulatory Care Groups: an evaluation for military health care use#Published:: June 6, 1994#Authors: Bolling DR, Georgoulakis JM, Guillen AC#Country: United States #Journal:Fort Sam Houston, TX, USA: United States Army Center for Healthcare Education and Studies, publication #HR 94-\n004. United States Army Center for Healthcare Education and […]#URL: http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA27804"
I want to take this data and split the #Published, #Authors, #Journal, #URL into columns -- c("Published", "Authors", "Journal", "URL").
Any suggestions?
Thanks in advance!
This seems to work OK:
dfr <- data.frame(TEXT=c("The ambulatory case-mix development project\n#Published:: June 6, 1994#Authors: Baker A, Honigfeld S, Lieberman R, Tucker AM, Weiner JP#Country: United States #Journal:Project final report. Baltimore, MD, USA: Johns Hopkins University and Aetna Health Plans. Johns Hopkins\nUniversity and Aetna Health Plans, USA As the US […",
"Ambulatory Care Groups: an evaluation for military health care use#Published:: June 6, 1994#Authors: Bolling DR, Georgoulakis JM, Guillen AC#Country: United States #Journal:Fort Sam Houston, TX, USA: United States Army Center for Healthcare Education and Studies, publication #HR 94-\n004. United States Army Center for Healthcare Education and […]#URL: http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA27804"),
stringsAsFactors = FALSE)
library(magrittr)
do.call(rbind, strsplit(dfr$TEXT, "#Published::|#Authors:|#Country:|#Journal:")) %>%
as.data.frame %>%
setNames(nm = c("Preamble","Published","Authors","Country","Journal"))
Basically split the text by one of four fields (noticing double :: after Published!), row-binding the result, converting to a dataframe, and giving some names.

Resources