Assign an ID based on keywords present in Tweets - r

I have extracted Tweets by feeding in 44 different keywords, and the output is in a file which consists of 400k tweets in total. The output file has tweets that contain the relevant keywords. How could I create a separate ID column which contains the keyword present in that tweet?
Eg: The tweet is:
Andhra Pradesh is the highest state with crimes against women
the keyword here is "crimes against women"
I would like to create a column that assigns the keyword "crimes against women" to the tweet, a sort of ID column to be precise.
#input column 1
Tweet<-("Andhra Pradesh is the highest state with crimes against women")
#expected output column 2 beside the Tweet column
Keyword<-("crimes against women")
Edit: I do not want to extract any part of the tweet, I just want to be able to assign to the tweet, in a new column, the keyword it contains so it will help me segregate the tweets based on this keyword.

You can perform this analysis with the stringr package, however, I don't think you need to use sapply.
Consider the following keyword list and table with tweets:
keyword_list <- c("crimes against women", "downloading tweets", "r analysis")
tweets <- data.frame(
tweet = c("Andhra Pradesh is the highest state with crimes against women",
"I am downloading tweets",
"I love r analysis",
"downloading tweets helps with my r analysis")
)
First, you want to combine your keywords into one regular expression that searches for any of the strings.
keyword_pattern <- paste0(
"(",
paste0(keyword_list, collapse = "|"),
")"
)
keyword_pattern
#> [1] "(crimes against women|downloading tweets|r analysis)"
Finally, we can add a column to the data frame that extracts the keyword from the tweet.
tweets$keyword <- str_extract(tweets$tweet, keyword_pattern)
> tweets
#> tweet keyword
#> 1 Andhra Pradesh is the highest state with crimes against women crimes against women
#> 2 I am downloading tweets downloading tweets
#> 3 I love r analysis r analysis
#> 4 downloading tweets helps with my r analysis downloading tweets
As the final example illustrates, you need to think about what you want to do when a tweet contains multiple keywords. In this case, the keyword returned is simply the first one found in the tweet. However, you can also use str_extract_all to return ALL keywords found in the tweet.

We can use stringr which is very handy for string operations and simply use str_extract, i.e.
str_extract(Tweet, Keyword)
#[1] "crimes against women"
For multiple keywords and multiple strings you need to apply, i.e.
Keyword <- c("crimes against women", "something")
Tweet <- c("Andhra Pradesh is the highest state with crimes against women",
"another string with something else")
sapply(Tweet, function(i)str_extract(i, paste(Keyword, collapse = '|')))
# Andhra Pradesh is the highest state with crimes against women another string with something else
# "crimes against women" "something"

Related

How to remove the first words of specific rows that appear in another column?

Is there a way to remove the first n words of the column "content" when there are words present in the "keyword" column?
I am working with a data frame similar to this:
keyword <- c("Mr. Jones", "My uncle Sam", "Tom", "", "The librarian")
content <- c("Mr. Jones is drinking coffee", "My uncle Sam is sitting in the kitchen with my uncle Richard", "Tom is playing with Tom's family's dog", "Cassandra is jogging for her first time", "The librarian is jogging with her")
data <- data.frame(keyword, content)
data
In some cases, the first few words of the "keyboard" sting are contained in the "content" string.
In others, the "keyword" string remains empty and only "content" is filled.
What I want to achieve here is to remove the first appearance of the word combination in "keyword" that appears in the same row in "content".
Unfortunately, I am only able to create code that deletes all the matching words. But as you can see, some words (like "uncle" or "Tom") appear more than one time in a cell.
I'd like to only delete the first appearance and keep all that come after in the same cell.
My next-best solution was to use the following code:
data$content <- mapply(function(x,y)gsub(x,"",y) ,gsub(" ", "|",data$keyword),data$content)
This code was designed to remove all of the words from "content" that are present in "keyword" of the same row. (It was initially posted here).
Another option that I tried was to design a function for this:
I first created a new variable which counted the number of words that are included in the "keyword" string of the corresponding line:
numw <- lengths(gregexpr("\\S+", data$keyword))
data <- cbind(data, numw)
Second, I tried to formulate a function to remove the first n words of content[i], with n = numw[i]
shorten <- function(v, z){
v <- gsub(".*^\\w+", z, v)
}
shorten(data$content, data$numw)
Unfortunately, I am not able to make the function work and the following error message will be generated:
Error in gsub(".*^\w+", z, v) : invalid 'replacement' argument
So, I'd be incredibly greatful if one could help me to formulate a function that could actually deal with the issue more appropriately.
Here is a solution which is based on str_remove. As str_remove gives warnings, if the pattern is '' the first row exchanges it with NA. If then keyword is NA the keyword is stripped off, if not content is taken as is.
library(tidyverse )
data |>
mutate(keyword = na_if(keyword, '')) |>
mutate(content = case_when(
!is.na(keyword) ~ str_remove(content, keyword),
is.na(keyword) ~content))
#> keyword content
#> 1 Mr. Jones is drinking coffee
#> 2 My uncle Sam is sitting in the kitchen with my uncle Richard
#> 3 Tom is playing with Tom's family's dog
#> 4 <NA> Cassandra is jogging for her first time
#> 5 The librarian is jogging with her

How do I create two subsets out of a corpus based on multiple keywords?

I am working with a large body of political speeches in quanteda and would like to create two subsets.
The first one should contain one or more from a list of specific keywords(e.g. "migrant*", "migration*", "asylum*"). The second one should contain the documents which do not hold any of these terms (the speeches which do not fall into the first subset).
Any input on this would be greatly appreciated. Thanks!
#first suggestion
> corp_labcon$criteria <- ifelse(stringi::stri_detect_regex(corp_labcon, pattern=paste0(regex_pattern), ignore_case = TRUE, collapse="|"), "yes", "no")
Warning messages:
1: In (function (case_insensitive, comments, dotall, dot_all = dotall, :
Unknown option to `stri_opts_regex`.
2: In stringi::stri_detect_regex(corp_labcon, pattern = paste0(regex_pattern), :
longer object length is not a multiple of shorter object length
> table(corp_labcon$criteria)
no yes
556921 6139
#Second suggestion
> corp_labcon$criteria <- ifelse(stringi::stri_detect_regex(corp_labcon, pattern = paste0(glob2rx(regex_pattern), collapse = "|")), "yes","no")
> table(corp_labcon$criteria)
no
563060
You didn't give a reproducible example, but I will show how it can be done with quanteda and the available corpus data_corpus_inaugural. You can make use of the docvars that you can attach to your corpus. It is just like adding a variable to a data.frame.
With stringi::stri_detect_regex you look inside each document if any of the looked for words is in the text, if so set the value in the criteria column to yes. Otherwise to no. After that you can use corpus_subset to create 2 new corpi based on the criteria values. See example code below.
library(quanteda)
# words used in regex selection
regex_pattern <- c("migrant*", "migration*", "asylum*")
# add selection to corpus
data_corpus_inaugural$criteria <- ifelse(stringi::stri_detect_regex(data_corpus_inaugural,
pattern = paste0(regex_pattern,
collapse = "|")),
"yes","no")
# Check docvars and new criteria column
head(docvars(data_corpus_inaugural))
Year President FirstName Party criteria
1 1789 Washington George none yes
2 1793 Washington George none no
3 1797 Adams John Federalist no
4 1801 Jefferson Thomas Democratic-Republican no
5 1805 Jefferson Thomas Democratic-Republican no
6 1809 Madison James Democratic-Republican no
# split corpus into segment 1 and 2
segment1 <- corpus_subset(data_corpus_inaugural, criteria == "yes")
segment2 <- corpus_subset(data_corpus_inaugural, criteria == "no")
Not sure how your data is organised, but you could try the function grep(). Imagining that the data is a data frame and each line is a text, you could try:
words <- c("migrant", "migration", "asylum")
df[grep(words, df$text),] # This will give you those lines with the words
df[!grep(words, df$text),] # This will give you those lines without the words
Probably though, your data is not structured like this! You should explain better how your data looks like.

Text column in R - Trying to count the keywords sequentially

I am working on a dataset that has a text column. The text has many sentences separated by a semi-colon ';'. I am trying to get a word count in a new column in dataframe for words that match my keyword. However, in one sentence, if there are repeated keywords, they should be considered only once.
For instance -
The section 201 solar trade case on cells and modules; Issues relating to section 201 tariffs on imported goods
Solar panels, Tawian tariffs, trade
Trade issues impacting the solar industry
are the text in one column of the dataframe.
My keywords include solar, solar panels, section 201
I want to count the words in each sentence that match my keywords but if both or all words are there in the sentence, then it is counted only once. Word counts should only consider keywords in different sentences. If one sentence doesn't have a specific keyword, then move towards finding the second keyword.
My output should be -
word_count
2 (as section 201 is mentioned in both sentences, we do not search for solar because the first word in the keyword list matched)
1 (as only solar word is there)
1 (as only solar word is there)
Please suggest a way to resolve this issue. It is a crucial part of my research work. Thanks.
Kind Regards,
Preety
I think for each number in your example, you want to consider that a separate list item. You want to consider each list item as separate sentences wherever there is semi-colon. Then you want to look for the first occurrence of a keyword in that each list item. That becomes the target keyword for that list item. You then want to count the first occurrence only of that target keyword in each sentence within each list item:
library(dplyr)
library(stringr)
# I modified your example sentences to include "section 201" twice in sentence 2 in list item #1 to show it will only count once in that sentence
# I modified the order of your keywords, otherwise solar will be detected first in all sentences
sentences <- list("The section 201 solar trade case on cells and modules; Issues relating to section 201 section 201 tariffs on imported goods", "Solar panels, Tawian tariffs, trade", "Trade issues impacting the solar industry")
keywords <- c("section 201", "solar", "solar panels")
# Go through by each list item
lapply(sentences, function(sentence){
# For each list item, sentences, split into separate sentences by ;
# Also I changed each sentence to lowercase, otherwise solar != Solar
str_split(tolower(sentence), ";") -> split_string
# Apply each keyword and detect if the keyword is in the list_item using str_detect from stringr
sapply(keywords, function(keyword) str_detect(split_string, keyword)) -> output
# Output is the keywords that are in each list item
# Choose the first occurrence of a keyword: keywords[output == T][1]
# Detect in each sentence if the keyword is included, then sum the number of occurrences for each list item. Name as the keyword that was detected
setNames(str_detect(split_string[[1]], keywords[output == T][1]) %>% sum(), keywords[output == T][1])
})
Gives a list, with the first occurring keyword identified and the number of first occurrences of that keyword in each sentence within list item:
[[1]]
section 201
2
[[2]]
solar
1
[[3]]
solar
1
I would first split your text column into multiple columns using tidyr::separate(); then search for your key phrases within each of these, using stringr::str_detect() or base regex functions; then sum across columns using rowSums(dplyr::across()). Like so:
library(tidyverse)
keywords <- paste("solar", "section 201", sep = "|")
df <- tibble(
text = c(
"The section 201 solar trade case on cells and modules; Issues relating to section 201 tariffs on imported goods",
"Solar panels; Tawian tariffs; trade",
"Trade issues impacting the solar industry"
)
)
df <- df %>%
separate(text, into = c("text1", "text2", "text3"), sep = ";", fill = "right") %>%
mutate(
count = rowSums(
across(text1:text3, ~ str_detect(str_to_lower(.x), keywords)),
na.rm = TRUE
)
)
The count column then contains the results you predicted:
df %>%
select(count)
# A tibble: 3 x 1
# count
# <dbl>
# 1 2
# 2 1
# 3 1
If the list of keywords is not too long (say 10 or 20 words), then you can look at the count of all the keywords for each text string. I am adding ; at the end of each text string so that a sentence always ends with a ;. The pattern paste0("[^;]*", key, "[^;]*;") identifies any sentence containing the word (stored in) key.
txt <- c("The section 201 solar trade case on cells and modules; Issues relating to section 201 tariffs on imported goods",
"Solar panels, Tawian tariffs, trade",
"Trade issues impacting the solar industry")
keys <- c("section 201", "solar panels", "solar")
counts <- sapply(keys, function(key) stringr::str_count(paste0(txt, ";"), regex(paste0("[^;]*", key, "[^;]*;"), ignore_case = T)))
Next you can go over each row of counts and look at the first non-zero element which should be the value you are looking for.
sapply(1:nrow(counts), function(i) {
a <- counts[i, ]
a[a != 0][1]
})

R xml2 : How to query only corresponding xml nodes

I'm trying to read and transform many XML files into R data frames (or preferably Tibbles).
All R packages I've tried, unfortunately (XML, flatxml, xmlconvert) failed when I tried to convert the files using built-in functions (e.g. xmltodataframe from the XML Package and xml_to_df from the xmlconvert package), so I have to do it manually with XML2.
Here is my question with a small working example:
# Minimal Working Example
library(tidyverse)
library(xml2)
interimxml <- read_xml("<Subdivision>
<Name>Charles</Name>
<Salary>100</Salary>
<Name>Laura</Name>
<Name>Steve</Name>
<Salary>200</Salary>
</Subdivision>")
names <- xml_text(xml_find_all(interimxml ,"//Subdivision/Name"))
salary <- xml_text(xml_find_all(interimxml ,"//Subdivision/Salary"))
names
salary
# combine in to tibble (doesn't work because of inequal vector lengths)
result <- tibble(names=names,
salary = salary)
result
rbind(names, salary)
From the (made up) XML file you can see that Charles earns 100 dollars, Laura earns nothing ( because of the missing entry, here is the problem) and Steve earns 200 dollars.
What I want xml2 do to is, when querying names and salary nodes is to return an "NA" (or zero which would also be okay), when it finds a name but no corresponding salary entry, so that I would end up a nice table like this:
Name
Salary
Charles
100
Laura
NA
Steve
200
I know that I could modify the "xpath" to only pick up the last value (for Steve), which wouldn't help me, since (in the real data) it could also be the 100th or the 23rd person with missing salary information.
[ I'm aware that Salary Numbers are pulled as character values from the xml file. I would mutate(across(salary, as.double) over columns afterwards.]
Any help is highly appreciated. Thank you very much in advance.
You need to be a bit more careful to match up the names and salaries. Basically first find all the <Name> nodes, then check only if their next sibling is a <Salary> node. If not, then return NA.
nameNodes <- xml_find_all(interimxml ,"//Subdivision/Name")
names <- xml_text(nameNodes)
salary <- map_chr(nameNodes, ~xml_text(xml_find_first(., "./following-sibling::*[1][self::Salary]")))
tibble::tibble(names, salary)
# names salary
# <chr> <chr>
# 1 Charles 100
# 2 Laura NA
# 3 Steve 200

Scraping PDF tables based on title

I am trying to extract one table each from 31 pdfs. The titles of the tables all start the same way but the end varies by region.
For one document the title is "Table 13.1: Total Number of Households Engaged in Agriculture by District, Rural and Urban Residence During 2011/12 Agriculture Year; Arusha Region, 2012 Census". Another would be "Table 13.1: Total Number of Households Engaged in Agriculture by District, Rural and Urban Residence During 2011/12 Agriculture Year; Dodoma Region, 2012 Census."
I used tabulizer to scrape the first table manually based on the specific text lines I need but given the similar naming conventions, I was hoping to automate this process.
```
PATH2<- "Regions/02. Arusha Regional Profile.pdf"
```
txt2 <- pdf_text(PATH2) %>%
readr:: read_lines()
```
specific_lines2<- txt2[4621:4639] %>%
str_squish() %>%
str_replace_all(",","") %>%
strsplit(split = " ")
What: You can find the page with the common part of the title on each file and extract the data from there (if there is only one occurrence of the title per file)
How: Build a function to get the table on a pdf, then ask the function on lapply to run for all pdfs.
Example:
First, load the function to find a page that includes the title and get the text from there.
get_page_text <- function(url,word_find) {
txt <- pdftools::pdf_text(url)
p <- grep(word_find, txt, ignore.case = TRUE)[1] # Sentence to find
L <- tabulizer::extract_text(url, pages = p)
i <- which.max(lengths(L))
data.frame(L[[i]])
}
Second, get file names.
setwd("C:/Users/xyz/Regions")
files <- list.files(pattern = "pdf$|PDF$") # Get file names on the folder Regions.
Then, the "loop" (lapply) to run the function for each pdf.
reports <- lapply(files,
get_page_text,
word_find = "Table 13.1: Total Number of Households Engaged in Agriculture by District, Rural and Urban Residence During 2011/12 Agriculture Year")
The result is a variable list that has one data.frame for each pdf extracted. What comes next is cleaning up your data.
The function may vary a lot depending on the patterns on your pdfs. Finding the page was effective for me, you will find what fits best for you.

Resources