R: Counting frequency of words in a character column - r

I'm trying to count the number of times that some pre-specified words appear in a character column, Post.
This is what my dataset looks like:
Now, I want to count all green/sustainable words in each of the posts and add this number as an extra column.
I have manually created a lexicon where all green words have Polarity == 1 and non-green words have Polarity == 0.
How can I do this?

str_count() from stringr can help with this (and with a lot more string-based tasks, see this R4DS chapter).
library(string)
# Create a reproducible example
dat <- data.frame(Post = c(
"This is a sample post without any target words",
"Whilst this is green!",
"And this is eco-friendly",
"This is green AND eco-friendly!"))
lexicon <- data.frame(Word = c("green", "eco-friendly", "neutral"),
Polarity = c(1, 1, 0))
# Extract relevant words from lexicon
green_words <- lexicon$Word[lexicon$Polarity == 1]
# Create new variable
dat$n_green_words <- str_count(dat$Post, paste(green_words, collapse = "|"))
dat
Output:
#> Post n_green_words
#> 1 This is a sample post without any target words 0
#> 2 Whilst this is green! 1
#> 3 And this is eco-friendly 1
#> 4 This is green AND eco-friendly! 2
Created on 2022-07-15 by the reprex package (v2.0.1)

Related

Remove non-unique string components from a column in R

example <- data.frame(
file_name = c("some_file_name_first_2020.csv",
"some_file_name_second_and_third_2020.csv",
"some_file_name_4_2020_update.csv"),
a = 1:3
)
example
#> file_name a
#> 1 some_file_name_first_2020.csv 1
#> 2 some_file_name_second_and_third_2020.csv 2
#> 3 some_file_name_4_2020_update.csv 3
I have a dataframe that looks something like this example. The "some_file_name" part changes often and the unique identifier is usually in the middle and there can be suffixed information (sometimes) that is important to retain.
I would like to end up with the dataframe below. The approach I can think of is finding all common string "components" and removing them from each row.
desired
#> file_name a
#> 1 first 1
#> 2 second_and_third 2
#> 3 4_update 3
This works for the example shared, perhaps you can use this to make a more general solution :
#split the data on "_" or "."
list_data <- strsplit(example$file_name, '_|\\.')
#Get the words that occur only once
unique_words <- names(Filter(function(x) x==1, table(unlist(list_data))))
#Keep only unique_words and paste the string back.
sapply(list_data, function(x) paste(x[x %in% unique_words], collapse = "_"))
#[1] "first" "second_and_third" "4_update"
However, this answer relies on the fact that you would have separators like "_" in the filenames to detect each "component".

I subsetted a list of words from a larger list of 72 items. How do I determine what list number (1-72) those words came from?

I imported 720 sentences from this website (https://www.cs.columbia.edu/~hgs/audio/harvard.html). There are 72 lists (each list contains 10 sentences.) and saved it in an appropriate structure. I did those step in R. The code is immediately depicted below.
#Q.1a
library(xml2)
library(rvest)
url <- 'https://www.cs.columbia.edu/~hgs/audio/harvard.html'
sentences <- read_html(url) %>%
html_nodes("li") %>%
html_text()
headers <- read_html(url) %>%
html_nodes("h2") %>%
html_text()
#Q.1b
harvardList <- list()
sentenceList <- list()
n <- 1
for(sentence in sentences){
sentenceList <- c(sentenceList, sentence)
print(sentence)
if(length(sentenceList) == 10) { #if we have 10 sentences
harvardList[[headers[n]]] <- sentenceList #Those 10 sentences and the respective list from which they are derived, are appended to the harvard list
sentenceList <- list() #emptying our temporary list which those 10 sentences were shuffled into
n <- n+1 #set our list name to the next one
}
}
#Q.1c
sentences1 <- split(sentences, ceiling(seq_along(sentences)/10))
getwd()
setwd("/Users/juliayudkovicz/Documents/Homework 4 Datascience")
sentences.df <- do.call("rbind", lapply(sentences1, as.data.frame))
names(sentences.df)[1] <- "Sentences"
write.csv(sentences.df, file = "sentences1.csv", row.names = FALSE)
THEN, in PYTHON, I computed a list of all the words ending in "ing" and what their frequency was, aka, how many times they appeared across all 72 lists.
path="/Users/juliayudkovicz/Documents/Homework 4 Datascience"
os.chdir(path)
cwd1 = os.getcwd()
print(cwd1)
import pandas as pd
df = pd.read_csv(r'/Users/juliayudkovicz/Documents/Homework 4 Datascience/sentences1.csv', sep='\t', engine='python')
print(df)
df['Sentences'] = df['Sentences'].str.replace(".", "")
print(df)
sen_List = df['Sentences'].values.tolist()
print(sen_List)
ingWordList = [];
for line in sen_List:
for word in line.split():
if word.endswith('ing'):
ingWordList.append(word)
ingWordCountDictionary = {};
for word in ingWordList:
word = word.replace('"', "")
word = word.lower()
if word in ingWordCountDictionary:
ingWordCountDictionary[word] = ingWordCountDictionary[word] + 1
else:
ingWordCountDictionary[word] = 1
print(ingWordCountDictionary)
f = open("ingWordCountDictionary.txt", "w")
for key, value in ingWordCountDictionary.items():
keyValuePairToWrite = "%s, %s\n"%(key, value)
f.write(keyValuePairToWrite)
f.close()
Now, I am being asked to create a dataset which shows what list (1 from 72) each "ing" word is derived from. THIS IS WHAT I DON'T KNOW HOW TO DO. I obviously know they are a subset of huge 72 item list, but how do I figure out what list those words came from.
The expected output should look something like this:
[List Number] [-ing Word]
List 1 swing, ring, etc.,
List 2 moving
so and so forth
Here is one way for you. As far as I see the expected result, you seem to want to get verbs in progressive forms (V-ing). (I do not understand why you have king in your result. If you have king, you should have spring here as well, for example.) If you need to consider lexical classes, I think you want to use the koRpus package. If not, you can use the textstem package, for example.
First, I scraped the link and created a data frame. Then, I split sentences into words using unnest_tokens() in the tidytext package, and subsetted words ending with 'ing'. Then, I used treetag() in the koRpus package. You need to install Treetagger by yourself before you use the package. Finally, I counted how many times these verbs in progressive forms appear in the data set. I hope this will help you.
library(tidyverse)
library(rvest)
library(tidytext)
library(koRpus)
read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>%
html_nodes("h2") %>%
html_text() -> so_list
read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>%
html_nodes("li") %>%
html_text() -> so_text
# Create a data frame
sodf <- tibble(list_name = rep(so_list, each = 10),
text = so_text)
# Split senteces into words and get words ending with ING.
unnest_tokens(sodf, input = text, output = word) %>%
filter(grepl(x = word, pattern = "ing$")) -> sowords
# Use koRpus package to lemmatize the words in sowords$word.
treetag(sowords$word, treetagger = "manual", format = "obj",
TT.tknz = FALSE , lang = "en", encoding = "UTF-8",
TT.options = list(path = "C:\\tree-tagger-windows-3.2\\TreeTagger",
preset = "en")) -> out
# Access to the data frame and filter the words. It seems that you are looking
# for verbs. So I did that here.
filter(out#TT.res, grepl(x = token, pattern = "ing$") & wclass == "verb") %>%
count(token)
# A tibble: 16 x 2
# token n
# <chr> <int>
# 1 adding 1
# 2 bring 4
# 3 changing 1
# 4 drenching 1
# 5 dying 1
# 6 lodging 1
# 7 making 1
# 8 raging 1
# 9 shipping 1
#10 sing 1
#11 sleeping 2
#12 wading 1
#13 waiting 1
#14 wearing 1
#15 winding 2
#16 working 1
How did you store the data from the lists (ie what does your data.frame look like? Could you provide an example?
Without seeing this, I suggest you save the data in a list as follows:
COLUMN 1 , COLUMN 2, COLUMN 3
"List number", "Sentence", "-ING words (as vector)"
I hope this makes sense, let me know if you need more help. I wasn't able to comment on this post unfortunately.

Efficiently transform XML to data frame

I need to transform some vanilla xml into a data frame. The XML is a simple representation of rectangular data (see example below). I can achieve this pretty straightforwardly in R with xml2 and a couple of for loops. However, I'm sure there is a much better/faster way (purrr?). The XML I will be ultimately working with are very large, so more efficient methods are preferred. I would be grateful for any advice from the community.
library(tidyverse)
library(xml2)
demo_xml <-
"<DEMO>
<EPISODE>
<item1>A</item1>
<item2>1</item2>
</EPISODE>
<EPISODE>
<item1>B</item1>
<item2>2</item2>
</EPISODE>
</DEMO>"
dx <- read_xml(demo_xml)
episodes <- xml_find_all(dx, xpath = "//EPISODE")
dx_names <- xml_name(xml_children(episodes[1]))
df <- data.frame()
for(i in seq_along(episodes)) {
for(j in seq_along(dx_names)) {
df[i, j] <- xml_text(xml_find_all(episodes[i], xpath = dx_names[j]))
}
}
names(df) <- dx_names
df
#> item1 item2
#> 1 A 1
#> 2 B 2
Created on 2019-09-19 by the reprex package (v0.3.0)
Thank you in advance.
This is a general solution which handles a varying number of different sub-nodes for each parent node. Each Episode node may have different sub-nodes.
This strategy parses the children nodes identifying the name and values of each sub node. Then it converts this list into a longer style dataframe and then reshapes it into your desired wider style:
library(tidyr)
library(xml2)
demo_xml <-
"<DEMO>
<EPISODE>
<item1>A</item1>
<item2>1</item2>
</EPISODE>
<EPISODE>
<item1>B</item1>
<item2>2</item2>
</EPISODE>
</DEMO>"
dx <- read_xml(demo_xml)
#find all episodes
episodes <- xml_find_all(dx, xpath = "//EPISODE")
#extract the node names and values from all of the episodes
nodenames<-xml_name(xml_children(episodes))
contents<-trimws(xml_text(xml_children(episodes)))
#Idenitify the number of subnodes under each episodes for labeling
IDlist<-rep(1:length(episodes), sapply(episodes, length))
#make a long dataframe
df<-data.frame(episodes=IDlist, nodenames, contents, stringsAsFactors = FALSE)
#make the dataframe wide, Remove unused blank nodes:
answer <- spread(df[df$contents!="",], nodenames, contents)
#tidyr 1.0.0 version
#answer <- pivot_wider(df, names_from = nodenames, values_from = contents)
# A tibble: 2 x 3
episodes item1 item2
<int> <chr> <chr>
1 1 A 1
2 2 B 2
This may be an option without using a for loop,
episodes <- xml_find_all(dx, xpath = "//EPISODE") %>% xml_attr("item1")
dx_names <- xml_name(xml_children(episodes[1]))
# You can get all values between the tags by xml_text()
values <- xml_children(episodes) %>% xml_text()
as.data.frame(matrix(values,
ncol=length(dx_names),
dimnames =list(seq(dx_names),dx_names),byrow=TRUE))
gives,
item1 item2
1 A 1
2 B 2
Note that, you may need to change the Item2 column to a numeric one by as.numeric() since it's been assigned as factor by this solution.

R: Count the frequency of every unique character in a column

I have a data frame df which contains a column named strings. The values in this column are some sentences.
For example:
id strings
1 "I want to go to school, how about you?"
2 "I like you."
3 "I like you so much"
4 "I like you very much"
5 "I don't like you"
Now, I have a list of stop word,
["I", "don't" "you"]
How can I make another data frame which stores the total number of occurrence of each unique word (except stop word)in the column of previous data frame.
keyword frequency
want 1
to 2
go 1
school 1
how 1
about 1
like 4
so 1
very 1
much 2
My idea is that:
combine the strings in the column to a big string.
Make a list storing the unique character in the big string.
Make the df whose one column is the unique words.
Compute the frequency.
But this seems really inefficient and I don't know how to really code this.
At first, you can create a vector of all words through str_split and then create a frequency table of the words.
library(stringr)
stop_words <- c("I", "don't", "you")
# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))
# create a frequency table
word_list <- as.data.frame(table(all_words))
# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]
One way is using tidytext. Here a book and the code
library("tidytext")
library("tidyverse")
#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))
df %>%
mutate(strings = as.character(strings)) %>%
unnest_tokens(word, string) %>% #this tokenize the strings and extract the words
filter(!word %in% c("I", "i", "don't", "you")) %>%
count(word)
#> # A tibble: 11 x 2
#> word n
#> <chr> <int>
#> 1 about 1
#> 2 go 1
#> 3 how 1
#> 4 like 4
#> 5 much 2
EDIT
All the tokens are transformed to lower case, so you either include i in the stop_words or add the argument lower_case = FALSE to unnest_tokens
Assuming you have a mystring object and a vector of stopWords, you can do it like this:
# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]
# remove stopwords from the vector
vector = vector[!vector %in% stopWords]
At this point you can turn a frequency table() into a dataframe object:
frequency_df = data.frame(table(words))
Let me know if this can help you.

Inserting random letters at random locations within a string

I am trying to make a little script to demonstrate how DNA sequences can evolve using a sentence as an example. I would like to repeatedly replace or insert letters or words into a string in R. I would like this to happen repeatedly so one can watch the string change over time. Finally I would like there to be a greater probability of letters changing than words changing.
So far I have defined a string and created lists of both letters and words and sample randomly from both these lists.
However I do not know how to then modify the text with a set probability. For example how do I make it so there is a 50% chance of a letter in the text being replaced with a letter from my letter list and if this happens, it should occur at a random location in the text?
I also want this process to occur X times so I can show the text changing over time. Any help or suggestions are greatly appreciated. My current incomplete code is below
#First I define the string
text <- c("This sentence is changing")
#Then make a vector of words from the string
word_list <- strsplit(text, " ")
word_list <- unlist(word_list)
#Also make a vector of letters from the string
letters_and_gaps <- substring(text, seq(1, nchar(text), 1), seq(1, nchar(text), 1))
letters_and_gaps <- unlist(letters_and_gaps)
#Now for probability 1 in 2 or it occuring, select a random character from letters_and_gaps:
sample(letters_and_gaps, 1)
#Then choose a random character in text and replace it with this randomly sampled character:
#Now with probability 1 in 10 or it occuring, select a random word from word_list
sample(letters_and_gaps, 1)
#Then choose a random word in text and replace it with this randomly sampled word:
#Then print the updated text:
text
#Iteratively repeat this process X times
My goal is to ultimately put this in a Shiny app where one can select the probability of different events occuring (letter vs word replacement) and then watch how this influence how the text evolves.
Here is the beginning of an implementation. We just wrap your logic up in a function and use a for loop to apply it again and again. Here I put the output in a table and then display only unique rows (possibly excluding times where it mutated back to the same string as a previous iteration but probably not significant) so you can see that changes happening. Note that because we are sampling from the words and characters of the previous sentence, and we are including spaces, new words can form when spaces are inserted and the distribution will tend to become more uniform (if a character is common it will tend to be substituted more often)
library(tidyverse)
evolve_sentence <- function(sentence, arg2) {
chars <- str_split(sentence, "") %>% pluck(1)
if (runif(1) > 0.5) {
chars[sample(1:length(chars), 1)] <- sample(chars, 1)
}
sentence <- str_c(chars, collapse = "")
words <- str_split(sentence, " ") %>% pluck(1)
if (runif(1) > 0.9) {
words[sample(1:length(words), 1)] <- sample(words, 1)
}
sentence <- str_c(words, collapse = " ")
sentence
}
tbl_evolve <- tibble(iteration = 1:500, text = "This sentence is changing")
for (i in 2:500) {
tbl_evolve$text[i] <- evolve_sentence(tbl_evolve$text[i - 1])
}
tbl_evolve %>%
distinct(text, .keep_all = TRUE)
#> # A tibble: 204 x 2
#> iteration text
#> <int> <chr>
#> 1 1 This sentence is changing
#> 2 3 hhis sentence is changing
#> 3 4 hhis sentence is chasging
#> 4 6 hhis sestence is chasging
#> 5 10 hhi sestence is chasging
#> 6 12 hhi sesnence is chasging
#> 7 14 hhi sesnesce is chasging
#> 8 15 hhi se nesce is chasging
#> 9 18 hhi se nesceiis chasging
#> 10 20 hhi se nesceiis chasgihg
#> # … with 194 more rows
Created on 2019-04-17 by the reprex package (v0.2.1)

Resources