If I have a specific list with terms like this:
df_specific <- data.frame(terms = c("hi", "why here", "see you soon"))
and a framework with text
df_text <- data.frame(text = c("hi my name is", "why here you are",
"hi see you later", "I hope to see you soon"))
How is it possible to use the first list as index to find in the df_text how many times exist?
Example of expected output:
term num
hi 2
why here 1
see you soon 1
You can use grepl for an individual term and use sapply to map that across all of your test terms.
sapply(df_specific$terms, function(x) sum(grepl(x, df_text$text)))
[1] 2 1 1
If you want to get the specific format that you listed, just cbind the previous result onto df_specific
num = sapply(df_specific$terms, function(x) sum(grepl(x, df_text$text)))
cbind(df_specific, num)
terms num
1 hi 2
2 why here 1
3 see you soon 1
Overview
Using the tidyverse, I supplied each value in df_specific$term as the pattern to test for its existence in all the values in df_text$text via map_df() and str_count().
# load necessary packages ----
library(tidyverse)
# load necessary data ------
df_specific <- tibble(terms = c("hi", "why here", "see you soon"))
df_text <- tibble(text = c("hi my name is"
, "why here you are"
, "hi see you later"
, "I hope to see you soon"))
# perform analysis --------
df_specific %>%
pull(terms) %>%
set_names() %>%
# for each value in df_text$text
# count how many times .x appears in the vector
map_df(.f = ~ str_count(string = df_text$text
, pattern = .x) %>% sum()) %>%
# transform data from wide to long
gather(key = "term", value = "num")
# A tibble: 3 x 2
# term num
# <chr> <int>
# 1 hi 2
# 2 why here 1
# 3 see you soon 1
# end of script #
Related
I need to compare two rows next to each other in a column in a dataframe, if the data in both those rows matches, then save the most recent row, e.g.
# Animals
# 1 dog
# 2 cat
# 3 cat
It should compare dog and cat, then not save any data. So it won't save row 1 and 2.
But when it moves onto compare cat and cat, realise they are the same and save those rows. So save rows 2 and 3. As they are the same. There are several other columns but the animals column is the only one I need to use to decide whether the row is saved. However I want to keep all the data in the columns within the saved rows.
I need to do this for lots of rows, iterating through to compare a big set of data (~68,000)
I've tried to produce an if statement in which:
# results <- list()
#
# if(isTRUE(data$Animals[i+1] == data$Animals[i])) {
# output <- print(data$Animals[i+1])
# results[[i+1]] <- output
# output <- print(data$Animals[i])
# results[[i]] <- output
# }
#}
I then converted this results list into a dataframe for further manipulation. However this method only provides me with the animal name, I would prefer it the entire row was saved. I'm not too sure how to achieve this, I've been trying to edit the statement but I can't seem to get it working.
I'm new to R and learning, please help anyway you can, I'd appreciate it :)
To "prove" that we're saving the "most recent row", I'll add a row-number column. The data:
dat <- structure(list(Animals = c("dog", "cat", "cat"), row = 1:3), row.names = c(NA, -3L), class = "data.frame")
dat
# Animals row
# 1 dog 1
# 2 cat 2
# 3 cat 3
base R
dat[c(with(dat, Animals[-nrow(dat)] != Animals[-1])),,drop=FALSE]
# Animals row
# 1 dog 1
# 3 cat 3
dplyr
library(dplyr)
dat %>%
filter(Animals != lead(Animals, default = ''))
# Animals row
# 1 dog 1
# 2 cat 3
The only caution I have with this is that if package-loading is at all out-of-order, there exists both stats::filter and stats::lag that behave completely differently. If you see odd results, try prepending dplyr:: to make sure it isn't a which-function-am-I-using problem.
dat %>%
dplyr::filter(Animals != dplyr::lead(Animals, default = ''))
We could use lead and filter
library(dplyr)
df %>%
mutate(helper = lead(animals)) %>%
filter(animals == helper) %>%
select(animals)
Output:
animals
<chr>
1 cat
I imported 720 sentences from this website (https://www.cs.columbia.edu/~hgs/audio/harvard.html). There are 72 lists (each list contains 10 sentences.) and saved it in an appropriate structure. I did those step in R. The code is immediately depicted below.
#Q.1a
library(xml2)
library(rvest)
url <- 'https://www.cs.columbia.edu/~hgs/audio/harvard.html'
sentences <- read_html(url) %>%
html_nodes("li") %>%
html_text()
headers <- read_html(url) %>%
html_nodes("h2") %>%
html_text()
#Q.1b
harvardList <- list()
sentenceList <- list()
n <- 1
for(sentence in sentences){
sentenceList <- c(sentenceList, sentence)
print(sentence)
if(length(sentenceList) == 10) { #if we have 10 sentences
harvardList[[headers[n]]] <- sentenceList #Those 10 sentences and the respective list from which they are derived, are appended to the harvard list
sentenceList <- list() #emptying our temporary list which those 10 sentences were shuffled into
n <- n+1 #set our list name to the next one
}
}
#Q.1c
sentences1 <- split(sentences, ceiling(seq_along(sentences)/10))
getwd()
setwd("/Users/juliayudkovicz/Documents/Homework 4 Datascience")
sentences.df <- do.call("rbind", lapply(sentences1, as.data.frame))
names(sentences.df)[1] <- "Sentences"
write.csv(sentences.df, file = "sentences1.csv", row.names = FALSE)
THEN, in PYTHON, I computed a list of all the words ending in "ing" and what their frequency was, aka, how many times they appeared across all 72 lists.
path="/Users/juliayudkovicz/Documents/Homework 4 Datascience"
os.chdir(path)
cwd1 = os.getcwd()
print(cwd1)
import pandas as pd
df = pd.read_csv(r'/Users/juliayudkovicz/Documents/Homework 4 Datascience/sentences1.csv', sep='\t', engine='python')
print(df)
df['Sentences'] = df['Sentences'].str.replace(".", "")
print(df)
sen_List = df['Sentences'].values.tolist()
print(sen_List)
ingWordList = [];
for line in sen_List:
for word in line.split():
if word.endswith('ing'):
ingWordList.append(word)
ingWordCountDictionary = {};
for word in ingWordList:
word = word.replace('"', "")
word = word.lower()
if word in ingWordCountDictionary:
ingWordCountDictionary[word] = ingWordCountDictionary[word] + 1
else:
ingWordCountDictionary[word] = 1
print(ingWordCountDictionary)
f = open("ingWordCountDictionary.txt", "w")
for key, value in ingWordCountDictionary.items():
keyValuePairToWrite = "%s, %s\n"%(key, value)
f.write(keyValuePairToWrite)
f.close()
Now, I am being asked to create a dataset which shows what list (1 from 72) each "ing" word is derived from. THIS IS WHAT I DON'T KNOW HOW TO DO. I obviously know they are a subset of huge 72 item list, but how do I figure out what list those words came from.
The expected output should look something like this:
[List Number] [-ing Word]
List 1 swing, ring, etc.,
List 2 moving
so and so forth
Here is one way for you. As far as I see the expected result, you seem to want to get verbs in progressive forms (V-ing). (I do not understand why you have king in your result. If you have king, you should have spring here as well, for example.) If you need to consider lexical classes, I think you want to use the koRpus package. If not, you can use the textstem package, for example.
First, I scraped the link and created a data frame. Then, I split sentences into words using unnest_tokens() in the tidytext package, and subsetted words ending with 'ing'. Then, I used treetag() in the koRpus package. You need to install Treetagger by yourself before you use the package. Finally, I counted how many times these verbs in progressive forms appear in the data set. I hope this will help you.
library(tidyverse)
library(rvest)
library(tidytext)
library(koRpus)
read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>%
html_nodes("h2") %>%
html_text() -> so_list
read_html("https://www.cs.columbia.edu/~hgs/audio/harvard.html") %>%
html_nodes("li") %>%
html_text() -> so_text
# Create a data frame
sodf <- tibble(list_name = rep(so_list, each = 10),
text = so_text)
# Split senteces into words and get words ending with ING.
unnest_tokens(sodf, input = text, output = word) %>%
filter(grepl(x = word, pattern = "ing$")) -> sowords
# Use koRpus package to lemmatize the words in sowords$word.
treetag(sowords$word, treetagger = "manual", format = "obj",
TT.tknz = FALSE , lang = "en", encoding = "UTF-8",
TT.options = list(path = "C:\\tree-tagger-windows-3.2\\TreeTagger",
preset = "en")) -> out
# Access to the data frame and filter the words. It seems that you are looking
# for verbs. So I did that here.
filter(out#TT.res, grepl(x = token, pattern = "ing$") & wclass == "verb") %>%
count(token)
# A tibble: 16 x 2
# token n
# <chr> <int>
# 1 adding 1
# 2 bring 4
# 3 changing 1
# 4 drenching 1
# 5 dying 1
# 6 lodging 1
# 7 making 1
# 8 raging 1
# 9 shipping 1
#10 sing 1
#11 sleeping 2
#12 wading 1
#13 waiting 1
#14 wearing 1
#15 winding 2
#16 working 1
How did you store the data from the lists (ie what does your data.frame look like? Could you provide an example?
Without seeing this, I suggest you save the data in a list as follows:
COLUMN 1 , COLUMN 2, COLUMN 3
"List number", "Sentence", "-ING words (as vector)"
I hope this makes sense, let me know if you need more help. I wasn't able to comment on this post unfortunately.
I have a data frame df which contains a column named strings. The values in this column are some sentences.
For example:
id strings
1 "I want to go to school, how about you?"
2 "I like you."
3 "I like you so much"
4 "I like you very much"
5 "I don't like you"
Now, I have a list of stop word,
["I", "don't" "you"]
How can I make another data frame which stores the total number of occurrence of each unique word (except stop word)in the column of previous data frame.
keyword frequency
want 1
to 2
go 1
school 1
how 1
about 1
like 4
so 1
very 1
much 2
My idea is that:
combine the strings in the column to a big string.
Make a list storing the unique character in the big string.
Make the df whose one column is the unique words.
Compute the frequency.
But this seems really inefficient and I don't know how to really code this.
At first, you can create a vector of all words through str_split and then create a frequency table of the words.
library(stringr)
stop_words <- c("I", "don't", "you")
# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))
# create a frequency table
word_list <- as.data.frame(table(all_words))
# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]
One way is using tidytext. Here a book and the code
library("tidytext")
library("tidyverse")
#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))
df %>%
mutate(strings = as.character(strings)) %>%
unnest_tokens(word, string) %>% #this tokenize the strings and extract the words
filter(!word %in% c("I", "i", "don't", "you")) %>%
count(word)
#> # A tibble: 11 x 2
#> word n
#> <chr> <int>
#> 1 about 1
#> 2 go 1
#> 3 how 1
#> 4 like 4
#> 5 much 2
EDIT
All the tokens are transformed to lower case, so you either include i in the stop_words or add the argument lower_case = FALSE to unnest_tokens
Assuming you have a mystring object and a vector of stopWords, you can do it like this:
# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]
# remove stopwords from the vector
vector = vector[!vector %in% stopWords]
At this point you can turn a frequency table() into a dataframe object:
frequency_df = data.frame(table(words))
Let me know if this can help you.
I could not find the answer how to count words in data frame and exclude if other word is found.
I have got below df:
words <- c("INSTANCE find", "LA LA LA", "instance during",
"instance", "instance", "instance", "find instance")
df <- data.frame(words)
df$words_count <- grepl("instance", df$words, ignore.case = T)
It counts all instances of "instance" I have been trying to exclude any row when word find is present as well.
I can add another grepl to look up for "find" and based on that exclude but I try to limit number of lines of my code.
I'm sure there's a solution using a single regular expression, but you could do
df$words_count <- Reduce(`-`, lapply(c('instance', 'find'), grepl, df$words)) > 0
or
df$words_count <- Reduce(`&`, lapply(c('instance', '^((?!find).)*$'), grepl, df$words, perl = T, ignore.case = T))
This might be easier to read
library(tidyverse)
df$words_count <- c('instance', '^((?!find).)*$') %>%
lapply(grepl, df$words, perl = T, ignore.case = T) %>%
reduce(`&`)
If all you need is the number of times "instance" appears in a string, negating all in that string if "find" is found anywhere:
df$counts <- sapply(gregexpr("\\binstance\\b", words, ignore.case=TRUE), function(a) length(a[a>0])) *
!grepl("\\bfind\\b", words, ignore.case=TRUE)
df
# words counts
# 1 INSTANCE find 0
# 2 LA LA LA 0
# 3 instance during 1
# 4 instance 1
# 5 instance 1
# 6 instance 1
# 7 find instance 0
New to R. I am looking to remove certain words from a data frame. Since there are multiple words, I would like to define this list of words as a string, and use gsub to remove. Then convert back to a dataframe and maintain same structure.
wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
a
id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"
I was thinking something like:
a2 <- apply(a, 1, gsub(wordstoremove, "", a)
but clearly this doesnt work, before converting back to a data frame.
wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
(dat <- read.table(header = TRUE, text = 'id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"'))
# id text time username
# 1 1 ai and x 10 me
# 2 2 and computing 5 you
# 3 3 nothing 15 everyone
# 4 4 ibm privacy 0 know
(dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste(wordstoremove, collapse = '|'), '', x))))
# id text time username
# 1 1 and x 10 me
# 2 2 and 5 you
# 3 3 nothing 15 everyone
# 4 4 0 know
Another option using dplyr::mutate() and stringr::str_remove_all():
library(dplyr)
library(stringr)
dat <- dat %>%
mutate(text = str_remove_all(text, regex(str_c("\\b",wordstoremove, "\\b", collapse = '|'), ignore_case = T)))
Because lowercase 'ai' could easily be a part of a longer word, the words to remove are bound with \\b so that they are not removed from the beginning, middle, or end or other words.
The search pattern is also wrapped with regex(pattern, ignore_case = T) in case some words are capitalized in the text string.
str_replace_all() could be used if you wanted to replace the words with something other than just removing them. str_remove_all() is just an alias for str_replace_all(string, pattern, '').
rawr's anwswer could be updated to:
dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste0('\\b', wordstoremove, '\\b', collapse = '|'), '', x, ignore.case = T)))