Text-mining including patterns and numbers - r

Dataset contains a free text field with information on building plans. I need to split the content of the field in 2 parts, the first part contains only the number of planned buildings, the other only the type of building. I have a reference lexicon list with the types of buildings.
Example
Plans<- c("build 10 houses ","5 luxury apartments with sea view",
"renovate 20 cottages"," transform 2 bungalows and a school", "1 hotel")
Reference list
Types <-c("houses", "cottages", "bungalows", "luxury apartments")
Desired Output 2 colums, Number and Type, with this content:
Number Type
10 houses
5 apartments
20 cottages
2 bungalows
Tried
matches <- unique (grep(paste(Types,collapse="|"), Plans, value=TRUE))
I can match the plans and types, but I can’t extract the numbers and types into two columns.
I tried str_split_fixed and grepl using :digit: and :alpha: but it isn’t working.

Assuming there is only going to be one numeric part in the string, we can extract the numeric part by replacing all the characters to empty strings. We create the Type column by extracting any of the string present in the Plans.
library(stringr)
data.frame(Number = as.numeric(gsub("[[:alpha:]]", "", Plans)),
Type = str_extract(Plans, paste(Types,collapse="|")))
# Number Type
#1 10 houses
#2 5 luxury apartments
#3 20 cottages
#4 2 bungalows
#5 1 <NA>
For the 5th row, "hotel" is not present in Types so it gives output as NA, if you need to ignore such cases you can do it with is.na. Extracting number from the string part is taken from here.

You can also use strcapture from base R:
strcapture(pattern = paste0("(\\d+)\\s(",paste(Types,collapse="|"),")"),x = Plans,
proto = data.frame(Number=numeric(),Type=character()))
Number Type
1 10 houses
2 5 luxury apartments
3 20 cottages
4 2 bungalows
5 NA <NA>

Related

How can I merge two dataframes equivalently to VLookup criteria?

I have just recently started using R for my master thesis. I need to match the ID number (uuid) of dataframe 1 to the investee names in dataframe 2.
Dataframe 1
investee_name uuid
1 Wetpaint e1393508
2 Zoho bf4d7b0e
3 Digg 5f2b40b8
4 Omidyar Network f4d5ab44
5 Facebook df662812
6 Trinity Ventures 7ca12f
Dataframe 2:
investee_name investor_name investor_type
1 Facebook cel organization
2 Facebook Grock Partners organization
3 Facebook Partners organization
4 Photobucket Ventures organization
5 Geni Fund organization
6 Gizmoz Capital organization
As you can see, in Dataframe 2 the investee names appear mutliple times. With VLookup in Excel I could have easily matched the respective IDs from dataframe 1 but for some reason the merging does not work in R.
I have tried the following:
investments_complete <- merge(v2_investments, ID_organizations, by.x= names(v2_investments)[1], by.y= names(ID_organizations)[1])
v2_investments_complete <- (merge(ID_organizations,v2_investments, by = "investee_name"))
for both options it merges the ID colums but I get 0 observations.
At last, I tried this:
v2_investments_merged <- merge(v2_investments, ID_organizations, by.x = "investee_name", by.y = "investee_name", all.x= TRUE)
here the merge works and all needed observations are there but al IDs have the value NA.
Is there any kind of merge function that mirrors the Vlookup that I intend to do? I've spent hours trying to solve this but couldn't, so I would be very grateful for support!
Cheers,
Philipp
It is possible that there are some leading/lagging spaces in the by columns. One option is trimws from base R which would remove the whitespace from both ends (if any)
v2_investments$investee_name <- trimws(v2_investments$investee_name)
ID_organizations$investee_name <- trimws(ID_organizations$investee_name)
Now, the merge should work

How do I replace values in an R dataframe column with a corresponding value?

Ok, so I have a dataframe that I downloaded from Pew Research Center. One of the columns (called 'cregion') contains a series of numbers from 1-56, with each number corresponding to a geographic location in the U.S. Most of these locations are states, and the additional 6 are at the sub-state level. So, for example, the number '1' corresponds to 'Alabama', and '11' corresponds to the 'District Of Columbia'.
What I'd like to do is replace each of those numbers in the 'cregion' column with the ACTUAL name of the region it corresponds to. Unfortunately, there is no column in this data frame that I can use to swap the values, as the key for which number corresponds to which region exists completely separately (word document). I'm new to R and while I've been searching for a few hours for the best way to go about this, I can't seem to find a method that would work (or I just don't understand the explanation). Can anybody suggest a method to me?
If you have a vector of the state names as strings called statevec whose ith element corresponds to cregion i, and your data frame is named dat, just do
dat <- data.frame(cregion = sample(1:50), stuff = runif(50))
head(dat)
# cregion stuff
#1 25 0.665843896
#2 11 0.144631131
#3 13 0.691616240
#4 28 0.507454243
#5 9 0.416535139
#6 30 0.004196311
statevec <- state.name
dat$cregion <- statevec[dat$cregion]
head(dat)
# cregion stuff
#1 Missouri 0.665843896
#2 Hawaii 0.144631131
#3 Illinois 0.691616240
#4 Nevada 0.507454243
#5 Florida 0.416535139
#6 New Jersey 0.004196311

Find the most frequently occuring words in a text in R

Can someone help me with how to find the most frequently used two and three words in a text using R?
My text is...
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
The tidytext package makes this sort of thing pretty simple:
library(tidytext)
library(dplyr)
data_frame(text = text) %>%
unnest_tokens(word, text) %>% # split words
anti_join(stop_words) %>% # take out "a", "an", "the", etc.
count(word, sort = TRUE) # count occurrences
# Source: local data frame [73 x 2]
#
# word n
# (chr) (int)
# 1 phrase 8
# 2 sentence 6
# 3 words 4
# 4 called 3
# 5 common 3
# 6 grammatical 3
# 7 meaning 3
# 8 alex 2
# 9 bird 2
# 10 complete 2
# .. ... ...
If the question is asking for counts of bigrams and trigrams, tokenizers::tokenize_ngrams is useful:
library(tokenizers)
tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>% # tokenize bigrams and trigrams
as_data_frame() %>% # structure
count(value, sort = TRUE) # count
# Source: local data frame [531 x 2]
#
# value n
# (fctr) (int)
# 1 of the 5
# 2 a phrase 4
# 3 the sentence 4
# 4 as a 3
# 5 in the 3
# 6 may be 3
# 7 a complete 2
# 8 a phrase is 2
# 9 a sentence 2
# 10 a white 2
# .. ... ...
Your text is:
text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
In Natural Language Processing, 2-word phrases are referred to as "bi-gram", and 3-word phrases are referred to as "tri-gram", and so forth. Generally, a given combination of n-words is called an "n-gram".
First, we install the ngram package (available on CRAN)
# Install package "ngram"
install.packages("ngram")
Then, we will find the most frequent two-word and three-word phrases
library(ngram)
# To find all two-word phrases in the test "text":
ng2 <- ngram(text, n = 2)
# To find all three-word phrases in the test "text":
ng3 <- ngram(text, n = 3)
Finally, we will print the objects (ngrams) using various methods as below:
print(ng, output="truncated")
print(ngram(x), output="full")
get.phrasetable(ng)
ngram::ngram_asweka(text, min=2, max=3)
We can also use Markov Chains to babble new sequences:
# if we are using ng2 (bi-gram)
lnth = 2
babble(ng = ng2, genlen = lnth)
# if we are using ng3 (tri-gram)
lnth = 3
babble(ng = ng3, genlen = lnth)
We can split the words and use table to summarize the frequency:
words <- strsplit(text, "[ ,.\\(\\)\"]")
sort(table(words, exclude = ""), decreasing = T)
Simplest?
require(quanteda)
# bi-grams
topfeatures(dfm(text, ngrams = 2, verbose = FALSE))
## of_the a_phrase the_sentence may_be as_a in_the in_common phrase_is
## 5 4 4 3 3 3 2 2
## is_usually group_of
## 2 2
# for tri-grams
topfeatures(dfm(text, ngrams = 3, verbose = FALSE))
## a_phrase_is group_of_words of_a_sentence of_the_sentence for_example_in example_in_the
## 2 2 2 2 2 2
## in_the_sentence an_orange_bird orange_bird_with bird_with_a
# 2 2 2 2
Here's a simple base R approach for the 5 most frequent words:
head(sort(table(strsplit(gsub("[[:punct:]]", "", text), " ")), decreasing = TRUE), 5)
# a the of in phrase
# 21 18 12 10 8
What it returns is an integer vector with the frequency count and the names of the vector correspond to the words that were counted.
gsub("[[:punct:]]", "", text) to remove punctuation since you don't want to count that, I guess
strsplit(gsub("[[:punct:]]", "", text), " ") to split the string on spaces
table() to count unique elements' frequency
sort(..., decreasing = TRUE) to sort them in decreasing order
head(..., 5) to select only the top 5 most frequent words

Grouping words that are similar

CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
I want to get either:
CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow
But would be absolutely fine with:
CompanyName2
1
1
1
2
2
3
3
I see algorithms for getting the distance between two words, so if I had just one weird name I would compare it to all other names and pick the one with the lowest distance. But I have thousands of names and want to group them all into groups.
I do not know anything about elastic search, but would one of the functions in the elastic package or some other function help me out here?
I'm sorry there's no programming here. I know. But this is way out of my area of normal expertise.
Solution: use string distance
You're on the right track. Here is some R code to get you started:
install.packages("stringdist") # install this package
library("stringdist")
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs")
Let's take a look. These are the calculated distances between strings, using Longest Common Subsequence metric (try others, e.g. cosine, Levenshtein). They all measure, in essence, how many characters the strings have in common. Their pros and cons are beyond this Q&A. You might look into something that gives a higher similarity value to two strings that contain the exact same substring (like dow)
sdm[1:5,1:5]
kraft kraft foods kfraft nestle nestle usa
kraft 0 6 1 9 13
kraft foods 6 0 7 15 15
kfraft 1 7 0 10 14
nestle 9 15 10 0 4
nestle usa 13 15 14 4 0
Some visualization
# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))
If you want to group then explicitly into k groups, use k-medoids.
library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)

Selecting strings and using in logical expressions to create new variable - R

I have a categorical variable indicating location of flu clinics as well as an "other" category. Participants who select the "other" category give open-ended responses for their location. In most cases, these open-ended responses fit with one of the existing categories (for example, one category is "public health clinic", but some respondents picked "other" and cited "mall" which was a public health clinic). I could easily do this by hand but want to learn the code to select "mall" strings then use logical expressions to assign these people to "public health clinic" (e.g. create a new variable for location of flu clinics).
My categorical variable is "lrecflu2" and my character string variable is "lfother"
So far I have:
mall <- grep("MALL", Motiv82012$lfother, value = TRUE)
This gives me a vector with all the string responses containing "MALL" (all strings are in caps in the dataframe)
How do I use this vector in a logical expression to create a new variable that assigns these people to the "public health clinic" category and assigns the original value of flu clinic location variable for people that did not select "other" (and do not have values in the character string variable) to the new flu clinic location variable?
Perhaps, grep is not even the right function to be using.
As I understand it, you have a column in a data frame, where you want to reassign one character value to another. If so, you were almost there...
set.seed(1) # for generating an example
df1 <- data.frame(flu2=sample(c("MALL","other","PHC"),size=10,replace=TRUE))
df1$flu2[grep("MALL",df1$flu2)] <- "PHC"
Here grep() is giving you the required vector index; you then subset the vector based on this and change those elements.
Update 2
This should produce a data.frame similar to the one you are using:
set.seed(1)
lreflu2 <- sample(c("PHC","Med","Work","other"),size=10,replace=TRUE)
Ifother <- rep("",10) # blank character vector
s1 <- c("Frontenac Mall","Kingston Mall","notMALL")
Ifother[lreflu2=="other"] <- s1
df1 <- data.frame(lreflu2,Ifother)
### alternative:
### df1 <- data.frame(lreflu2,Ifother, stringsAsFactors = FALSE)
df1
gives:
lreflu2 Ifother
1 Med
2 Med
3 Work
4 other Frontenac Mall
5 PHC
6 other Kingston Mall
7 other notMALL
8 Work
9 Work
10 PHC
If you're looking for an exact string match you don't need grep at all:
df1$lreflu2[df1$Ifother=="MALL"] <- "PHC"
Using a regex:
df1$lreflu2[grep("Mall",df1$Ifother)] <- "PHC"
gives:
lreflu2 Ifother
1 Med
2 Med
3 Work
4 PHC Frontenac Mall
5 PHC
6 PHC Kingston Mall
7 other notMALL
8 Work
9 Work
10 PHC
Whether Ifother is a factor or vector with mode character doesn't affect things. data.frame will coerce string vectors to factors by default.

Resources