count number of words per each line - r

I am trying to move an R code into spark using sparklyr, I am facing troubles with some of the functions in order to do the following things:
-Count the total number of words in a row: for example
word= "Hello how are you" , number of words: 4
-Count the total number of character in the first word: for example:
word= "Hello how are you" , number of characters in the first word: 5
-Count the total number of character in the first word: for example:
word= "Hello how are you" , number of characters in the second word: 3
I tried with dpylr and stringr package but I can't get what I need.
I connect to a spark session
install.packages("DBI")
install.packages("ngram")
require(DBI)
require(sparklyr)
require(dplyr)
require(stringr)
require(stringi)
require(base)
require(ngram)
# Spark Config
config <- spark_config()
config$spark.executor.cores <- 2
config$spark.executor.memory <- "4G"
spark <- spark_connect(master = "yarn-client",version = "2.3.0",app_name = "Test", config=config)
Then I try to retrieve some data with an SQL statement
test_query<-sdf_sql(spark,"SELECT ID, NAME FROM table.name LIMIT 10")
NAME <- c('John Doe','Peter Gynn','Jolie Hope')
ID<-c(1,2,3)
test_query<-data.frame(NAME,ID) # ( this is the example data, here it is in R data frame, but I have on a Spark Data Frame)
When I try to do feature engineering I got an error in the last line
test_query<-test_query %>%
mutate(Total_char=nchar(NAME))%>% # this works good
mutate(Name_has_numbers=str_detect(NAME,"[[:digit:]]"))%>% # Works good
mutate(Total_words=str_count(NAME, '\\w+')) # I got an error
The error message I am getting is this one: Error: org.apache.spark.sql.AnalysisException: Undefined function: 'STR_COUNT'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.
-Count the total number of words in a row: for example
word= "Hello how are you" , number of words: 4
-Count the total number of character in the first word: for example:
word= "Hello how are you" , number of characters in the first word: 5
-Count the total number of character in the first word: for example:
word= "Hello how are you" , number of characters in the second word: 3

> library(tidyverse)
> test_query %>%
mutate(NAME = as.character(NAME),
word_count = str_count(NAME, "\\w+"), # Count the total number of words in a row
N_char_first_word = nchar((gsub("(\\w+).*", "\\1", NAME))) #Count the total number of character in the first word
)
NAME ID word_count N_char_first_word
1 John Doe 1 2 4
2 Peter Gynn 2 2 5
3 Jolie Hope 3 2 5

Related

In R, compare string variable of two dataframes to create new flag variable indicating match in both dataframes, using a for-loop?

I have two dataframes which I would like to compare. One of them contains a complete list of sentences as a string variable as well as manually assigned codes of 0 and 1 (i.e. data.1). The second dataframe contains a subset of the sentences of the first dataframe and is reduced to those sentences that were matched by a dictionary.
This is, in essence, what these two datasets look like:
data.1 = data.frame(texts = c("This is a sentence", "This is another sentence", "This is not a sentence", "Yet another sentence"),
code = c(1,1,0,1))
data.2 = data.frame(texts = c("This is not a sentence", "This is a sentence"),
code = c(1,1))
I would like to merge the results of the data.2 into data.1 and ideally create a new code_2 variable there that indicates whether a sentence was matched by the dictionary. This would yield something like this:
> data.1
texts code code_2
1 This is a sentence 1 1
2 This is another sentence 1 0
3 This is not a sentence 0 1
4 Yet another sentence 1 0
To make this slightly more difficult, and as you can see above, the sentences in data.2 are not just a subset of data.1 but they may also be in a different order (e.g. "This is not a sentence" is in the third row of the first dataframe but in the first row of the second dataframe).
I was thinking that looping through all of the texts of data.1 would do the trick, but I'm not sure how to implement this.
for (i in 1:nrow(data.1)) {
# For each i in data.1...
# compare sentence to ALL sentences in data.2...
# create a new variable called "code_2"...
# assign a 1 if a sentence occurs in both dataframes...
# and a 0 otherwise (i.e. if that sentence only occurs in `data.1` but not in `data.2`).
}
Note: My question is similar to this one, where the string variable "Letter" corresponds to my "texts", yet the problem is somewhat different, since the matching of sentences itself is the basis for the creation of a new flag variable in my case (which is not the case in said other question).
can you just join the dataframes?
NOTE: Added replace_na to substitue with 0
data.1 = data.frame(texts = c("This is a sentence", "This is another sentence", "This is not a sentence", "Yet another sentence"),
code = c(1,1,0,1))
data.2 = data.frame(texts = c("This is not a sentence", "This is a sentence"),
code = c(1,1))
data.1 %>% dplyr::left_join(data.2, by = 'texts') %>%
dplyr::mutate(code.y = tidyr::replace_na(code.y, 0))
I believe the following match based solution does what the question asks for.
i <- match(data.2$texts, data.1$texts)
i <- sort(i)
data.1$code_2 <- 0L
data.1$code_2[i] <- data.2$code[seq_along(i)]
data.1
# texts code code_2
#1 This is a sentence 1 1
#2 This is another sentence 1 0
#3 This is not a sentence 0 1
#4 Yet another sentence 1 0

Limiting word count in a character column in R and saving extra words in another variable [duplicate]

I have a string in R as
x <- "The length of the word is going to be of nice use to me"
I want the first 10 words of the above specified string.
Also for example I have a CSV file where the format looks like this :-
Keyword,City(Column Header)
The length of the string should not be more than 10,New York
The Keyword should be of specific length,Los Angeles
This is an experimental basis program string,Seattle
Please help me with getting only the first ten words,Boston
I want to get only the first 10 words from the column 'Keyword' for each row and write it onto a CSV file.
Please help me in this regards.
Regular expression (regex) answer using \w (word character) and its negation \W:
gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)
^ Beginning of the token (zero-width)
((\\w+\\W+){9}\\w+) Ten words separated by not-words.
(\\w+\\W+){9} A word followed by not-a-word, 9 times
\\w+ One or more word characters (i.e. a word)
\\W+ One or more non-word characters (i.e. a space)
{9} Nine repetitions
\\w+ The tenth word
.* Anything else, including other following words
$ End of the token (zero-width)
\\1 when this token found, replace it with the first captured group (the 10 words)
How about using the word function from Hadley Wickham's stringr package?
word(string = x, start = 1, end = 10, sep = fixed(" "))
Here is an small function that unlist the strings, subsets the first ten words and then pastes it back together.
string_fun <- function(x) {
ul = unlist(strsplit(x, split = "\\s+"))[1:10]
paste(ul,collapse=" ")
}
string_fun(x)
df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
This is an experimental basis program string is or are in,Seattle
Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)
df <- as.data.frame(df)
Using apply (the function isn't doing anything in the second column)
df$Keyword <- apply(df[,1:2], 1, string_fun)
EDIT
Probably this is a more general way to use the function.
df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))
print(df)
# Keyword City.Column.Header.
# 1 The length of the string should not be more than New York
# 2 The Keyword should be of specific length is or are Los Angeles
# 3 This is an experimental basis program string is or Seattle
# 4 Please help me with getting only the first ten Boston
x <- "The length of the word is going to be of nice use to me"
head(strsplit(x, split = "\ "), 10)

R: Extract list of list

I am trying to extract the elements from a nested list.
I have a list as below
> terms[1:3]
$`1`
mathew
1
$`2`
apr expires gmt thu
1 1 1 1
$`3`
distribution world
1 1
When I am using unlist I get the following output, where each term is preceded by the number it is present inside the list
> unlist(terms)[1:6]
1.mathew 2.apr 2.expires 2.gmt 2.thu 3.distribution
1 1 1 1 1 1
>
How can I extract the row name and the value associated with it. Example mathew column has value 1.
I need to create a dataframe in the end for term,count
Reproducible Example
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
findMostFreqTerms(tdm,10)
TermDocumentMatrix will return a named list by default. If you just want to combine those terms into a single list ignoring the document name, use
unlist(unname(terms))
But note that this may duplicate some words multiple times if more than one document shares a most frequent work. If you want to treat the entire corpus as a single document, you can do
findMostFreqTerms(tdm, 10, INDEX=rep(1, ncol(tdm)))[[1]]
Does this help?
data('crude')
library(tm)
tdm <- TermDocumentMatrix(crude)
terms=findMostFreqTerms(tdm,10)
a = unlist(terms)
words = gsub('[0-9.]+', '', attr(a,'names'))
words
df = t(data.frame(a))
colnames(df) = words
# colnames(df)

How to get the first 10 words in a string in R?

I have a string in R as
x <- "The length of the word is going to be of nice use to me"
I want the first 10 words of the above specified string.
Also for example I have a CSV file where the format looks like this :-
Keyword,City(Column Header)
The length of the string should not be more than 10,New York
The Keyword should be of specific length,Los Angeles
This is an experimental basis program string,Seattle
Please help me with getting only the first ten words,Boston
I want to get only the first 10 words from the column 'Keyword' for each row and write it onto a CSV file.
Please help me in this regards.
Regular expression (regex) answer using \w (word character) and its negation \W:
gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)
^ Beginning of the token (zero-width)
((\\w+\\W+){9}\\w+) Ten words separated by not-words.
(\\w+\\W+){9} A word followed by not-a-word, 9 times
\\w+ One or more word characters (i.e. a word)
\\W+ One or more non-word characters (i.e. a space)
{9} Nine repetitions
\\w+ The tenth word
.* Anything else, including other following words
$ End of the token (zero-width)
\\1 when this token found, replace it with the first captured group (the 10 words)
How about using the word function from Hadley Wickham's stringr package?
word(string = x, start = 1, end = 10, sep = fixed(" "))
Here is an small function that unlist the strings, subsets the first ten words and then pastes it back together.
string_fun <- function(x) {
ul = unlist(strsplit(x, split = "\\s+"))[1:10]
paste(ul,collapse=" ")
}
string_fun(x)
df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
This is an experimental basis program string is or are in,Seattle
Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)
df <- as.data.frame(df)
Using apply (the function isn't doing anything in the second column)
df$Keyword <- apply(df[,1:2], 1, string_fun)
EDIT
Probably this is a more general way to use the function.
df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))
print(df)
# Keyword City.Column.Header.
# 1 The length of the string should not be more than New York
# 2 The Keyword should be of specific length is or are Los Angeles
# 3 This is an experimental basis program string is or Seattle
# 4 Please help me with getting only the first ten Boston
x <- "The length of the word is going to be of nice use to me"
head(strsplit(x, split = "\ "), 10)

Frequency of occurrence of two-pair combinations in text data in R

I have a file with several string (text) variables where each respondent has written a sentence or two for each variable. I want to be able to find the frequency of each combination of words (i.e. how often "capability" occurs with "performance").
My code so far goes:
#Setting up the data file
data.text <- scan("C:/temp/tester.csv", what="char", sep="\n")
#Change everything to lower text
data.text <- tolower(data.text)
#Split the strings into separate words
data.words.list <- strsplit(data.text, "\\W+", perl=TRUE)
data.words.vector <- unlist(data.words.list)
#List each word and frequency
data.freq.list <- table(data.words.vector)
This gives me a list of each word and how often it appears in the string variables. Now I want to see the frequency of every 2 word combination. Is this possible?
Thanks!
An example of the string data:
ID Reason_for_Dissatisfaction Reason_for_Likelihood_to_Switch
1 "not happy with the service" "better value at other place"
2 "poor customer service" "tired of same old thing"
3 "they are overchanging me" "bad service"
I'm not sure if this is what yu mean, but rather than splitting on every two word boundaires (which I found a pain to try and regex) you could paste every two words together using the trusty head and tails slip trick...
# How I read your data
df <- read.table( text = 'ID Reason_for_Dissatisfaction Reason_for_Likelihood_to_Switch
1 "not happy with the service" "better value at other place"
2 "poor customer service" "tired of same old thing"
3 "they are overchanging me" "bad service"
' , h = TRUE , stringsAsFactors = FALSE )
# Split to words
wlist <- sapply( df[,-1] , strsplit , split = "\\W+", perl=TRUE)
# Paste word pairs together
outl <- sapply( wlist , function(x) paste( head(x,-1) , tail(x,-1) , sep = " ") )
# Table as per usual
table(unlist( outl ) )
are overchanging at other bad service better value customer service
1 1 1 1 1
happy with not happy of same old thing other place
1 1 1 1 1
overchanging me poor customer same old the service they are
1 1 1 1 1
tired of value at with the
1 1 1

Resources