Count unique string patterns in a row

Count unique string patterns in a row - r

i have a following example:
dat <- read.table(text="index string
1 'I have first and second'
2 'I have first, first'
3 'I have second and first and thirdeen'", header=TRUE)
toMatch <- c('first', 'second', 'third')
dat$count <- stri_count_regex(dat$string, paste0('\\b',toMatch,'\\b', collapse="|"))
dat
index string count
1 1 I have first and second 2
2 2 I have first, first 2
3 3 I have second and first and thirdeen 2
I want to add to the dataframe a column count, which will tell me how many UNIQUE words does each row have. The desired output would in this case be
index string count
1 1 I have first and second 2
2 2 I have first, first 1
3 3 I have second and first and thirdeen 2
Could you please give me a hint how to modify the original formula? Thank you very much

With base R you could do the following:
sapply(dat$string, function(x)
{sum(sapply(toMatch, function(y) {grepl(paste0('\\b', y, '\\b'), x)}))})
which returns
[1] 2 1 2
Hope this helps!

We can use stri_match_all instead which gives us the exact matches and then calculate distinct values using n_distinct or length(unique(x)) in base.
library(stringi)
library(dplyr)
sapply(stri_match_all(dat$string, regex = paste0('\\b',toMatch,'\\b',
collapse="|")), n_distinct)
#[1] 2 1 2
Or similary in base R
sapply(stri_match_all(dat$string, regex = paste0('\\b',toMatch,'\\b',
collapse="|")), function(x) length(unique(x)))
#[1] 2 1 2

Related

R: How to tell where in a word a repeating letters appears in order to add to a dataframe

I'm trying to detect how many words in a vector have a repeating letter and count the number of times that it is repeated in other words also, adding it to a data frame each time the repeated letters are encountered.
For example: x = c("google", "blood", "street")
the data frame will appear as
letter n
1 oo 2
2 ee 1

You can match repeating letters using regex and match using stringr::str_match_all():
library(stringr)
as.data.frame(table(unlist(sapply(str_match_all(x, regex("([A-Za-z]{1})\\1")), `[`, , 1))))
Var1 Freq
1 ee 1
2 oo 2

One option in base R is to convert to raw, use rle to get the run-length-encoding, subset only the elements having lengths greater than 1, reconvert to character and get the frequency count with table
stack(table(sapply(x, function(y) rawToChar(with(rle(charToRaw(y)),
rep(values[lengths > 1], lengths[lengths > 1]))))))[2:1]
# ind values
#1 ee 1
#2 oo 2
Or with str_extract (assuming there is only a single repeated substring)
library(stringr)
stack(table(str_extract(x, "(\\w)\\1")))[2:1]
# ind values
#1 ee 1
#2 oo 2
Or using dplyr
library(dplyr)
library(tidyr)
str_extract_all(x, "(\\w)\\1") %>%
tibble(letter = .) %>%
unnest(c(letter)) %>%
count(letter)

Another base R solution using regmatches + table
dfout <- as.data.frame(table(unlist(regmatches(x,gregexpr("(\\w)\\1+",x)))))
which gives
> dfout
Var1 Freq
1 ee 1
2 oo 2

Count the frequency of strings in a dataframe R

I am wanting to count the frequencies of certain strings within a dataframe.
strings <- c("pi","pie","piece","pin","pinned","post")
df <- as.data.frame(strings)
I would then like to count the frequency of the strings:
counts <- c("pi", "in", "pie", "ie")
To give me something like:
string freq
pi 5
in 2
pie 2
ie 2
I have experimented with grepl and table but I don't see how I can specify the strings I want to search for are.

You can use sapply() to go the counts and match every item in counts against the strings column in df using grepl() this will return a logical vector (TRUE if match, FALSE if non-match). You can sum this vector up to get the number of matches.
sapply(df, function(x) {
sapply(counts, function(y) {
sum(grepl(y, x))
})
})
This will return:
strings
pi 5
in 2
pie 2
ie 2

colSums(sapply(counts, stringr::str_count, string = df$strings))
pi in pie ie
5 2 2 2
You can use adist from base R:
data.frame(counts,freq=rowSums(!adist(counts,strings,partial = T)))
counts freq
1 pi 5
2 in 2
3 pie 2
4 ie 2
If you are comfortable with regular expressions then you can do:
a=sapply(paste0(".*(",counts,").*|.*"),sub,"\\1",strings)
table(grep("\\w",a,value = T))
ie in pi pie
2 2 5 2

Frequency table created by qgrams from the stringdist package
library(stringdist)
strings <- c("pi","pie","piece","pin","pinned","post")
frequency <- data.frame(t(stringdist::qgrams(freq = strings, q = 2)))
freq
pi 5
po 1
st 1
ie 2
in 2
nn 1
os 1
ne 1
ec 1
ed 1
ce 1

Here's my solution using only base R and tidyverse functions, however it might not be as efficient as other packages that people mentioned.
new_df <- data.frame('VarName'=unique(df$VarName), 'Count'=0)
for (row_no in 1:nrow(new_df)) {
new_df[row_no,'Count'] = df %>%
filter(VarName==new_df[row_no, 'VarName']) %>%
nrow()
}
All you need to switch out is df and VarName.

Flatten list column in data frame with ID column

My data frame contains the output of a survey with a select multiple question type. Some cells have multiple values.
df <- data.frame(a=1:3,b=I(list(1,1:2,1:3)))
df
a b
1 1 1
2 2 1, 2
3 3 1, 2, 3
I would like to flatten out the list to obtain the following output:
df
a b
1 1 1
2 2 1
3 2 2
4 3 1
5 3 2
6 3 3
should be easy but somehow I can't find the search terms. thanks.

You can just use unnest from "tidyr":
library(tidyr)
unnest(df, b)
# a b
# 1 1 1
# 2 2 1
# 3 2 2
# 4 3 1
# 5 3 2
# 6 3 3

Using base R, one option is stack after naming the list elements of 'b' column with that of the elements of 'a'. We can use setNames to change the names.
stack(setNames(df$b, df$a))
Or another option would be to use unstack to automatically name the list element of 'b' with 'a' elements and then do the stack to get a data.frame output.
stack(unstack(df, b~a))
Or we can use a convenient function listCol_l from splitstackshape to convert the list to data.frame.
library(splitstackshape)
listCol_l(df, 'b')

Here's one way, with data.table:
require(data.table)
data.table(df)[,as.integer(unlist(b)),by=a]
If b is stored consistently, as.integer can be skipped. You can check with
unique(sapply(df$b,class))
# [1] "numeric" "integer"

Here's another base solution, far less elegant than any other solution posted thus far. Posting for the sake of completeness, though personally I would recommend akrun's base solution.
with(df, cbind(a = rep(a, sapply(b, length)), b = do.call(c, b)))
This constructs the first column as the elements of a, where each is repeated to match the length of the corresponding list item from b. The second column is b "flattened" using do.call() with c().
As Ananda Mahto pointed out in a comment, sapply(b, length) can be replaced with lengths(b) in the most recent version of R (3.2, if I'm not mistaken).

A base R approach might also be to create a new data.frame for each row and rbind it afterwards:
df <- data.frame(a=1:3,b=I(list(1,1:2,1:3)))
df
df <- lapply(seq_along(df$a), function(x){data.frame(a = df$a[[x]], b = df$b[[x]])})
df <- do.call("rbind", df)
df

Add index numbers when converting sorted table to dataframe

I have a vector of strings that I'm trying to convert into a data frame with a frequency column. So far so good, but when I dim my data frame, I get only one column instead of two. I guess R is using the words as the index values.
Anyway here is how it starts. My list:
a<-c("welcoming", "whatsyourexcuse", "whiteway", "zero", "yay", "whatsyourexcuse", "yay")
Then, I tried to sort the frequency values in decreasing order and store as data frame using:
df <- as.data.frame(sort(table(a), decreasing=TRUE))
Problem is when I dim(df) I get [1] 5 1 instead of [1] 5 2. Here is what df looks like:
sort(table(a), decreasing = TRUE)
whatsyourexcuse 2
yay 2
welcoming 1
whiteway 1
zero 1
instead of:
a Freq
[1] whatsyourexcuse 2
[2] yay 2
[3] welcoming 1
[4] whiteway 1
[5] zero 1
Any pointers please? Thanks.

Try:
library(plyr)
a1 <- count(a)
a1[order(-a1$freq),]
# x freq
# 2 whatsyourexcuse 2
# 4 yay 2
# 1 welcoming 1
# 3 whiteway 1
# 5 zero 1
dim(a1)
#[1] 5 2
Or
a2 <- stack(sort(table(a),decreasing=TRUE))[,2:1]
dim(a2)
#[1] 5 2
When you are converting to data.frame using as.data.frame(sort(table(a), decreasing=TRUE)), the names of the elements become the rownames of the dataframe, so you are creating only one column instead of two. When you do sort, it no longer is the table object. For comparison check str(table(a)) and str(sort(table(a), decreasing=TRUE)))
You can also create the data.frame by
tbl <- sort(table(a), decreasing=TRUE)
data.frame(col1= names(tbl), Values= as.vector(tbl))

order vector by number of occurences in R

I have A vector:
x<-c(1,2,3,3,2,2)
Now I want to order this vector on number of occurences, I know I can count the number of occurences with table:
x.order <- table(x)[rev(order(table(x)))]
Gives me:
2 3 1
3 2 1
Now I know, I first have to select the values of x, which are 2, then the values of x which are 3 and then the values where x is 1. How can I perform this last step?
The final output has to look like:
2,2,2,3,3,1
Or is there a better way to order the vector by number of occurences?

x<-c(1,2,3,3,2,2)
x.order <- sort(table(x), TRUE)
rep(as.numeric(names(x.order)), times=x.order)
#[1] 2 2 2 3 3 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Count unique string patterns in a row - r

With base R you could do the following: sapply(dat$string, function(x) {sum(sapply(toMatch, function(y) {grepl(paste0('\\b', y, '\\b'), x)}))}) which returns [1] 2 1 2 Hope this helps!

Related

R: How to tell where in a word a repeating letters appears in order to add to a dataframe

Count the frequency of strings in a dataframe R

Flatten list column in data frame with ID column

Add index numbers when converting sorted table to dataframe

order vector by number of occurences in R

Categories

Resources