Count the frequency of strings in a dataframe R - r

I am wanting to count the frequencies of certain strings within a dataframe.
strings <- c("pi","pie","piece","pin","pinned","post")
df <- as.data.frame(strings)
I would then like to count the frequency of the strings:
counts <- c("pi", "in", "pie", "ie")
To give me something like:
string freq
pi 5
in 2
pie 2
ie 2
I have experimented with grepl and table but I don't see how I can specify the strings I want to search for are.

You can use sapply() to go the counts and match every item in counts against the strings column in df using grepl() this will return a logical vector (TRUE if match, FALSE if non-match). You can sum this vector up to get the number of matches.
sapply(df, function(x) {
sapply(counts, function(y) {
sum(grepl(y, x))
})
})
This will return:
strings
pi 5
in 2
pie 2
ie 2

colSums(sapply(counts, stringr::str_count, string = df$strings))
pi in pie ie
5 2 2 2
You can use adist from base R:
data.frame(counts,freq=rowSums(!adist(counts,strings,partial = T)))
counts freq
1 pi 5
2 in 2
3 pie 2
4 ie 2
If you are comfortable with regular expressions then you can do:
a=sapply(paste0(".*(",counts,").*|.*"),sub,"\\1",strings)
table(grep("\\w",a,value = T))
ie in pi pie
2 2 5 2

Frequency table created by qgrams from the stringdist package
library(stringdist)
strings <- c("pi","pie","piece","pin","pinned","post")
frequency <- data.frame(t(stringdist::qgrams(freq = strings, q = 2)))
freq
pi 5
po 1
st 1
ie 2
in 2
nn 1
os 1
ne 1
ec 1
ed 1
ce 1

Here's my solution using only base R and tidyverse functions, however it might not be as efficient as other packages that people mentioned.
new_df <- data.frame('VarName'=unique(df$VarName), 'Count'=0)
for (row_no in 1:nrow(new_df)) {
new_df[row_no,'Count'] = df %>%
filter(VarName==new_df[row_no, 'VarName']) %>%
nrow()
}
All you need to switch out is df and VarName.

Related

Filtering/subsetting R dataframe based on each rows n'th position value

I have a 'df' with 2 columns:
Combinations <- c(0011111111, 0011113111, 0013113112, 0022223114)
Values <- c(1,2,3,4)
df <- cbind.data.frame(Combinations, Values)
I am trying to find a way to subset or filter the dataframe where the 'Combinations' column's 7th, 8th, and 9th digits equal 311. For the example given, I would expect Combination's 0011113111, 0013113112, 0022223114
There are also instances where I would need to find different combinations, in different nth positions.
I know substring() can find these values for single rows but I'm not sure how to apply it to an entire dataframe.
subtring will work with vectors as well.
subset(df, substring(Combinations, 7, 9) == 311)
# Combinations Values
#2 0011113111 2
#3 0013113112 3
#4 0022223114 4
data
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- data.frame(Combinations, Values)
Another base R idea:
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- data.frame(Combinations, Values)
df[grep(pattern = "^[0-9]{6}311.$", df$Combinations), ]
Output:
Combinations Values
2 0011113111 2
3 0013113112 3
4 0022223114 4
As a tip, if you want to know more about regular expressions, this website helps me a lot: https://regexr.com/3elkd
Would this work?
library(dplyr)
library(stringr)
df %>% filter(str_sub(Combinations, 7,9) == 311)
Combinations Values
1 0011113111 2
2 0013113112 3
3 0022223114 4
Not pretty but works:
df[which(lapply(strsplit(df$Combinations, ""), function(x) which(x[7]==3 & x[8]==1 & x[9]==1))==1),]
Combinations Values
2 0011113111 2
3 0013113112 3
4 0022223114 4
Data:
Combinations <- c("0011111111", "0011113111", "0013113112", "0022223114")
Values <- c(1,2,3,4)
df <- cbind.data.frame(Combinations, Values)

R: How to tell where in a word a repeating letters appears in order to add to a dataframe

I'm trying to detect how many words in a vector have a repeating letter and count the number of times that it is repeated in other words also, adding it to a data frame each time the repeated letters are encountered.
For example: x = c("google", "blood", "street")
the data frame will appear as
letter n
1 oo 2
2 ee 1
You can match repeating letters using regex and match using stringr::str_match_all():
library(stringr)
as.data.frame(table(unlist(sapply(str_match_all(x, regex("([A-Za-z]{1})\\1")), `[`, , 1))))
Var1 Freq
1 ee 1
2 oo 2
One option in base R is to convert to raw, use rle to get the run-length-encoding, subset only the elements having lengths greater than 1, reconvert to character and get the frequency count with table
stack(table(sapply(x, function(y) rawToChar(with(rle(charToRaw(y)),
rep(values[lengths > 1], lengths[lengths > 1]))))))[2:1]
# ind values
#1 ee 1
#2 oo 2
Or with str_extract (assuming there is only a single repeated substring)
library(stringr)
stack(table(str_extract(x, "(\\w)\\1")))[2:1]
# ind values
#1 ee 1
#2 oo 2
Or using dplyr
library(dplyr)
library(tidyr)
str_extract_all(x, "(\\w)\\1") %>%
tibble(letter = .) %>%
unnest(c(letter)) %>%
count(letter)
Another base R solution using regmatches + table
dfout <- as.data.frame(table(unlist(regmatches(x,gregexpr("(\\w)\\1+",x)))))
which gives
> dfout
Var1 Freq
1 ee 1
2 oo 2

Count unique string patterns in a row

i have a following example:
dat <- read.table(text="index string
1 'I have first and second'
2 'I have first, first'
3 'I have second and first and thirdeen'", header=TRUE)
toMatch <- c('first', 'second', 'third')
dat$count <- stri_count_regex(dat$string, paste0('\\b',toMatch,'\\b', collapse="|"))
dat
index string count
1 1 I have first and second 2
2 2 I have first, first 2
3 3 I have second and first and thirdeen 2
I want to add to the dataframe a column count, which will tell me how many UNIQUE words does each row have. The desired output would in this case be
index string count
1 1 I have first and second 2
2 2 I have first, first 1
3 3 I have second and first and thirdeen 2
Could you please give me a hint how to modify the original formula? Thank you very much
With base R you could do the following:
sapply(dat$string, function(x)
{sum(sapply(toMatch, function(y) {grepl(paste0('\\b', y, '\\b'), x)}))})
which returns
[1] 2 1 2
Hope this helps!
We can use stri_match_all instead which gives us the exact matches and then calculate distinct values using n_distinct or length(unique(x)) in base.
library(stringi)
library(dplyr)
sapply(stri_match_all(dat$string, regex = paste0('\\b',toMatch,'\\b',
collapse="|")), n_distinct)
#[1] 2 1 2
Or similary in base R
sapply(stri_match_all(dat$string, regex = paste0('\\b',toMatch,'\\b',
collapse="|")), function(x) length(unique(x)))
#[1] 2 1 2

summarize results on a vector of different length of the original - Pivot table r

I would like to use the vector:
time.int<-c(1,2,3,4,5) #vector to be use as a "guide"
and the database:
time<-c(1,1,1,1,5,5,5)
value<-c("s","s","s","t","d","d","d")
dat1<- as.data.frame(cbind(time,value))
to create the following vector, which I can then add to the first vector "time.int" into a second database.
freq<-c(4,0,0,0,3) #wished result
This vector is the sum of the events that belong to each time interval, there are four 1 in "time" so the first value gets a four and so on.
Potentially I would like to generalize it so that I can decide the interval, for example saying sum in a new vector the events in "times" each 3 numbers of time.int.
EDIT for generalization
time.int<-c(1,2,3,4,5,6)
time<-c(1,1,1,2,5,5,5,6)
value<-c("s","s","s","t", "t","d","d","d")
dat1<- data.frame(time,value)
let's say I want it every 2 seconds (every 2 time.int)
freq<-c(4,0,4) #wished result
or every 3
freq<-c(4,4) #wished result
I know how to do that in excel, with a pivot table.
sorry if a duplicate I could not find a fitting question on this website, I do not even know how to ask this and where to start.
The following will produce vector freq.
freq <- sapply(time.int, function(x) sum(x == time))
freq
[1] 4 0 0 0 3
BTW, don't use the construct as.data.frame(cbind(.)). Use instead
dat1 <- data.frame(time,value))
In order to generalize the code above to segments of time.int of any length, I believe the following function will do it. Note that since you've changed the data the output for n == 1 is not the same as above.
fun <- function(x, y, n){
inx <- lapply(seq_len(length(x) %/% n), function(m) seq_len(n) + n*(m - 1))
sapply(inx, function(i) sum(y %in% x[i]))
}
freq1 <- fun(time.int, time, 1)
freq1
[1] 3 1 0 0 3 1
freq2 <- fun(time.int, time, 2)
freq2
[1] 4 0 4
freq3 <- fun(time.int, time, 3)
freq3
[1] 4 4
We can use the table function to count the event number and use merge to create a data frame summarizing the information. event_dat is the final output.
# Create example data
time.int <- c(1,2,3,4,5)
time <- c(1,1,1,1,5,5,5)
# Count the event using table and convert to a data frame
event <- as.data.frame(table(time))
# Convert the time.int to a data frame
time_dat <- data.frame(time = time.int)
# Merge the data
event_dat <- merge(time_dat, event, by = "time", all = TRUE)
# Replace NA with 0
event_dat[is.na(event_dat)] <- 0
# See the result
event_dat
time Freq
1 1 4
2 2 0
3 3 0
4 4 0
5 5 3

Using sum(x:y) to create a new variable/vector from existing values in R

I am working in R with a data frame d:
ID <- c("A","A","A","B","B")
eventcounter <- c(1,2,3,1,2)
numberofevents <- c(3,3,3,2,2)
d <- data.frame(ID, eventcounter, numberofevents)
> d
ID eventcounter numberofevents
1 A 1 3
2 A 2 3
3 A 3 3
4 B 1 2
5 B 2 2
where numberofevents is the highest value in the eventcounter for each ID.
Currently, I am trying to create an additional vector z <- c(6,6,6,3,3).
If the numberofevents == 3, it is supposed to calculate sum(1:3), equally to 3 + 2 + 1 = 6.
If the numberofevents == 2, it is supposed to calculate sum(1:2) equally to 2 + 1 = 3.
Working with a large set of data, I thought it might be convenient to create this additional vector
by using the sum function in R d$z<-sum(1:d$numberofevents), i.e.
sum(1:3) # for the rows 1-3
and
sum(1:2) # for the rows 4-5.
However, I always get this warning:
Numerical expression has x elements: only the first is used.
You can try ave
d$z <- with(d, ave(eventcounter, ID, FUN=sum))
Or using data.table
library(data.table)
setDT(d)[,z:=sum(eventcounter), ID][]
Try using apply sapply or lapply functions in R.
sapply(numberofevents, function(x) sum(1:x))
It works for me.

Resources