Counting the frequency of differing patterns in a character string - r

I currently have a string in R that looks like this:
a <- "BMMBMMMMBMMMBMMBBMMM"
First, I need to determine the frequency of different patterns of "M" that appear in the string.
In this example it would be:
MM = 2
MMM = 2
MMMM = 1
Secondly, I then need to designate a numerical value/score for each different pattern.
i.e:
MM = 1
MMM = 2
MMMM = 3
This would mean that the total value/score of M's in a would equal 9.
If anyone knows any script that would allow me to do this for multiple strings like this in a dataframe that would be great?
Thank you.

a <- "BMMBMMMMBMMMBMMBBMMM"
tbl <- table(strsplit(a, "B"), exclude="")
tbl
# MM MMM MMMM
# 2 2 1
score <- sum(tbl * 1:3)
score
# 9

You could also use the table function.
a_list<-unlist(strsplit(a,"B"))
a_list<-a_list[!a_list==""] #remove cases when 2 B are together
a_list<-table(a_list)
# MM MMM MMMM
# 2 2 1

Here's a solution that uses the dplyr package. First, I load the library and define my string.
library(dplyr)
a <- "BMMBMMMMBMMMBMMBBMMM"
Next, I define a function that counts the occurrences of character x in string y.
char_count <- function(x, y){
# Get runs of same character
tmp <- rle(strsplit(y, split = "")[[1]])
# Count runs of character stored in `x`
tmp <- data.frame(table(tmp$lengths[tmp$values == x]))
# Return strings and frequencies
tmp %>%
mutate(String = strrep(x, Var1)) %>%
select(String, Freq)
}
Then, I run the function.
# Run the function
res <- char_count("M", a)
# String Freq
# 1 M 2
# 2 MM 2
# 3 MMM 1
Finally, I define my value vector and calculate the total value of vector a.
# My value vector
value_vec <- c(M = 1, MM = 2, MMM = 3)
# Total `value` of vector `a`
sum(value_vec * res$Freq)
#[1] 9

It it's acceptable to skip the first step you could do:
nchar(gsub("(B+M)|(^M)","",a))
# [1] 9

First compute all diffrent patterns that appear in your sting :
a <- "BMMBMMMMBMMMBMMBBMMM"
chars = unlist(strsplit(a, ""))
pat = c()
for ( i in 1:length(chars)){
for (j in 1:(length(chars) - i+1)){ pat = c(pat, paste(chars[j:(j+i-1)], collapse = ""))}}
pat =sort(unique(pat))
pat[1:5] : [1] "B" "BB" "BBM" "BBMM" "BBMMM"
Next, count the occurence of each pattern :
counts = sapply(pat, function(w) length(gregexpr(w, a, fixed = TRUE)[[1]]))
Finally build a nice dataframe to summary everything up :
df = data.frame(counts = counts, num = 1:length(pat))
head(df, 10)
counts num
B 6 1
BB 1 2
BBM 1 3
BBMM 1 4
BBMMM 1 5
BM 5 6
BMM 5 7
BMMB 2 8
BMMBB 1 9
BMMBBM 1 10

library(stringr)
str_count(a, "MMMM")
gives 1
str_count(gsub("MMMM", "", a), "MMM") # now count how many times "MMM" occurs, but first delete the "MMMM"
gives 2
str_count(gsub("MMM", "", a), "MM") #now count how many times "MM" occurs, but first delete the "MMM"'s
gives 2

Related

For loop with an index inside a string in R

I have a rather simple problem but somehow I cannot solve it.
So I have a dataset with a column cycle with rows cycle1, cycle2, cycle3. I want to replace e.g. the word cycle1 with just the number 1. How to somehow separate the index i from the string cycle?
for (i in 1:3){
data$cycle[data$cycle=="cyclei"]<-i
}
Replace "cycle" with the empty string and convert to numeric:
data <- data.frame(cycle = c("cycle2", "cycle1", "cycle3")) # sample input
transform(data, cycle = as.numeric(sub("cycle", "", cycle)))
giving:
cycle
1 2
2 1
3 3
Use gsub()
# load data
df <- data.frame( cycle = c( "cycle1", "cycle2", "cycle3" ), stringsAsFactors = FALSE )
# Identify the pattern and replace with nothing
# and cast the values as numeric
df$cycle <- as.numeric( gsub( pattern = "cycle", replacement = "", x = df$cycle ) )
# view results
df
# cycle
# 1 1
# 2 2
# 3 3
# end of script #

Convert single column dataframe to dataframe with multiple rows and named columns

dfOrig <- data.frame(rbind("1",
"C",
"531404",
"3",
"B",
"477644"))
setnames(dfOrig, "Value")
I have a single column vector, which actually comprises two observations of three variables. How do I convert it to a data.frame with the following structure:
ID Code Tag
"1" "C" "531404"
"3" "B" "477644"
Obviously, this is just a toy example to illustrate a real-world problem with many more observations and variables.
Here's another approach - it does rely on the dfOrig column being ordered 1,2,3,1,2,3 etc.
x <- c("ID", "Code", "Tag") # new column names
n <- length(x) # number of columns
res <- data.frame(lapply(split(as.character(dfOrig$Value), rep(x, nrow(dfOrig)/n)),
type.convert))
The resulting data is:
> str(res)
#'data.frame': 2 obs. of 3 variables:
# $ Code: Factor w/ 2 levels "B","C": 2 1
# $ ID : int 1 3
# $ Tag : int 531404 477644
As you can see, the column classes have been converted. In case you want the Code column to be character instead of factor you can specify stringsAsFactors = FALSE in the data.frame call.
And it looks like this:
> res
# Code ID Tag
#1 C 1 531404
#2 B 3 477644
Note: You have to get the column name order in x in line with the order of the entries in dfOrig$Value.
If you want to get the column order of res as specified in x, you can use the following:
res <- res[, match(x, names(res))]
Maybe convert to matrix with nrow:
# set number of columns
myNcol <- 3
# convert to matrix, then dataframe
res <- data.frame(matrix(dfOrig$Value, ncol = myNcol, byrow = TRUE),
stringsAsFactors = FALSE)
# convert the type and add column names
res <- as.data.frame(lapply(res, type.convert),
col.names = c("resID", "Code", "Tag"))
res
# resID Code Tag
# 1 1 C 531404
# 2 3 B 477644
You can create a sequence of numbers
x <- seq(1:nrow(dfOrig)) %% 3 #you can change this 3 to number of columns you need
data.frame(ID = dfOrig$Value[x == 1],
Code = dfOrig$Value[x == 2],
Tag = dfOrig$Value[x == 0])
#ID Code Tag
#1 1 C 531404
#2 3 B 477644
Another approach would be splitting the dataframe according to the sequence generated above and then binding the columns using do.call
x <- seq(1:nrow(dfOrig))%%3
res <- do.call("cbind", split(dfOrig,x))
You can definitely change the column names
colnames(res) <- c("Tag", "Id", "Code")
# Tag Id Code
#3 531404 1 C
#6 477644 3 B

How to find and label start / end points of unique sequences in R

I have a sequence of 1s and 0s alongside a time vector. I'd like to find the start and end time points of all the sequences of 1s, and give each sequence a unique ID. Here is some example data and my attempt so far.
Create dummy data
# Create the sequence
x = c(0,0,0,0,1,1,1,1,0,0,0,1,1,1,1,1,1,0,0,0,1,1,1,1,0)
# Create the time vector
t = 10:34
This is my effort
#Get changepoints using diff()
diff_result <- diff(x)
# Use ifelse() to get start and end times (i.e. on and off)
on_t <- ifelse(diff_result == 1, t, NA)
off_t <- ifelse(diff_result == -1, t, NA)
# Combine into data frame and remove NAs, add 1 to on_t
results <- data.frame(on_t = on_t[!is.na(on_t)] + 1, off_t = off_t[!is.na(off_t)])
# Create unique ID for each sequence
results$ID <- factor(1:nrow(results))
print(results)
on_t off_t ID
1 14 17 1
2 21 26 2
3 30 33 3
I'm sure there's a better way...
Put the two vectors in a data.table and then do typical group by, filter and mutate transformation is another option:
library(data.table)
dt = data.table(seq = x, time = t)
dt[, .(on_t = min(time), off_t = max(time), lab = unique(seq)), .(ID = rleid(seq))]
# Use rleid to create a unique ID for each sequence as a group by variable, find the start
# and end point for each sequence as well as a label for each sequence;
[lab == 1]
# filter label so that the result only contains time for sequence of 1
[, `:=`(lab = NULL, ID = seq_along(ID))][]
# Remove label and recreate the ID
# ID on_t off_t
# 1: 1 14 17
# 2: 2 21 26
# 3: 3 30 33
Following OP's logic, which might be a better way:
d = diff(c(0, x, 0))
# prepend and append a 0 at the beginning and ending of x to make sure this always work
# if the sequence starts or ends with 1.
results = data.frame(on_t = t[d == 1], off_t = t[(d == -1)[-1]])
# pick up the time where 1 sequence starts as on time, and 0 starts as off time. Here d is
# one element longer than t and x but since the last element for d == 1 will always be false, it won't affect the result.
results$ID = 1:nrow(results)
# create an ID
results
# on_t off_t ID
# 1 14 17 1
# 2 21 26 2
# 3 30 33 3
You can also do it this way.
x = c(0,0,0,0,1,1,1,1,0,0,0,1,1,1,1,1,1,0,0,0,1,1,1,1,0)
# Create the time vector
t = 10:34
xy <- data.frame(x, t)
mr <- rle(xy$x)$lengths
xy$group <- rep(letters[1:length(mr)], times = mr)
onesies <- xy[xy$x == 1, ]
out <- by(onesies, INDICES = onesies$group,
FUN = function(x) {
data.frame(on_t = x$t[1], off_t = x$t[nrow(x)], ID = unique(x$group))
})
do.call("rbind", out)
on_t off_t ID
b 14 17 b
d 21 26 d
f 30 33 f
Here is one method for finding the starting and stopping positions of the above vector:
# get positions of the 1s
onePos <- which(x == 1)
# get the ending positions
stopPos <- onePos[c(which(diff(onePos) != 1), length(onePos))]
# get the starting positions
startPos <- onePos[c(1, which(diff(onePos) != 1) + 1)]
The values of the ts can be obtained through subsetting:
t[startPos]
[1] 14 21 30
t[stopPos]
[1] 17 26 33
Finally, to add an id:
df <- data.frame(id=seq_along(startPos), on_t=t[startPos], off_t=t[stopPos])

R count matched strings

I am trying count matched items between strings:
target_str = "a,b,c"
table1 = data.frame(name = c("p1","p2","p3","p4"),
str = c("a,b","a","d,e,f","a,a"))
Based on target_str, count how many matches. I want my output table look like this:
name matches
p1 2 #matches a and b
p2 1 #matches a
p3 0 #no matches
p4 1 #if has duplicate, count only once
I have about 1 million target_strs that need to calculate the matches, so speed is very important. Appreciate any suggestions. Thanks in advance!
target_str = "a,b,c"
split_str <- strsplit(target_str, split = ",")[[1]]
table1 = data.frame(name = c("p1","p2","p3","p4"),
str = c("a,b","a","d,e,f","a,a"))
data.frame(name = table1$name,
matches = rowSums(sapply(split_str, grepl, x = table1$str)))
# name matches
# 1 p1 2
# 2 p2 1
# 3 p3 0
# 4 p4 1
This should be fairly fast:
# target string modified to be a character vector:
target_str <- unlist(strsplit(c("a,b,c"), split=","))
# separate each obervations strings:
stringList <- sapply(s, strsplit, split=",")
# get counts, put into data.frame
table1$Counts <- sapply(stringList, function(i) sum(i %in% target_str))
This cbinds counts to the first column, preserved as a dataframe with drop=FALSE. Counts are added from successive test for "in-ness" with grepl:
cbind( table1[ ,1,drop=FALSE], counts=rowSums(sapply( scan(text=target_str, sep= ",", what=""), function(t) { grepl( t, table1$str)})) )
Read 3 items
name counts
a p1 2
b p2 1
c p3 0

Count number of occurences for each unique value

Let's say I have:
v = rep(c(1,2, 2, 2), 25)
Now, I want to count the number of times each unique value appears. unique(v) returns what the unique values are, but not how many they are.
> unique(v)
[1] 1 2
I want something that gives me
length(v[v==1])
[1] 25
length(v[v==2])
[1] 75
but as a more general one-liner :) Something close (but not quite) like this:
#<doesn't work right> length(v[v==unique(v)])
Perhaps table is what you are after?
dummyData = rep(c(1,2, 2, 2), 25)
table(dummyData)
# dummyData
# 1 2
# 25 75
## or another presentation of the same data
as.data.frame(table(dummyData))
# dummyData Freq
# 1 1 25
# 2 2 75
If you have multiple factors (= a multi-dimensional data frame), you can use the dplyr package to count unique values in each combination of factors:
library("dplyr")
data %>% group_by(factor1, factor2) %>% summarize(count=n())
It uses the pipe operator %>% to chain method calls on the data frame data.
It is a one-line approach by using aggregate.
> aggregate(data.frame(count = v), list(value = v), length)
value count
1 1 25
2 2 75
length(unique(df$col)) is the most simple way I can see.
table() function is a good way to go, as Chase suggested.
If you are analyzing a large dataset, an alternative way is to use .N function in datatable package.
Make sure you installed the data table package by
install.packages("data.table")
Code:
# Import the data.table package
library(data.table)
# Generate a data table object, which draws a number 10^7 times
# from 1 to 10 with replacement
DT<-data.table(x=sample(1:10,1E7,TRUE))
# Count Frequency of each factor level
DT[,.N,by=x]
To get an un-dimensioned integer vector that contains the count of unique values, use c().
dummyData = rep(c(1, 2, 2, 2), 25) # Chase's reproducible data
c(table(dummyData)) # get un-dimensioned integer vector
1 2
25 75
str(c(table(dummyData)) ) # confirm structure
Named int [1:2] 25 75
- attr(*, "names")= chr [1:2] "1" "2"
This may be useful if you need to feed the counts of unique values into another function, and is shorter and more idiomatic than the t(as.data.frame(table(dummyData))[,2] posted in a comment to Chase's answer. Thanks to Ricardo Saporta who pointed this out to me here.
This works for me. Take your vector v
length(summary(as.factor(v),maxsum=50000))
Comment: set maxsum to be large enough to capture the number of unique values
or with the magrittr package
v %>% as.factor %>% summary(maxsum=50000) %>% length
Also making the values categorical and calling summary() would work.
> v = rep(as.factor(c(1,2, 2, 2)), 25)
> summary(v)
1 2
25 75
You can try also a tidyverse
library(tidyverse)
dummyData %>%
as.tibble() %>%
count(value)
# A tibble: 2 x 2
value n
<dbl> <int>
1 1 25
2 2 75
If you need to have the number of unique values as an additional column in the data frame containing your values (a column which may represent sample size for example), plyr provides a neat way:
data_frame <- data.frame(v = rep(c(1,2, 2, 2), 25))
library("plyr")
data_frame <- ddply(data_frame, .(v), transform, n = length(v))
You can also try dplyr::count
df <- tibble(x=c('a','b','b','c','c','d'), y=1:6)
dplyr::count(df, x, sort = TRUE)
# A tibble: 4 x 2
x n
<chr> <int>
1 b 2
2 c 2
3 a 1
4 d 1
If you want to run unique on a data.frame (e.g., train.data), and also get the counts (which can be used as the weight in classifiers), you can do the following:
unique.count = function(train.data, all.numeric=FALSE) {
# first convert each row in the data.frame to a string
train.data.str = apply(train.data, 1, function(x) paste(x, collapse=','))
# use table to index and count the strings
train.data.str.t = table(train.data.str)
# get the unique data string from the row.names
train.data.str.uniq = row.names(train.data.str.t)
weight = as.numeric(train.data.str.t)
# convert the unique data string to data.frame
if (all.numeric) {
train.data.uniq = as.data.frame(t(apply(cbind(train.data.str.uniq), 1,
function(x) as.numeric(unlist(strsplit(x, split=","))))))
} else {
train.data.uniq = as.data.frame(t(apply(cbind(train.data.str.uniq), 1,
function(x) unlist(strsplit(x, split=",")))))
}
names(train.data.uniq) = names(train.data)
list(data=train.data.uniq, weight=weight)
}
I know there are many other answers, but here is another way to do it using the sort and rle functions. The function rle stands for Run Length Encoding. It can be used for counts of runs of numbers (see the R man docs on rle), but can also be applied here.
test.data = rep(c(1, 2, 2, 2), 25)
rle(sort(test.data))
## Run Length Encoding
## lengths: int [1:2] 25 75
## values : num [1:2] 1 2
If you capture the result, you can access the lengths and values as follows:
## rle returns a list with two items.
result.counts <- rle(sort(test.data))
result.counts$lengths
## [1] 25 75
result.counts$values
## [1] 1 2
count_unique_words <-function(wlist) {
ucountlist = list()
unamelist = c()
for (i in wlist)
{
if (is.element(i, unamelist))
ucountlist[[i]] <- ucountlist[[i]] +1
else
{
listlen <- length(ucountlist)
ucountlist[[i]] <- 1
unamelist <- c(unamelist, i)
}
}
ucountlist
}
expt_counts <- count_unique_words(population)
for(i in names(expt_counts))
cat(i, expt_counts[[i]], "\n")

Resources