Counting words within factors - r

I have millions of Keywords in a column labeled Keyword.text. Each factor or Keyword can contains multiple words (or shall we say token). Here is an example with 4 keywords
Keyword.text
The quick brown fox the
.8 .crazy lazy dog
dog
jumps over+the 9
I'd like to count the number of tokens in each Keyword, so as to obtain:
Keyword.length
5
4
1
4
I installed the Tau package but I haven't gotten very far...
textcnt(Mydf$Keyword.text, split = "[[:space:][:punct:]]+", method = "string", n = 1L)
returns an error I don't understand. Maybe it's due to having factors; it worked fine when practicing with a string.
I know how to do it in excel, but it doesn't work for the last line. If A2 has the keywords then: =LEN(TRIM(A2))-LEN(SUBSTITUTE(A2," ",""))+1 would do

Edit : For a dataframe and the total number of keywords, just use strsplit. There's no need to use strcnt if you're not interested in the counts per keyword. That's where I got you wrong :
tt <- data.frame(
a=rnorm(3),
b=rnorm(3),
c=c("the quick fox lazy","rbrown+fr even","what what goes & around"),
stringsAsFactors=F
)
sapply(tt$c, function(n){
length(strsplit(n, split = "[[:space:][:punct:]]+")[[1]])
})
To read the data, take also a look at ?readLines and/or ?scan. This preserves the string format and allows you to process the file line by line (or row per row). If you use a file connection, you can even load the file in parts, which helps you when you hit memory limits.
A simple example using readLines :
con <- textConnection("
The lazy fog+fog fog
never ended for fog jumping over the
fog whatever . $ plus.
")
# You use con <- file("myfile.txt")
Text <- readLines(con)
sapply(Text,textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L)
On a sidenote, using the option Dirk mentioned (stringsAsFactors=F) won't slow down performance compared to the usual read.table command. In contrary actually. You should use the sapply as mentioned above, but replace Text with as.character(Mydf$Keyword.text) (or use the stringsAsFactors=F option and drop the as.character().

Please show the error.
Also try:
require(tau)
textcnt(as character(Mydf$Keyword.txt), split, ....)
... to force character mode.
Or load your data with stringsAsFactors=FALSE -- the same question has come up here before.

What about a nice little function that let us also decide which kind of words we would like to count and which works on whole vectors as well?
require(stringr)
nwords <- function(string, pseudo=F){
ifelse( pseudo,
pattern <- "\\S+",
pattern <- "[[:alpha:]]+"
)
str_count(string, pattern)
}
nwords("one, two three 4,,,, 5 6")
# 3
nwords("one, two three 4,,,, 5 6", pseudo=T)
# 6

Related

Can't use loop on str_count() function

I have a Data Frame that has two columns like that:
USER ID
text
1
"..."
2
"..."
.
.
.
.
.
.
100
"..."
Let's say there are 100 users and each user has a text.
I want to count the proportion the texts that has question marks in them:
for example, let's say I have only 20 texts in which there are question marks. That means the value I will get is 20/100 (I don't care how many questions marks are within each text).
I tried to use str_count() and build a loop for it:
for (i in 1:length(data_frame$text)) {
str_count(data_frame$text[i], pattern = "\\?")}
but it just not working, it's not even producing an error
If you want to find if there is a question mark in the string (dichotomize as 1/0) you could do this in base R:
df <- data.frame(id = 1:10,
text = c(LETTERS[1:5], paste0(LETTERS[1:5],"?")))
df$question_mark <- grepl("\\?", df$text)*1
You can find the proportion by:
sum(df$question_mark) / nrow(df)
You may want to use stringr::str_detect() and you do not need a for loop.
Most of the str_* functions are vectorized, which is one of R's core strengths. (It still is a hidden for loop of course but it is implemented in c++ and so it's much faster as well as easier to write).
Consider:
df$test <- c("asa", "asa?", "asa??", "asa???", "asa??")
result <- paste0( sum(stringr::str_detect(df$test, "\\?")), "/", length(df$test) )
print(result)
4/5

Alignment of multiple (non-biological, discrete state) sequences

I have some data that describes an ordered set of discrete events (or states). There are 34 possible states, which may occur in any order and may repeat. Each sequence of events can contain any number of events, and crucially there are more than 2 sequences of events. My eventual aim is to cluster these sequences into similar subsets, but my hunch is that this cannot be meaningful unless these sequences are aligned such that equivalent events occupy the same position in all sequences.
I'm very familiar with multiple alignment of biological sequences, but all the software I've come across for this (MUSCLE, MAFFT, T-COFFEE, Clustal*, etc) require DNA, RNA or AA sequences, and I have more states than any of these, so I can't get them to work.
I've found various implementations of the pairwise alignment algorithms such as Needleman-Wunsch in R, but so far haven't come across any generic (non-biological) implementations of any multiple sequence alignment algorithms.
For example, say my data looks like this:
1: ABCDEFG
2: ACDGH
3: BDEFEGI
4: AH
5: DEGHI
My aim is to have it look like this:
1: ABCDEF-G--
2: A-CD---GH-
3: -B-DEFE--I
4: A-------H-
5: ---DE--GHI
Where the - symbol denotes the absence of an event in this sequence. This is a simplified example, in reality I'm looking for something that penalises the opening of gaps (-) in the same way that biological sequence MSA algorithms do.
The only piece of software I've found that seems to possibly do this is Alphamalig (http://alggen.lsi.upc.es/recerca/align/alphamalig/intro-alphamalig.html) but it's old and I can't get it working on my machine. Ideally I'd like something that can be implemented in R.
I would advise using MAFFT sequence alignment. Typically, this is used to align biological sequences, but it has the option to align text using --anysymbol. Note that MAFFT is a bash script and requires an input/output file.
input file (mafft_anysymbol_input.txt):
>Seq1
ABCDEFG
>Seq2
ACDGH
>Seq3
BDEFEGI
>Seq4
AH
>Seq5
DEGHI
R code to run bash script:
#Be sure that input/output and R files share the same path, otherwise you'll have to specify the path in the mafft script call.
x <- 'mafft --anysymbol mafft_anysymbol_input.txt > mafft_anysymbol_output.txt'
system(x)
Contents of output file (mafft_anysymbol_output.txt):
>Seq1
ABCDEFG--
>Seq2
-ACDGH---
>Seq3
--BDEFEGI
>Seq4
----AH---
>Seq5
---DEGHI-
Edit - I see now that you are familiar with biological alignment tools. If you want to make a customized scoring matrix for your text alignments, check out mafft options --text and --textmatrix. It requires ascii code input (extra data type conversions), but you would have the option of associating similar letters (however you choose to define similar) by score. For example, you could associate upper and lowercase letters, or letters with/without accent marks.
Assuming that we need to match with LETTERS, one option is str_match, then change the NA to -, paste
library(stringr)
library(dplyr)
f1 <- Vectorize(function(x) str_match(x, LETTERS))
out1 <- f1(v1)
do.call(paste0, as.data.frame(t(replace_na(out1[!!rowSums(!is.na(out1)),], '-'))))
#[1] "ABCDEFG--" "A-CD--GH-" "-B-DEFG-I" "A------H-" "---DE-GHI"
It can be also done with match after splitting
lst <- strsplit(v1, "")
mx <- match(max(sapply(lst, tail, 1)), LETTERS)
sapply(lst, function(x) paste(replace_na(x[match(LETTERS[seq_len(mx)],
x)], '-'), collapse=""))
data
v1 <- c("ABCDEFG", "ACDGH", "BDEFEGI", "AH", "DEGHI")

r replace text within a string by lookup table

I already have tried to find a solutions on the internet for my problem, and I have the feeling I know all the small pieces but I am unable to put them together. I'm quite knew at programing so pleace be patient :D...
I have a (in reality much larger) text string which look like this:
string <- "Test test [438] test. Test 299, test [82]."
Now I want to replace the numbers in square brackets using a lookup table and get a new string back. There are other numbers in the text but I only want to change those in brackets and need to have them back in brackets.
lookup <- read.table(text = "
Number orderedNbr
1 270 1
2 299 2
3 82 3
4 314 4
5 438 5", header = TRUE)
I have made a pattern to find the square brackets using regular expressions
pattern <- "\\[(\\d+)\\]"
Now I looked all around and tried sub/gsub, lapply, merge, str_replace, but I find myself unable to make it work... I don't know how to tell R! to look what's inside the brackets, to look for that same argument in the lookup table and give out what's standing in the next column.
I hope you can help me, and that it's not a really stupid question. Thx
We can use a regex look around to match only numbers that are inside a square bracket
library(gsubfn)
gsubfn("(?<=\\[)(\\d+)(?=\\])", setNames(as.list(lookup$orderedNbr),
lookup$Number), string, perl = TRUE)
#[1] "Test test [5] test. Test [3]."
Or without regex lookaround by pasteing the square bracket on each column of 'lookup'
gsubfn("(\\[\\d+\\])", setNames(as.list(paste0("[", lookup$orderedNbr,
"]")), paste0("[", lookup$Number, "]")), string)
Read your table of keys and values (a 2 column table) into a data frame. If your source information be a flat text file, then you can easily use read.csv to obtain a data frame. In the example below, I hard code a data frame with just two entries. Then, I iterate over it and make replacements in the input string.
df <- data.frame(keys=c(438, 82), values=c(5, 3))
string <- "Test test [438] test. Test [82]."
for (i in 1:nrow(df)) {
string <- gsub(paste0("(?<=\\[)", df$keys[i], "(?=\\])"), df$values[i], string, perl=TRUE)
}
string
[1] "Test test 5 test. Test 3."
Demo
Note: As #Frank wisely pointed out, my solution would fail if your number markers (e.g. [438]) happen to have replacements which are numbers also appearing as other markers. That is, if replacing a key with a value results in yet another key, there could be problems. If this be a possibility, I would suggest using markers for which this cannot happen. For example, you could remove the brackets after each replacement.
You can use regmatches<- with a pattern containing lookahead/lookbehind:
patt = "(?<=\\[)\\d+(?=\\])"
m = gregexpr(patt, string, perl=TRUE)
v = as.integer(unlist(regmatches(string, m)))
`regmatches<-`(string, m, value = list(lookup$orderedNbr[match(v, lookup$Number)]))
# [1] "Test test [5] test. Test 299, test [3]."
Or to modify the string directly, change the last line to the more readable...
regmatches(string, m) <- list(lookup$orderedNbr[match(v, lookup$Number)])

How to use apply + cat in R and get a variable width result

I'm trying to use R to print some data in a custom fashion for use in a separate program. It keeps trying to left pad my numbers, and I can't quite get it to stop. I have not been able to find anything in the ?format, ?print, ?cat, ?etc. to quite fix my problem. Also, searching for fixed width or variable width results in people looking to solve somewhat different problems (and most looking to change the style of padding -- not remove it).
Take the following data setup:
> df.play <- data.frame(name = c('a','b','c'), value = c(1,11,111))
> df.play
name value
1 a 1
2 b 11
3 c 111
This is my desired output
#Goal
> for (j in seq(nrow(df.play))) {cat(as.character(df.play[j,1]),'.',df.play[j,2],'\n',sep='')}
a.1
b.11
c.111
How do I get this output format without explicitly looping (preferably avoiding external libraries)?
#Naive Attempt 1
# Why does it left pad the second "column"?
# How do I get it to stop?
# Why does cat even parse it as a column to begin with?
junk<-apply(df.play,1,function(x) cat(x[1],'.',x[2],'\n',sep=''))
a. 1
b. 11
c.111
#Naive Attempt 2
# Perhaps this will help provide some insight.
# The number is a string before it gets to cat. Why?
t<-apply(df.play,1,function(x) cat(x[1],'.',sprintf('%d',x[2]),'\n',sep=''))
Error in sprintf("%d", x[2]) :
invalid format '%d'; use format %s for character objects
Maybe this will do it:
cat(do.call(paste, c(df.play, list(sep = '.'))), sep = '\n')
# a.1
# b.11
# c.111
In addition, apply by row will give fixed with results because the format will add extra spacing when converting data.frame to matrix with as.matrix (see this post).
You can use sprintf, not sure you wanted sprintf solution though, you just need to put a "-" sign before the total number of chars to make left align, like below:
data.frame(value=sprintf("%-5s",paste0(df.play$name,".",df.play$value)))
Or BaseR solution with gsub:
df <- data.frame(value =gsub("\\s+","",apply(df.play,1,paste0,collapse=".")))
data.frame(value1=sprintf("%-5s",df$value))
Or in case you don't want to paste0 then we can unite also,
df <- tidyr::unite(df.play,value,1:2,sep=".")
data.frame(value1=sprintf("%-5s",df$value))
Output:
value
1 a.1
2 b.11
3 c.111

In R, how to read file with custom end of line (eol)

I have a text file to read in R (and store in a data.frame). The file is organized in several rows and columns. Both "sep" and "eol" are customized.
Problem: the custom eol, i.e. "\t&nd" (without quotations), can't be set in read.table(...) (or read.csv(...), read.csv2(...),...) nor in fread(...), and I can't able to find a solution.
I'have search here ("[r] read eol" and other I don't remember) and I don't find a solution: the only one was to preprocess the file changing the eol (not possible in my case because into some fields I can find something like \n, \r, \n\r, ", ... and this is the reason for the customization).
Thanks!
You could approach this two different ways:
A. If the file is not too wide, you can read your desired rows using scan and split it into your desired columns with strsplit, then combine into a data.frame. Example:
# Provide reproducible example of the file ("raw.txt" here) you are starting with
your_text <- "a~b~c!1~2~meh!4~5~wow"
write(your_text,"raw.txt"); rm(your_text)
eol_str = "!" # whatever character(s) the rows divide on
sep_str = "~" # whatever character(s) the columns divide on
# read and parse the text file
# scan gives you an array of row strings (one string per row)
# sapply strsplit gives you a list of row arrays (as many elements per row as columns)
f <- file("raw.txt")
row_list <- sapply(scan("raw.txt", what=character(), sep=eol_str),
strsplit, split=sep_str)
close(f)
df <- data.frame(do.call(rbind,row_list[2:length(row_list)]))
row.names(df) <- NULL
names(df) <- row_list[[1]]
df
# a b c
# 1 1 2 meh
# 2 4 5 wow
B. If A doesn't work, I agree with #BondedDust that you probably need an external utility -- but you can invoke it in R with system() and do a find/replace to reformat your file for read.table. Your invocation will be specific to your OS. Example: https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands . Since you note that you have \n, and \r\n in your text already, I recommend that you first find and replace them with temporary placeholders -- perhaps quoted versions of themselves -- and then you can convert them back after you have built your data.frame.

Resources