I have tried some different packages in order to build a R program that will take as input a text file and produce a list of the words inside that file. Each word should have a vector with all the places that this word exist in the file.
As an example, if the text file has the string:
"this is a nice text with nice characters"
The output should be something like:
$this
[1] 1
$is
[1] 2
$a
[1] 3
$nice
[1] 4 7
$text
[1] 5
$with
[1] 6
$characters
[1] 8
I came across a useful post, http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-td4644053.html, but it does not include the positions of each words.
I found a similar function called "str_locate", however I want to count "words" and not "characters".
Any guidance of what packages / techniques to use on that, would be really appreciated
You can do this with base R (which curiously produces exact your suggested output):
# data
x <- "this is a nice text with nice characters"
# split on whitespace
words <- strsplit(x, split = ' ')[[1]]
# find positions of every word
sapply(unique(words), function(x) which(x == words))
### result ###
$this
[1] 1
$is
[1] 2
$a
[1] 3
$nice
[1] 4 7
$text
[1] 5
$with
[1] 6
$characters
[1] 8
Related
I have somehow a regex problem with handling strings in R.
I have data structure provided by RNAfold software that looks like this:
"....(((..((((((((.(((((((((((.........))))))))))).))))))))..))).."
This is a typical secondary structure for miRNAs, but I also have other sequences that are not miRNAs, that look somwhat like this:
...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......
This second sequence has two hairpin loops, one at the beginning and another one in the middle, whereas the first sequence just has one hairpin loop in the middle.
Dots (".") represent nucleotides that are not paired, while "(" represent nucleotides that are paired with their counterparts, represented as ")".
I want to split this string so that I can get the stems in the structure.
The output I would like to obtain is:
Input:
[1] "....(((..((((((((.(((((((((((.........))))))))))).))))))))..))).."
Output:
[1] "....(((..((((((((.(((((((((((........."
[2] "))))))))))).))))))))..))).."
So that I can count the number of splited strings and count the number of stems.
The result for the second sequence would be:
Input:
[1] ...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......
Output:
[1] "...((((....."
[2] "))))...........(((((((...((..(((..((((...((((((....."
[3] ")))).))...)))).))).))...)))))))......."
So in esence, what I want is to parse the strings, so that they are splitted when they fin a ")" symbol, conserving all the symbols of the string.
I have been tried using strplit() and some regex variations but I haven't been able to find the trick...
Any help?
Thanks
You could do a lookahead and look for dots ending by a closing parenthesis which come straight after an opening parenthesis.
x <- c("....(((..((((((((.(((((((((((..))))))))))).))))))))..)))..",
"...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......")
strsplit(x, "\\((?=(\\.+\\)))", perl = TRUE)
# [[1]]
# [1] "....(((..((((((((.((((((((((" "..))))))))))).))))))))..))).."
#
# [[2]]
# [1] "...(((" ".....))))...........(((((((...((..(((..((((...((((("
# [3] ".....)))).))...)))).))).))...)))))))......."
If you looking to count character it might be more convenient to do this:
x <- "...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...)))))))......."
with(rle(strsplit(x, "")[[1]]), setNames(lengths, values))
## . ( . ) . ( . ( . ( . ( . ( . ) . ) . ) . ) . ) . ) .
## 3 4 5 4 11 7 3 2 2 3 2 4 3 6 5 4 1 2 3 4 1 3 1 2 3 7 7
You can get the output you specified using DavidArenburg's logic but with a twist - David uses a lookahead regex expression to find the ( that precedes the pattern.{N}) where N can be any number. A variable-length lookbehind (where pattern contains unspecified # of a character) would be ideal but does not work (read - is not allowed). The trick is to reverse the string to use variable-length lookahead, much like a variable-length lookbehind might operate.
Data
S <- c("....(((..((((((((.(((((((((((.........))))))))))).))))))))..)))..", "...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......")
Functions
reverse_string <- function(S) {
paste(rev(unlist(strsplit(S, ""))), collapse="")
}
myfun <- function(S) {
T <- reverse_string(S)
result <- unlist(strsplit(T, "\\)(?=(\\.+\\())", perl = TRUE))
setNames(rev(sapply(result, function(i) reverse_string(i))), NULL)
}
Result
lapply(S, myfun)
# [[1]]
# [1] "....(((..((((((((.(((((((((((........."
# [2] ")))))))))).))))))))..))).."
# [[2]]
# [1] "...((((....."
# [2] ")))...........(((((((...((..(((..((((...((((((....."
# [3] "))).))...)))).))).))...)))))))......."
I have a column of coordinates that I am splitting with strsplit() and removing unwanted character from with gsub(). Note that there are 3034 rows.
> head(bike_parking$Geom)
[1] "(37.7606289177, -122.410647009)" "(37.752476948, -122.410625009)"
[3] "(37.7871729481, -122.402401009)" "(37.7776039475, -122.422764009)"
[5] "(37.7658325695, -122.46649784)" "(37.7693399479, -122.432820008)"
> length(bike_parking$Geom)
[1] 3034
> sum(is.na(bike_parking$Geom))
[1] 0
For some reason, after I run
dat <- data.frame(do.call(rbind, strsplit(as.vector(gsub("[()]", "", bike_parking$Geom)), split = ",")))
I am left with 3033. How did that happen and what steps do I take to figure out what went wrong?
> head(dat)
X1 X2
1 37.7606289177 -122.410647009
2 37.752476948 -122.410625009
3 37.7871729481 -122.402401009
4 37.7776039475 -122.422764009
5 37.7658325695 -122.46649784
6 37.7693399479 -122.432820008
> nrow(dat)
[1] 3033
It seems like your strings do not have the same structure everywhere. You will somehow have to know which structure they all have in common to split them properly. From the comments below the question, I derive that some strings may not contain a comma to split the coordinates. You can remove all commas and split the strings at the empty space instead. I'll post a solution in base R and a solution with the stringr-package.
Option 1: Base R:
We can remove the parentheses and commas from your strings by using gsub(). Then we can split the strings at the space using strsplit(). The result will be:
splitted <- strsplit(gsub("[(),]", "", bike_parking$Geom), " ")
# [[1]]
# [1] "37.7606289177" "-122.410647009"
# [[2]]
# [1] "37.752476948" "-122.410625009"
# [[3]]
# [1] "37.7871729481" "-122.402401009"
# [[4]]
# [1] "37.7776039475" "-122.422764009"
# [[5]]
# [1] "37.7658325695" "-122.46649784"
# [[6]]
# [1] "37.7693399479" "-122.432820008"
We have to reorganise these results a bit, so you'll end up with a data.frame with two columns:
sapply(1:2, function(x) sapply(splitted, `[[`, x))
# [,1] [,2]
# [1,] "37.7606289177" "-122.410647009"
# [2,] "37.752476948" "-122.410625009"
# [3,] "37.7871729481" "-122.402401009"
# [4,] "37.7776039475" "-122.422764009"
# [5,] "37.7658325695" "-122.46649784"
# [6,] "37.7693399479" "-122.432820008"
Option 2: Stringr: This package contains a function str_split() (not strsplit()!), that allows you to skip the last step in the base R solution, because you can immediately get a data.frame instead of a list with vectors:
str_split(gsub("[(),]", "", bike_parking$Geom), " ", simplify=TRUE)
I want to use str_view from stringr in R to find all the words that start with "y" and all the words that end with "x." I have a list of words generated by Corpora, but whenever I launch the code, it returns a blank view.
Common_words<-corpora("words/common")
#start with y
start_with_y <- str_view(Common_words, "^[y]", match = TRUE)
start_with_y
#finish with x
str_view(Common_words, "$[x]", match = TRUE)
Also, I would like to find the words that are only 3 letters long, but no
ideas so far.
I'd say this is not about programming with stringr but learning some regex. Here are some sites I have found useful for learning:
http://www.regular-expressions.info/tutorial.html
http://www.rexegg.com/
https://www.debuggex.com/
Here the \\w or short hand class for word characters (i.e., [A-Za-z0-9_]) is useful with quantifiers (+ and {3} in these 2 cases). PS here I use stringi because stringr is using that in the backend anyway. Just skipping the middle man.
x <- c("I like yax because the rock to the max!",
"I yonx & yix to pick up stix.")
library(stringi)
stri_extract_all_regex(x, 'y\\w+x')
stri_extract_all_regex(x, '\\b\\w{3}\\b')
## > stri_extract_all_regex(x, 'y\\w+x')
## [[1]]
## [1] "yax"
##
## [[2]]
## [1] "yonx" "yix"
## > stri_extract_all_regex(x, '\\b\\w{3}\\b')
## [[1]]
## [1] "yax" "the" "the" "max"
##
## [[2]]
## [1] "yix"
EDIT Seems like these may be of use too:
## Just y starting words
stri_extract_all_regex(x, 'y\\w+\\b')
## Just x ending words
stri_extract_all_regex(x, 'y\\w+x')
## Words with n or more characters
stri_extract_all_regex(x, '\\b\\w{4,}\\b')
For digits I have done so:
digits <- c("0","1","2","3","4","5","6","7","8","9")
You can use the [:punct:] to detect punctuation. This detects
[!"\#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]
Either in grepexpr
x = c("we are friends!, Good Friends!!")
gregexpr("[[:punct:]]", x)
R> gregexpr("[[:punct:]]", x)
[[1]]
[1] 15 16 30 31
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
or via stringi
# Gives 4
stringi::stri_count_regex(x, "[:punct:]")
Notice the , is counted as punctuation.
The question seems to be about getting individual counts of particular punctuation marks. #Joba provides a neat answer in the comments:
## Create a vector of punctuation marks you are interested in
punct = strsplit('[]?!"\'#$%&(){}+*/:;,._`|~[<=>#^-]\\', '')[[1]]
The count how often they appear
counts = stringi::stri_count_fixed(x, punct)
Decorate the vector
setNames(counts, punct)
You can use regular expressions.
stringi::stri_count_regex("amdfa, ad,a, ad,. ", "[:punct:]")
https://en.wikipedia.org/wiki/Regular_expression
might help too.
I have list written in file created by sink() - "file.txt". That file contains one list, which look like this, and it contains only numers:
[[1]]
[1] 1 2
[[2]]
[1] 1 2 3
how to read in data as list from such file ?
EDITION :
I'm going to try read it as a string, then use some regex to remove '[[*]]' and substitute '[*]' with special symbol - let it be '#'. Then take every substring between '#', split it into vector and put into empty list.
Something like this should do the trick. (The exact details may vary, but at least this will give you some ideas to work with.)
l <- readLines("file.txt")
l2 <- gsub("\\[{2}\\d+\\]{2}", "#", l) # Replace [[*]] with '#'
l3 <- gsub("\\[\\d+\\]\\s", "", l2)[-1] # Remove all [*]
l4 <- paste(l3, collapse=" ") # Paste together into one string
l5 <- strsplit(l4, "#")[[1]] # Break into list
lapply(l5, function(X) scan(textConnection(X))) # Use scan to convert 2 numeric
# [[1]]
# [1] 1 2
#
# [[2]]
# [1] 1 2 3