I am looking to split a string into ngrams of 3 characters - e.g HelloWorld would become "Hel", "ell", "llo", "loW" etc
How would I achieve this using R?
In Python it would take a loop using the range function - e.g. [myString[i:] for i in range(3)]
Is there a neat way to loop through the letters of a string using stringr (or another suitable function/package) to tokenize the word into a vector?
e.g.
dfWords <- c("HelloWorld", "GoodbyeMoon", "HolaSun") %>%
data.frame()
names(dfWords)[1] = "Text"
I would like to generate a new column which would contain a vector of the tokenized Text variable (preferably using dplyr). This can then be split later into new columns.
For the others that are coming here, as I did, to really find the R function that would be an equivalent to range() function in Python, I have found the answer.
And it is seq() function. A few examples will be better than words but the usage is really the same as in Python:
> seq(from = 1, to = 5, by = 1)
[1] 1 2 3 4 5
> seq(from = 1, to = 6, by = 2)
[1] 1 3 5
> seq(5)
[1] 1 2 3 4 5
In base R you could do something like this
ss <- "HelloWorld"
len <- 3
lapply(seq_len(nchar(ss) - len + 1), function(x) substr(ss, x, x + len - 1))
#[[1]]
#[1] "Hel"
#
#[[2]]
#[1] "ell"
#
#[[3]]
#[1] "llo"
#
#[[4]]
#[1] "loW"
#
#[[5]]
#[1] "oWo"
#
#[[6]]
#[1] "Wor"
#
#[[7]]
#[1] "orl"
#
#[[8]]
#[1] "rld"
Explanation: The approach is a basic sliding window method to extract substrings from ss. The return object is a list.
Another (sliding window) alternative could be zoo::rollapply with strsplit
library(zoo)
len <- 3
rollapply(unlist(strsplit(ss, "")), len, paste, collapse = "")
[1] "Hel" "ell" "llo" "loW" "oWo" "Wor" "orl" "rld"
In response to your comment/edit, here's a tidyverse option
# Sample data
df <- data.frame(words = c("HelloWorld", "GoodbyeMoon", "HolaSun"))
library(tidyverse)
library(zoo)
df %>% mutate(lst = map(str_split(words, ""), function(x) rollapply(x, len, paste, collapse = "")))
# words lst
#1 HelloWorld Hel, ell, llo, loW, oWo, Wor, orl, rld
#2 GoodbyeMoon Goo, ood, odb, dby, bye, yeM, eMo, Moo, oon
#3 HolaSun Hol, ola, laS, aSu, Sun
Related
Supose I have the following
X <- "1,2,3,4,5"
How do I get the sequence of numeric values
#[1] 1 2 3 4 5
I've already seen this example https://statisticsglobe.com/convert-character-to-numeric-in-r/ But it doesn't quite match with the problem above.
This is one way of doing this:
library(stringr)
l="1,2,3,4,5"
as.numeric(str_split(l, ',', simplify = TRUE))
1) scan will convert such a string to a numeric vector. Omit the quiet argument if you would like it to report the length of the result. No packages are used.
x <- "1,2,3,4,5"
scan(text = x, sep = ",", quiet = TRUE)
## [1] 1 2 3 4 5
2) If what you have is actually a vector of comma separated character stings. xx. and a list of numeric vectors is wanted then lapply over them.
xx <- c(x, x)
lapply(xx, function(x) scan(text = x, sep = ",", quiet = TRUE))
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [1] 1 2 3 4 5
Sadly was unable to realize how to do it in R, but the idea seams simple.
What I want is a list of pairs of numbers under a range where the fist pair is the first value and the sum of first pair with the maximum length, in the end I should have something like:
somefun <- function(start, end, step){...}
l <- somefun (5, 30, 5)
l
#[[1]]
#[[1]][[1]]
#[1] 5
#
#[[1]][[2]]
#[1] 10
#
#[[2]]
#[[2]][[1]]
#[1] 11
#
#[[2]][[2]]
#[1] 16
#
#[[3]]
#[[3]][[1]]
#[1] 17
#
#[[3]][[2]]
#[1] 22
#
#[[4]]
#[[4]][[1]]
#[1] 23
#
#[[4]][[2]]
#[1] 28
#
#[[5]]
#[[5]][[1]]
#[1] 29
#[[5]][[2]]
#[1] 30
So, the final list should have the first start and the last end values, but the difference within each list shouldn't be larger than the step.
Also, I don't know if it could be the best way, but my objective is pass this values with lapply to build a plot using grid with gredExtra::grid.arrange
So the list should fit in this code
p_list = lapply(myRanges, function(a,b){
my_gg_function(myData[a:b], font=f)
})
do.call(gridExtra::grid.arrange, c(p_list, ncol=2))
Thanks in advance
How about this
somefun <- function(start, end, step){
starts <- seq(start, end, step+1)
ends <- pmin(starts + step, end)
mapply(list, starts, ends, SIMPLIFY = FALSE)
}
somefun(5, 30, 5)
We just use a basic seq() and trim as needed.
I would like to find the location of a character in a string.
Say: string = "the2quickbrownfoxeswere2tired"
I would like the function to return 4 and 24 -- the character location of the 2s in string.
You can use gregexpr
gregexpr(pattern ='2',"the2quickbrownfoxeswere2tired")
[[1]]
[1] 4 24
attr(,"match.length")
[1] 1 1
attr(,"useBytes")
[1] TRUE
or perhaps str_locate_all from package stringr which is a wrapper for gregexpr stringi::stri_locate_all (as of stringr version 1.0)
library(stringr)
str_locate_all(pattern ='2', "the2quickbrownfoxeswere2tired")
[[1]]
start end
[1,] 4 4
[2,] 24 24
note that you could simply use stringi
library(stringi)
stri_locate_all(pattern = '2', "the2quickbrownfoxeswere2tired", fixed = TRUE)
Another option in base R would be something like
lapply(strsplit(x, ''), function(x) which(x == '2'))
should work (given a character vector x)
Here's another straightforward alternative.
> which(strsplit(string, "")[[1]]=="2")
[1] 4 24
You can make the output just 4 and 24 using unlist:
unlist(gregexpr(pattern ='2',"the2quickbrownfoxeswere2tired"))
[1] 4 24
find the position of the nth occurrence of str2 in str1(same order of parameters as Oracle SQL INSTR), returns 0 if not found
instr <- function(str1,str2,startpos=1,n=1){
aa=unlist(strsplit(substring(str1,startpos),str2))
if(length(aa) < n+1 ) return(0);
return(sum(nchar(aa[1:n])) + startpos+(n-1)*nchar(str2) )
}
instr('xxabcdefabdddfabx','ab')
[1] 3
instr('xxabcdefabdddfabx','ab',1,3)
[1] 15
instr('xxabcdefabdddfabx','xx',2,1)
[1] 0
To only find the first locations, use lapply() with min():
my_string <- c("test1", "test1test1", "test1test1test1")
unlist(lapply(gregexpr(pattern = '1', my_string), min))
#> [1] 5 5 5
# or the readable tidyverse form
my_string %>%
gregexpr(pattern = '1') %>%
lapply(min) %>%
unlist()
#> [1] 5 5 5
To only find the last locations, use lapply() with max():
unlist(lapply(gregexpr(pattern = '1', my_string), max))
#> [1] 5 10 15
# or the readable tidyverse form
my_string %>%
gregexpr(pattern = '1') %>%
lapply(max) %>%
unlist()
#> [1] 5 10 15
You could use grep as well:
grep('2', strsplit(string, '')[[1]])
#4 24
In python we can do this..
numbers = [1, 2, 3]
characters = ['foo', 'bar', 'baz']
for item in zip(numbers, characters):
print(item[0], item[1])
(1, 'foo')
(2, 'bar')
(3, 'baz')
We can also unpack the tuple rather than using the index.
for num, char in zip(numbers, characters):
print(num, char)
(1, 'foo')
(2, 'bar')
(3, 'baz')
How can we do the same using base R?
To do something like this in an R-native way, you'd use the idea of a data frame. A data frame has multiple variables which can be of different types, and each row is an observation of each variable.
d <- data.frame(numbers = c(1, 2, 3),
characters = c('foo', 'bar', 'baz'))
d
## numbers characters
## 1 1 foo
## 2 2 bar
## 3 3 baz
You then access each row using matrix notation, where leaving an index blank includes everything.
d[1,]
## numbers characters
## 1 1 foo
You can then loop over the rows of the data frame to do whatever you want to do, presumably you actually want to do something more interesting than printing.
for(i in seq_len(nrow(d))) {
print(d[i,])
}
## numbers characters
## 1 1 foo
## numbers characters
## 2 2 bar
## numbers characters
## 3 3 baz
For another option, how about mapply, which is the closest analog to zip I can think of in R. Here I'm using the c function to make a new vector, but you could use any function you'd like:
numbers<- c(1, 2, 3)
characters<- c('foo', 'bar', 'baz')
mapply(c,numbers, characters, SIMPLIFY = FALSE)
[[1]]
[1] "1" "foo"
[[2]]
[1] "2" "bar"
[[3]]
[1] "3" "baz"
Which way is of most use depends on what you want to do with your output, but as the other answers mention, a dataframe is the most natural approach in R (and pandas dataframe probably in python).
To index a vector in R, where the vector is variable x would be x[1]. This would return the first element of the vector. R element numbering starts at 1 in contrast to Python which starts at 0.
For this problem it would be:
x = seq(1,10)
j = seq(11,20)
for (i in 1:length(x)){
print (c(x[i],j[i]))
}
Many functions in R are vectorized and don't require loops:
numbers = c(1, 2, 3)
characters = c('foo', 'bar', 'baz')
myList <- list(numbers, characters)
myDF <- data.frame(numbers,characters, stringsAsFactors = F)
print(myList)
print(myDF)
This is the conceptual equivalent:
for (item in Map(list,numbers,characters)){ # though most of the time you would actually do all your work inside Map
print(item[c(1,2)])
}
# [[1]]
# [1] 1
#
# [[2]]
# [1] "a"
#
# [[1]]
# [1] 2
#
# [[2]]
# [1] "b"
#
# [[1]]
# [1] 3
#
# [[2]]
# [1] "c"
#
# [[1]]
# [1] 4
#
# [[2]]
# [1] "d"
#
# [[1]]
# [1] 5
#
# [[2]]
# [1] "e"
Though most of the time you would actually do all your work inside Map and do something like this:
Map(function(nu,ch){print(data.frame(nu,ch))},numbers,characters)
This is the closest I could get to a clone:
zip <- function(...){ Map(list,...)}
print2 <- function(...){do.call(cat,c(list(...),"\n"))}
for (item in zip(numbers,characters)){
print2(item[[1]],item[[2]])
}
# 1 a
# 2 b
# 3 c
# 4 d
# 5 e
to be able to call items by their names (still works with indices):
zip <- function(...){
names <- sapply(substitute(list(...))[-1],deparse)
Map(function(...){setNames(list(...),names)}, ...)
}
for (item in zip(numbers,characters)){
print2(item[["numbers"]],item[["characters"]])
}
The tidyverse solution would be to use purrr::map2 function. Ex:
numbers <- c(1, 2, 3)
characters <- c('foo', 'bar', 'baz')
map2(numbers, characters, ~paste0(.x, ',', .y))
#[[1]]
#[1] "1,foo"
#[[2]]
#[1] "2,bar"
#[[3]]
#[1] "3,baz"
See API here
Other scalable alternatives: Store the vectors in the list and iterate over.
vect1 <- c(1, 2, 3)
vect1 <- c('foo', 'bar', 'baz')
vect2 <- c('a', 'b', 'c')
idx_list <- list(vect1, vect2)
idx_vect <- c(1:length(idx_list[[1]]))
for(i in idx_vect){
x <- idx_list[[1]][i]
j <- idx_list[[2]][i]
print(c(i, x, j))
}
NOTE: I have updated the question to reflect specific patterns in the data.
Say that I have two vectors.
names_data <- c('A', 'B', 'C', 'D', 'E', 'F')
levels_selected <- c('A1','A3', 'Blow', 'Bhigh', 'D(4.88e+03,9.18+e+04]', 'F')
I want to know how to get a vector, a data frame, a list, or whatever, that checks on the levels vector and returns which levels of which variables where selected. Something that says:
A: 1, 3
B: low, high
D: (4.88e+03,9.18e+04]
Ultimately, there is a data frame X for which names_data = names(data) and levels_selected are some, but not all, of the levels in each of the variables. In the end what I want to do is to make a matrix (for, say for example, a random forest) using model.matrix where I want to include only the variables AND levels in levels_selected. Is there a straightforward way of doing so?
We can create a grouping variable after keeping the substring that contains the "names_data" in the "levels_selected" ('grp'), split the substring with prefix removed using the 'grp' to get a list.
grp <- sub(paste0("^(", paste(names_data, collapse="|"), ").*"), "\\1", levels_selected)
value <- gsub(paste(names_data, collapse="|"), "",
levels_selected)
lst <- split(value, grp)
lst
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "x"
If we meant something like
library(qdapTools)
mtabulate(lst)
# 1 3 high low x
#A 1 1 0 0 0
#B 0 0 1 1 0
#D 0 0 0 0 1
Or another option is using strsplit
d1 <- as.data.frame(do.call(rbind, strsplit(levels_selected,
paste0("(?<=(", paste(names_data, collapse="|"), "))"),
perl=TRUE)), stringsAsFactors=FALSE)
aggregate(V2~V1, d1, FUN= toString)
# V1 V2
#1 A 1, 3
#2 B low, high
#3 D x
and possibly the model.matrix would be
model.matrix(~V1+V2-1, d1)
Update
By using the OP's new example
d1 <- as.data.frame(do.call(rbind, strsplit(levels_selected,
paste0("(?<=(", paste(names_data, collapse="|"), "))"),
perl=TRUE)), stringsAsFactors=FALSE)
split(d1$V2, d1$V1)
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "(4.88e+03,9.18+e+04]"
It is also working with the first method.
Update2
If there are no characters that succeed the elements in 'names_data', we can filter them out
lst <- strsplit(levels_selected, paste0("(?<=(", paste(names_data,
collapse="|"), "))"), perl = TRUE)
d2 <- as.data.frame(do.call(rbind,lst[lengths(lst)==2]), stringsAsFactors=FALSE)
split(d2$V2, d2$V1)
#$A
#[1] "1" "3"
#$B
#[1] "low" "high"
#$D
#[1] "(4.88e+03,9.18+e+04]"
An option that returns a list with levels as a vector stored under each corresponding name:
> setNames(lapply(names_data, function(x) gsub(x, "", levels_selected[grepl(x, levels_selected)])), names_data)
$A
[1] "1" "3"
$B
[1] "low" "high"
$C
character(0)
$D
[1] "x"
$E
character(0)
So this is a handy little function I extended from the regexpr help example, using perl-style regex
parseAll <- function(data, pattern) {
result <- gregexpr(pattern, data, perl = TRUE)
do.call(rbind,lapply(seq_along(data), function(i) {
if(any(result[[i]] == -1)) return("")
st <- data.frame(attr(result[[i]], "capture.start"))
le <- data.frame(attr(result[[i]], "capture.length") - 1)
mapply(function(start,leng) substring(data[i], start, start + leng), st, le)
}))
}
EDIT: It's extended because this one will find multiple matches of the patterns, allowing you to look for say, multiple patterns per line. so a pattern like: "(?<first>[[:upper:]][[:lower:]]+) (?<last>[[:upper:]][[:lower:]]+)" (from the original regexpr help) finds all instances of the pattern in each string, rather than just one.
suppose I had data that looked like this:
dat <- c('A1','A2','A3','B3')
I could then search for this data via
parseAll(z,'A(?<A>.*)|B(?<B>.*)') to get a data.frame with the levels selected:
parseAll(dat,'A(?<A>.*)|B(?<B>.*)')
A B
[1,] "1" ""
[2,] "2" ""
[3,] "3" ""
[4,] "" "3"
and which selection had each level (though that may not be useful to you), I can programmatically generate these patterns as well from your vectors:
pattern <- paste(paste0(names_data,'(?<',names_data,'>.*)'),collapse = '|')
then your selected levels are the unique elements of each column, (it's in data.frame, so the conversion to list is easy enough)
This is my omnitool for this kinda stuff, hope it's handy