making for loop for character vector in R - r

char_vector <- c("Africa", "identical", "ending" ,"aa" ,"bb", "rain" ,"Friday" ,"transport") # character vector
Suppose I have the above character vector
I would like to create a for loop to print on the screen only the elements in a vector that have more than 5 characters and starts with a vowel
and also delete from the vector those elements that do not start with a vowel
I created this for loop but it also gives null characters
for (i in char_vector){
if (str_length(i) > 5){
i <- str_subset(i, "^[AEIOUaeiou]")
print(i)
}
}
The result for the above is
[1] "Africa"
[1] "identical"
[1] "ending"
character(0)
character(0)
My desired result would only be the first 3 characters
I'm really new to R and facing huge difficulty with creating a for loop for this problem. Any help would be greatly appreciated!

Use grepl with the pattern ^[AEIOUaeiuo]\w{5,}$:
char_vector <- c("Africa", "identical", "ending" ,"aa" ,"bb", "rain" ,"Friday" ,"transport")
char_vector <- char_vector[grepl("^[AEIOUaeiuo]\\w{5,}$", char_vector)]
char_vector
[1] "Africa" "identical" "ending"
The regex pattern used here says to match words which:
^ from the start of the word
[AEIOUaeiuo] starts with a vowel
\w{5,} followed by 5 or more characters (total length > 5)
$ end of the word

You don't need for loop, because we use vectorized functions in R.
A simple solution using grep and substr (refer to Tim Blegeleisen answer for details):
substr(grep('^[aeiu].{4}', char_vector, T, , T), 1, 3)
# [1] "Afr" "ide" "end"

With stringr functions, you'd rather use str_detect instead of str_subset, and you can take advantage of the fact that those functions are vectorized:
library(stringr)
char_vector[str_length(char_vector) > 5 & str_detect(char_vector, "^[AEIOUaeiou]")]
#[1] "Africa" "identical" "ending"
or if you want your for loop as a single vector:
vec <- c()
for (i in char_vector){
if (str_length(i) > 5 & str_detect(i, "^[AEIOUaeiou]")){
vec <- c(vec, i)
}
}
vec
# [1] "Africa" "identical" "ending"

The first 3 characters?
library(stringr)
for (i in char_vector){
if (str_length(i) > 5 & str_detect(i, "^[AEIOUaeiou]")) {
word <- str_sub(i, 1, 3)
print(word)
}
}
output is:
[1] "Afr"
[1] "ide"
[1] "end"

Using only base R functions. No need for a loop. I wrapped the steps in a function so you can use the function with other character vectors. You could make this code shorter (see #utubun's answer) but I feel it is easier to understand the process with a "one line one step" approach.
char_vector <- c("Africa", "identical", "ending" ,"aa" ,"bb", "rain" ,"Friday" ,"transport")
yourfun <- function(char_vector){
char_vector <- char_vector[nchar(char_vector)>= 5] # grab only the strings that are at least 5 characters long
char_vector <- char_vector[grep(pattern = "^[AEIOUaeiou]", char_vector)] # grab strings that starts with vowel
return(char_vector) # print the first three strings
# remove comments to get the first three characters of each string
# out <- substring(char_vector, 1, 3) # select only the first 3 characters of each string
# return(out)
}
yourfun(char_vector = char_vector)
#> [1] "Africa" "identical" "ending"
Created on 2022-05-09 by the reprex package (v2.0.1)

Related

String matching within a list of lists [duplicate]

I have a list like this:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
> grep("ABC", map_tmp)
[1] 1 3
> grep("^ABC$", map_tmp)
[1] 1 # by using regex, I get the index of "ABC" in the list
> grep("^KML$", map_tmp)
[1] 5 # I wanted 3, but I got 5. Claiming the end of a string by "$" didn't help in this case.
> grep("^HIJ$", map_tmp)
integer(0) # the regex do not return to me the index of a string inside the vector
How can I get the index of a string (exact match) in the list?
I'm ok not to use grep. Is there any way to get the index of a certain string (exact match) in the list? Thanks!
Using lapply:
which(lapply(map_tmp, function(x) grep("^HIJ$", x))!=0)
The lapply function gives you a list of which for each element in the list (0 if there's no match). The which!=0 function gives you the element in the list where your string occurs.
Use either mapply or Map with str_detect to find the position, I have run only for one string "KML" , you can run it for all others. I hope this is helpful.
First of all we make the lists even so that we can process it easily
library(stringr)
map_tmp_1 <- lapply(map_tmp, `length<-`, max(lengths(map_tmp)))
### Making the list even
val <- t(mapply(str_detect,map_tmp_1,"^KML$"))
> which(val[,1] == T)
[1] 3
> which(val[,2] == T)
integer(0)
In case of "ABC" string:
val <- t(mapply(str_detect,map_tmp_1,"ABC"))
> which(val[,1] == T)
[1] 1
> which(val[,2] == T)
[1] 3
>
I had the same question. I cannot explain why grep would work well in a list with characters but not with regex. Anyway, the best way I found to match a character string using common R script is:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
sapply( map_tmp , match , 'ABC' )
It returns a list with similar structure as the input with 'NA' or '1', depending on the result of the match test:
[[1]]
[1] 1
[[2]]
[1] NA NA
[[3]]
[1] NA NA
[[4]]
[1] NA
[[5]]
[1] NA

grep exact match in vector inside a list in R

I have a list like this:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
> grep("ABC", map_tmp)
[1] 1 3
> grep("^ABC$", map_tmp)
[1] 1 # by using regex, I get the index of "ABC" in the list
> grep("^KML$", map_tmp)
[1] 5 # I wanted 3, but I got 5. Claiming the end of a string by "$" didn't help in this case.
> grep("^HIJ$", map_tmp)
integer(0) # the regex do not return to me the index of a string inside the vector
How can I get the index of a string (exact match) in the list?
I'm ok not to use grep. Is there any way to get the index of a certain string (exact match) in the list? Thanks!
Using lapply:
which(lapply(map_tmp, function(x) grep("^HIJ$", x))!=0)
The lapply function gives you a list of which for each element in the list (0 if there's no match). The which!=0 function gives you the element in the list where your string occurs.
Use either mapply or Map with str_detect to find the position, I have run only for one string "KML" , you can run it for all others. I hope this is helpful.
First of all we make the lists even so that we can process it easily
library(stringr)
map_tmp_1 <- lapply(map_tmp, `length<-`, max(lengths(map_tmp)))
### Making the list even
val <- t(mapply(str_detect,map_tmp_1,"^KML$"))
> which(val[,1] == T)
[1] 3
> which(val[,2] == T)
integer(0)
In case of "ABC" string:
val <- t(mapply(str_detect,map_tmp_1,"ABC"))
> which(val[,1] == T)
[1] 1
> which(val[,2] == T)
[1] 3
>
I had the same question. I cannot explain why grep would work well in a list with characters but not with regex. Anyway, the best way I found to match a character string using common R script is:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
sapply( map_tmp , match , 'ABC' )
It returns a list with similar structure as the input with 'NA' or '1', depending on the result of the match test:
[[1]]
[1] 1
[[2]]
[1] NA NA
[[3]]
[1] NA NA
[[4]]
[1] NA
[[5]]
[1] NA

Finding number of r's in the vector (Both R and r) before the first u

rquote <- "R's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
in the above code we need to find the number of r's(R and r) in rquote
You could use substrings.
## find position of first 'u'
u1 <- regexpr("u", rquote, fixed = TRUE)
## get count of all 'r' or 'R' before 'u1'
lengths(gregexpr("r", substr(rquote, 1, u1), ignore.case = TRUE))
# [1] 5
This follows what you ask for in the title of the post. If you want the count of all the "r", case insensitive, then simplify the above to
lengths(gregexpr("r", rquote, ignore.case = TRUE))
# [1] 6
Then there's always stringi
library(stringi)
## count before first 'u'
stri_count_regex(stri_sub(rquote, 1, stri_locate_first_regex(rquote, "u")[,1]), "r|R")
# [1] 5
## count all R or r
stri_count_regex(rquote, "r|R")
# [1] 6
To get the number of R's before the first u, you need to make an intermediate step. (You probably don't need to. I'm sure akrun knows some incredibly cool regular expression to get the job done, but it won't be as easy to understand as this).
rquote <- "R's internals are irrefutably intriguing"
before_u <- gsub("u[[:print:]]+$", "", rquote)
length(stringr::str_extract_all(before_u, "(R|r)")[[1]])
You may try this,
> length(str_extract_all(rquote, '[Rr]')[[1]])
[1] 6
To get the count of all r's before the first u
> length(str_extract_all(rquote, perl('u.*(*SKIP)(*F)|[Rr]'))[[1]])
[1] 5
EDIT: Just saw before the first u. In that case, we can get the position of the first 'u' from either which or match.
Then use grepl in the 'chars' up to the position (ind) to find the logical index of 'R' with ignore.case=TRUE and use sum using the strsplit output from the OP's code.
ind <- which(chars=='u')[1]
Or
ind <- match('u', chars)
sum(grepl('r', chars[seq(ind)], ignore.case=TRUE))
#[1] 5
Or we can use two gsubs on the original string ('rquote'). First one removes the characters starting with u until the end of the string (u.$) and the second matches all characters except R, r ([^Rr]) and replace it with ''. We can use nchar to get count of the characters remaining.
nchar(gsub('[^Rr]', '', sub('u.*$', '', rquote)))
#[1] 5
Or if we want to count the 'r' in the entire string, gregexpr to get the position of matching characters from the original string ('rquote') and get the length
length(gregexpr('[rR]', rquote)[[1]])
#[1] 6

Get rid of repetitive characters from a column name in R

Here is a portion of my large dataframe
> a
SS29.SS29 PP1.PP1 SS4.SS4 CC43.CC43 FF57.FF57 NN23.NN23 MM25.MM25 KK9.KK9 MM55.MM55 AA75.AA75 SS88.SS88
1 669.9544 1.068153 35.86534 24.47688 1.058007 72.20306 1.854856 10.15414 0.08715572 0.02006310 0.1817582
2 651.2092 1.164428 37.59895 27.41381 1.095322 73.48029 1.927993 10.09958 0.09096972 0.02261701 0.1855258
How I'd be able to get rid of the double column names separated by a dot? e.g. for the first column I'd like to have SS29 instead of repetitive SS29.SS29, for the second column PP1 and so on. Is there any automated way of doing it?
The simplest way would be to use sub to remove the substring after the dot . character.
names(a) <- sub('\\.[^.]*', '', names(a))
You could use sub
names(a) <- sub("[.](.*)", "", names(a))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or a substring
substring(names(a), 1, regexpr("[.]", names(a))-1)
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
or strsplit
names(a) <- unlist(strsplit(names(a), "[.](.*)"))
# [1] "SS29" "PP1" "SS4" "CC43" "FF57" "NN23"
# [7] "MM25" "KK9" "MM55" "AA75" "SS88"
You can assign new column names with
colnames(a) <- new_column_names
To compute new_column_names, you can use regular expressions, e.g.. the gsub function, as ssdecontrol suggested.
new_column_names <- gsub(...)

Using R to compare two words and find letters unique to second word (across c. 6000 cases)

I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.
The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))
Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"
x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.

Resources