R find the index of a list - r

I have a list of characters and after some lines of code my list has an element with zero characters. How can i extract the index of the element which have zero characters???
Original list
blocks <- list(
c("A", "B"),
c("C","D", "E", "R", "T"),
c("X"),
c("N")
)
Transformed list
blocks <- list(
character(0),
c("C","D", "E", "R", "T"),
c("X"),
c("N")
)

Not sure of what you want, but I guess grepcan do the trick:
if you want to know in which element of the list is a letter, use grep('A', block)
if you want to know the position of a letter into the whole list, you can try grep('A', unlist(blocks))
if you want something else, well, try it as well !

If we want to get a logical index of elements that have character(0), we can use lengths on the second 'blocks'
!lengths(blocks)
#[1] TRUE FALSE FALSE FALSE
lengths is a convenient wrapper for sapply(blocks, length) and is much faster.
lengths(blocks)
#[1] 0 5 1 1
returns a length of 0 for the first list element. By negating (!), the 0 gets coerced to TRUE and all others as FALSE.

Related

Generate all possible combinations of a text string with two specific letters substituted for each other in R

Using R, I have generated several strings of letters that range from 6-25 characters. I'd like for each one to generate an output that consists of all the combinations of these strings with every "I" substituted for a "L" and vice versa, the order of the characters should stay the same.
For example:
Input
"IVGLWEA"
OUTPUT
"IVGLWEA"
"LVGLWEA"
"LVGIWEA"
'IVGIWEA"
"LVGLWEA"
many thanks
rob
Edit: Thanks to #Skaqqs for the dynamic solution!
string <- "IVGLWEA"
# find the number of I's and L's in the string
n <- length(unlist(gregexpr("I|L", string)))
# make a grid of all possible combinations with this amount of I's and L's
df <- expand.grid(rep(list(c("I", "L")), n))
# replace I's and L's with %s
string_ <- gsub("I|L", "\\%s", string)
# replace %s with letters in grid
do.call(sprintf, as.list(c(string_, df)))
Result:
[1] "IVGIWEA" "LVGIWEA" "IVGLWEA" "LVGLWEA"
Here's an extremely inefficient (but concise!) approach:
Create all potential combinations of your input characters and use regex to extract the desired pattern.
pattern <- "(I|L)VG(I|L)WEA"
b <- c("I", "V", "G", "L", "W", "E", "A")
strings <- apply(expand.grid(rep(list(b), 7)), 1, paste0, collapse = "")
grep(pattern, strings, value = TRUE)
[1] "IVGIWEA" "LVGIWEA" "IVGLWEA" "LVGLWEA"

Problem with regex (check string for certain repetitions)

I would like to check whether in a text there are a) three consonants in a row or b) four identical letters in a row. Can someone please help me with the regular expressions?
library(tidyverse)
df <- data.frame(text = c("Completely valid", "abcdefg", "blablabla", "flahaaaa", "asdf", "another text", "a last one", "sj", "ngbas"))
consonants <- c("q", "w", "r", "t", "z", "p", "s", "d", "f", "g", "h", "k", "l", "m", "n", "b", "x")
df %>% mutate(
invalid = FALSE,
# Length too short
invalid = ifelse(nchar(text)<3, TRUE, invalid),
# Contains three consonants in a row: e.g. "ngbas"
invalid = ifelse(str_detect(text,"???"), TRUE, FALSE), # <--- Regex missing
# More than 3 identical characters in a row: e.g. "flahaaaa"
invalid = ifelse(str_detect(text,"???"), TRUE, FALSE) # <--- Regex missing
)
Three consonants in a row:
[qwrtzpsdfghklmnbx]{3}
Sequences of length > 3 of a specific char:
([a-z])(\\1){3}
# The double backslash occurs due to its role as the escape character in strings.
The latter uses a backreference. The number represents the ordinal number assigned to the capture group (= expression in parentheses) that is referenced - in this case the character class of latin lowercase letters.
For clarity, character case is not taken into account here.
Without backreferences, the solution gets a bit lengthy:
(aaaa|bbbb|cccc|dddd|eeee|ffff|gggg|hhhh|iiii|jjjj|kkkk|llll|mmmm|nnnn|oooo|pppp|qqqq|rrrr|ssss|tttt|uuuu|vvvv|wwww|xxxx|yyyy|zzzz)
The relevant docs can be found here.
You don't need to check the length of the word, regexs will made it for you.
In your code you have an error, the last ifelse condition will rewrite any output before, for example if the second ifelse is true and the third false the output will be false, your are making and AND condition.
I correct your error.
Here is the complete code:
df %>% mutate(
invalid = FALSE,
# Contains three consonants in a row: e.g. "ngbas"
invalid = ifelse(str_detect(text,regex("[BCDFGHJKLMNPQRSTVWXYZ]{3}", ignore_case = TRUE)), TRUE, invalid), # <--- Regex missing
# More than 3 identical characters in a row: e.g. "flahaaaa"
invalid = ifelse(str_detect(text,regex("([a-zA-Z])\\1{3}", ignore_case = TRUE)), TRUE, invalid) # <--- Regex missing
)

Detect in list2 if there are any strings (whole or part of bigger string) that is contained in list1

I have two lists:
list1<-list("q","w","e","r","t")
list2<-list("a","a","aq","c","f","g")
I need a code that will give TRUE because q is in the third cell of list2. I need to search for every cell of list1 in list2. I mean that I need to search every cell of list2 for any strings that are contained in every cell of list1. Matching should be as for the whole match but also for partial (if string from list1 is a part of the bigger string in list2) and in both cases I need to receive TRUE.
Not sure if the list input is particularly important here in that case. Here is a way that avoids using any iteration functions like apply. We can collapse the input list into a single regular expression pattern and then check the whole of the second list with it. You may need to be careful if you have any special characters in list1, though that is the case for any string matching method.
library(stringr)
list1 <- list("q", "w", "e", "r", "t")
list2 <- list("a", "a", "aq", "c", "f", "g")
pat <- unlist(list1) %>% str_c(collapse = "|")
list2 %>%
unlist %>%
str_detect(pat) %>%
any
#> [1] TRUE
Created on 2019-05-16 by the reprex package (v0.2.1)
any(sapply(list1, grepl, list2))
# [1] TRUE
Or equivalently
greplv <- Vectorize(grepl, 'pattern')
any(greplv(list1, list2))
# [1] TRUE

Using a list's assigned name from a character string in a vector

I have some lists:
my_list1 <- list("data" = list(c("a", "b", "c")), "meta" = list(c("a", "b")))
my_list2 <- list("data" = list(c("x", "y", "z")), "meta" = list(c("x", "y")))
I'd like to be able to perform some operations on these lists but I need to use the names of the lists stored in a vector as I'm creating them dynamically from an API call. Such a vector might be:
list_vec <- c("my_list1", "my_list2")
I'm running into problems evaluating the character string in the vector into the name of the list. I know this topic's been covered but the part I'm stuck on specifically is being able to extract just the data sublist when running functions within assign. Essentially a situation like this:
library(purrr)
for(i in seq_along(1:length(list_vec))){
assign(list_vec[[i]], map_df(list_vec[[i]][["data"]], unlist))
}
Which would give a result of:
# A tibble: 3 x 1
data
<chr>
1 a
2 b
3 c
I could also do something like:
my_list1$meta <- NULL
with
list_vec[[1]][["meta"]] <- NULL
To reduce the list to just the data sublist, but I can't within dynamically assigned names.
I've also wrapping things with eval but can't get that to work.
So specifically I need to evaluate the list's name from a string so I can extract a sublist from it.
We can pass the vector list_vec to mget, which returns a nested list. We use lapply to extract ([[) the data element and use unlist to convert this nested list to a list.
unlist(lapply(mget(list_vec), `[[`, "data"), recursive = FALSE)
Result
#$my_list1
#[1] "a" "b" "c"
#$my_list2
#[1] "x" "y" "z"

Manipulating the quotes on strings when coding in R

This is actually a series of questions about the referencing character type of values in R. Would add more bullets when I recalled any other related questions I believe which is interesting and related to this topic. For simplification, here I shall use some simple random examples to explain my questions. Hope this helps:
When building up a set of datasets using for loops and wanted to output a series of vectors with names restored in a list called name_list = ("a", "b", "c", "d", "e", "f") in the loop we would like to define as
for(i in 1:4){
a <- data[data$Year == 2010,]
b <- unique(data$Name)
c <- summarise(group_by(data,Year,Name), avg = mean(quantity))
...
f <- left_join(data,data1, by = c("Year", "Names)
}
Is there any function that allows me to use function(name_list[1]) through function(name_list[6]) to replace the a through f in the for loop? This question also goes for trying to create columns using column names in some tables/data frames embedded a chunk of code. (as.name and noquote function work when just referencing the vector/dataset but don't work when attempting to assign values to the target variable, if possible could anyone share why this happens?)
When we extract some information from SQL or other data sources we might have some information separated by comma or some other delimiters as one variable. How could we test if certain values is among one of the values separated by commas? See the example below:
1567 %in% c(1567,1456,123)
TRUE
a <- "c(1567,1456,123)"
noquote(a)
c(1567,1456,123)
1567 %in% noquote(a)
FALSE
1567 %in% list(noquote(a))
FALSE
b <- "1567,1456,123"
noquote(b)
1567,1456,123
1567 %in% noquote(strsplit(a,","))
FALSE
1567 %in% list(noquote(strsplit(a,",")))
FALSE
I kind of get why the %in% here doesn't work, seems like R is taking 1567,1456,123 as one element. So I used the strsplit to separate them. But seems that it's still not working. Wondering is there any way that allows us to get R taking the string as commands?
If all you need to do is convert comma-separated lists like "1567,1456,123" into R vectors like c(1567, 1456, 123), you definitely do not need to wrap them in c(...) and try to evaluate them directly as vectors. You should just use strsplit to split the data:
data_str <- "1567,1456,123"
data_vec <- as.integer(strsplit(string_data, ","))
stopifnot(1567 %in% data_vec)
Note that strsplit returns a list, because it can also character vectors of length greater than one:
stopifnot(
all.equal(
list(c("a", "b"), c("x", "y")),
strsplit(c("a,b", "x,y"), ",")) == TRUE)
which makes it useful for operating on columns of SQL output:
| id | concatenated_field |
|----|--------------------|
| 1 | 5362,395,9000,7 |
| 2 | 319,75624,63 |
(etc.)
d <- data.frame(
id = c(1, 2),
concatenated_field = c("5362,395,9000,7", "319,75624,63"))
d$split_field <- strsplit(d$concatenated_field, ",")
sapply(d, class)
# id concatenated_field split_field
# "numeric" "character" "list"
d$split_field[[1]]
# [1] "5362" "395" "9000" "7"
Alternatively, if you're reading in one big stream of comma-separated data, you can use scan:
data_vec <- scan(
what = 0, # arcane way to say "expect numeric input"
sep = ",",
text = "1,2,3,4,5,6,7,8,9,10")
stopifnot(all.equal(data_vec, 1:10) == TRUE)
scan is more heavy-duty than strsplit and can handle more complicated inputs as well, such as data with quoted fields:
weird_data <- scan(what="", sep=",", text='marvin,ruby,"joe,joseph",dean')
print(weird_data)
# [1] "marvin" "ruby" "joe,joseph" "dean"
If you are really really sure you need to be able to accept and evaluate R code passed as an input (this can be VERY DANGEROUS since it means you will be executing arbitrary unverified R code), you can use
r_code_string <- 'c("a", "b"), c("x", "y"))'
stopifnot(
all.equal(
c("a", "b"), c("x", "y")),
eval(parse(r_code_string))) == TRUE)
parse converts raw text into an unevaluated "expression", which is a representation of R code in the form of a special R object, eval passes the expression to the interpreter for execution.
As for noquote, it doesn't do what you think it does. It doesn't actually modify the string, it just adds a flag to the variable so that it will print without quotation marks. You can emulate this behavior with print(..., quote = FALSE).

Resources