Consecutive character matching and extraction with position - r

I'm trying to write a generic code in R where I look for 2 (or more in the future) explicit characters in a specific order located consecutively in the vector. Every command I am trying will only return a match for the first character.
I have a character string that looks similar to data and I want to extract the positions that have "L" and "V" next to each other only and in that order. So the only matches I have should be positions 3 & 4 and 7 & 8; However, I will get back positions 1, 3, and 7 as a match for L. Is it possible to only return "LV" matches?
Reproducible data to work with:
data <- c("L", "D", "L", "V", "A", "V", "L", "V")

Here are some possibilities:
which(ts(data) == "L" & stats::lag(ts(data)) == "V")
## [1] 3 7
which(head(data, -1) == "L" & tail(data, -1) == "V")
## [1] 3 7
which(apply(t(embed(data, 2)) == c("V", "L"), 2, all))
## [1] 3 7
which(data == "L" & dplyr::lead(data) == "V")
## [1] 3 7

The vector data could first be collapsed into one string with paste. Then we can find the starting positions via gregexpr. After that, we can form a list of the start and finish points by concatenating the result from gregexpr with the adjusted match length attribute.
x <- gregexpr("LV", paste(data, collapse = ""))[[1]]
Map(c, x, x + attr(x, "match.length") - 1)
# [[1]]
# [1] 3 4
#
# [[2]]
# [1] 7 8

Related

R: How to count length of intervals between specific word/symbol in a vector?

I have a vector that contains series of texts and numbers, like:
t <- c("A", 1:3, "A", 1:4, "A", 1:3)
t
#> [1] "A" "1" "2" "3" "A" "1" "2" "3" "4" "A" "1" "2" "3"
Created on 2022-08-06 by the reprex package (v2.0.1)
That is, the actual data is taken from a pdf, with the data frame collapsed into a single column vector, and the wrap length is uneven for some reason (probably because of the cell merging).
To process this data efficiently, I want to know the length from "A" to next "A" or end. In this example the answer would be 3, 4, 3 (Edit: sorry for a simple mistake, it would be 4, 5, 4).
I have tried many different methods but can't find one that works. Does anyone know of a better way?
An alternative using rle (run-length encoding)
with(rle(t == "A"), subset(lengths, !values))
#> [1] 3 4 3
You want the number of elements
(1) between adjacent "A"s;
(2) from the last "A" (excluding it) to the end.
We can use either of the following:
diff(c(which(t == "A"), length(t) + 1)) - 1
#[1] 3 4 3
diff(which(c(t, "A") == "A")) - 1
#[1] 3 4 3
Essentially we pad an "A" at the end to turn (2) into (1). If the last element of t happens to be an "A", the last value in the result will be 0.
Extension:
If you further want to know the number of elements from the beginning to the first "A" (excluding it), we can pad a leading "A":
diff(c(0, which(t == "A"), length(t) + 1)) - 1
#[1] 0 3 4 3
diff(which(c("A", t, "A") == "A")) - 1
#[1] 0 3 4 3
Here, the first value is 0, because the first element of t happens to be an "A".

R version of Vlookup with multiple matches per cell

I have a vector with numbers and a lookup table. I want the numbers replaced by the description from the lookup table.
This is easy when vectors are straight forward like this example:
> variable <- sample(1:5, 10, replace=T)
> variable
[1] 5 4 5 3 2 3 2 3 5 2
>
> lookup <- data.frame(var = 1:5, description=LETTERS[1:5])
> lookup
var description
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
>
> with(lookup, description[match(variable, var)])
[1] E D E C B C B C E B
Levels: A B C D E
However, when single elements of a vector contain multiple outcomes, I get in trouble:
variable <- c("1", "2^3", "1^5", "4", "4")
I would like the vector returned to give:
c("A", "B^C", "A^E", "D", "D")
If you have only one character match and replacement you can use chartr
chartr(paste0(lookup$var, collapse = ""),
paste0(lookup$description, collapse = ""), variable)
#[1] "A" "B^C" "A^E" "D" "D"
chartr basically tells that replace
paste0(lookup$var, collapse = "")
#[1] "12345"
with
paste0(lookup$description, collapse = "")
#[1] "ABCDE"
It is also useful since it does not change or return NA for characters which do not match.
As mentioned in the comments, there are a couple of steps needed to achieve the desired output. The following splits your variable, indexes the results against the description variable and then uses paste to collapse multiple elements.
sapply(strsplit(variable, "\\^"), function(x) paste0(lookup$description[as.numeric(x)], collapse = "^"))
[1] "A" "B^C" "A^E" "D" "D"
You can use scan to parse text into numeric, which can then be used as an index to pick items which can then be collapsed together. Add quiet=TRUE to suppress "Read" messages.
sapply(variable, function(t) {
paste( lookup$description[ scan(text=t, sep="^")], collapse="^")} )
Read 1 item
Read 2 items
Read 2 items
Read 1 item
Read 1 item
1 2^3 1^5 4 4
"A" "B^C" "A^E" "D" "D"

Why are empty levels in my factor tabulated after I assign NAs to missing values?

I have a dataframe df with a column foo containing data of type factor:
df <- data.frame("bar" = c(1:4), "foo" = c("M", "F", "F", "M"))
When I inspect the structure with str(df$foo), I get this:
Factor w/ 3 levels "","F",..: 2 2 2 2 2 2 2 2 2 2 ..
Why does it report 3 levels when there are only 2 in my data?
Edit:
There seems to be a missing value "" that I clean up by assigning it NA.
When I call table(df$foo), it seems to still count the "missing value" level, but finds no occurences:
F M
0 2 2
However, when I call df$foo I find it reports only two levels:
Levels: F M
How is it possible that table still counts the empty level, and how can I fix that behaviour?
Check whether your dataframe indeed has no missing values, because it does look to be that way. Try this:
# works because factor-levels are integers, internally; "" seems to be level 1
which(as.integer(df$MF) == 1)
# works if your missing value is just ""
which(df$MF == "")
You should then clean up your dataframe to properly refeclet missing values. A factor will handle NA:
df <- data.frame("rest" = c(1:5), "sex" = c("M", "F", "F", "M", ""))
df$sex[which(as.integer(df$sex) == 1)] <- NA
Once you have cleaned your data, you will have to drop unused levels to avoid tabulations such as table counting occurences of the empty level.
Observe this sequence of steps and its outputs:
# Build a dataframe to reproduce your behaviour
> df <- data.frame("Restaurant" = c(1:5), "MF" = c("M", "F", "F", "M", ""))
# notice the empty level "" for the missing value
> levels(df$MF)
[1] "" "F" "M"
# notice how a tabulation counts the empty level;
# this is the first column with a 1 (it has no label because
# there is no label, it is "")
> table(df$MF)
F M
1 2 2
# find the culprit and change it to NA
> df$MF[which(as.integer(df$MF) == 1)] <- as.factor(NA)
# AHA! So despite us changing the value, the original factor
# was not updated! I wonder what happens if we tabulate the column...
> levels(df$MF)
[1] "" "F" "M"
# Indeed, the empty level is present in the factor, but there are
# no occurences!
> table(df$MF)
F M
0 2 2
# droplevels to the rescue:
# it is used to drop unused levels from a factor or, more commonly,
# from factors in a data frame.
> df$MF <- droplevels(df$MF)
# factors fixed
> levels(df$MF)
[1] "F" "M"
# tabulation fixed
> table(df$MF)
F M
2 2

Function that returns the first column values based on the other columns?

I'm starting with a data.frame similar to this in R:
data <- data.frame(Names=c("A", "B", "C", "D"), E1=c(NA, 1, 0, 4), E2=c(3, 0, 0, NA))
Names E1 E2
1 A NA 3
2 B 1 0
3 C 0 0
4 D 4 NA
My goal is to create a list that shows the Names where the value of each column is nonzero, zero, or NA. In other words:
[[1]]
$Nonzero
"B", "D"
$Zero
"C"
$N/A
"A"
[[2]]]
$Nonzero
"A"
$Zero
"B", "C"
$N/A
"D"
So far I've written the following function:
my.function <- function(x) {
nonzero <- which(x!=0 & !is.na(x))
zero <- which(x==0 & !is.na(x))
na <- which(is.na(x))
rows <- list("Nonzero"=nonzero, "Zero"=zero, "N/A"=na)
return(rows)
}
Then I used lapply:
lapply(data[,-1], my.function)
The result is this:
[[1]]
$Nonzero
2, 4
$Zero
3
$N/A
1
[[2]]]
$Nonzero
1
$Zero
2, 3
$N/A
4
So I've got the row numbers, but now I can't figure out how to get to the Names from here. My real data set has ~50 columns so I definitely need some thing I can use with lapply, rather than doing it separately for each column. Advice is greatly appreciated!
Edit: I should add that I would like this function to be transferable for use on other datasets. Thus inserting the name of this individual dataset into the function will not work.
A very quick fix is:
library(magrittr)
my.function <- function(x) {
nonzero <- which(x!=0 & !is.na(x)) %>% data$Names[.]
zero <- which(x==0 & !is.na(x)) %>% data$Names[.]
na <- which(is.na(x)) %>% data$Names[.]
rows <- list("Nonzero"=nonzero, "Zero"=zero, "N/A"=na)
return(rows)
}
Then call
lapply(data, my.function)[-1]
Because you don't want the list results for column "Names".

How can I remove an element from a list?

I have a list and I want to remove a single element from it. How can I do this?
I've tried looking up what I think the obvious names for this function would be in the reference manual and I haven't found anything appropriate.
If you don't want to modify the list in-place (e.g. for passing the list with an element removed to a function), you can use indexing: negative indices mean "don't include this element".
x <- list("a", "b", "c", "d", "e"); # example list
x[-2]; # without 2nd element
x[-c(2, 3)]; # without 2nd and 3rd
Also, logical index vectors are useful:
x[x != "b"]; # without elements that are "b"
This works with dataframes, too:
df <- data.frame(number = 1:5, name = letters[1:5])
df[df$name != "b", ]; # rows without "b"
df[df$number %% 2 == 1, ] # rows with odd numbers only
I don't know R at all, but a bit of creative googling led me here: http://tolstoy.newcastle.edu.au/R/help/05/04/1919.html
The key quote from there:
I do not find explicit documentation for R on how to remove elements from lists, but trial and error tells me
myList[[5]] <- NULL
will remove the 5th element and then "close up" the hole caused by deletion of that element. That suffles the index values, So I have to be careful in dropping elements. I must work from the back of the list to the front.
A response to that post later in the thread states:
For deleting an element of a list, see R FAQ 7.1
And the relevant section of the R FAQ says:
... Do not set x[i] or x[[i]] to NULL, because this will remove the corresponding component from the list.
Which seems to tell you (in a somewhat backwards way) how to remove an element.
I would like to add that if it's a named list you can simply use within.
l <- list(a = 1, b = 2)
> within(l, rm(a))
$b
[1] 2
So you can overwrite the original list
l <- within(l, rm(a))
to remove element named a from list l.
Here is how the remove the last element of a list in R:
x <- list("a", "b", "c", "d", "e")
x[length(x)] <- NULL
If x might be a vector then you would need to create a new object:
x <- c("a", "b", "c", "d", "e")
x <- x[-length(x)]
Work for lists and vectors
Removing Null elements from a list in single line :
x=x[-(which(sapply(x,is.null),arr.ind=TRUE))]
Cheers
If you have a named list and want to remove a specific element you can try:
lst <- list(a = 1:4, b = 4:8, c = 8:10)
if("b" %in% names(lst)) lst <- lst[ - which(names(lst) == "b")]
This will make a list lst with elements a, b, c. The second line removes element b after it checks that it exists (to avoid the problem #hjv mentioned).
or better:
lst$b <- NULL
This way it is not a problem to try to delete a non-existent element (e.g. lst$g <- NULL)
Use - (Negative sign) along with position of element, example if 3rd element is to be removed use it as your_list[-3]
Input
my_list <- list(a = 3, b = 3, c = 4, d = "Hello", e = NA)
my_list
# $`a`
# [1] 3
# $b
# [1] 3
# $c
# [1] 4
# $d
# [1] "Hello"
# $e
# [1] NA
Remove single element from list
my_list[-3]
# $`a`
# [1] 3
# $b
# [1] 3
# $d
# [1] "Hello"
# $e
[1] NA
Remove multiple elements from list
my_list[c(-1,-3,-2)]
# $`d`
# [1] "Hello"
# $e
# [1] NA
my_list[c(-3:-5)]
# $`a`
# [1] 3
# $b
# [1] 3
my_list[-seq(1:2)]
# $`c`
# [1] 4
# $d
# [1] "Hello"
# $e
# [1] NA
There's the rlist package (http://cran.r-project.org/web/packages/rlist/index.html) to deal with various kinds of list operations.
Example (http://cran.r-project.org/web/packages/rlist/vignettes/Filtering.html):
library(rlist)
devs <-
list(
p1=list(name="Ken",age=24,
interest=c("reading","music","movies"),
lang=list(r=2,csharp=4,python=3)),
p2=list(name="James",age=25,
interest=c("sports","music"),
lang=list(r=3,java=2,cpp=5)),
p3=list(name="Penny",age=24,
interest=c("movies","reading"),
lang=list(r=1,cpp=4,python=2)))
list.remove(devs, c("p1","p2"))
Results in:
# $p3
# $p3$name
# [1] "Penny"
#
# $p3$age
# [1] 24
#
# $p3$interest
# [1] "movies" "reading"
#
# $p3$lang
# $p3$lang$r
# [1] 1
#
# $p3$lang$cpp
# [1] 4
#
# $p3$lang$python
# [1] 2
Don't know if you still need an answer to this but I found from my limited (3 weeks worth of self-teaching R) experience with R that, using the NULL assignment is actually wrong or sub-optimal especially if you're dynamically updating a list in something like a for-loop.
To be more precise, using
myList[[5]] <- NULL
will throw the error
myList[[5]] <- NULL : replacement has length zero
or
more elements supplied than there are to replace
What I found to work more consistently is
myList <- myList[[-5]]
Just wanted to quickly add (because I didn't see it in any of the answers) that, for a named list, you can also do l["name"] <- NULL. For example:
l <- list(a = 1, b = 2, cc = 3)
l['b'] <- NULL
In the case of named lists I find those helper functions useful
member <- function(list,names){
## return the elements of the list with the input names
member..names <- names(list)
index <- which(member..names %in% names)
list[index]
}
exclude <- function(list,names){
## return the elements of the list not belonging to names
member..names <- names(list)
index <- which(!(member..names %in% names))
list[index]
}
aa <- structure(list(a = 1:10, b = 4:5, fruits = c("apple", "orange"
)), .Names = c("a", "b", "fruits"))
> aa
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
## $b
## [1] 4 5
## $fruits
## [1] "apple" "orange"
> member(aa,"fruits")
## $fruits
## [1] "apple" "orange"
> exclude(aa,"fruits")
## $a
## [1] 1 2 3 4 5 6 7 8 9 10
## $b
## [1] 4 5
Using lapply and grep:
lst <- list(a = 1:4, b = 4:8, c = 8:10)
# say you want to remove a and c
toremove<-c("a","c")
lstnew<-lst[-unlist(lapply(toremove, function(x) grep(x, names(lst)) ) ) ]
#or
pattern<-"a|c"
lstnew<-lst[-grep(pattern, names(lst))]
You can also negatively index from a list using the extract function of the magrittr package to remove a list item.
a <- seq(1,5)
b <- seq(2,6)
c <- seq(3,7)
l <- list(a,b,c)
library(magrittr)
extract(l,-1) #simple one-function method
[[1]]
[1] 2 3 4 5 6
[[2]]
[1] 3 4 5 6 7
There are a few options in the purrr package that haven't been mentioned:
pluck and assign_in work well with nested values and you can access it using a combination of names and/or indices:
library(purrr)
l <- list("a" = 1:2, "b" = 3:4, "d" = list("e" = 5:6, "f" = 7:8))
# select values (by name and/or index)
all.equal(pluck(l, "d", "e"), pluck(l, 3, "e"), pluck(l, 3, 1))
[1] TRUE
# or if element location stored in a vector use !!!
pluck(l, !!! as.list(c("d", "e")))
[1] 5 6
# remove values (modifies in place)
pluck(l, "d", "e") <- NULL
# assign_in to remove values with name and/or index (does not modify in place)
assign_in(l, list("d", 1), NULL)
$a
[1] 1 2
$b
[1] 3 4
$d
$d$f
[1] 7 8
Or you can remove values using modify_list by assigning zap() or NULL:
all.equal(list_modify(l, a = zap()), list_modify(l, a = NULL))
[1] TRUE
You can remove or keep elements using a predicate function with discard and keep:
# remove numeric elements
discard(l, is.numeric)
$d
$d$e
[1] 5 6
$d$f
[1] 7 8
# keep numeric elements
keep(l, is.numeric)
$a
[1] 1 2
$b
[1] 3 4
Here is a simple solution that can be done using base R. It removes the number 5 from the original list of numbers. You can use the same method to remove whatever element you want from a list.
#the original list
original_list = c(1:10)
#the list element to remove
remove = 5
#the new list (which will not contain whatever the `remove` variable equals)
new_list = c()
#go through all the elements in the list and add them to the new list if they don't equal the `remove` variable
counter = 1
for (n in original_list){
if (n != ){
new_list[[counter]] = n
counter = counter + 1
}
}
The new_list variable no longer contains 5.
new_list
# [1] 1 2 3 4 6 7 8 9 10
How about this? Again, using indices
> m <- c(1:5)
> m
[1] 1 2 3 4 5
> m[1:length(m)-1]
[1] 1 2 3 4
or
> m[-(length(m))]
[1] 1 2 3 4
You can use which.
x<-c(1:5)
x
#[1] 1 2 3 4 5
x<-x[-which(x==4)]
x
#[1] 1 2 3 5
if you'd like to avoid numeric indices, you can use
a <- setdiff(names(a),c("name1", ..., "namen"))
to delete names namea...namen from a. this works for lists
> l <- list(a=1,b=2)
> l[setdiff(names(l),"a")]
$b
[1] 2
as well as for vectors
> v <- c(a=1,b=2)
> v[setdiff(names(v),"a")]
b
2

Resources