How to drop substring from variable names?

How to drop substring from variable names? - r

I have the following names of variables:
vars <- c("var1.caps(12, For]","var2(5,For]","var3.tree.(15, For]","var4.caps")
I need to clean these names in order to get the following result:
clean_vars <- c("var1.caps","var2","var3.tree.","var4.caps")
So, basically I would like to drop (..].
Is there any automated way to do it in R?
I was trying to adapt str_replace(vars, pattern, ""), but not sure how to make pattern flexible because it could have different values between ( and ].

gsub("\\(.*\\]","",vars)
[1] "var1.caps" "var2" "var3.tree." "var4.caps"

Using stringr and purrr:
stringr::str_split(vars, "\\(") %>% purrr::map(., 1) %>% unlist()
[1] "var1.caps" "var2" "var3.tree." "var4.caps"

Another option of using gsub
> gsub("(?<=)\\(.*\\]","\\1",vars,perl = T)
[1] "var1.caps" "var2" "var3.tree."
[4] "var4.caps

Eliminate
the first ( (in regex \\() in the string
and everything that comes after it (.+).
Replace it with nothing ("").
sub("\\(.+", "", vars)
# [1] "var1.caps" "var2" "var3.tree." "var4.caps"

Related

combining elements in a list of lists in R

I have a list called samples_ID with 116 vectors, each vectors has three elements like these:
"11" "GT20-16829" "S27"
I wanna keep the 116 vectors, but combine the elements to a single element like this
"11_GT20-16829_S27"
I tried something like this
samples_ID_ <- paste(samples_ID, collapse = "_")
it returns a single vector, below is just a part of it:
..._c(\"33\", \"GT20-16846\", \"S24\")_c(\"33\", \"GT20-18142\", \"S72\")_c(\"34\", \"GT20-16819\", \"S50\")_c...
What am I doing wrong?
Can you help me please?
Thanks

A tidyverse option.
library(stringr)
library(purrr)
map(samples_ID, ~ str_c(., collapse = '_'))
# [[1]]
# [1] "11_GT20-16829_S27"
#
# [[2]]
# [1] "12_GT20-16830_S28"
Data
samples_ID <- list(c("11", "GT20-16829", "S27"), c("12", "GT20-16830", "S28"
))

In base R, we can use sapply
sapply(samples_ID, paste, collapse="_")

Another base R option using paste
do.call(paste, c(data.frame(t(list2DF(samples_ID))), sep = "_"))
or
do.call(paste, data.frame(do.call(rbind, samples_ID)), sep = "_"))

keep duplicates using `make_clean_names` in R janitor package

I am trying to clean a character column using make_clean_names function in janitor package in R. I need to keep the duplicated in this case and not add a numeric to it. Is this possible? My code is like this
x <- c(' x y z', 'xyz', 'x123x', 'xy()','xyz','xyz')
janitor::make_clean_names(x)
[1] "x_y_z" "xyz" "x123x" "xy" "xyz_2" "xyz_3"
janitor::make_clean_names(x, unique_sep = '.')
[1] "x_y_z" "xyz" "x123x" "xy" "xyz.1" "xyz.2"
janitor::make_clean_names(x, unique_sep = NULL)
[1] "x_y_z" "xyz" "x123x" "xy" "xyz_2" "xyz_3"
Using unique_sep = NULL doesn't seem to work. Any other way to keep unique values?
Desired Output:
[1] "x_y_z" "xyz" "x123x" "xy" "xyz" "xyz"
I know how to use regular expressions to do this. Just searching for a shortcut.
PS: I know this function is created to clean names of a data.frame, I am trying to apply this to a different use case. This functionality might help a lot in cleaning character columns.

You can use sapply to go through the vector elements one by one and thus avoid adding numeric suffixes to duplicates:
sapply(x, make_clean_names, USE.NAMES = F)
[1] "x_y_z" "xyz" "x123x" "xy" "xyz" "xyz"

Unfortunately no, it's not possible. If you look at the code for make_clean_names you'll see it ends with this:
# Handle duplicated names - they mess up dplyr pipelines. This appends the
# column number to repeated instances of duplicate variable names.
while (any(duplicated(cased_names))) {
dupe_count <-
vapply(
seq_along(cased_names), function(i) {
sum(cased_names[i] == cased_names[1:i])
},
1L
)
cased_names[dupe_count > 1] <-
paste(
cased_names[dupe_count > 1],
dupe_count[dupe_count > 1],
sep = "_"
)
}
I think you're on the right track passing the unique_sep argument through to the underlying function that make_clean_names uses, snakecase::to_any_case. But that while loop, recently introduced to ensure there are never duplicated names resulting from make_clean_names, will always deduplicate at the end.
You could try adapting your own function that is the first part of make_clean_names, without the loop, or you could perhaps make use of snakecase::to_any_case.

R subsetting list "incorrect number of dimensions"

I am working with some text in a list. The text is separated by CR/LF, so I split the string on that. Then I have to clean up the list to make it usable.
library(tidyverse)
my_list <-("abc\r\ndef\r\nghi\r\njkl\r\n")
# The str_split gives me a list that has an empty element at the end. Why?
split_list <- str_split(my_list, "\r\n")
[[1]]
[1] "abc" "def" "ghi" "jkl" ""
I need to remove the first two elements and then sort in reverse order:
split_list %>%
split_list[[1]][-1:-2] %>%
sort(split_list, decreasing = TRUE)
But it fails with Error in.[split_list[[1]], -1:-2] : incorrect number of dimensions
I've read so many discussions of subsetting but they all seem more complicated than my example. I clearly don't understand this yet. Thank you for your suggestions!

You could do :
library(magrittr)
split_list %>% .[[1]] %>% tail(-2) %>% sort(decreasing = TRUE)
#[1] "jkl" "ghi" ""

Here's a way of using "[[" and "[" inside the tidyverse framework. They are both functions so you need to backtick them when they are used in this manner. (Your error arises from referring to the data-object twice. You should not need to refer to split_list twice.) The tidyverse creates an implicit pass-through of the leading data-object as it gets progressively modified by the sequence of functions. Functions become somewhat like 'infix'-functions in base R:
split_list %>%
`[[`(1) %>% # pulls first column from split_list
`[`(-1:-2) %>% # both extraction functions used by back-ticked names
sort( decreasing = TRUE)
[1] "jkl" "ghi" ""
It's actually quite similar to the arrangement you could use in the base R use of these functions which are also infix:
sort( split_list
[[ 1]]
[ (-1:-2)],
decreasing = TRUE)
[1] "jkl" "ghi" ""

If you are only working on one vector such that str_split only ever returns a list with one element containing the split vector, you could wrap your str_split() inside the unlist() function to obtain the vector of split elements directly. It could look something like this:
sort(unlist(str_split(my_list, "\r\n"))[-c(1:2)], decreasing = TRUE)
Above I also subset the unlisted vector to remove the first two elements and then wrap the entire expression inside the sort() function with decreasing = TRUE.

handling sequential tasks with purrr

I would like to take list of objects and build a single object out of all of them. The actual use case is to combine multiple Seurat Objects into a single object. Currently I use a for loop, however, I was curious if I could use purrr::map. To make the problem simpler, lets just concatenate a part of a list. Try not to get too cute with the result because I the true problem is more difficult (a more complex function).
w1 = list(a="first",b="This")
w2 = list(a="second",b="is")
w3 = list(a="third",b="the")
w4 = list(a="fourth",b="desired results")
The desired results would be an "This is the desired results".
list(w1,w2,w3,w4) %>% map(paste,.$b," ")
gives
[[1]] [1] "This "
[[2]] [1] "is "
[[3]] [1] "the "
[[4]] [1] "desired result "
I would like to save the results of the previous iteration and add it as a parameter to the function.
essentially I would like to replace the following line with a functional.
y=NULL;for (x in list(w1,w2,w3,w4)){ y=ifelse(is.null(y),x$b,paste0(y," ",x$b))}
#y
#"This is the desired result"

library(purrr)
list(w1, w2, w3, w4) %>%
accumulate(~paste(.x, .y[2][[1]]), .init = '') %>%
tail(1) %>%
substr(2, nchar(.))
# [1] "This is the desired results"

With do.call and lapply in Base R:
do.call(paste, lapply(list(w1,w2,w3,w4), `[[`, "b"))
# [1] "This is the desired results"

I would recommend this using purrr
list(w1,w2,w3,w4) %>%
map_chr("b") %>%
paste(collapse=" ")
We can pass a string to map() to return just that named element, and since we are expecting only character values, we can use map_chr to get just a vector of character values rather than a list. Finally just pipe that to paste(collapse=) to turn it into just one string.
But more generally if you want to collapse incrementally, you can use reduce.
list(w1, w2, w3, w4) %>%
map_chr("b") %>%
reduce(~paste(.x, .y))

numeric sort a list of strings in R

I have a list:
a <- ["12file.txt", "8file.txt", "66file.txt"]
I would like to sort by number:
a would be: ["8file.txt", "12file.txt", "66file.txt"]
Now I could get only this:
a = ["12file.txt", "66file.txt", "8file.txt"]
Thanks

I'm assuming you have a character vector:
a <- c("12file.txt", "8file.txt", "66file.txt")
I would approach this by pulling out the number at the start of each string and sorting on that:
num <- as.numeric(sub("([0-9]+).*", "\\1", a))
a[order(num)]
#[1] "8file.txt" "12file.txt" "66file.txt"

You could also pad your strings with spaces by setting a field length to sprintf to achieve the sorting you want:
a[order(sprintf("%10s",a))]
[1] "8file.txt" "12file.txt" "66file.txt"

You can use str_sort(..., numeric = TRUE) function from stringr package:
library(stringr)
a <- c("12file.txt", "8file.txt", "66file.txt")
str_sort(a, numeric = TRUE)
#> [1] "8file.txt" "12file.txt" "66file.txt"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to drop substring from variable names? - r

gsub("\\(.*\\]","",vars) [1] "var1.caps" "var2" "var3.tree." "var4.caps"

Using stringr and purrr: stringr::str_split(vars, "\\(") %>% purrr::map(., 1) %>% unlist() [1] "var1.caps" "var2" "var3.tree." "var4.caps"

Another option of using gsub > gsub("(?<=)\\(.*\\]","\\1",vars,perl = T) [1] "var1.caps" "var2" "var3.tree." [4] "var4.caps

Eliminate the first ( (in regex \\() in the string and everything that comes after it (.+). Replace it with nothing (""). sub("\\(.+", "", vars) # [1] "var1.caps" "var2" "var3.tree." "var4.caps"

Related

combining elements in a list of lists in R

keep duplicates using `make_clean_names` in R janitor package

R subsetting list "incorrect number of dimensions"

handling sequential tasks with purrr

numeric sort a list of strings in R

Categories

Resources