splitting a character vector at specified intervals in R - r

I have some sentences in specific format and I need to split them at regular intervals.
The sentences look like this
"abxyzpqrst34245"
"mndeflmnop6346781"
I want to split each of these sentences after the following characters: c(2,5,10), so that the output will be:
[1] c("ab", "xyz", "pqrst", "34245")
[2] c("mn", "def", "lmnop", "6346781")
NOTE: The numeric character after the 3rd split is of variable length, where as the previous ones are of fixed length.
I tried to use cut, but it only works with integer vectors.
I looked at split, but I'm not sure if it works without factors.
So, I finally went with substr to divide each of the sentences separately like this:
substr("abxyzpqrst34245", 1,2)
[1] "ab"
substr("abxyzpqrst34245", 3,5)
[1] "xyz"
substr("abxyzpqrst34245", 6,10)
[1] "pqrst"
substr("abxyzpqrst34245", 11,10000)
[1] "34245"
I'm using this long process to split these strings. Is there any easier way to achieve this splitting?

You're looking for (the often overlooked) substring:
x <- "abxyzpqrst34245"
substring(x,c(1,3,6,11),c(2,5,10,nchar(x)))
[1] "ab" "xyz" "pqrst" "34245"
which is handy because it is fully vectorized. If you want to do this over multiple strings in turn, you might do something like this:
x <- c("abxyzpqrst34245","mndeflmnop6346781")
> lapply(x,function(y) substring(y,first = c(1,3,6,11),last = c(2,5,10,nchar(y))))
[[1]]
[1] "ab" "xyz" "pqrst" "34245"
[[2]]
[1] "mn" "def" "lmnop" "6346781"

If you have a vector of strings to be split, you might also find read.fwf() handy. Use it like so:
x <- c("abxyzpqrst34245", "mndeflmnop6346781")
df <- read.fwf(file = textConnection(x),
widths = c(2,3,5,10000),
colClasses = "character")
df
# V1 V2 V3 V4
# 1 ab xyz pqrst 34245
# 2 mn def lmnop 6346781
str(df)
# 'data.frame': 2 obs. of 4 variables:
# $ V1: chr "ab" "mn"
# $ V2: chr "xyz" "def"
# $ V3: chr "pqrst" "lmnop"
# $ V4: chr "34245" "6346781"

Related

Convert a list of characters partially to int

I have a list of data:
$nPerm
[1] "1000"
$minGSSize
[1] "10"
$maxGSSize
[1] "100"
$by
[1] "DOSE"
$seed
[1] "TRUE"
This list is supposed to be flexible, so these values could be different and could be something else.
All the data in this list is in character class, the numbers and words also. I would like to know if it is possible to convert only the numbers to numeric, but leave the others as characters/strings.
Thank you in advance!
L <- list(a="1000", b="DOSE", c="99")
type.convert(L, as.is = TRUE)
# $a
# [1] 1000
# $b
# [1] "DOSE"
# $c
# [1] 99
Evan's answer is very neat, just for completeness also a {purrr} option:
L <- list(a="1000", b="DOSE", c="99")
L |> purrr::map(~ifelse(stringr::str_detect(.x,"^[:digit:]+$"), as.numeric(.x), .x))

R - function paramter that is list of functions--inspect parameter without evaluating?

EDIT: the initial resonses suggest my write-up focused people's attention on questions of best practices rather than questions of technique. I'd like to focus on a technical issue, however, with the below just as a toy example:
If a person passes a list to a function parameter, how can you capture and inspect individual elements of that list without risking errors from the system attempting to call/evaluate those elements?
For instance, if a user passes to a parameter a list of functions that may or may not be appropriate, or have the associated packages loaded, how can the function safely examine what functions were requested?
Say I would like to build a function that iterates through other functions that might be applied. The actual example would call different modeling functions, but here's a toy example that's easier to see:
newfunc <- function(func.list){
lapply(func.list,
function(f){
f(letters)
}
)
}
Let's say that among the functions newfunc() can take are the functions nchar() and length(). If we provide those, we get the following:
newfunc(
func.list = list(nchar, length)
)
[[1]]
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[[2]]
[1] 26
But, let's say that newfunc() is also capable of taking something like str_to_upper(), which comes from the package stringr. Passing str_to_upper() works fine, but only if stringr has been loaded beforehand:
newfunc(
func.list = list(nchar, length, str_to_upper)
)
Error in lapply(func.list, function(f) f(letters)) :
object 'str_to_upper' not found
require(stringr)
newfunc(func.list = list(nchar, length, str_to_upper))
[[1]]
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[[2]]
[1] 26
[[3]]
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O"
[16] "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
I'd like to put code in the function that can investigate the elements of the list and determine whether any packages (like stringr) need to be loaded. Also, I'd like to check whether the functions listed are from an acceptable set (so it catches if someone passes mean() or, worse, rcorr() from an unloaded Hmisc).
# This works here but is undesireable:
newfunc(func.list = list(nchar, length, str_to_upper, mean))
# This creates issues no matter what:
newfunc(func.list = list(nchar, length, str_to_upper, rcorr))
require(Hmisc)
newfunc(func.list = list(nchar, length, str_to_upper, rcorr))
I know how to do something like func.list.test <- deparse(substitute(func.list) to get the literal text of the parameter, but I don't know how to do that on individual elements without risking triggering an error if some function isn't present.
(and I don't want to take the hacky route of string manipulation on the overall deparsed output of func.list.test)
Ideally for this use case I'd like to know if this can be done with base R techniques. However, feel free to explain how to do this using newer approaches like tidy evaluation/quosures if it's the best/only way (though know that my familiarity with those is currently pretty limited).
Any help would be appreciated.
Here's a pure base function that uses find() to determine what function is being used and help.search() to locate any installed packages that might have the function:
resolve <- function( func.list )
{
## Disassemble the supplied list of functions (lfs)
lf <- as.list(substitute( func.list ))[-1]
lfs <- lapply( lf, deparse )
lfs <- setNames( lfs, lfs )
## Find functions (ff) in the loaded namespaces
ff <- lapply( lfs, find )
## Existing functions (fex) are listed in the order of masking
## The first element is used by R in the absence of explicit ::
fex <- subset( ff, lapply(ff, length) > 0 )
fex <- lapply( fex, `[`, 1 )
## Search for empty entries (ee) among installed packages
ee <- names(subset( ff, lapply(ff, length) < 1 ))
ee <- setNames( ee, ee )
eeh <- lapply( ee, function(e)
help.search( apropos = paste0("^", e, "$"),
fields = "name", ignore.case=FALSE )$matches$Package )
## Put everything together
list( existing = fex, to_load = eeh )
}
Example usage:
resolve(func.list = list(nchar, length, str_to_upper, lag, between))
# List of 2
# $ existing:List of 3
# ..$ nchar : chr "package:base"
# ..$ length: chr "package:base"
# ..$ lag : chr "package:stats"
# $ to_load :List of 2
# ..$ str_to_upper: chr "stringr"
# ..$ between : chr [1:3] "data.table" "dplyr" "rex"
library(dplyr)
resolve(func.list = list(nchar, length, str_to_upper, lag, between))
# List of 2
# $ existing:List of 4
# ..$ nchar : chr "package:base"
# ..$ length : chr "package:base"
# ..$ lag : chr "package:dplyr"
# ..$ between: chr "package:dplyr"
# $ to_load :List of 1
# ..$ str_to_upper: chr "stringr"
library(data.table)
resolve(func.list = list(nchar, length, str_to_upper, lag, between))
# List of 2
# $ existing:List of 4
# ..$ nchar : chr "package:base"
# ..$ length : chr "package:base"
# ..$ lag : chr "package:dplyr"
# ..$ between: chr "package:data.table"
# $ to_load :List of 1
# ..$ str_to_upper: chr "stringr"

Why do identical row names yield different results on barplot axis labels? [duplicate]

I've come across a strange behavior when playing with some dataframes: when I create two identical dataframes a,b, then swap their rownames around, they don't come out as identical:
rm(list=ls())
a <- data.frame(a=c(1,2,3),b=c(2,3,4))
b <- a
identical(a,b)
#TRUE
identical(rownames(a),rownames(b))
#TRUE
rownames(b) <- rownames(a)
identical(a,b)
#FALSE
Can anyone reproduce/explain why?
This is admittedly a bit confusing. Starting with ?data.frame we see that:
If row.names was supplied as NULL or no suitable component was found
the row names are the integer sequence starting at one (and such row
names are considered to be ‘automatic’, and not preserved by
as.matrix).
So initially a and b each have an attribute called row.names that are integers:
> str(attributes(a))
List of 3
$ names : chr [1:2] "a" "b"
$ row.names: int [1:3] 1 2 3
$ class : chr "data.frame"
But rownames() returns a character vector (as does dimnames(), actually a list of character vectors, called under the hood). So after reassigning the row names you end up with:
> str(attributes(b))
List of 3
$ names : chr [1:2] "a" "b"
$ row.names: chr [1:3] "1" "2" "3"
$ class : chr "data.frame"

R-programming: How can I change the variable names when I combine lists?

I have a data in this format:
x = c(list(a=1,b=2),list(a=5,b=6))
How can I change it to the following format?
x = list(a=1,b=2,a1=5,b1=6)
I am aware that I can achieve the above by using
names(x)[3:4]=c('a1','b1')
but it isn't effective as the the length of each lists vary in the data set that I have.
We can use make.unique and it works for all cases without doing any conversion
names(x) <- make.unique(names(x), sep="")
names(x)
#[1] "a" "b" "a1" "b1"
How about this...
as.list(as.data.frame(x))
$a
[1] 1
$b
[1] 2
$a.1
[1] 5
$b.1
[1] 6

Why does sapply() return a list with attributes when used on characters?

There is a strange behaviour of sapply() when used on a vector of characters:
y <- c("Hello", "bob", "daN")
z <- sapply(y, function(x) {toupper(x)})
z
# Hello bob daN
# "HELLO" "BOB" "DAN"
str(z)
# Named chr [1:3] "HELLO" "BOB" "DAN"
# - attr(*, "names")= chr [1:3] "Hello" "bob" "daN"
Why does sapply() return a vector with the old values as attributes? I don't want them, I don't need them and I am not aware of this behaviour when applied on e.g. numerical vectors.
By default, sapply() adds names for each iteration on character vectors.
The result can be delivered without the names by using USE.NAMES = FALSE in the call.
sapply(y, toupper, USE.NAMES = FALSE)
# [1] "HELLO" "BOB" "DAN"
This is explained in help(sapply)
USE.NAMES - logical; if TRUE and if X is character, use X as names for the result unless it had names already. Since this argument follows ... its name cannot be abbreviated.
Note that when you are applying a single function only, there is no need to use an anonymous function (anonymous function usage is also slightly less efficient). This is also shown above.
Also note that sapply() is not necessary here, as toupper() is vectorized.
toupper(y)
# [1] "HELLO" "BOB" "DAN"

Resources