order strings according to some characters - r

I have a vector of strings, each of those has a number inside and I like to sort this vector according to this number.
MWE:
> str = paste0('N', sample(c(1,2,5,10,11,20), 6, replace = FALSE), 'someotherstring')
> str
[1] "N11someotherstring" "N5someotherstring" "N2someotherstring" "N20someotherstring" "N10someotherstring" "N1someotherstring"
> sort(str)
[1] "N10someotherstring" "N11someotherstring" "N1someotherstring" "N20someotherstring" "N2someotherstring" "N5someotherstring"
while I'd like to have
[1] "N1someotherstring" "N2someotherstring" "N5someotherstring" "N10someotherstring" "N11someotherstring" "N20someotherstring"
I have thought of using something like:
num = sapply(strsplit(str, split = NULL), function(s) {
as.numeric(paste0(head(s, -15)[-1], collapse = ""))
})
str = str[sort(num, index.return=TRUE)$ix]
but I guess there might be something simpler

There is an easy way to do this via gtools package,
gtools::mixedsort(str)
#[1] "N1someotherstring" "N2someotherstring" "N5someotherstring" "N10someotherstring" "N11someotherstring" "N20someotherstring"

Related

Format thousand to Ks in R

How to format numbers like 465456.6789 to beautiful 465,4K in R? Other examples 13567.566 to 13,5K 3567.5 to 3,5K and so on. In general I want something like
roundup_to <- function(x, to = 10, up = FALSE){
if(up) round(.Machine$double.eps^0.5 + x/to)*to else round(x/to)*to
}
roundup_to(c((74453.867574737)), to = 100)
to become 74,5K
You can also have a look at function scales::label_number_si which rounds the number.
a <- c(465456.6789, 3567.5, 1465458.12)
scales::label_number_si(accuracy = 0.1)(a)
#[1] "465.5K" "3.6K" "1.5M"
You could do:
a <- c(465456.6789, 13567.566, 3567.5)
sprintf("%sK", format(round(a/1000, 1), dec=","))
[1] "465,5K" " 13,6K" " 3,6K"

How to sub matching words with bracketed words?

Trying to create a function to bracket reserved words in Access for a SQL query:
library(dplyr)
tester <- data.frame(names=c("Add", "Date", "Test", "DOB"))
bracket_access <- function(x) {x %>% gsub(c("ADD|ALL|Alphanumeric|ALTER|AND|ANY|Application|AS|ASC|Assistant|
AUTOINCREMENT|Avg|BETWEEN|BINARY|BIT|BOOLEAN|BY|BYTE|CHAR|CHARACTER|
COLUMN|CompactDatabase|CONSTRAINT|Container|Count|COUNTER|CREATE|CreateDatabase|
CreateField|CreateGroup|CreateIndex|CreateObject|CreateProperty|CreateRelation|
CreateTableDef|CreateUser|CreateWorkspace|CURRENCY|CurrentUser|DATABASE|DATE|
DATETIME|DELETE|DESC|Description|DISALLOW|DISTINCT|DISTINCTROW|Document|
DOUBLE|DROP|Echo|Else|End|Eqv|Error|EXISTS|Exit|FALSE|Field |Fields|
FillCache|FLOAT |FLOAT4 |FLOAT8|FOREIGN|Form |Forms|FROM|Full|FUNCTION|
GENERAL|GetObject|GetOption|GotoPage|GROUP|GROUP BY|GUID|HAVING|Idle|
IEEEDOUBLE|IEEESINGLE|If|IGNORE|Imp|IN|INDEX|Index|Indexes|INNER|
INSERT|InsertText|INT|INTEGER|INTEGER1 |INTEGER2 |INTEGER4|INTO|IS|
JOIN|KEY|LastModified|LEFT|Level|Like|LOGICAL |LOGICAL1|LONG |LONGBINARY|
LONGTEXT|Macro|Match|Max |Min |Mod|MEMO|Module|MONEY|Move|NAME|
NewPassword|NO|Not|Note|NULL|NUMBER |NUMERIC|Object|OLEOBJECT|OFF|ON|
OpenRecordset|OPTION|OR|ORDER|Orientation|Outer|OWNERACCESS|Parameter|
PARAMETERS|Partial|PERCENT|PIVOT|PRIMARY|PROCEDURE|Property|Queries|Query|
Quit|REAL|Recalc|Recordset|REFERENCES|Refresh|RefreshLink|RegisterDatabase|
Relation|Repaint|RepairDatabase|Report|Reports|Requery|RIGHT|SCREEN|SECTION|
SELECT|SET|SetFocus|SetOption|SHORT|SINGLE|SMALLINT|SOME|SQL|StDev|
StDevP|STRING|Sum|TABLE|TableDef|TableDefs|TableID|TEXT|TIME |TIMESTAMP|
TOP|TRANSFORM|TRUE|Type|UNION|UNIQUE|UPDATE|USER|VALUE|VALUES|Var|
VarP|VARBINARY|VARCHAR|VERSION|WHERE|WITH|Workspace|Xor|Year|YES|YESNO"), paste0("[",.,"]"), ignore.case = T)
}
bracket_access(tester)
I get a numeric output and I don't really understand why:
> bracket_access(tester)
[1] "[c(1, 2, 4, 3)]"
You may fix the current approach by matching and capturing the strings equal to the alternatives you provided and then replace the names column with [\\1] in the gsub:
bracket_access <- function(x) {
gsub("^(ADD|ALL|Alphanumeric|ALTER|AND|ANY|Application|AS|ASC|Assistant|AUTOINCREMENT|Avg|BETWEEN|BINARY|BIT|BOOLEAN|BY|BYTE|CHAR|CHARACTER|COLUMN|CompactDatabase|CONSTRAINT|Container|Count|COUNTER|CREATE|CreateDatabase|CreateField|CreateGroup|CreateIndex|CreateObject|CreateProperty|CreateRelation|CreateTableDef|CreateUser|CreateWorkspace|CURRENCY|CurrentUser|DATABASE|DATE|DATETIME|DELETE|DESC|Description|DISALLOW|DISTINCT|DISTINCTROW|Document|DOUBLE|DROP|Echo|Else|End|Eqv|Error|EXISTS|Exit|FALSE|Field |Fields|FillCache|FLOAT |FLOAT4 |FLOAT8|FOREIGN|Form |Forms|FROM|Full|FUNCTION|GENERAL|GetObject|GetOption|GotoPage|GROUP|GROUP BY|GUID|HAVING|Idle|IEEEDOUBLE|IEEESINGLE|If|IGNORE|Imp|IN|INDEX|Index|Indexes|INNER|INSERT|InsertText|INT|INTEGER|INTEGER1 |INTEGER2 |INTEGER4|INTO|IS|JOIN|KEY|LastModified|LEFT|Level|Like|LOGICAL |LOGICAL1|LONG |LONGBINARY|LONGTEXT|Macro|Match|Max |Min |Mod|MEMO|Module|MONEY|Move|NAME|NewPassword|NO|Not|Note|NULL|NUMBER |NUMERIC|Object|OLEOBJECT|OFF|ON|OpenRecordset|OPTION|OR|ORDER|Orientation|Outer|OWNERACCESS|Parameter|PARAMETERS|Partial|PERCENT|PIVOT|PRIMARY|PROCEDURE|Property|Queries|Query|Quit|REAL|Recalc|Recordset|REFERENCES|Refresh|RefreshLink|RegisterDatabase|Relation|Repaint|RepairDatabase|Report|Reports|Requery|RIGHT|SCREEN|SECTION|SELECT|SET|SetFocus|SetOption|SHORT|SINGLE|SMALLINT|SOME|SQL|StDev|StDevP|STRING|Sum|TABLE|TableDef|TableDefs|TableID|TEXT|TIME |TIMESTAMP|TOP|TRANSFORM|TRUE|Type|UNION|UNIQUE|UPDATE|USER|VALUE|VALUES|Var|VarP|VARBINARY|VARCHAR|VERSION|WHERE|WITH|Workspace|Xor|Year|YES|YESNO)$",
"[\\1]",
x,
ignore.case = T)
}
bracket_access(tester$names)
## => [1] "[Add]" "[Date]" "Test" "DOB"
Here, the gsub pattern looks like ^(word1|word2|...|wordN)$ and once there is a match, the whole string is wrapped with [...] and put back (the \\1 is a placeholder for the capturing group #1 in the pattern, and there is one defined with a pair of unescaped parentheses).

Avoiding for loop, Naming Example

I would like to avoid using for loop in following example. Goal is to repeat string vector multiple times with different second part which changes each repetition. Is that possible?
str2D = mtcars
Vector = c(10,20)
Dimen = dim( str2D )
nn = c()
for ( i in Dimen[2]*(1:length(Vector)) ){
nn[ (i+1-Dimen[2]): i ] = rep(paste("|d",Vector[i/Dimen[2]],sep=""), Dimen[2] )
}
Name = paste( rep(names(str2D) , length(Vector) ),nn,sep="")
Correct result for "Name" vector is following:
"mpg|d10" "cyl|d10" "disp|d10" "hp|d10" "drat|d10" "wt|d10" "qsec|d10" "vs|d10" "am|d10" "gear|d10" "carb|d10" "mpg|d20" "cyl|d20" "disp|d20" "hp|d20" "drat|d20" "wt|d20" "qsec|d20" "vs|d20" "am|d20" "gear|d20" "carb|d20"
Thank you
I don't quite understand the end goal here but at least this achieves your desired output without a loop:
Name <- paste0(paste(names(mtcars)), "|d", rep(1:2, each = length(names(mtcars))), "0")
> Name
[1] "mpg|d10" "cyl|d10" "disp|d10" "hp|d10" "drat|d10" "wt|d10" "qsec|d10"
[8] "vs|d10" "am|d10" "gear|d10" "carb|d10" "mpg|d20" "cyl|d20" "disp|d20"
[15] "hp|d20" "drat|d20" "wt|d20" "qsec|d20" "vs|d20" "am|d20" "gear|d20"
[22] "carb|d20"

Use grep() to select character strings with "XXX-0000" syntax

Given a character vector:
id.data = c("XXX-2355",
"XYz-03",
"XYU-3",
"ABC-1234",
"AX_2356",
"AbC234")
What is the appropriate way to grep for ONLY the entries that DONT'T follow an "XXX-0000" pattern? In the example above I'd want to end up with only "XXX-2355" and "ABC-1234". There are tens of thousands of records.
I tried selecting by individual issue. For example,
id.error = rep(NA, length(id.data))
id.error[-grep("-", id.data)] = "hyphen"
This was obviously really inefficient and I have no way of knowing every possible error. Strplit was useful to a point, but only when I know where to split.
Thanks!
You seem to be looking for invert:
invert logical. If TRUE return indices or values for elements that do not match.
> id.data = c("XXX-2355",
+ "XYz-03",
+ "XYU-3",
+ "ABC-1234",
+ "AX_2356",
+ "AbC234")
> grep("[A-Z]{3}-[0-9]{4}", id.data)
[1] 1 4
> grep("[A-Z]{3}-[0-9]{4}", id.data, value = TRUE)
[1] "XXX-2355" "ABC-1234"
> grep("[A-Z]{3}-[0-9]{4}", id.data, invert = TRUE)
[1] 2 3 5 6
> grep("[A-Z]{3}-[0-9]{4}", id.data, invert = TRUE, value = TRUE)
[1] "XYz-03" "XYU-3" "AX_2356" "AbC234"
>
Not sure whether you want strings that match the said pattern, or those that don't match. The above example lists both options.
One way:
library(stringr)
id.data[str_detect(id.data, "[A-z]{3}-[0-9]{4}")]
> [1] "XXX-2355" "ABC-1234"

How to remove outliers from a list of vectors?

I have this list of vectors :
tdatm.sp=structure(list(X3CO = c(24.88993835, 25.02366257, 24.90308762
), X3CS = c(25.70629883, 25.26747704, 25.1953907), X3CD = c(26.95723343,
26.84725571, 26.2314415), X3CSD = c(36.95250702, 36.040905, 36.90475845
), X5CO = c(25.44123077, 24.97585869, 24.86075592), X5CS = c(25.71570396,
26.10244179, 25.39032555), X5CD = c(27.67508507, 27.18985558,
26.93682098), X5CSD = c(36.26528549, 34.88553238, 33.97910309
), X7CO = c(24.7142601, 24.08443642, 23.97057915), X7CS = c(24.55734444,
24.56562042, 24.7589817), X7CD = c(27.14260101, 26.65704346,
26.49533081), X7CSD = c(33.89881897, 32.91091919, 32.79199219
), X9CO = c(26.86141014, 26.42648888, 25.8350563), X9CS = c(28.17367744,
27.27400589, 26.58813667), X9CD = c(28.88915062, 28.32597542,
28.2713623), X9CSD = c(34.61352158, 35.84189987, 35.80329132)), .Names = c("X3CO",
"X3CS", "X3CD", "X3CSD", "X5CO", "X5CS", "X5CD", "X5CSD", "X7CO",
"X7CS", "X7CD", "X7CSD", "X9CO", "X9CS", "X9CD", "X9CSD"))
> head(tdatm.sp)
$X3CO
[1] 24.88994 25.02366 24.90309
$X3CS
[1] 25.70630 25.26748 25.19539
$X3CD
[1] 26.95723 26.84726 26.23144
$X3CSD
[1] 36.95251 36.04091 36.90476
$X5CO
[1] 25.44123 24.97586 24.86076
$X5CS
[1] 25.71570 26.10244 25.39033
I would like to remove outliers from each individual vector using the Hampel method.
One way I found to do it is :
repoutliers=function(x){ med=median(x); mad=mad(x); x[x>med+3*mad | x<med-3*mad]=NA; return(x)}
lapply(tdatm.sp, repoutliers)
But I was wondering if it was possible to do it without declaring a new function, directly within lapply. lapply sends each individual vector to the function repoutliers, do you know how to operate on this individual vectors directly within lapply ? Let's say I swap repoutliers with the function "replace", I could do the same word by calling the individual vectors in the arguments of replace (lapply(X,FUN,...); ... = replace arguments).
In brief : how to manipulate individual vectors lapply sends to the function winthin lapply ?
It's really more or less a tomato tomahtoe thing. Doing it all in lapply doesn't get you very far.
lapply( tdatm.sp, function(x){
med=median(x)
mad=mad(x)
x[x>med+3*mad | x<med-3*mad]=NA
return(x)} )
Now lapply is just sending everything to an anonymous function. But if you didn't want the function hanging around afterwards this is handy syntax.

Resources