Use grep() to select character strings with "XXX-0000" syntax - r

Given a character vector:
id.data = c("XXX-2355",
"XYz-03",
"XYU-3",
"ABC-1234",
"AX_2356",
"AbC234")
What is the appropriate way to grep for ONLY the entries that DONT'T follow an "XXX-0000" pattern? In the example above I'd want to end up with only "XXX-2355" and "ABC-1234". There are tens of thousands of records.
I tried selecting by individual issue. For example,
id.error = rep(NA, length(id.data))
id.error[-grep("-", id.data)] = "hyphen"
This was obviously really inefficient and I have no way of knowing every possible error. Strplit was useful to a point, but only when I know where to split.
Thanks!

You seem to be looking for invert:
invert logical. If TRUE return indices or values for elements that do not match.
> id.data = c("XXX-2355",
+ "XYz-03",
+ "XYU-3",
+ "ABC-1234",
+ "AX_2356",
+ "AbC234")
> grep("[A-Z]{3}-[0-9]{4}", id.data)
[1] 1 4
> grep("[A-Z]{3}-[0-9]{4}", id.data, value = TRUE)
[1] "XXX-2355" "ABC-1234"
> grep("[A-Z]{3}-[0-9]{4}", id.data, invert = TRUE)
[1] 2 3 5 6
> grep("[A-Z]{3}-[0-9]{4}", id.data, invert = TRUE, value = TRUE)
[1] "XYz-03" "XYU-3" "AX_2356" "AbC234"
>
Not sure whether you want strings that match the said pattern, or those that don't match. The above example lists both options.

One way:
library(stringr)
id.data[str_detect(id.data, "[A-z]{3}-[0-9]{4}")]
> [1] "XXX-2355" "ABC-1234"

Related

How can I add an element to the beginning of a character in R conditionally? [duplicate]

This question already has answers here:
How to add leading zeros?
(8 answers)
Closed 3 years ago.
I am new to coding and I am learning the basics of R. I have a data set that I made in Excel. They are Zip codes; however, zip codes starting with 0 automatically eliminated when exporting. I am attempting to iterate through and add the 0 back.
My thoughts were, assuming that the zip codes w/o an initial zero are 4 characters long, I simply find the iterations that have length of 4 and then add a 0 to the front, but I am not getting the right answer.
zip<-c(61415, 19087, 63122, 3104, 1938)
zip<-as.character(zip)
>for(i in zip){
+
+if(nchar(i)==4){
+ paste0("0",i)
+ }
+ }
NULL
I should get:
"61415", "19087", "63122", "03104", "01938"
It can be done by formatting the numeric vector with sprintf
sprintf("%05d", zip)
#[1] "61415" "19087" "63122" "03104" "01938"
Another option is str_pad
library(stringr)
str_pad(zip, pad = "0", width = 5)
#[1] "61415" "19087" "63122" "03104" "01938"
NOTE: Both the options doesn't require any loop or any conditional statements
data
zip <- c(61415, 19087, 63122, 3104, 1938)
In the case "zip" is a string, you can also try:
ifelse(nchar(zip) != 5, paste0("0", zip), zip)
[1] "61415" "19087" "63122" "03104" "01938"
In the case "zip" is a numeric vector:
formatC(zip, width = 5, format = "d", flag = "0")

order strings according to some characters

I have a vector of strings, each of those has a number inside and I like to sort this vector according to this number.
MWE:
> str = paste0('N', sample(c(1,2,5,10,11,20), 6, replace = FALSE), 'someotherstring')
> str
[1] "N11someotherstring" "N5someotherstring" "N2someotherstring" "N20someotherstring" "N10someotherstring" "N1someotherstring"
> sort(str)
[1] "N10someotherstring" "N11someotherstring" "N1someotherstring" "N20someotherstring" "N2someotherstring" "N5someotherstring"
while I'd like to have
[1] "N1someotherstring" "N2someotherstring" "N5someotherstring" "N10someotherstring" "N11someotherstring" "N20someotherstring"
I have thought of using something like:
num = sapply(strsplit(str, split = NULL), function(s) {
as.numeric(paste0(head(s, -15)[-1], collapse = ""))
})
str = str[sort(num, index.return=TRUE)$ix]
but I guess there might be something simpler
There is an easy way to do this via gtools package,
gtools::mixedsort(str)
#[1] "N1someotherstring" "N2someotherstring" "N5someotherstring" "N10someotherstring" "N11someotherstring" "N20someotherstring"

How to store a "complex" data structure in R (not "complex numbers")

I need to train, store, and use a list/array/whatever of several ksvm SVM models, which once I get a set of sensor readings, I can call predict() on each of the models in turn. I want to store these models and metadata about tham in some sort of data structure, but I'm not very familiar with R, and getting a handle on its data structures has been a challenge. My familiarity is with C++, C, and C#.
I envision some sort of array or list that contains both the ksvm models as well as the metadata about them. (The metadata is necessary, among other things, for knowing how to select & organize the input data presented to each model when I call predict() on it.)
The data I want to store in this data structure includes the following for each entry of the data structure:
The ksvm model itself
A character string saying who trained the model & when they trained it
An array of numbers indicating which sensors' data should be presented to this model
A single number between 1 and 100 that represents how much I, the trainer, trust this model
Some "other stuff"
So in tinkering with how to do this, I tried the following....
First I tried what I thought would be really simple & crude, hoping to build on it later if this worked: A (list of (list of different data types))...
>
> uname = Sys.getenv("USERNAME", unset="UNKNOWN_USER")
> cname = Sys.getenv("COMPUTERNAME", unset="UNKNOWN_COMPUTER")
> trainedAt = paste("Trained at", Sys.time(), "by", uname, "on", cname)
> trainedAt
[1] "Trained at 2015-04-22 20:54:54 by mminich on MMINICH1"
> sensorsToUse = c(12,14,15,16,24,26)
> sensorsToUse
[1] 12 14 15 16 24 26
> trustFactor = 88
>
> TestModels = list()
> TestModels[1] = list(trainedAt, sensorsToUse, trustFactor)
Warning message:
In TestModels[1] = list(trainedAt, sensorsToUse, trustFactor) :
number of items to replace is not a multiple of replacement length
>
> TestModels
[[1]]
[1] "Trained at 2015-04-22 20:54:54 by mminich on MMINICH1"
>
...wha? What did it think I was trying to replace? I was just trying to populate element 1 of TestModels. Later I would add an element [2], [3], etc... but this didn't work and I don't know why. Maybe I need to define TestModels as a list of lists right up front...
> TestModels = list(list())
> TestModels[1] = list(trainedAt, sensorsToUse, trustFactor)
Warning message:
In TestModels[1] = list(trainedAt, sensorsToUse, trustFactor) :
number of items to replace is not a multiple of replacement length
>
Hmm. That no workie either. Let's try something else...
> TestModels = list(list())
> TestModels[1][1] = list(trainedAt, sensorsToUse, trustFactor)
Warning message:
In TestModels[1][1] = list(trainedAt, sensorsToUse, trustFactor) :
number of items to replace is not a multiple of replacement length
>
Drat. Still no workie.
Please clue me in on how I can do this. And I'd really like to be able to access the fields of my data structure by name, perhaps something along the lines of...
> print(TestModels[1]["TrainedAt"])
Thank you very much!
You were very close. To avoid the warning, you shouldn't use
TestModels[1] = list(trainedAt, sensorsToUse, trustFactor)
but instead
TestModels[[1]] = list(trainedAt, sensorsToUse, trustFactor)
To access a list element you use [[ ]]. Using [ ] on a list will return a list containing the elements inside the single brackets. The warning is shown because you were replacing a list containing one element (because this is how you created it) with a list containing 3 elements. This wouldn't be a problem for other elements:
TestModels[2] = list(trainedAt, sensorsToUse, trustFactor) # This element did not exist, so no replacement warning
To understand list subsetting better, take a look at this:
item1 <- list("a", 1:10, c(T, F, T))
item2 <- list("b", 11:20, c(F, F, F))
mylist <- list(item1=item1, item2=item2)
mylist[1] #This returns a list containing the item 1.
#$item1 #Note the item name of the container list
#$item1[[1]]
#[1] "a"
#
#$item1[[2]]
# [1] 1 2 3 4 5 6 7 8 9 10
#
#$item1[[3]]
#[1] TRUE FALSE TRUE
#
mylist[[1]] #This returns item1
#[[1]] #Note this is the same as item1
#[1] "a"
#
#[[2]]
# [1] 1 2 3 4 5 6 7 8 9 10
#
#[[3]]
#[1] TRUE FALSE TRUE
To access the list items by name, just name them when creating the list:
mylist <- list(var1 = "a", var2 = 1:10, var3 = c(T, F, T))
mylist$var1 #Or mylist[["var1"]]
# [1] "a"
You can nest this operators like you suggested. So you coud use
containerlist <- list(mylist)
containerlist[[1]]$var1
#[1] "a"

Truncate decimal to specified places

This seems like it should be a fairly easy problem to solve but I am having some trouble locating an answer.
I have a vector which contains long decimals and I want to truncate it to a specific number of decimals. I do not wish to round it, but rather just remove the values beyond my desired number of decimals.
For example I would like 0.123456789 to return 0.1234 if I desired 4 decimal digits. This is not an issue of printing a specific number of digits but rather returning the original value truncated to a given number.
Thanks.
trunc(x*10^4)/10^4
yields 0.1234 like expected.
More generally,
trunc <- function(x, ..., prec = 0) base::trunc(x * 10^prec, ...) / 10^prec;
print(trunc(0.123456789, prec = 4) # 0.1234
print(trunc(14035, prec = -2), # 14000
I used the technics above for a long time. One day I had some issues when I was copying the results to a text file and I solved my problem in this way:
trunc_number_n_decimals <- function(numberToTrunc, nDecimals){
numberToTrunc <- numberToTrunc + (10^-(nDecimals+5))
splitNumber <- strsplit(x=format(numberToTrunc, digits=20, format=f), split="\\.")[[1]]
decimalPartTrunc <- substr(x=splitNumber[2], start=1, stop=nDecimals)
truncatedNumber <- as.numeric(paste0(splitNumber[1], ".", decimalPartTrunc))
return(truncatedNumber)
}
print(trunc_number_n_decimals(9.1762034354551236, 6), digits=14)
[1] 9.176203
print(trunc_number_n_decimals(9.1762034354551236, 7), digits=14)
[1] 9.1762034
print(trunc_number_n_decimals(9.1762034354551236, 8), digits=14)
[1] 9.17620343
print(trunc_number_n_decimals(9.1762034354551236, 9), digits=14)
[1] 9.176203435
This solution is very handy in cases when its necessary to write to a file the number with many decimals, such as 16.
Just remember to convert the number to string before writing to the file, using format()
numberToWrite <- format(trunc_number_n_decimals(9.1762034354551236, 9), digits=20)
Not the most elegant way, but it'll work.
string_it<-sprintf("%06.9f", old_numbers)
pos_list<-gregexpr(pattern="\\.", string_it)
pos<-unlist(lapply(pos_list, '[[', 1)) # This returns a vector with the first
#elements
#you're probably going to have to play around with the pos- numbers here
new_number<-as.numeric(substring(string_it, pos-1,pos+4))

Converting geo coordinates from degree to decimal

I want to convert my geographic coordinates from degrees to decimals, my data are as follows:
lat long
105252 30°25.264 9°01.331
105253 30°39.237 8°10.811
105255 31°37.760 8°06.040
105258 31°41.190 8°06.557
105259 31°41.229 8°06.622
105260 31°38.891 8°06.281
I have this code but I can not see why it is does not work:
convert<-function(coord){
tmp1=strsplit(coord,"°")
tmp2=strsplit(tmp1[[1]][2],"\\.")
dec=c(as.numeric(tmp1[[1]][1]),as.numeric(tmp2[[1]]))
return(dec[1]+dec[2]/60+dec[3]/3600)
}
don_convert=don1
for(i in 1:nrow(don1)){don_convert[i,2]=convert(as.character(don1[i,2])); don_convert[i,3]=convert(as.character(don1[i,3]))}
The convert function works but the code where I am asking the loop to do the job for me does not work.
Any suggestion is apperciated.
Use the measurements package from CRAN which has a unit conversion function already so you don't need to make your own:
x = read.table(text = "
lat long
105252 30°25.264 9°01.331
105253 30°39.237 8°10.811
105255 31°37.760 8°06.040
105258 31°41.190 8°06.557
105259 31°41.229 8°06.622
105260 31°38.891 8°06.281",
header = TRUE, stringsAsFactors = FALSE)
Once your data.frame is set up then:
# change the degree symbol to a space
x$lat = gsub('°', ' ', x$lat)
x$long = gsub('°', ' ', x$long)
# convert from decimal minutes to decimal degrees
x$lat = measurements::conv_unit(x$lat, from = 'deg_dec_min', to = 'dec_deg')
x$long = measurements::conv_unit(x$long, from = 'deg_dec_min', to = 'dec_deg')
Resulting in the end product:
lat long
105252 30.4210666666667 9.02218333333333
105253 30.65395 8.18018333333333
105255 31.6293333333333 8.10066666666667
105258 31.6865 8.10928333333333
105259 31.68715 8.11036666666667
105260 31.6481833333333 8.10468333333333
Try using the char2dms function in the sp library. It has other functions that will additionally do decimal conversion.
library("sp")
?char2dms
A bit of vectorization and matrix manipulation will make your function much simpler:
x <- read.table(text="
lat long
105252 30°25.264 9°01.331
105253 30°39.237 8°10.811
105255 31°37.760 8°06.040
105258 31°41.190 8°06.557
105259 31°41.229 8°06.622
105260 31°38.891 8°06.281",
header=TRUE, stringsAsFactors=FALSE)
x
The function itself makes use of:
strsplit() with the regex pattern "[°\\.]" - this does the string split in one step
sapply to loop over the vector
Try this:
convert<-function(x){
z <- sapply((strsplit(x, "[°\\.]")), as.numeric)
z[1, ] + z[2, ]/60 + z[3, ]/3600
}
Try it:
convert(x$long)
[1] 9.108611 8.391944 8.111111 8.254722 8.272778 8.178056
Disclaimer: I didn't check your math. Use at your own discretion.
Thanks for answers by #Gord Stephen and #CephBirk. Sure helped me out.
I thought I'd just mention that I also found that measurements::conv_unit doesn't deal with "E/W" "N/S" entries, it requires positive/negative degrees.
My coordinates comes as character strings "1 1 1W" and needs to first be converted to "-1 1 1".
I thought I'd share my solution for that.
df <- c("1 1 1E", "1 1 1W", "2 2 2N","2 2 2S")
measurements::conv_unit(df, from = 'deg_min_sec', to = 'dec_deg')
[1] "1.01694444444444" NA NA NA
Warning message:
In split(as.numeric(unlist(strsplit(x, " "))) * c(3600, 60, 1), :
NAs introduced by coercion
ewns <- ifelse( str_extract(df,"\\(?[EWNS,.]+\\)?") %in% c("E","N"),"+","-")
dms <- str_sub(df,1,str_length(df)-1)
df2 <- paste0(ewns,dms)
df_dec <- measurements::conv_unit(df2,
from = 'deg_min_sec',
to = 'dec_deg'))
df_dec
[1] "1.01694444444444" "-1.01694444444444" "2.03388888888889" "-2.03388888888889"
as.numeric(df_dec)
[1] 1.016944 -1.016944 2.033889 -2.033889
Have a look at the command degree in the package OSMscale.
As Jim Lewis commented before it seems your are using floating point minutes. Then you only concatenate two elements on
dec=c(as.numeric(tmp1[[1]][1]),as.numeric(tmp2[[1]]))
Having degrees, minutes and seconds in the form 43°21'8.02 which as.character() returns "43°21'8.02\"", I updated your function to
convert<-function(coord){
tmp1=strsplit(coord,"°")
tmp2=strsplit(tmp1[[1]][2],"'")
tmp3=strsplit(tmp2[[1]][2],"\"")
dec=c(as.numeric(tmp1[[1]][1]),as.numeric(tmp2[[1]][1]),as.numeric(tmp3[[1]]))
c<-abs(dec[1])+dec[2]/60+dec[3]/3600
c<-ifelse(dec[1]<0,-c,c)
return(c)
}
adding the alternative for negative coordinates, and works great for me . I still don't get why char2dms function in the sp library didn't work for me.
Thanks
Another less elegant option using substring instead of strsplit. This will only work if all your positions have the same number of digits. For negative co-ordinates just multiply by -1 for the correct decimal degree.
x$LatDD<-(as.numeric(substring(x$lat, 1,2))
+ (as.numeric(substring(x$lat, 4,9))/60))
x$LongDD<-(as.numeric(substring(x$long, 1,1))
+ (as.numeric(substring(x$long, 3,8))/60))

Resources