String recognition in idl - idl-programming-language

I have the following strings:
F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_East_A.dat
F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_Froemke-Hoy.dat
and from each I want to extract the three variables, 1. SWIR32 2. the date and 3. the text following the date. I want to automate this process for about 200 files, so individually selecting the locations won't exactly work for me.
so I want:
variable1=SWIR32
variable2=2005210
variable3=East_A
variable4=SWIR32
variable5=2005210
variable6=Froemke-Hoy
I am going to be using these to add titles to graphs later on, but since the position of the text in each string varies I am unsure how to do this using strmid

I think you want to use a combination of STRPOS and STRSPLIT. Something like the following:
s = ['F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_East_A.dat', $
'F:\Sheyenne\ROI\SWIR32_subset\SWIR32_2005210_Froemke-Hoy.dat']
name = STRARR(s.length)
date = name
txt = name
foreach sub, s, i do begin
sub = STRMID(sub, 1+STRPOS(sub, '\', /REVERSE_SEARCH))
parts = STRSPLIT(sub, '_', /EXTRACT)
name[i] = parts[0]
date[i] = parts[1]
txt[i] = STRJOIN(parts[2:*], '_')
endforeach
You could also do this with a regular expression (using just STRSPLIT) but regular expressions tend to be complicated and error prone.
Hope this helps!

Related

Naming the rows in a matrix in R

I am trying to name the rows in a matrix, but it adds the prefixes 'X.', 'X..' etc. in front of these names. Also, the row names don't come out properly. For example, the first-row name is supposed to be 'e subscript (t+1)' but it shows something else. The even-numbered row names should be vacant, but they are given names. Could you help please?
Please see the dataset here.
Below is the code I used:
rownames(table2a)=paste(c("$e_t+1$"," ","$r_t+1$", " ","$\\Delta y_{n,t+1}$"," ",
"$s_{n,t+1}$"," ", "$d_t+1-p_t+1$"," ", "$rb_t+1$"," "))
Included libraries: matrix, dplyr, tidyverse, xtable.
Below is the data from dput(table2a):
structure(c(-0.011918875562309, 0.0493186644094629, 0.00943711646402318,
0.0084043692395113, 0.0140061617086464, 0.00795790133389348,
-0.00372516684283399, 0.00631517247007723, 0.00514156266584497,
0.0039339752041611, 0.0148362913561212, 0.00793003246354337,
-0.0807656037164587, 0.0599852917766847, 0.991792361285981, 0.0102220639400435,
-0.00608828061911691, 0.00967903407684488, 0.010343002117867,
0.00768101625973846, 0.0578541455030235, 0.00478481429473926,
-0.00902328873121743, 0.00964513773477125, -0.799680080407018,
0.340494864072598, 0.0519648273240202, 0.0580235615884655, 0.0850517813830584,
0.0549411579861702, -0.0665428760388874, 0.0435997977143392,
-0.032698572959578, 0.027160069487786, 0.114163705951583, 0.0547487519805466,
0.352025366916776, 0.197746547959218, 0.0476825327079758, 0.0336978915546042,
0.0464511908714403, 0.0319077480426568, 0.904849333951824, 0.0253211146465119,
0.132904050913606, 0.0157735418364402, 0.0653710645280691, 0.0317960059066269,
0.939695537568421, 0.612311426298072, -0.0578948128653228, 0.104343687684969,
-0.0744692071400603, 0.0988006057025484, 0.121089017775182, 0.0784054537723728,
0.0345069733304992, 0.048841914052704, -0.090885199308955, 0.0984546022582597,
-0.280821673428002, 0.248826811381596, -0.0288068135696716, 0.0424024540117092,
-0.0239685609446809, 0.0401498953370305, 0.00219488911775388,
0.0318618569231297, 0.066433933135983, 0.0198480335553826, 0.871940074366622,
0.0400092888905855), .Dim = c(12L, 6L), .Dimnames = list(c("$e_t+1$",
" ", "$r_t+1$", " ", "$\\Delta y_{n,t+1}$", " ", "$s_{n,t+1}$",
" ", "$d_t+1-p_t+1$", " ", "$rb_t+1$", " "), c("ex_stock_ret_100.l1",
"real_int_100.l1", "Chg_1month.l1", "spreads.l1", "log_dp.l1",
"rb_rate_100.l1")))
My desired output (row & column names) is as shown in this picture
enter image description here
In case you want to remove that prefix, you can do the following:
rownames(table2a) <- substring(rownames(table2a), 2)
you can remove the first X and all following dots (no matter how many there are) with the gsub command:
rownames(table2a) <- gsub("^X\\.*","",rownames(table2a))
^ = beginning of the string;
X = your actual X;
\\. = a dot;
* = 0 or more of the before mentioned (in this case \\.); so in total ^X\\. means: if you find X as the first letter and all possible dots following directly behind it.
gsub replaces this find with "", meaning nothing, leaving only whatever comes after
EDIT:
to also get rid of every 2nd rowname, add a little something extra:
rownames(table2a) <- gsub("^X\\.*[1-9]*","",rownames(table2a))
which gets rid of any number directly behind the dots. This should leave those rows empty.

Platform Data Extension: how do I parse ROAD_NAME_FCN

I'm using HERE's Platform Data Extension to retrieve road names. However, I don't understand the strings that I'm getting. I suspect they're encoded somehow but I don't know how to decode them.
For example:
ENGBNFDR Dr NNASN"e|fe "de "e|rre "dri|ve "nol|te;NASY"e|fe "de "e|rre;<snip>
If I split them by a "record separator" character, e.g. link_names.split('\x1e') the values look slightly more intelligible, but only slightly. There are still bizarre abbreviations I don't understand, e.g. ENGBN.
The PDE Layers documents can be found here: http://pde.cit.api.here.com/1/doc/content.html?detail=1&app_id=xxx&app_code=yyy
Layers > ROAD_NAME_FC1 > NAMES.
List of all names for this object, in all languages, latin1/pinyin/phonetic transliterations.
For convenience, non-exonym base names are listed first.
Format:
NAMES = NAME1 \u001D NAME2 \u001D NAME3 ...
NAME = NAME_TEXT \u001E TRANSLIT1 ; TRANSLIT2 ; ... \u001E PHONEME1 ; PHONEME2 ; ... NAME_TEXT = LANGUAGE_CODE NAME_TYPE IS_EXONYM text
TRANSLIT = LANGUAGE_CODE text
PHONEME = LANGUAGE_CODE IS_PREFERRED text
LANGUAGE_CODE is a 3 character string
NAME_TYPE is one letter (A = abbreviation, B = base name, E = exonym, K = shortened name, S = synonym)
IS_EXONYM = Y if the name is a translation into another language
IS_PREFERRED = Y if this is the preferred phoneme.
Please note, the delimiters are:
\u001D between languages (NAMES level)
\u001E between name text, transliterations, and phonemes ';' between different transliterations and phonemes of the same name.

regex to replace text *outside* of {}

I want to use regex to replace commands or tags around strings. My use case is converting LaTeX commands to bookdown commands, which means doing things like replacing \citep{*} with [#*], \ref{*} with \#ref(*), etc. However, lets stick to the generalized question:
Given a string <begin>somestring<end> where <begin> and <end> are known and somestring is an arbitrary sequence of characters, can we use regex to susbstitute <newbegin> and <newend> to get the string <newbegin>somestring<newend>?
For example, consider the LaTeX command \citep{bonobo2017}, which I want to convert to [#bonobo2017]. For this example:
<begin> = \citep{
somestring = bonobo2017
<end> = }
<newbegin> = [#
<newend> = ]
This question is basically the inverse of this question.
I'm hoping for an R or notepad++ solution.
Additional Examples
Convert \citet{bonobo2017} to #bonobo2017
Convert \ref{myfigure} to \#ref(myfigure)
Convert \section{Some title} to # Some title
Convert \emph{something important} to *something important*
I'm looking for a template regex that I can fill in my <begin>, <end>, <newbegin> and <newend> on a case-by-case basis.
You can try something like this with dplyr + stringr:
string = "\\citep{bonobo2017}"
begin = "\\citep{"
somestring = "bonobo2017"
end = "}"
newbegin = "[#"
newend = "]"
library(stringr)
library(dplyr)
string %>%
str_extract(paste0("(?<=\\Q", begin, "\\E)\\w+(?=\\Q", end, "\\E)")) %>%
paste0(newbegin, ., newend)
or:
string %>%
str_replace_all(paste0("\\Q", begin, "\\E|\\Q", end, "\\E"), "") %>%
paste0(newbegin, ., newend)
You can also make it a function for convenience:
convertLatex = function(string, BEGIN, END, NEWBEGIN, NEWEND){
string %>%
str_replace_all(paste0("\\Q", BEGIN, "\\E|\\Q", END, "\\E"), "") %>%
paste0(NEWBEGIN, ., NEWEND)
}
convertLatex(string, begin, end, newbegin, newend)
# [1] "[#bonobo2017]"
Notes:
Notice that I manually added an additional \ to "\\citep{bonobo2017}", this is because raw strings don't exist in R(I hope they do exist), so a single \ would be treated as an escape character. I need another \ to escape the first \.
The regex in str_extract uses positive lookbehind and positve lookahead to extract the somestring in between begin and end.
str_replace takes another approach of removing begin and end from string.
The "\\Q", "\\E" pair in the regex means "Backslash all nonalphanumeric characters" and "\\E" ends the expression. This is especially useful in your case since you likely have special characters in your Latex command. This expression automatically escapes them for you.

Unable to combine specific lines in notepad++ files using nested for loops

I'm trying to compare portions of lines in two notepad++ files against each other using two variables(vg_line and sn_line)in order to combine them together if equal. Once it has found its pair it prints out certain information from each for loop, but it only finds the first pair and doesn't continue to loop through vg_lines file in order to compare other lines with sn_lines file.
input_file = open(input_VG_name)
input_Server_name = open(input_Server_name)
for line in input_file:
line_data = line.strip()
vg_line = line_data[0:44]
volume_group = line_data[44:58]
for line1 in input_Server_name:
line_data = line1.strip()
sn_line = line_data[0:44]
server_name = line_data[46:64]
if vg_line == sn_line:
print(vg_line, volume_group, server_name)
First post so any tips on what I can do better coding/asking questions is much appreciated!
You are not reading the files
Try the following:
input_file = r'c:\file.txt'
input_Server_name = r'c:\server_file.txt'
with open(input_file, 'r') as file:
for line in file.readlines():
line_data = line.strip()
vg_line = line_data[0:44]
volume_group = line_data[44:58]
with open(input_Server_name, 'r') as file1:
for line1 in file1.readlines():
line1_data = line1.strip()
sn_line = line1_data[0:44]
server_name = line1_data[46:64]
if vg_line == sn_line:
print(vg_line, volume_group, server_name)
The thing is: this code will have to read the second file for every line in the first file (which is what I got from your original code).
There are other methods two match to files up, have a search around, there are plenty of answers. Don't forget to check "Code Review" which has some good examples as well.

grep on two strings

I'm working to grab two different elements in a string.
The string look like this,
str <- c('a_abc', 'b_abc', 'abc', 'z_zxy', 'x_zxy', 'zxy')
I have tried with the different options in ?grep, but I can't get it right, 'm doing something like this,
grep('[_abc]:[_zxy]',str, value = TRUE)
and what I would like is,
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
any help would be appreciated.
Use normal parentheses (, not the square brackets [
grep('_(abc|zxy)',str, value = TRUE)
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
To make the grep a bit more flexible, you could do something like:
grep('_.{3}$',str, value = TRUE)
Which will match an underscore _ followed by any character . three times {3} followed immediately by the end of the string $
this should work: grep('_abc|_zxy', str, value=T)
X|Y matches when either X matches or Y matches
In this case just doing:
str[grep("_",str)]
will work... is it more complicated in your specific case?

Resources