How to replace string by its own part - r

I have one column in data.table in R which looks like this.
[1] "<= MSG: 'ACK', BODY: '{\"MessageRep\":{\"Parameters\":[\"UNIT_RESULT\",\"SK190400\",
[2] "=> MSG: 'MessageReq', BODY: '{\"MessageReq\":{\"Parameters\":[\"UNIT_CHECKIN\",\"SK190400\",
[3] "<= MSG: 'ACK', BODY: '{\"MessageRep\":{\"Parameters\":[\"UNIT_CHECKIN\",\"SK190400\",
[4] "=> MSG: 'MessageReq', BODY: '{\"MessageReq\":{\"Parameters\":[\"OEE_DATA\",
[5] "<= MSG: 'ACK', BODY: '{\"MessageRep\":{\"Parameters\":[\"PING\",\"SK190400\",
But only thing that I care about is whether it is "UNIT_RESULT", "UNIT_CHECKIN", "OEE_DATA" or "PING", so I would like to replace each of row by new string ("UNIT_RESULT" etc.)
Result should looks like:
[1] "UNIT_RESULT"
[2] "UNIT_CHECKIN"
[3] "UNIT_CHECKIN"
[4] "OEE_DATA"
[5] "PING"
I have spent many hours by trying to find how to replace string by its own part but nothing showed me any useful result.
Replace specific characters within strings
Reference - What does this regex mean?
Test if characters in string in R
In the beginning function substring(x, 53, 63) looks like solution for me but it just choose fixed symbols in string so unless I have all rows same it is useless.
Any hints?

The str_match_all function will apply a regex to each element of a vector of strings and return only the match. So we can make a list of all the terms we want to extract and use paste0 to join them together with the | OR operator to make a single regular expression that matches any of the 4 desired terms.
Then we just run the str_match_all function and unlist the resulting list into a character vector.
strings <- c("<= MSG: 'ACK', BODY: '{\"MessageRep\":{\"Parameters\":[\"UNIT_RESULT\",\"SK190400\"",
"=> MSG: 'MessageReq', BODY: '{\"MessageReq\":{\"Parameters\":[\"UNIT_CHECKIN\",\"SK190400\"",
"<= MSG: 'ACK', BODY: '{\"MessageRep\":{\"Parameters\":[\"UNIT_CHECKIN\",\"SK190400\"",
"=> MSG: 'MessageReq', BODY: '{\"MessageReq\":{\"Parameters\":[\"OEE_DATA\"",
"<= MSG: 'ACK', BODY: '{\"MessageRep\":{\"Parameters\":[\"PING\",\"SK190400\""
)
items <- c('UNIT_RESULT', 'UNIT_CHECKIN', 'OEE_DATA', 'PING')
library(stringr)
unlist(str_match_all(strings, paste0(items,collapse = '|')))
[1] "UNIT_RESULT" "UNIT_CHECKIN" "UNIT_CHECKIN" "OEE_DATA" "PING"

An alternative is to use str_extract. You pass your string as the 'string' argument and the alternatives you gave as the 'pattern' argument, and it will return whatever of your alternatives is the first one to appear in the string.
library(stringr)
DT[, newstring := str_extract(string_column, "UNIT_RESULT|UNIT_CHECKIN|OEE_DATA|PING")]

I suggest
gsub("^.*?(UNIT_RESULT|UNIT_CHECKIN|OEE_DATA|PING).*","\\1",strings,perl=TRUE)

If you do not have a finite list of strings you are searching for I would recommend using a reg-ex pattern. Here is one that works based on the examples you provided:
# Code to create example data.table
library(data.table)
dt <- data.table(f1 = c("<= MSG: 'ACK', BODY: '{\"MessageRep\":{\"Parameters\":[\"UNIT_RESULT\",\"SK190400\"",
"=> MSG: 'MessageReq', BODY: '{\"MessageReq\":{\"Parameters\":[\"UNIT_CHECKIN\",\"SK190400\"",
"<= MSG: 'ACK', BODY: '{\"MessageRep\":{\"Parameters\":[\"UNIT_CHECKIN\",\"SK190400\"",
"=> MSG: 'MessageReq', BODY: '{\"MessageReq\":{\"Parameters\":[\"OEE_DATA\"",
"<= MSG: 'ACK', BODY: '{\"MessageRep\":{\"Parameters\":[\"PING\",\"SK190400\""
))
# Start of code to parse out values:
rex_pattern <- "(?<=(\"))[A-Z]{2,}_*[A-Z]+(?=(\"))"
dt[, .(parsed_val = regmatches(f1, regexpr(pattern = rex_pattern, f1, perl = TRUE)))]
This gives you:
parsed_val
1: UNIT_RESULT
2: UNIT_CHECKIN
3: UNIT_CHECKIN
4: OEE_DATA
5: PING
If you really want to "overwrite" the original field f1 with the new substring, you can use the following:
dt[, `:=`(f1 = regmatches(f1, regexpr(pattern = rex_pattern, f1, perl = TRUE)))]

Related

split a key-value pair in Python

I have a dictionairy as follows:
{
"age": "76",
"Bank": "98310",
"Stage": "final",
"idnr": "4578",
"last number + Value": "[345:K]"}
I am trying to adjust the dictionary by splitting the last key-value pair creating a new key('total data'), it should look like this:
"Total data":ยจ[
{
"last number": "345"
"Value": "K"
}]
}
Does anyone know if there is a split function based on ':' and '+' or a for loop to accomplish this?
Thanks in advance.
One option to accomplish that could be getting the last key from the dict and using split on + for the key and : for the value removing the outer square brackets assuming the format of the data is always the same.
If you want Total data to contain a list, you can wrap the resulting dict in []
from pprint import pprint
d = {
"age": "76",
"Bank": "98310",
"Stage": "final",
"idnr": "4578",
"last number + Value": "[345:K]"
}
last = list(d.keys())[-1]
d["Total data"] = dict(
zip(
last.strip().split('+'),
d[last].strip("[]").split(':')
)
)
pprint(d)
Output (tested with Python 3.9.4)
{'Bank': '98310',
'Stage': 'final',
'Total data': {' Value': 'K', 'last number ': '345'},
'age': '76',
'idnr': '4578',
'last number + Value': '[345:K]'}
Python demo

Regex find the string between last two quotes " "?

For example, this is my string -> abcd 1234abcda="author 1" content="author 2.">\n
I only want the string author 2. by using the function str_extract() in R. How can I use regex to do that? Thank you so much.
You can use :
string = 'abcd 1234abcda="author 1" content="author 2.">\n'
sub('.*"(.*)".*', '\\1', string)
#[1] "author 2."
With str_match
library(stringr)
str_match(string, '.*"(.*)"')[, 2]
Another option is to extract all the values with "author" followed by a number and select the last one using tail.
tail(str_extract_all(string, 'author \\d+')[[1]], 1)

Splitting character string in R - Extracting the timestamp

Thank you in advance for any feedback.
I am attempting to clean some data in R where a time stamp and a text string are included together in the same cell. I am not getting the expected result. I know the regex needs validation work, but just testing out this particular function
Expected:
"04/05/2018 17:14:35" " -(Additional comments) update"
Actual:
"04/05/2018 17:14:35 -(Additional comments) update"
What I tried:
string <- "04/05/2018 17:14:35 -(Additional comments) update"
pattern <- "[:digit:][:digit:][:punct:]
[:digit:][:digit:][:punct:]
[:digit:][:digit:][:digit:][:digit:]
[[:space:]]
[:digit:][:digit:]
[:punct:]
[:digit:][:digit:]
[:punct:]
[:digit:][:digit:]"
strsplit(string, pattern)
I also tried this variation, same result
pattern <- "[:digit:][:digit:]\\/
[:digit:][:digit:]\\/
[:digit:][:digit:][:digit:][:digit:]
[[:space:]]
[:digit:][:digit:]
\\:
[:digit:][:digit:]
\\:
[:digit:][:digit:]"
You can try :
string <- "04/05/2018 17:14:35 -(Additional comments) update"
gsub("(\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2}).*","\\1", string)
#[1] "04/05/2018 17:14:35"
#RHS part
gsub("(\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2})(.*)","\\2", string)
#" -(Additional comments) update"
Regex explanation:
\\d{2} - 2 digits
\\d{4} - 4 digits
/ - separator
: - separator
() - Group for selection
.* - Followed by anything
Seems OP is very keen on using strsplit. One option could be as:
strsplit(gsub("(\\d{2}/\\d{2}/\\d{4} \\d{2}:\\d{2}:\\d{2})(.*)",
paste("\\1","####","\\2",sep=""), string), split = "####")
# [[1]]
# [1] "04/05/2018 17:14:35" " -(Additional comments) update"
Try this:
sub('-.*','',string)
[1] "04/05/2018 17:14:35 "

how to print recursively a Python dictionary and its subdictionaries with whitespace alignment into columns

I want to create a function that can take a dictionary of dictionaries such as the following
information = {
"sample information": {
"ID": 169888,
"name": "ttH",
"number of events": 124883,
"cross section": 0.055519,
"k factor": 1.0201,
"generator": "pythia8",
"variables": {
"trk_n": 147,
"zappo_n": 9001
}
}
}
and then print it in a neat way such as the following, with alignment of keys and values using whitespace:
sample information:
ID: 169888
name: ttH
number of events: 124883
cross section: 0.055519
k factor: 1.0201
generator: pythia8
variables:
trk_n: 147
zappo_n: 9001
My attempt at the function is the following:
def printDictionary(
dictionary = None,
indentation = ''
):
for key, value in dictionary.iteritems():
if isinstance(value, dict):
print("{indentation}{key}:".format(
indentation = indentation,
key = key
))
printDictionary(
dictionary = value,
indentation = indentation + ' '
)
else:
print(indentation + "{key}: {value}".format(
key = key,
value = value
))
It produces the output like the following:
sample information:
name: ttH
generator: pythia8
cross section: 0.055519
variables:
zappo_n: 9001
trk_n: 147
number of events: 124883
k factor: 1.0201
ID: 169888
As is shown, it successfully prints the dictionary of dictionaries recursively, however is does not align the values into a neat column. What would be some reasonable way of doing this for dictionaries of arbitrary depth?
Try using the pprint module. Instead of writing your own function, you can do this:
import pprint
pprint.pprint(my_dict)
Be aware that this will print characters such as { and } around your dictionary and [] around your lists, but if you can ignore them, pprint() will take care of all the nesting and indentation for you.

grep on two strings

I'm working to grab two different elements in a string.
The string look like this,
str <- c('a_abc', 'b_abc', 'abc', 'z_zxy', 'x_zxy', 'zxy')
I have tried with the different options in ?grep, but I can't get it right, 'm doing something like this,
grep('[_abc]:[_zxy]',str, value = TRUE)
and what I would like is,
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
any help would be appreciated.
Use normal parentheses (, not the square brackets [
grep('_(abc|zxy)',str, value = TRUE)
[1] "a_abc" "b_abc" "z_zxy" "x_zxy"
To make the grep a bit more flexible, you could do something like:
grep('_.{3}$',str, value = TRUE)
Which will match an underscore _ followed by any character . three times {3} followed immediately by the end of the string $
this should work: grep('_abc|_zxy', str, value=T)
X|Y matches when either X matches or Y matches
In this case just doing:
str[grep("_",str)]
will work... is it more complicated in your specific case?

Resources