How can I extract a substring based on a regex in U-SQL? - u-sql

I am trying to extract some substring based on a regex in U-SQL. But I couldn't find a built-in function to do so.
Maybe there is even an easier way to solve my problem.
I have version codes like "1.10.12 ABC" or "10.1" or "10.1.10" and want to standardize them in a way that I only get the first two numbers.
So something like "^\d+\.\d+" in regex.
Is there a way to get that result in U-SQL?
#someData =
SELECT * FROM
( VALUES
("1.1.10 ABC"),
("1.10.1"),
("15.3.2")
) AS T(version);
I want the versions in the following format:
"1.1"
"1.10"
"15.3"

You can try the following:
#someData =
SELECT * FROM
( VALUES
(Regex.Replace("1.1.10 ABC", "^\d+\.\d+"),
(Regex.Replace("1.10.1", "^\d+\.\d+"),
(Regex.Replace("15.3.2", "^\d+\.\d+")
) AS T(version);

Related

Unexpected string constant in R when try to select colname from data.table

I try to group by my customize movieLense dataset
groupBy<- data.table(unifiedTbl)
x<- groupBy[,list(rating=sum(rating)
,Unknown=sum(unknown)
,Action=sum(Action)
,Adventure = sum(Adventure)
,Animation = sum(Animation)
,"Children's" = sum(Children's)
),by=list(user_id,age,occupation)]
but because of Children's I received some error which related to specified character
If I remove below part of my code every things is OK
,"Children's" = sum(Children's)
Now my question is how can I address to this column with full name?
how can I fix my codes?
You can use backticks with names that aren't valid syntax:
`Children's` = sum(`Children's`)
And of course, I'd recommend creating valid names instead:
setnames(groupBy, make.names(names(groupBy)))

Silverstripe combine filterAny and filter to have an OR with an AND in it

I have a question and maybe somebody can help me figure out what the best way would be to achieve the solution. What i wanted to do is the reverse of:
http://doc.silverstripe.com/framework/en/topics/datamodel#highlighter_745635
WHERE ("LastName" = 'Minnée' AND ("FirstName" = 'Sam' OR "Age" = '17'))
I want to get something along the lines of:
WHERE( ("LastName" = 'Minnée') OR ("FirstName" = 'Sam' AND "Age" = '17'))
Now i cannot find any way to achieve this effect seeing as i cannot add a filter within the filterAny
For now i am doing it with the get()->where( ... ) option but was more wondering with this question if their are alternative options without having to write normal SQL code.
As 'AND' has a higher priority it doesn't need to be in braces. You can just write it as:
WHERE( "LastName" = 'Minnée' OR "FirstName" = 'Sam' AND "Age" = '17')
But as far as I can tell on the first look there isn't a way to write this without using where(). Let us know if you find a way. For debuging you can display the generated SQL-Query by calling the function sql():
... get()->where( ... )->sql();

How to get the part of the name in whole column?

I have a column like this:
X.at1g01050.1.symbols.atppa1.ppa1.pyrophosphorylase1.chr1.31382.32670reverselength.212
X.at1g01080.1.symbols..rna.binding.rrm.rbd.rnpmotifs.familyprotein.chr1.45503.46789reverselength.293
X.at1g01090.1.symbols.pdh.e1alpha.pyruvatedehydrogenasee1alpha.chr1.47705.49166reverselength.428
X.at1g01220.1.symbols.fkgp.atfkgp.l.fucokinase.gdp.l.fucosepyrophosphorylase.chr1.91750.95552forwardlength.1055
X.at1g01320.2.symbols..tetratricopeptiderepeat.tpr..likesuperfamilyprotein.chr1.121582.130099reverselength.1787
X.at1g01420.1.symbols.ugt72b3.udp.glucosyltransferase72b3.chr1.154566.156011reverselength.481
X.at1g01470.1.symbols.lea14.lsr3.lateembryogenesisabundantprotein.chr1.172295.172826reverselength.151
X.at1g01800.1.symbols..nad.p..bindingrossmann.foldsuperfamilyprotein.chr1.293396.294888forwardlength.295
X.at1g01910.5.symbols..p.loopcontainingnucleosidetriphosphatehydrolasessuperfamilyprotein.chr1.313595.315644reverselength.249
X.at1g01920.2.symbols..setdomain.containingprotein.chr1.316204.319507forwardlength.547
X.at1g01960.1.symbols.eda10.sec7.likeguaninenucleotideexchangefamilyprotein.chr1.330830.337582reverselength.1750
The interesting part of this data is bolded below:
X.**at1g01050.1**.symbols.atppa1.ppa1.pyrophosphorylase1.chr1.31382.32670reverselength.212
I can easlsy get it by applying the function =MID(B1;3;11) in Excel. I would like to do the same in R.
The column with the names:
tbl_end[,1]
Use substr:
substr(tbl_end[,1],3,11)

Find word (not containing substrings) in comma separated string

I'm using a linq query where i do something liike this:
viewModel.REGISTRATIONGRPS = (From a In db.TABLEA
Select New SubViewModel With {
.SOMEVALUE1 = a.SOMEVALUE1,
...
...
.SOMEVALUE2 = If(commaseparatedstring.Contains(a.SOMEVALUE1), True, False)
}).ToList()
Now my Problem is that this does'n search for words but for substrings so for example:
commaseparatedstring = "EWM,KI,KP"
SOMEVALUE1 = "EW"
It returns true because it's contained in EWM?
What i would need is to find words (not containing substrings) in the comma separated string!
Option 1: Regular Expressions
Regex.IsMatch(commaseparatedstring, #"\b" + Regex.Escape(a.SOMEVALUE1) + #"\b")
The \b parts are called "word boundaries" and tell the regex engine that you are looking for a "full word". The Regex.Escape(...) ensures that the regex engine will not try to interpret "special characters" in the text you are trying to match. For example, if you are trying to match "one+two", the Regex.Escape method will return "one\+two".
Also, be sure to include the System.Text.RegularExpressions at the top of your code file.
See Regex.IsMatch Method (String, String) on MSDN for more information.
Option 2: Split the String
You could also try splitting the string which would be a bit simpler, though probably less efficient.
commaseparatedstring.Split(new Char[] { ',' }).Contains( a.SOMEVALUE1 )
what about:
- separating the commaseparatedstring by comma
- calling equals() on each substring instead of contains() on whole thing?
.SOMEVALUE2 = If(commaseparatedstring.Split(',').Contains(a.SOMEVALUE1), True, False)

xQuery substring problem

I now have a full path for a file as a string like:
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml"
However, now I need to take out only the folder path, so it will be the above string without the last back slash content like:
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/"
But it seems that the substring() function in xQuery only has substring(string,start,len) or substring(string,start), I am trying to figure out a way to specify the last occurence of the backslash, but no luck.
Could experts help? Thanks!
Try out the tokenize() function (for splitting a string into its component parts) and then re-assembling it, using everything but the last part.
let $full-path := "/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml",
$segments := tokenize($full-path,"/")[position() ne last()]
return
concat(string-join($segments,'/'),'/')
For more details on these functions, check out their reference pages:
fn:tokenize()
fn:string-join()
fn:replace can do the job with a regular expression:
replace("/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml",
"[^/]+$",
"")
This can be done even with a single XPath 2.0 (subset of XQuery) expression:
substring($fullPath,
1,
string-length($fullPath) - string-length(tokenize($fullPath, '/')[last()])
)
where $fullPath should be substituted with the actual string, such as:
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml"
The following code tokenizes, removes the last token, replaces it with an empty string, and joins back.
string-join(
(
tokenize(
"/db/Liebherr/Content_Repository/Techpubs/Topics/HyraulicPowerDistribution/Released/TRN_282C_HYD_MOD_1_Drive_Shaft_Rev000.xml",
"/"
)[position() ne last()],
""
),
"/"
)
It seems to return the desired result on try.zorba-xquery.com. Does this help?

Resources