I'm working with an address database. For further cleaning I need to identify the leading zeros that are stored in the string containing the door number. So a clean and friendly case would be something like 7/5 - where 7 would be the house number, 5 the door number.
The problem is that in some cases, there are leading zeros involved - not only at the beginning of the string, but also in the middle. And of course, there are also "normal" and necessary zeros. So I could have an address like 7/50, where the zero is totally fine and should stay there. But I could also have 007/05, where all the zeros need to go away. So the only pattern I can think of is to check wheter there is a number greater than zero before that zero. If no, then delete that zero, if yes, keep it.
Is there any function to achieve something like this? Or would this require something custom built? Thanks for any suggestions!
You can try the code below base R option using gsub
> gsub("\\b0+", "", s)
[1] "1/1001001" "7/50" "7/50" "7/5" "7/5"
with given
s <- c("01/1001001", "07/050", "0007/50", "7/5", "007/05")
Maybe a negative look behind will help
x <- c("7/50", "7/5", "007/05")
stringr::str_remove_all(x, "\\b(?<![1-9])0+")
# [1] "7/50" "7/5" "7/5"
Hard to say for sure with such a limited set of test cases.
Related
I am working with a large data frame in r which includes a column containing the text content of a number of tweets. Each value starts with "RT #(account which is retweeted): ", for example "RT #RosannaXia: Here’s some deep ocean wonder in case you want to explore a different corner of our planet...". I need to change each value in this column to only include the account name ("#RosannaXia"). How would I be able to do this? I understand that I may be able to do this with gsub and regular expressions (a lookbehind and a lookahead), but when I tried the following lookahead code it did not doing anything (or show an error):
Unnested_rts$rt_user <- gsub("[a-z](?=:)", "", Unnested_rts$rt_user, perl=TRUE)
Is there a better way to do this? I am not sure what went wrong, but I am still a very inexperienced coder. Any help would be greatly appreciated!
You can extract everything from # till a colon (:).
x <- "RT #RosannaXia: Here’s some deep ocean wonder in case you want to explore a different corner of our planet..."
sub('RT (#.*?):.*', '\\1', x)
#[1] "#RosannaXia"
For your case , it would be -
Unnested_rts$rt_user <- sub('RT (#.*?):.*', '\\1', Unnested_rts$rt_user)
A few things:
according to twitter, a handle can include alphanumeric ([A-Za-z0-9]) and underscores, this needs to be in your pattern;
your pattern needs to capture it and preserve it, and discard everything else, since we don't always know how to match everything else, we'll stick with matching what we know and use .* on either side.
gsub(".*(#[A-Za-z0-9_]+)(?=:).*", "\\1", "RT #RosannaXia: Here’s some deep ocean wonder in case you want to explore a different corner of our planet...", perl=TRUE)
# [1] "#RosannaXia"
Since you want this for the entire column, you can probably just to
gsub(".*(#[A-Za-z0-9_]+)(?=:).*", "\\1", Unnested_rts$rt_user, perl=TRUE)
The only catch is that if there is a failed match (pattern is not found), then the entire string is returned, which may not be what you want. If you want to extract what you found, then there are several techniques that use gregexpr and regmatches, or perhaps stringr::str_extract.
I am currently working on particular algorithm, but I face with a problem that I'm not sure what I have to do to resolve it. I appreciate if anyone helps me out.
There are some objects{O1,O2,O3,.....}, each of them has a value that we don't know about its amount, we call them {V1,V2,V3,....} also there is another element we call it w(w1,w2,w3.....) which shows the difference between values, I mean w1=v2-v1, w2=v3-v2,w3=v4-v3 and so on. I'm wondering if there is any way to get value of v1,v2,v3...etc without having the value of V1?
Looking forward for your reply guys,
Thanks.
Not in general. Knowing the differences between successive numbers in a list of numbers under-determines the set of numbers. This is particularly obvious in the case when w1 = w2 = w3 = ... = wk = 1. That would tell you that the viare consecutive numbers, but nothing else could be inferred. You wouldn't be able to distinguish 3,4,5,6,7 from 10,11,12,13,14 (for example).
Having said that, it would of course be possible if you know one of the numbers, and the known number wouldn't need to be the first one. Knowing any single one of the numbers would suffice. Furthermore, knowing something like the sum of the vi would be sufficient since you could express the sum as a function of the unknown number v1 and solve the resulting equation.
In Google Sheets I want to count the number of cells in a range (C4:U4) that are non-empty and non-blank. Counting non-empty is easy with COUNTIF. The tricky issue seems to be that I want to treat cells with one or more blank as empty. (My users keep leaving blanks in cells which are not visible and I waste a lot of time cleaning them up.)
=COUNTIF(C4:U4,"<>") treats a cell with one or more blanks as non-empty and counts it. I've also tried =COUNTA(C4:U4) but that suffers from the same problem of counting cells with one or more blanks.
I found a solution in stackoverflow flagged as a solution by 95 people but it doesn't work for cells with blanks.
After much reading I have come up with a fancy formula:
=COUNTIF(FILTER(C4:U4,TRIM(C4:U4)>="-"),"<>")
The idea is that the TRIM removes leading and trailing blanks before FILTER tests the cell to be greater than or equal to a hyphen (the lowest order of printable characters I could find). The FILTER function then returns an array to the COUNTIF function which only contains non-empty and non-blank cells. COUNTIF then tests against "<>"
This works (or at least "seems" to work) but I was wondering if I've missed something really obvious. Surely the problem of hidden blanks is very common and has been around since the dawn of excel and google sheets. there must be a simpler way.
(My first question so apologies for any breaches of forum rules.)
I don't know about Google. But for Excel you could use this array formula for multiple contiguous columns:
=ROWS(A1:B10) * COLUMNS(A1:B10)-(COUNT(IF(ISERROR(CODE(A1:B10)),1,""))+COUNT(IF(CODE(A1:B10)=32,1,"")))
Could try this but I'm not at all sure about it
=SUMPRODUCT(--(trim((substitute(A2:A5,char(160),"")))<>""))
seems in Google Sheets that you've got to put char(160) to match a space entered into a cell?
Seems this is due to a non-breaking space and could possibly apply to Excel also - as explained here - the suggestion is that you could also pass it through the CLEAN function to eliminate invisible characters with codes in range 0-31.
I found another way to do it using:
=ARRAYFORMULA(SUM(IF(TRIM($C4:$U4)<>"",1,0)))
I'm still looking for a simpler way to do it if one is available.
This should work:
=countif(C4:U4,">""")
I found this solution here:
Is COUNTA counting blank (empty) cells in new Google spreadsheets?
Please let me know if it does.
=COLUMNS(C4:U4)-COUNTBLANK(C4:U4)
This will count how many cells are in your range (C4 to U4 = 19 cells), and subtract those that are truly "empty".
Blank spaces will not get counted by COUNTBLANK, despite its name, which should really be COUNTEMPTY.
I'm trying to match everything except a specific string in R, and I've seen a bunch of posts on this suggesting a negative lookaround, but I haven't gotten that to work.
I have a dataset looking at crime incidents in SF, and I want to sort cases that have a resolution or do not. In the resolution field, cases have things listed like arrest booked, arrest cited, juvenile booked, etc., or none. I want to relabel all the specific resolutions like the different arrests to "RESOLVED" and keep the instances with "NONE" as such. So, I thought I could gsub or grep for not "NONE".
Based on what I've read on finding all strings except one specific string, I would have thought this would work:
resolution_vector = grep("^(?!NONE$).*", trainData$Resolution, fixed=TRUE)
Where I make a vector that searches through my training dataset, specifically the resolution column, and finds the terms that aren't "NONE". But, I just get an empty vector.
Does anyone have suggestions, or know why this might not be working in R? Or, even if there was a way to just use gsub, how do I say "not NONE" for my regex in R?
trainData$Resolution = gsub("!NONE", RESOLVED, trainData$Resolution) << what's the way to negate the string here?
Based on your explanation, it seems as though you don't need regular expressions (i.e. gsub()) at all. You can use != since you are looking for all non-matches of an exact string. Perhaps you want
within(trainData, {
## next line only necessary if you have a factor column
Resolution <- as.character(Resolution)
Resolution[Resolution != "NONE"] <- "RESOLVED"
})
resolution_vector = grep("^(?!NONE$).*", trainData$Resolution, fixed=TRUE,perl=TRUE)
You need to use option perl=TRUE.
I made the following regex:
(\d{5}|\d-\d{4}|\d{2}-\d{3}|\d{3}-\d{2}|\d{4}-\d)
And it seems to work. That is, it will match a 5 digit number or a 5 digit number with only 1 hyphen in it, but the hyphen can not be the lead or the end.
I would like a similar regex, but for a 25 digit number. If I use the same tactic as above, the regex will be very long.
Can anyone suggest a simpler regex?
Additional Notes:
I'm putting this regex into an XML file which is to be consumed by an ASP.NET application. I don't have access to the .net backend code. But I suspect they would do something liek this:
Match match = Regex.Match("Something goes here", "my regex", RegexOptions.None);
You need to use a lookahead:
^(?:\d{25}|(?=\d+-\d+$)[\d\-]{26})$
Explanation:
Either it's \d{25} from start to end, 25 digits.
Or: it is 26 characters of [\d\-] (digits or hyphen) AND it matched \d+-\d+ - meaning it has exactly one hyphen in the middle.
Working example with test cases
You could use this regex:
^[0-9](?:(?=[0-9]*-[0-9]*$)[0-9-]{24}|[0-9]{23})[0-9]$
The lookahead makes sure there's only 1 dash and the character class makes sure there are 23 numbers between the first and the last. Might be made shorter though I think.
EDIT: The a 'bit' shorter xP
^(?:[0-9]{25}|(?=[^-]+-[^-]+$)[0-9-]{26})$
A bit similar to Kobi's though, I admit.
If you aren't fussy about the length at all (i.e. you only want a string of digits with an optional hyphen) you could use:
([\d]+-[\d]+){1}|\d
(You may want to add line/word boundaries to this, depending on your circumstances)
If you need to have a specific length of match, this pattern doesn't really work. Kobi's answer is probably a better fit for you.
I think the fastest way is to do a simple match then add up the length of the capture buffers, why attempt math in a regex, makes no sence.
^(\d+)-?(\d+)$
This will match 25 digits and exactly one hyphen in the middle:
^(?=(-*\d){25})\d.{24}\d$