How to remove leading zeroes from a column in Google Data Studio? - google-analytics

I have a column of store IDs which all have leading zeroes. I.E. 0017 shows rather than 17, 0876 shows rather than 876.
All Store IDs are 4 digits long with these leading zeroes. Is there a way to remove these leading zeroes and therefore leave me with 17 and 876 (as per above).
I imagine this would involve a REGEXP statement but I haven't been able to successfully create one yet.

Create a calculated field using this formula REGEXP_REPLACE(Store ID, r'^\D*0*', '').
Working example here.

Related

Perform operation on all substrings of a string in SQL (MariaDB)

Disclaimer: This is not a database administration or design question. I did not design this database and I do not have rights to change it.
I have a database in which many fields are compound. For example, a single column is used for acre usage for a district. Many districts have one primary crop and the value is a single number, such as 14. Some have two primary crops and it has two numbers separated by a comma like "14,8". Some have three, four, or even five primary crops resulting in a compound value like "14,8,7,4,3".
I am pulling data out of this database for analytical research. Right now, I am pulling columns like that into R, splitting them into 5 values (padding nulls if there aren't 5 values), and performing work on the values. I want to do it in the database itself. I want to split the value on the comma, perform an operation on the resulting values, and then concatenate the result of the operation back into the original column format.
Example, I have a column that is in acres. I want it in square meters. So, I want to take "14,8", temporarily turn it into 14 and 8, multiply each of those by 4046.86, and get "56656.04,32374.88" as my result. What I am currently doing is using regexp_replace. I start with all rows where "acres REGEXP '^[0-9.]+,[0-9.]+,[0-9.]+,[0-9.]+$'" for the where clause. That gives me rows with 5 numbers in the field. Then, I can do the first number with "cast(regexp_replace(acres,',.*%','') as float) * 4046.86". I can do each of the 5 using a different regexp_replace. I can concatenate those values back together. Then, I run a query for those with 4 numbers, then 3, then 2, and finally the single number rows.
Is this possible as a single query?
Use a function to parse the string and to convert it to desired result. This will allow for you to use a sigle query for the job.

R: Regex for identifying numbers within HTML chunk

this is my first entry on stack overflow, so please be indulgent if my post might have some lack in terms of quality.
I want to learn some webscraping with R and started with a simple example --> Extracting a table from a Wikipedia site.
I managed to download the specific page and identified the HTML sections I am interested in:
<td style="text-align:right">511.000.000\n</td>
Now I want to extract the number in the data from the table by using regex. So i created a regex, which should match the structure of the number from my point of view:
pattern<-"\\d*\\.\\d*\\.\\d*\\.\\d*\\."
I also tried other variations but none of them found the number within the HTML code. I wanted to keep the pattern open as the numbers might be hundreds, thousand, millions, billions.
My questions: The number is within the HTML code, might it be
necessary to include some code for the non-number code (which should
not be extracted...)
What would be the correct version for the
pattern to identify the number correctly?
Thank you very much for your support!!
So many stars implies a lot of backtracking.
One point further, using \\d* would match more than 3 digits in any group and would also match a group with no digit.
Assuming your numbers are always integers, formatted using a . as thousand separator, you could use the following: \\d{1,3}(?:\\.\\d{3})* (note the usage of non-capturing group construct (?:...) - implying the use of perl = TRUE in arguments, as mentioned in Regular Expressions as used in R).
Look closely at your regex. You are assuming that the number will have 4 periods (\\.) in it, but in your own example there are only two periods. It's not going to match because while the asterisk marks \\d as optional (zero or more), the periods are not marked as optional. If you add a ? modifier after the 3rd and 4th period, you may find that your pattern starts matching.

SQLite sorting a column containing numbers or text

I have a column containing user entry descriptions, these descriptions can be anything however i do need them sorted into a logical order.
The text can be anything like
16 to 26 months
40 to 60 months
Literacy
Mathematics
When i order these in sql statement the text items return fine. However any beginning with numbers come back in an order not logical
i.e.
16 to 26 months
will be before
8 to 20 months
i understand why as it takes first character etc but don't know how to alter sql statement (using sqlite) to improve the performance without messing up the entries beginning with text
When i cast to numeric the numbers are fine the items beginning with text go wrong
Thanks
What you need is sorting the values in "natural order". To achieve this you will need to implement your own collating sequence; SQLite doesn't provide one for this case.
There are some questions (and answers) regarding this topic here on SO, but they are for other RDBMS. The best I could find in a quick search was this:
http://wiki.ozanh.com/doku.php?id=python:database:sqlite:how_to_natural_sort
You should think about improving your table schema, e. g. splitting the period into separate integer columns (monthsMin, monthsMax) instead of using text, which would make sorting much easier. You can always build a string from this values if necessary.

COUNTIF of non-empty and non-blank cells

In Google Sheets I want to count the number of cells in a range (C4:U4) that are non-empty and non-blank. Counting non-empty is easy with COUNTIF. The tricky issue seems to be that I want to treat cells with one or more blank as empty. (My users keep leaving blanks in cells which are not visible and I waste a lot of time cleaning them up.)
=COUNTIF(C4:U4,"<>") treats a cell with one or more blanks as non-empty and counts it. I've also tried =COUNTA(C4:U4) but that suffers from the same problem of counting cells with one or more blanks.
I found a solution in stackoverflow flagged as a solution by 95 people but it doesn't work for cells with blanks.
After much reading I have come up with a fancy formula:
=COUNTIF(FILTER(C4:U4,TRIM(C4:U4)>="-"),"<>")
The idea is that the TRIM removes leading and trailing blanks before FILTER tests the cell to be greater than or equal to a hyphen (the lowest order of printable characters I could find). The FILTER function then returns an array to the COUNTIF function which only contains non-empty and non-blank cells. COUNTIF then tests against "<>"
This works (or at least "seems" to work) but I was wondering if I've missed something really obvious. Surely the problem of hidden blanks is very common and has been around since the dawn of excel and google sheets. there must be a simpler way.
(My first question so apologies for any breaches of forum rules.)
I don't know about Google. But for Excel you could use this array formula for multiple contiguous columns:
=ROWS(A1:B10) * COLUMNS(A1:B10)-(COUNT(IF(ISERROR(CODE(A1:B10)),1,""))+COUNT(IF(CODE(A1:B10)=32,1,"")))
Could try this but I'm not at all sure about it
=SUMPRODUCT(--(trim((substitute(A2:A5,char(160),"")))<>""))
seems in Google Sheets that you've got to put char(160) to match a space entered into a cell?
Seems this is due to a non-breaking space and could possibly apply to Excel also - as explained here - the suggestion is that you could also pass it through the CLEAN function to eliminate invisible characters with codes in range 0-31.
I found another way to do it using:
=ARRAYFORMULA(SUM(IF(TRIM($C4:$U4)<>"",1,0)))
I'm still looking for a simpler way to do it if one is available.
This should work:
=countif(C4:U4,">""")
I found this solution here:
Is COUNTA counting blank (empty) cells in new Google spreadsheets?
Please let me know if it does.
=COLUMNS(C4:U4)-COUNTBLANK(C4:U4)
This will count how many cells are in your range (C4 to U4 = 19 cells), and subtract those that are truly "empty".
Blank spaces will not get counted by COUNTBLANK, despite its name, which should really be COUNTEMPTY.

Extracting Data from a column based on symbols

I have a tricky question that I'm hoping someone can help me with. I have an output file that looks pretty standard in that there is one value per row, per column - except for one column (excerpt below) that contains multiple entries per row:
4:103806204-103940896,4:103806204-103940896,4:103822084-103940896,4:103806204-103940896
7:27135712-27139877,7:27135712-27139877
2:209030070-209054773
1:16091458-16113084,1:16090993-16101715,1:16085254-16113084
16:70333061-70367735,16:70323669-70367735,16:70333061-70367735,16:70333061-70367735,16:70328735-70367735,16:70328699-70367735,16:70333061-70367735
It would be easy enough to split this column by ',' but then I won't be able to read it into, say, R very easily.
Instead, I'm hoping I can use a simple bit of code to select only the first two values, and then make one column into two, removing the rest. So the above would become the below:
4 103806204
7 27135712
2 209030070
1 16091458
16 70333061
I lose a little bit of info this way, but it makes the data more manageable. Does anyone have any suggestions?
We can use str_extract_all from library(stringr). We extract the numeric elements (\\d+) in a list, convert the 'character' class to numeric and get the first two elements with head, rbind the list elements.
library(stringr)
do.call(rbind, lapply(str_extract_all(df$col, '\\d+'),
function(x) head(as.numeric(x),2)))

Resources