BigQuery adding unnecessary decimals to defined STRING - r

I have defined a schema in BigQuery as such:
+------------------+----------+----------+
| name | type | mode |
+------------------+----------+----------+
| warehouse | INTEGER | NULLABLE |
| transaction_date | DATETIME | NULLABLE |
| style | STRING | NULLABLE |
| piece | STRING | NULLABLE |
| fabric_1 | STRING | NULLABLE |
| fabric_2 | STRING | NULLABLE |
| serial | STRING | NULLABLE |
| customer_po | STRING | NULLABLE |
| order_number | STRING | NULLABLE |
+------------------+----------+----------+
The two fields I'm focusing on are serial and order_number, which when previewed in R, look like this:
+-----------+------------------+--------+-------+-----------+----------+------------+--------------+--------------+
| warehouse | transaction_date | style | piece | fabric_1 | fabric_2 | serial | customer_po | order_number |
+-----------+------------------+--------+-------+-----------+----------+------------+--------------+--------------+
| 80 | 4/3/19 | K28300 | ARMH | ALL CHAR | NA | 8040418253 | 1486838165 | 464374 |
| 80 | 4/3/19 | K28300 | ARMH | ALL CHAR | NA | 9040542252 | 1485798731-P | 464069 |
| 80 | 4/3/19 | K28300 | ARMH | ELEG NAVY | NA | 8040355550 | 1486826068 | 464369 |
| 80 | 4/3/19 | K28300 | ARMH | ELEG NAVY | NA | 8040532364 | 1485366411-R | 464071 |
+-----------+------------------+--------+-------+-----------+----------+------------+--------------+--------------+
Within R, those two fields appear to be read as characters in the dataframe I'm uploading, which is what I'm looking for. Yet when I push the data to BigQuery, those two fields end up like such:
+-----------+------------------+--------+-------+-----------+----------+------------+--------------+--------------+
| warehouse | transaction_date | style | piece | fabric_1 | fabric_2 | serial | customer_po | order_number |
+-----------+------------------+--------+-------+-----------+----------+------------+--------------+--------------+
| 80 | 4/3/19 | K28300 | ARMH | ALL CHAR | NA | 8040418253.0 | 1486838165 | 464374.0 |
| 80 | 4/3/19 | K28300 | ARMH | ALL CHAR | NA | 9040542252.0 | 1485798731-P | 464069.0 |
| 80 | 4/3/19 | K28300 | ARMH | ELEG NAVY | NA | 8040355550.0 | 1486826068 | 464369.0 |
| 80 | 4/3/19 | K28300 | ARMH | ELEG NAVY | NA | 8040532364.0 | 1485366411-R | 464071.0 |
+-----------+------------------+--------+-------+-----------+----------+------------+--------------+--------------+
Why is this happening, and how can I change it? For reference, my code to upload it:
bqr_upload_data(projectId = "project-test",
datasetId = "orders",
tableId = "daily_orders",
upload_data = df_daily_orders,
maxBadRecords = 1000,
overwrite = TRUE)

The upload from R looks at the class of the column to decide which is the best schema for BigQuery. Try changing the class of the data frame column to string to avoid it changing it to float as what looks like is happening via something like
as.character(df$column)

Now I'm not completely sure in my answer, as I am still a beginner, but it may help you. I would add this as a comment, but I don't have enough reputation yet.
If I understood properly, you are actually doing an implicit casting - from a numerical value to a string value and BigQuery is catching the decimal point as to be sure that it's properly catching the whole value
Check here BigQuery's conversion rules - second table, FLOAT64 to String.
In your place and depending on what you need to do with the table - I would:
Recreate the table, but change the schema for serial and order_number columns to an integer type
or
Try to update the already created table with an update query - and modifying the '.0' at the end of every string value

Related

How to create a variable based on character and number iteration in R?

I'm trying to create a dummy variable based on the character type variable.
For example, I need to create "newcat" variable ranging from "I00" to "I99".
In the code I wrote, I place all the characters from I00-I99.
But is there any way to make this code efficient with the loop to iterate number after the string?
Thank you in advance!!
mort <- mort %>%
mutate(newcat = ifelse(ucod=="I00" |
ucod=="I01" | ucod=="I02" | ucod=="I03" | ucod=="I04" | ucod=="I05" |
ucod=="I06" | ucod=="I07" | ucod=="I08" | ucod=="I09" | ucod=="I10" |
ucod=="I11" | ucod=="I12" | ucod=="I13" | ucod=="I14" | ucod=="I15" |
ucod=="I16" | ucod=="I17" | ucod=="I18" | ucod=="I19" | ucod=="I20" |
ucod=="I21" | ucod=="I22" | ucod=="I23" | ucod=="I24" | ucod=="I25" |
ucod=="I26" | ucod=="I27" | ucod=="I28" | ucod=="I29" | ucod=="I30" |
ucod=="I31" | ucod=="I32" | ucod=="I33" | ucod=="I34" | ucod=="I35" |
ucod=="I36" | ucod=="I37" | ucod=="I38" | ucod=="I39" | ucod=="I40" |
ucod=="I41" | ucod=="I42" | ucod=="I43" | ucod=="I44" | ucod=="I45" |
ucod=="I46" | ucod=="I47" | ucod=="I48" | ucod=="I49" | ucod=="I50" |
ucod=="I51" | ucod=="I52" | ucod=="I53" | ucod=="I54" | ucod=="I55" |
ucod=="I56" | ucod=="I57" | ucod=="I58" | ucod=="I59" | ucod=="I60" |
ucod=="I61" | ucod=="I62" | ucod=="I63" | ucod=="I64" | ucod=="I65" |
ucod=="I66" | ucod=="I67" | ucod=="I68" | ucod=="I69" | ucod=="I70" |
ucod=="I71" | ucod=="I72" | ucod=="I73" | ucod=="I74" | ucod=="I75" |
ucod=="I76" | ucod=="I77" | ucod=="I78" | ucod=="I79" | ucod=="I80" |
ucod=="I81" | ucod=="I82" | ucod=="I83" | ucod=="I84" | ucod=="I85" |
ucod=="I86" | ucod=="I87" | ucod=="I88" | ucod=="I89" | ucod=="I90" |
ucod=="I91" | ucod=="I92" | ucod=="I93" | ucod=="I94" | ucod=="I95" |
ucod=="I96" | ucod=="I97" | ucod=="I98" | ucod=="I99", 1, 0))
Try %in% instead of == with |
x <- c(paste0("I0", 0:9),paste0("I", c(10:99)))
mort %>%
mutate(newcat = ifelse(ucod %in% x, 1, 0))
Another option is to use regex:
mort <- mort %>%
mutate(newcat = +str_detect(ucod, '^I[0-9]{2}$'))
where ^ is a metacharacter which indicates the beginning of the string. Then we have I[0-9]{2} which matches the letter I and any 2 combinations of the numbers 0-9. Then $ is another metacharacter that indicates the end of the string. So the string matched must start with I followed by 2 numbers and that should be the end of the string. Any string that does not match the pattern will be flaged as FALSE

MariaDB DATETIME Index not working with Between FROM_UNIXTIME()

I have a table with DATETIME field, which is indexed by a BTree. Now i want to query it with following statement:
SELECT
count(us.CITY) as metric,
us.CITY as Name,
us.LATITUDE as latitude,
us.LONGITUDE as longitude
FROM
FACT
LEFT JOIN
USER us
ON
us.ID_USER = FACT.USER
WHERE
ASSESSMENT_DATE BETWEEN FROM_UNIXTIME(1601568552) AND FROM_UNIXTIME(1604028277)
GROUP BY us.CITY, us.LATITUDE, us.LONGITUDE;
EXPLAIN:
+------+-------------+-------+--------+----------------------------+---------+---------+------------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+--------+----------------------------+---------+---------+------------------------------+--------+----------------------------------------------+
| 1 | SIMPLE | FACT | ALL | INDEX_FACT_ASSESSMENT_DATE | NULL | NULL | NULL | 762621 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | us | eq_ref | PRIMARY | PRIMARY | 46 | dwh0.FACT.USER,dwh0.FACT.ENV | 1 | |
+------+-------------+-------+--------+----------------------------+---------+---------+------------------------------+--------+----------------------------------------------+
2 rows in set (0.001 sec)
Interestingly, by only changing the dates manually into the DATETIME Format string it uses the index. But the FROM_UNIXTIME() function should in my opinion return the exactly same thing...
SELECT
count(us.CITY) as metric,
us.CITY as Name,
us.LATITUDE as latitude,
us.LONGITUDE as longitude
FROM
FACT
LEFT JOIN
USER us
ON
us.ENV = FACT.ENV AND us.ID_USER = FACT.USER
WHERE
-- ASSESSMENT_DATE BETWEEN FROM_UNIXTIME(1596649101) AND FROM_UNIXTIME(1599108827)
ASSESSMENT_DATE BETWEEN '2020-08-05 11:30:11.987' AND '2020-09-03 11:30:11.987'
GROUP BY us.CITY, us.LATITUDE, us.LONGITUDE;
EXPLAIN:
+------+-------------+-------+--------+----------------------------+----------------------------+---------+------------------------------+--------+--------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
|
+------+-------------+-------+--------+----------------------------+----------------------------+---------+------------------------------+--------+--------------------------------------------------------+
| 1 | SIMPLE | FACT | range | INDEX_FACT_ASSESSMENT_DATE | INDEX_FACT_ASSESSMENT_DATE | 5 | NULL | 132008 | Using index condition; Using temporary; Using filesort |
| 1 | SIMPLE | us | eq_ref | PRIMARY | PRIMARY | 46 | dwh0.FACT.USER,dwh0.FACT.ENV | 1 |
|
+------+-------------+-------+--------+----------------------------+----------------------------+---------+------------------------------+--------+--------------------------------------------------------+
2 rows in set (0.001 sec)
Can anyone refer to such a problem? the where clause is generated by grafana, so i can not change that, but the rest i can change if it changes something.
Thanks for suggestions!
Sorry for bothering.. after around 10^5 more inserts, it works for both cases... Maybe it was just bad luck

R: Regex to match more than one pipe occurrence

I have a dataset in which I paste values in a dplyr chain and collapse with the pipe character (e.g. " | "). If any of the values in the dataset are blank, I just get recurring pipe characters in the pasted list.
Some of the values look like this, for example:
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
I want to match all the pipes that occur more than once and delete them, so that just the names appear like so:
correctstring = "| GHOULSBY,SCROGGINS | CAT,JOHNSON | |BURGLAR,PALA |"
I tried the following, but to no avail:
mutate(names = gsub('[\\|]{2,}', '', name_list))
The difficulty in this question is in formulating a regex which can selectively remove every pipe, except the ones we want to remain as actual separators between terms. We can match on the following pattern:
\|\s+(?=\|)
and then replace just empty string. This pattern will remove any pipe (and any following whitespace) so long as what follows is another pipe. A removal would not occur when a pipe is followed by an actual term, or when it is followed by the end of the string.
badstring = "| | | | | | GHOULSBY,SCROGGINS | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CAT,JOHNSON | | | | | | | | | | | | BURGLAR,PALA | | | | | | | | |"
result <- gsub("\\|\\s+(?=\\|)", "", badstring, perl=TRUE)
result
[1] "| GHOULSBY,SCROGGINS | CAT,JOHNSON | BURGLAR,PALA |"
Demo
Edit:
If you expect to have inputs like | | | which are devoid of any terms, and you would expect empty string as the output, then my solution would fail. I don't see an obvious way to modify the above regex, but you can handle this case with one more call to sub:
result <- sub("^\\|$", "", result)
We also might be able to modify the original pattern to use an alternation covering all cases:
result <- gsub("\\|\\s+(?=\\|)|(?:^\\|$)", "", badstring, perl=TRUE)

R - Multiple search and replace based on partial match within a column of a dataframe

I have a list of publisher that looks like this :
+--------------+
| Site Name |
+--------------+
| Radium One |
| Euronews |
| EUROSPORT |
| WIRED |
| RadiumOne |
| Eurosport FR |
| Wired US |
| Eurosport |
| EuroNews |
| Wired |
+--------------+
I'd like to create the following result:
+--------------+----------------+
| Site Name | Publisher Name |
+--------------+----------------+
| Radium One | RadiumOne |
| Euronews | Euronews |
| EUROSPORT | Eurosport |
| WIRED | Wired |
| RadiumOne | RadiumOne |
| Eurosport FR | Eurosport |
| Wired US | Wired |
| Eurosport | Eurosport |
| EuroNews | Euronews |
| Wired | Wired |
+--------------+----------------+
I would like to understand how I can replicate this code I use in Power Query :
search first 4 characters
if Text.Start([Site Name],4) = "WIRE" then "Wired" else
search last 3 characters
if Text.End([Site Name],3) = "One" then "RadiumOne" else
If no match is found, then add "Rest"
It does not have to be case sensitive.
Using properCase from ifultools package and gsub, we replace everything after first word with "" i.e delete it and treat the exceptional case of Radium separtely. If you have many exceptions like Radium case, please update your post with those so that we can find a neater solution to this hack :)
library("ifultools")
siteName=c("Radium One","Euronews","EUROSPORT","WIRED","RadiumOne","Eurosport FR","Wired US","Eurosport","EuroNews","Wired")
publisherName = gsub("^Radium$","Radiumone",gsub("\\s+.*","",properCase(siteName)))
# [1] "Radiumone" "Euronews" "Eurosport" "Wired" "Radiumone" "Eurosport" "Wired"
# [8] "Eurosport" "Euronews" "Wired"

Unable to forecast linear model in R

I'm able to do forecasts with an ARIMA model, but when I try to do a forecast for a linear model, I do not get any actual forecasts - it stops at the end of the data set (which isn't useful for forecasting since I already know what's in the data set). I've found countless examples online where using this same code works just fine, but I haven't found anyone else having this same error.
library("stats")
library("forecast")
y <- data$Mfg.Shipments.Total..USA.
model_a1 <- auto.arima(y)
forecast_a1 <- forecast.Arima(model_a1, h = 12)
The above code works perfectly. However, when I try to do a linear model....
model1 <- lm(y ~ Mfg.NO.Total..USA. + Mfg.Inv.Total..USA., data = data )
f1 <- forecast.lm(model1, h = 12)
I get an error message saying that I MUST provide a new data set (which seems odd to me, since the documentation for the forecast package says that it is an optional argument).
f1 <- forecast.lm(model1, newdata = x, h = 12)
If I do this, I am able to get the function to work, but the forecast only predicts values for the existing data - it doesn't predict the next 12 periods. I have also tried using the append function to add additional rows to see if that would fix the issue, but when trying to forecast a linear model, it immediately stops at the most recent point in the time series.
Here's the data that I'm using:
+------------+---------------------------+--------------------+---------------------+
| | Mfg.Shipments.Total..USA. | Mfg.NO.Total..USA. | Mfg.Inv.Total..USA. |
+------------+---------------------------+--------------------+---------------------+
| 2110-01-01 | 3.59746e+11 | 3.58464e+11 | 5.01361e+11 |
| 2110-01-01 | 3.59746e+11 | 3.58464e+11 | 5.01361e+11 |
| 2110-02-01 | 3.62268e+11 | 3.63441e+11 | 5.10439e+11 |
| 2110-03-01 | 4.23748e+11 | 4.24527e+11 | 5.10792e+11 |
| 2110-04-01 | 4.08755e+11 | 4.02769e+11 | 5.16853e+11 |
| 2110-05-01 | 4.08187e+11 | 4.02869e+11 | 5.18180e+11 |
| 2110-06-01 | 4.27567e+11 | 4.21713e+11 | 5.15675e+11 |
| 2110-07-01 | 3.97590e+11 | 3.89916e+11 | 5.24785e+11 |
| 2110-08-01 | 4.24732e+11 | 4.16304e+11 | 5.27734e+11 |
| 2110-09-01 | 4.30974e+11 | 4.35043e+11 | 5.28797e+11 |
| 2110-10-01 | 4.24008e+11 | 4.17076e+11 | 5.38917e+11 |
| 2110-11-01 | 4.11930e+11 | 4.09440e+11 | 5.42618e+11 |
| 2110-12-01 | 4.25940e+11 | 4.34201e+11 | 5.35384e+11 |
| 2111-01-01 | 4.01629e+11 | 4.07748e+11 | 5.55057e+11 |
| 2111-02-01 | 4.06385e+11 | 4.06151e+11 | 5.66058e+11 |
| 2111-03-01 | 4.83827e+11 | 4.89904e+11 | 5.70990e+11 |
| 2111-04-01 | 4.54640e+11 | 4.46702e+11 | 5.84808e+11 |
| 2111-05-01 | 4.65124e+11 | 4.63155e+11 | 5.92456e+11 |
| 2111-06-01 | 4.83809e+11 | 4.75150e+11 | 5.86645e+11 |
| 2111-07-01 | 4.44437e+11 | 4.40452e+11 | 5.97201e+11 |
| 2111-08-01 | 4.83537e+11 | 4.79958e+11 | 5.99461e+11 |
| 2111-09-01 | 4.77130e+11 | 4.75580e+11 | 5.93065e+11 |
| 2111-10-01 | 4.69276e+11 | 4.59579e+11 | 6.03481e+11 |
| 2111-11-01 | 4.53706e+11 | 4.55029e+11 | 6.02577e+11 |
| 2111-12-01 | 4.57872e+11 | 4.81454e+11 | 5.86886e+11 |
| 2112-01-01 | 4.35834e+11 | 4.45037e+11 | 6.04042e+11 |
| 2112-02-01 | 4.55996e+11 | 4.70820e+11 | 6.12071e+11 |
| 2112-03-01 | 5.04869e+11 | 5.08818e+11 | 6.11717e+11 |
| 2112-04-01 | 4.76213e+11 | 4.70666e+11 | 6.16375e+11 |
| 2112-05-01 | 4.95789e+11 | 4.87730e+11 | 6.17639e+11 |
| 2112-06-01 | 4.91218e+11 | 4.87857e+11 | 6.09361e+11 |
| 2112-07-01 | 4.58087e+11 | 4.61037e+11 | 6.19166e+11 |
| 2112-08-01 | 4.97438e+11 | 4.74539e+11 | 6.22773e+11 |
| 2112-09-01 | 4.86994e+11 | 4.85560e+11 | 6.23067e+11 |
| 2112-10-01 | 4.96744e+11 | 4.92562e+11 | 6.26796e+11 |
| 2112-11-01 | 4.70810e+11 | 4.64944e+11 | 6.23999e+11 |
| 2112-12-01 | 4.66721e+11 | 4.88615e+11 | 6.08900e+11 |
| 2113-01-01 | 4.51585e+11 | 4.50763e+11 | 6.25881e+11 |
| 2113-02-01 | 4.56329e+11 | 4.69574e+11 | 6.33157e+11 |
| 2113-03-01 | 5.04023e+11 | 4.92978e+11 | 6.31055e+11 |
| 2113-04-01 | 4.84798e+11 | 4.76750e+11 | 6.35643e+11 |
| 2113-05-01 | 5.04478e+11 | 5.04488e+11 | 6.34376e+11 |
| 2113-06-01 | 4.99043e+11 | 5.13760e+11 | 6.25715e+11 |
| 2113-07-01 | 4.75700e+11 | 4.69012e+11 | 6.34892e+11 |
| 2113-08-01 | 5.05244e+11 | 4.90404e+11 | 6.37735e+11 |
| 2113-09-01 | 5.00087e+11 | 5.04849e+11 | 6.34665e+11 |
| 2113-10-01 | 5.05965e+11 | 4.99682e+11 | 6.38945e+11 |
| 2113-11-01 | 4.78876e+11 | 4.80784e+11 | 6.34442e+11 |
| 2113-12-01 | 4.80640e+11 | 4.98807e+11 | 6.19458e+11 |
| 2114-01-01 | 4.56779e+11 | 4.57684e+11 | 6.36568e+11 |
| 2114-02-01 | 4.62195e+11 | 4.70312e+11 | 6.48982e+11 |
| 2114-03-01 | 5.19472e+11 | 5.25900e+11 | 6.47038e+11 |
| 2114-04-01 | 5.04217e+11 | 5.06090e+11 | 6.52612e+11 |
| 2114-05-01 | 5.14186e+11 | 5.11149e+11 | 6.58990e+11 |
| 2114-06-01 | 5.25249e+11 | 5.33247e+11 | 6.49512e+11 |
| 2114-07-01 | 4.99198e+11 | 5.52506e+11 | 6.57645e+11 |
| 2114-08-01 | 5.17184e+11 | 5.07622e+11 | 6.59281e+11 |
| 2114-09-01 | 5.23682e+11 | 5.24051e+11 | 6.55582e+11 |
| 2114-10-01 | 5.17305e+11 | 5.09549e+11 | 6.59237e+11 |
| 2114-11-01 | 4.71921e+11 | 4.70093e+11 | 6.57044e+11 |
| 2114-12-01 | 4.84948e+11 | 4.86804e+11 | 6.34120e+11 |
+------------+---------------------------+--------------------+---------------------+
Edit - Here's the code I used for adding new datapoints for forecasting.
library(xts)
library(mondate)
d <- as.mondate("2115-01-01")
d11 <- d + 11
seq(d, d11)
newdates <- seq(d, d11)
new_xts <- xts(order.by = as.Date(newdates))
new_xts$Mfg.Shipments.Total..USA. <- NA
new_xts$Mfg.NO.Total..USA. <- NA
new_xts$Mfg.Inv.Total..USA. <- NA
x <- append(data, new_xts)
Not sure if you ever figured this out, but just in case I thought I'd point out what's going wrong.
The documentation for forecast.lm says:
An optional data frame in which to look for variables with which to predict. If omitted, it is assumed that the only variables are trend and season, and h forecasts are produced.
so it's optional if trend and season are your only predictors.
The ARIMA model works because it's using lagged values of the time series in the forecast. For the linear model, it uses the given predictors (Mfg.NO.Total..USA. and Mfg.Inv.Total..USA. in your case) and thus needs their corresponding future values; without these, there are no independent variables to predict from.
In the edit, you added those variables to your future dataset, but they still have values of NA for all future points, thus the forecasts are also NA.
Gabe is correct. You need future values of your causals.
You should consider the Transfer Function modeling process instead of regression(ie developed for use with cross-sectional data). By using prewhitening your X variables (ie build a model for each one), you can calculate the Cross correlation function to see any lead or lag relationship.
It is very apparent that Inv.Total is a lead variable(b**-1) from the standardized graph of Y and the two x's. When Invto moves down so does shipments. In addition, there is also AR seasonal component beyond the causals that is driving the data. There are a few outliers as well so this is a robust solution. I am developer of this software used here, but this can be run in any tool.

Resources