Run R Script on Power BI Column - r

I have a Dataset in Power Bi with many columns, which contain information on incident tickets (e.g. How long it took to solve the issue, etc.)
Unfortunately the data I'm getting is not in the correct Time format. I wrote a simple R Function which would re-calculate the Time and return the correct value:
calculateHours <- function(hours) {
x <- trunc(hours/24)
rest <- mod(hours,24)
y <- trunc(rest/10)
z <- mod(rest,10)
result <- (((x+y)*10)+z)
return(result)
}
Example: 204 hours would turn into 92 hours if you run this through the Function.
Now I need to have a new column with the calculated values in it.
E.g. 'Business Elapsed Time = 204' -> 'Business Elapsed Time calculated (new Column) = 92'
How can I use this function in Power BI to add a new column which uses the values from another column of this table and then calculates the correct time values?
I'm still new to Power Bi and R so any help would be appreciated! Thanks in advance!

In Power BI Query Editor you can add an R Script (Transform -> Run R script) to your query. Here's a simple example that assumes you have a column Number:
# 'dataset' holds the input data for this script
myfunction <- function(x)
{
return (x + 1)
}
dataset$NewNumber <- myfunction(dataset$Number) ## apply function and add result as new column
output <- dataset ## PowerBI uses "output" as result from this query step
Here's a more detailed intro: https://www.red-gate.com/simple-talk/sql/bi/power-bi-introduction-working-with-r-scripts-in-power-bi-desktop-part-3/

Power Query can handle most of calculation itself using M formula. It would be much simpler than invoking R script, more integrated, and probably faster.
In Power Query Editor, navigate Add Column > Custom Column, then input M formula like below.
let
x = Number.IntegerDivide([Hours], 24),
rest = Number.Mod([Hours], 24),
y = Number.IntegerDivide(rest, 10),
z = Number.Mod(rest, 10),
result = (x + y) * 10 + z
in
result

Related

How to build a function with summation

I need to find a way to create a function that can work like the function counts described in the picture.
This is what I tried so far working with a code from an answer I found here, but I can't work out how the elements that I see in that function can be translated to my case.
Function I need to replicate
How my database looks
Also, the function can actually be simplified with respect to the one above because the database already shows the total number of cites per year... Am I wrong? This is what I have so far:
j <- patents_grant$Company
t<- patents_grant$Year
x <- patents_grant$count
fun_counts <- function(j,t) {
for (i in j)
sum(x[1:M, j], na.rm = T)
}
counts_try <- sapply(1:j ,fun_counts, M=3)
I'm pretty sure this one must be easy to build and I just don't have the knowledge. So even if you just have suggestions on good places to look at to learn how to build functions, that would be immensely appreciated.
What you actually want to do is calculate the 5 (or 3)-year moving total of the number of patents lagged by one year and transformed with log(x + 1).
I have created example data in patent_grants, and this is what one can do:
patent_grants <- expand.grid(Company = LETTERS[1:6],
Year = 1990:2010)
patent_grants$count <- rpois(nrow(patent_grants), 4)
M <- 5
we sort the data by Company and Year, which makes it a lot easier and create the new column for the transformed moving total:
patent_grants <- patent_grants[with(patent_grants, order(Company, Year)),]
patent_grants$count_avg <- NA
one now splits the data by Company, as the value is for each Company (this is done once for the vector of our moving total, and once for the count of the patents). For each company_data (ie. the number of patents sorted by value), we now calculate the moving total [the branching over the Companies is done with lapply].
For each year, we select the relevant data of the following M years with company_data[(t + 1):(t + M)], remove possible NA-values with na.omit, take the sum, and calculate log(x + 1) which is equivalent to log1p(x) [the "for-each-year" part is done using sapply].
split(patent_grants$count_avg, patent_grants$Company) <-
lapply(split(patent_grants$count, patent_grants$Company), function(company_data) {
sapply(1:length(company_data), function(t) {
log1p(sum(na.omit(company_data[(t + 1):(t + M)])))
})
})
Created on 2022-06-14 by the reprex package (v2.0.1)

User defined function with ticker as input?

Here is my code right now:
f=function(Symbol, start, end, interval){
getSymbols(Symbols=Symbol, from=start, to= end)
Symbol=data.frame(Symbol)
a=length(Symbol$Symbol.Adjusted)
b=a/interval
c=ceiling(b)
origData=as.data.frame(matrix(`length<-`(Symbol$Symbol.Adjusted, c * interval), ncol = interval, byrow = TRUE))
return(origData)
}
f("SPY", "2012-01-01", "2013-12-31", 10)
Next I need to Get the adjusted close price and consider this price data only for following tasks. Split daily stock adjusted close price into N blocks as rows in a data frame. So that each block containing M days (columns) data, where M equals to the time interval value. It’s referred as origData in my code.
The function is supposed to return the data frame origData, but whenever I try running this it tells me that the Symbol data frame is empty. How do I need to change my function to get the data frame output?
#IRTFM's observations are correct. Incorporating those changes you can change your function to :
library(quantmod)
f = function(Symbol, start, end, interval){
getSymbols(Symbols=Symbol, from=start, to= end)
data= get(Symbol)
col = data[, paste0(Symbol, '.Adjusted')]
a=length(col)
b=a/interval
c=ceiling(b)
origData= as.data.frame(matrix(`length<-`(col, c * interval),
ncol = interval, byrow = TRUE))
return(origData)
}
f("SPY", "2012-01-01", "2013-12-31", 10)
I haven't figured out what the set of expressions inside the data.matrix call is supposed to do and you made no effort to explain your intent. However, your error occurs farther up the line. If you put in a debugging call to str(Symbol) you will see that Symbol will evaluate to "SPY" but that is just a character value and not an R object name. The object you wnat is named SPY and the way to retrieve an object's value when you can only have access to a character value is to use the R function get, So try adding this after the getSymbols call inside the function:
library(quantmod) # I'm assuming this was the package in use
...
Symbol=data.frame( get(Symbol) )
str(Symbol) # will print the result at your console
....
# then perhaps you can work on what you were trying inside the data.matrix call
You will also find that the name Symbol.Adjusted will not work (since R is not a macro language). You will need to do something like:
a=length( Symbol[[ paste0(Symbol, ".Adjusted")]] )
Oh wait. You overwrote the value for Symbol. That won't work. You need to use a different name for your dataframe. So why don't you edit your question to fix the errors I've identified so far and also describe what you are trying to do when you were using as.data.frame.

Poisson Process algorithm in R (renewal processes perspective)

I have the following MATLAB code and I'm working to translating it to R:
nproc=40
T=3
lambda=4
tarr = zeros(1, nproc);
i = 1;
while (min(tarr(i,:))<= T)
tarr = [tarr; tarr(i, :)-log(rand(1, nproc))/lambda];
i = i+1;
end
tarr2=tarr';
X=min(tarr2);
stairs(X, 0:size(tarr, 1)-1);
It is the Poisson Process from the renewal processes perspective. I've done my best in R but something is wrong in my code:
nproc<-40
T<-3
lambda<-4
i<-1
tarr=array(0,nproc)
lst<-vector('list', 1)
while(min(tarr[i]<=T)){
tarr<-tarr[i]-log((runif(nproc))/lambda)
i=i+1
print(tarr)
}
tarr2=tarr^-1
X=min(tarr2)
plot(X, type="s")
The loop prints an aleatory number of arrays and only the last is saved by tarr after it.
The result has to look like...
Thank you in advance. All interesting and supportive comments will be rewarded.
Adding on to the previous comment, there are a few things which are happening in the matlab script that are not in the R:
[tarr; tarr(i, :)-log(rand(1, nproc))/lambda]; from my understanding, you are adding another row to your matrix and populating it with tarr(i, :)-log(rand(1, nproc))/lambda].
You will need to use a different method as Matlab and R handle this type of thing differently.
One glaring thing that stands out to me, is that you seem to be using R: tarr[i] and M: tarr(i, :) as equals where these are very different, as what I think you are trying to achieve is all the columns in a given row i so in R that would look like tarr[i, ]
Now the use of min is also different as R: min() will return the minimum of the matrix (just one number) and M: min() returns the minimum value of each column. So for this in R you can use the Rfast package Rfast::colMins.
The stairs part is something I am not familiar with much but something like ggplot2::qplot(..., geom = "step") may work.
Now I have tried to create something that works in R but am not sure really what the required output is. But nevertheless, hopefully some of the basics can help you get it done on your side. Below is a quick try to achieve something!
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
# Major alteration, create a temporary row from previous row in tarr
temp <- matrix(tarr[i, ] - log((runif(nproc))/lambda), nrow = 1)
# Join temp row to tarr matrix
tarr <- rbind(tarr, temp)
i = i + 1
}
# I am not sure what was meant by tarr' in the matlab script I took it as inverse of tarr
# which in matlab is tarr.^(-1)??
tarr2 = tarr^(-1)
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
As you can see I have sorted the min_for_each_col so that the plot is actually a stair plot and not some random stepwise plot. I think there is a problem since from the Matlab code 0:size(tarr2, 1)-1 gives the number of rows less 1 but I cant figure out why if grabbing colMins (and there are 40 columns) we would create around 20 steps. But I might be completely misunderstanding! Also I have change T to T0 since in R T exists as TRUE and is not good to overwrite!
Hope this helps!
I downloaded GNU Octave today to actually run the MatLab code. After looking at the code running, I made a few tweeks to the great answer by #Croote
nproc <- 40
T0 <- 3
lambda <- 4
i <- 1
tarr <- matrix(rep(0, nproc), nrow = 1, ncol = nproc)
while(min(tarr[i, ]) <= T0){
temp <- matrix(tarr[i, ] - log(runif(nproc))/lambda, nrow = 1) #fixed paren
tarr <- rbind(tarr, temp)
i = i + 1
}
tarr2 = t(tarr) #takes transpose
library(ggplot2)
library(Rfast)
min_for_each_col <- colMins(tarr2, value = TRUE)
qplot(seq_along(min_for_each_col), sort(min_for_each_col), geom="step")
Edit: Some extra plotting tweeks -- seems to be closer to the original
qplot(seq_along(min_for_each_col), c(1:length(min_for_each_col)), geom="step", ylab="", xlab="")
#or with ggplot2
df1 <- cbind(min_for_each_col, 1:length(min_for_each_col)) %>% as.data.frame
colnames(df1)[2] <- "index"
ggplot() +
geom_step(data = df1, mapping = aes(x = min_for_each_col, y = index), color = "blue") +
labs(x = "", y = "")
I'm not too familiar with renewal processes or matlab so bear with me if I misunderstood the intention of your code. That said, let's break down your R code step by step and see what is happening.
The first 4 lines assign numbers to variables.
The fifth line creates an array with 40 (nproc) zeros.
The sixth line (which doesnt seem to be used later) creates an empty vector with mode 'list'.
The seventh line starts a while loop. I suspect this line is supposed to say while the min value of tarr is less than or equal to T ...
or it's supposed to say while i is less than or equal to T ...
It actually takes the minimum of a single boolean value (tarr[i] <= T). Now this can work because TRUE and FALSE are treated like numbers. Namely:
TRUE == 1 # returns TRUE
FALSE == 0 # returns TRUE
TRUE == 0 # returns FALSE
FALSE == 1 # returns FALSE
However, since the value of tarr[i] depends on a random number (see line 8), this could lead to the same code running differently each time it is executed. This might explain why the code "prints an aleatory number of arrays ".
The eight line seems to overwrite the assignment of tarr with the computation on the right. Thus it takes the single value of tarr[i] and subtracts from it the natural log of runif(proc) divided by 4 (lambda) -- which gives 40 different values. These fourty different values from the last time through the loop are stored in tarr.
If you want to store all fourty values from each time through the loop, I'd suggest storing it in say a matrix or dataframe instead. If that's what you want to do, here's an example of storing it in a matrix:
for(i in 1:nrow(yourMatrix)){
//computations
yourMatrix[i,] <- rowCreatedByComputations
}
See this answer for more info about that. Also, since it's a set number of values per run, you could keep them in a vector and simply append to the vector each loop like this:
vector <- c(vector,newvector)
The ninth line increases i by one.
The tenth line prints tarr.
the eleveth line closes the loop statement.
Then after the loop tarr2 is assigned 1/tarr. Again this will be 40 values from the last time through the loop (line 8)
Then X is assigned the min value of tarr2.
This single value is plotted in the last line.
Also note that runif samples from the uniform distribution -- if you're looking for a Poisson distribution see: Poisson
Hope this helped! Let me know if there's more I can do to help.

How to create a new column using function in R?

I have got a data frame with geographic position inside. The positions are strings.
This is my function to scrape the strings and get the positions by Degress.Decimal.
Example position 23º 30.0'N
latitud.decimal <- function(y) {
latregex <- str_match(y,"(\\d+)º\\s(\\d*.\\d*).(.)")
latitud <- (as.numeric(latregex[1,2])) +((as.numeric(latregex[1,3])) / 60)
if (latregex[1,4]=="S") {latitud <- -1*latitud}
return(latitud)
}
Results> 23.5
then I would like to create a new column in my original dataframe applying the function to every item in the Latitude column.
Is the same issue for the longitude. Another new column
I know how to do this using Python and Pandas buy I am newbie y R and cannot find the solution.
I am triying with
lapply(datos$Latitude, 2 , FUN= latitud.decimal(y))
but do not read the y "argument" which is every column value.
Note that the str_match is vectorized as stated in the help page of the function help("str_match").
For the sake of answering the question, I lack a reproducable example and data. This page describes how one can make questions that are more likely to be reproducable and thus obtain better answers.
As i lack data, and code, i cannot test whether i am actually hitting the spot, but i will give it a shot anyway.
Using the fact the str_match is vectorized, we can apply the entire function without using lapply, and thus create a new column simply. I'll slightly rewrite your function, to incorporate the vectorizations. Note the missing 1's in latregex[., .]
latitud.decimal <- function(y) {
latregex <- str_match(y,"(\\d+)º\\s(\\d*.\\d*).(.)")
latitud <- as.numeric(latregex[, 2]) + as.numeric(latregex[, 3]) / 60)
which_south <- which(latregex[, 4] == "S")
latitud[which_south] <- -latitud[which_south]
latitud
}
Now that the function is ready, creating a column can be done using the $ operator. If the data is very large, it can be performed more efficiently using the data.table. See this stackoverflow page for an example of how to assign via the data.table package.
In base R we would simply perform the action as
datos$new_column <- latitud.decimal(datos$Latitude)
datos$lat_decimal = sapply(datos$Latitude, latitud.decimal)

Converting slow "WHILE" loop to "apply"-type function

I have created a while loop that is being executed across a sizable data set. The loop is as such:
i = 1
while(i<=m){
Date = London.Events$start_time[i]
j=1
while(j<=n){
Hotel = London.Hotels$AS400.ID[j]
Day.Zero[i,j] = sum(London.Bookings$No.of.Rooms[London.Bookings$Stay.Date == Date & London.Bookings$Legacy.Hotel.Code == Hotel])
j=j+1
}
i=i+1
}
Where:
m = 9957 #Number of Events
n = 814 #Number of Hotels
Day.Zero = as.data.frame(matrix(0, 9957, 814))
Briefly explained, for each combination of date and hotel (pulled from two other data frames), produce the sum from the column London.Bookings$No.of.Rooms and deposit that into the corresponding row of the matrix.
The loop appears to run without error, however when stopping it after 5 mins+ it is still running and nowhere near complete!
I would like to know how one of the apply family of functions could be used as a replacement here for much faster completion.
Thanks!
Probably,
xtabs(No.of.Rooms ~ Stay.Date + Legacy.Hotel, data = London.Bookings)
gets you something similar to what you want.
Using library dplyr, you can do something like the following (assuming your input data frame has such column names - vaguely interpreted from your code / question):
library(dplyr)
London.Bookings %>% group_by(Legacy.Hotel.Code, Stay.Date) %>% summarise(Total.No.of.Rooms = sum(No.of.Rooms))

Resources