How do I create a dynamic count of categories in Power BI (histogram view)? - count

I am trying to convert a dataset which is currently at the hourly level and create a histogram off of a calculated field "WeekNumCounter". The data is aggregated below off of a binary calculation. Can someone please recommend the best way to visualize in power BI? The expected result would be 1 instance of 3 for Finance, and 2 instances of 3 for Tech.
Department | Name | WeeknumYear | WeekNumCounter
__________________________________________________
Finance | Joe | W21 2022 | 3
Finance | Jane | W21 2022 | 4
Tech | Bill | W21 2022 | 3
Tech | Sam | W21 2022 | 3
WeekNumCounter =
int(averageX(
SUMMARIZE (
s2_upload,
s2_upload[Name],
'Date'[WeeknumYear],
"Count", counta('Date'[WeeknumYear] )
),
[Count]
))

Related

Plotting incrementally increasing dates in RStudio

I'm wanting to plot the country vs a date range of exposures to COVID-19, as a learning tool in RStudio.
I've been trying to read the CSV and store as a dataframe, then plot via ggplot, but I think that I'm doing this incorrectly, since this is a date range.
How could I approach this to plot the infected countries to the dates, which increase daily in the header?
| Province/State | 1/21/2020 22:00 | 1/22/2020 12:00 | 1/23/2020 12:00 | 1/24/2020 0:00 |...
|----------------|-----------------|-----------------|-----------------|----------------|
| Anhui | 1 | 1 | 2 | 5 |...
| Beijing | 1 | 1 | 3 | 4 |...
| Chongqing | 2 | 4 | 5 | 6 |...
These cases are not accurate, just generated through MD to provide a table of data.
Thank you!

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

Time Series Forecast in R - Weekly Sparse Data with Variable Starting Point

I am working on building a time series forecast model for a problem which involves dataset of manufacturers and their product offerings in a large retail outlet.
The problem is as follows:
Lets say you have thousands of manufacturers. (M1 to Mn)
There is a retail outlet who would take goods from these
manufacturers to sell.
The manufacturers supply product offerings to the store on weekly
basis.(same product or new product with same price or different
price, but for simplicity lets say they supply new distinct
products) (from week W1 to Wn)
Each manufacturer would have started working with the retail outlet
on different dates.(which can be anything in past)
From the time they started working with the retail outlet, the
history suggests that some manufacturers have been constantly
supplying products weekly and some have been sparse.
In Dataset Below (N/A means they hadn't started doing business with
the retail outlet e.g manufacturer M1 joined the retailer on Week3 in
our time window with their first offering count to be 10)
W1 - W2 - W3 - W4 - W5 - W6 - W7 - W8 - . - . - . - Wn
M1 - | N/A | N/A | 10 | 0 | 5 | 0 | 10 | 15 | 12 | . | . | 23 |
M2 - | 10 | 5 | 12 | 8 | 4 | 9 | 0 | 0 | 0 | . | . | 4 |
M3 - | 9 | N/A | 0 | 0 | 45 | 45 | 45 | 38 | 12 | . | . | 11 |
. - | N/A | N/A | N/A | N/A | 12 | 0 | 10 | 15 | 12 | . | . | 28 |
. - | N/A | N/A | N/A | 0 | 5 | 0 | 8 | 15 | 12 | . | . | 12 |
. - | 5 | N/A | 60 | 0 | 5 | 0 | 40 | 67 | 23 | . | . | 46 |
. - | N/A | N/A | 12 | 9 | 12 | 15 | 10 | 15 | 43 | . | . | 9 |
Mn - | 0 | N/A | 90 | 78 | 65 | 0 | 10 | 15 | 12 | . | . | 65 |
Now assuming that for all the manufacturers, I want to forecast, from the point Wn their next 8 weeks of supply they would do from this historical data ( i.e. from Wn for next 8 weeks i..e Wn+1 to Wn+8).
I am trying to forecast using the Auto ARIMA and seasonal naive models in R.
tsfull<- ts(tsdata, start=c(ts_series_strt_dt,weeknum_strt), freq=52)
tswindow<- window(tsdata, start=c(ts_series_strt_dt,weeknum_strt), end=c(ts_series_end_dt,weeknum_end))
SN<-snaive(tswindow, 8)
AR<-forecast.Arima(auto.arima(tswindow), h=8)
etsfr<-forecast(ets(tswindow), h=8)
stlffr<-forecast(stlf(tswindow), h=8)
Since the data contains a lot of Zeros, also since the starting point of the series for every manufacturer is totally different, also we have thousands of manufacturers and I see the RMSE error varies erratically since each series is unique. Also I have tried grouping the manufacturer based on their age with the outlet. I am not able to decide on the best forecast model that would fit this problem.
I am not an expert in this field. Any thoughts and opinion would be really helpful.

Group by Week with Columns for each day of week

I'm trying to created a report for my asp.net application which will show the quantity of each item in combination with unit that was ordered for each day of the week. The days of the week are columns.
To be more specific:
I have two table, one is the Orders table with order id, customer name, date etc...
The second table is OrderItems, this table has order id as a foreign key, order Item id, item name, unit (exp: each, box , case), quantity, and price.
When a user picks a date range for the report, for example from 3/2/12 to 4/2/12, on my asp page, the report will group order items by week and will look as follows:
**week (1) starting from sunday of such date to saturday of such date**
item | unit | Sun | Mon | Tues | Wedn | Thur | Fri | Sat | Total Price for week
item1 | bag | 3 | 0 | 12 | 8 | 45 | 1 | 4 | $1234
item4 | box | 2 | 4 | 5 | 0 | 5 | 2 | 6 | $1234
**week (2) starting from sunday of such date to saturday of such date**
item | unit | Sun | Mon | Tues | Wedn | Thur | Fri | Sat | Total Price for week
item1 | bag | 3 | 0 | 12 | 8 | 45 | 1 | 4 | $1234
item4 | box | 2 | 4 | 5 | 0 | 5 | 2 | 6 | $2354
**week (2) starting from sunday of such date to saturday of such date**
item | unit | Sun | Mon | Tues | Wedn | Thur | Fri | Sat | Total Price for week
item1 | bag | 3 | 0 | 12 | 8 | 45 | 1 | 4 | $1234
item4 | box | 2 | 4 | 5 | 0 | 5 | 2 | 6 | $2354
I wish I could have something to show that I have already started, but crystal isn't my strong point and I dont even know where start tackling this one. I do know how to pass parameters and a datatable that I myself pre-filtered before passing it to the report. For example filtering items by date range and customer or order id.
any help would be much appreciated
Create a formula for each day of the week that totals the order.
ie Sunday quantity:
if dayOfWeek(dateField) = 'Sun'
then order.quantity
else 0
Add each day formula to the detail section of the report and then summarize it for each group level. To group it by week, just group by the date field, then set the grouping option to by week. Suppress the detail, and you'll have what you are looking for.
I don't remember the exact name of the dayOfWeek function, but it's something like that.

Interpolate variables on subsets of dataframe

I have a large dataframe which has observations from surveys from multiple states for several years. Here's the data structure:
state | survey.year | time1 | obs1 | time2 | obs2
CA | 2000 | 1 | 23 | 1.2 | 43
CA | 2001 | 2 | 43 | 1.4 | 52
CA | 2002 | 5 | 53 | 3.2 | 61
...
CA | 1998 | 3 | 12 | 2.3 | 20
CA | 1999 | 4 | 14 | 2.8 | 25
CA | 2003 | 5 | 19 | 4.3 | 29
...
ND | 2000 | 2 | 223 | 3.2 | 239
ND | 2001 | 4 | 233 | 4.2 | 321
ND | 2003 | 7 | 256 | 7.9 | 387
For each state/survey.year combination, I would like to interpolate obs2 so that it's time-location is lined up with (time1,obs1).
ie I would like to break up the dataframe into state/survey.year chunks, perform my linear interpolation, and then stitch the individual state/survey.year dataframes back together into a master dataframe.
I have been trying to figure out how to use the plyr and Hmisc packages for this. But keeping getting myself in a tangle.
Here's the code that I wrote to do the interpolation:
require(Hmisc)
df <- new.obs2 <- NULL
for (i in 1:(0.5*(ncol(indirect)-1))){
df[,"new.obs2"] <- approxExtrap(df[,"time1"],
df[,"obs1"],
xout = df[,"obs2"],
method="linear",
rule=2)
}
But I am not sure how to unleash plyr on this problem. Your generous advice and suggestions would be much appreciated. Essentially - I am just trying to interpolate "obs2", within each state/survey.year combination, so it's time references line up with those of "obs1".
Of course if there's a slick way to do this without invoking plyr functions, then I'd be open to that...
Thank you!
This should be as simple as,
ddply(df,.(state,survey.year),transform,
new.obs2 = approxExtrap(time1,obs1,xout = obs2,
method = "linear",
rule = 2))
But I can't promise you anything, since I haven't the foggiest idea what the point of your for loop is. (It's overwriting df[,"new.obs2"] each time through the loop? You initialize the entire data frame df to NULL? What's indirect?)

Resources