SPOTFIRE % of total in cross table - percentage

I would like to use the aggregation "% of Total" in a cross table to basically calculate the percentage of Total income per year per shop (see table below).
However, it is not calculating correctly. It will do OVER AXIS ROWS instead of OVER YEAR: 
Sum([INCOME]) THEN [Value] / Sum([Value]) OVER (All([Axis.Rows]))
So, i switch to the "Max" aggregation where i get the proper value. (it works also with "Min")
However, in the Sub Total per Year, it will show me the MAX value and not 100%.
Do you know how should i type my formula so when i select the aggregation "% of Total", it works correctly?

Related

Plotting proportions of choices of each participant separately

I's like to find a quite efficient way to plot for each participant ($participant_num) the proportion of responses ($resp) every 10 trials ($trial, out of 200 trials per participant).
enter image description here
When I did it for a subset of my sample (only 30 participants) I used a very rudimental code, for which I had first created a separate dataframe for each subject:
whichSubject<-6 # Which subject do want to analyse?
sData<-filter(banditData,subject==whichSubject)
and then I tried to get proportions for each 10 trials and put them in a separate column
sData$newcolumn <- NULL
sData$newcolumn1_10<- table(sData[1:10,]$resp)/length(sData[1:10,]$resp)
sData$newcolumn11_20<- table(sData[11:20,]$resp)/length(sData[11:20,]$resp)
sData$newcolumn21_30<- table(sData[21:30,]$resp)/length(sData[21:30,]$resp)
and so on for all the 200 trials and separately for each subject.. Then, I reshaped the dataframe as long and plotted it with the following script:
ggplot()+
geom_line(data=rewardDF,aes(x=Trial,y=pHappy,colour=Bandit), linetype="dashed", size=1.03)+
geom_point(data=longdf,aes(x=trial, y=resp_prop,colour=bandit,shape=bandit),size=3)+
geom_line(data=longdf,aes(x=trial, y=resp_prop,colour=bandit),size=1)+
scale_shape_manual(values=SymTypes)+
scale_colour_manual(values=cbPalette)+
labs(col='bandit',y='p(choice)',x='trials')+
scale_x_continuous(breaks = seq(0,200,by=10), limits=c(0,203), expand=(c(0,0)))+
scale_y_continuous(breaks = seq(0,1,by=0.1), limits=c(0,1.03), expand=(c(0.02,0)))+
theme_bw()+
ggsave(paste(c("data/S",whichSubject,"p(choice_absorangeblue).png"),collapse=""), scale=2,dpi = 300)
The output was something like this. Each dot represented how many times a participant selected left (resp=0) vs right (resp=1) in 10 trials (e.g., if the participant selected left 3 times out of 10 the dot for left, which corresponded to arm 1 in a task where you were asked to select between two arms, would be presented on the y axis at 0.3 and conversly the dot for right at 0.7)
enter image description here
However, now I have over 200 participants and it is definitely too time consuming using this approach!
I was thinking of using something to add facet_grid(participant_num ~ .)+ to my ggplot code in order to code each participant separately without the need of sub selecting.. However, I haven't found a solution on how to plot the proportion of choices without having to calculate them separately. Do you have any tip on how I could do this within ggplot?
Many thanks in advance for your help!!

Scaling GGplot based on other dataset

I have a graph showing total sales per state. I have another date frame with population data among other measures per city that can be rolled up to state. Given that there is a large variance in the population per state, I wanted to scale the sales by the % of the states population relative to the total population. - i.e. See if particular states actually buy more or not relative to their size. Any ideas where to start pls?
Very basic code I am using to start with.
State_Sales_Summary_Plot <- ggplot(
data=Customers_DF_Clean,
aes(x=State.Code, y=Total.Spent)
) +
geom_bar(stat="identity")
Population Version of the code below:
Population_Per_State <- ggplot(
data=SSC_IndexData_Education_and_Occupation_DF_Clean,
aes(x=State.Name, y=Population)
) +
geom_bar(stat="identity")

Plot every year as line with months on Xaxis and variable on Y-axis from NetCDF

I have netcdf data with lat,lon,time as dimensions and temperature temp as variable. It has daily temperature data for 10 years.
For single location I can plot time series. But how to plot for every year, Year as hue and Months on Xaxis and temp on Y axis. So i want 10 lines as 10 years on my graph. Every line is an year which represents 12 monthly means or daily data. example is here.
And if possible please tell how to add mean and median of all the years as seperate line among these 10 yearly line plots. example picture image example
I'm tempted to agree with the comment that it would be good to show a little more effort in terms of what you've tried. It would also be good to mention what you've read (in e.g. the xarray documentation: https://xarray.pydata.org/en/stable/), which I believe has many of the components you need.
I'll start by setting up some mock data, like you mention, with four years of daily (random) data.
time = pd.date_range("2000-01-01", "2004-12-31")
base = xr.DataArray(
data=np.ones((time.size, 3, 2)),
dims=("time", "lat", "lon"),
coords={
"time": time,
"lat": [1, 2, 3],
"lon": [0.5, 1.5],
},
)
To make the data a bit more comparable with your example, I'm going to add yearly seasonality (based on day of year), and make every year increase by 0.1.
seasonality = xr.DataArray(
data=np.sin((time.dayofyear / 365.0) * (2 * np.pi)),
coords={"time": time},
dims=["time"],
)
trend = xr.DataArray(
data=(time.year - 2000) * 0.1,
coords={"time": time},
dims=["time"],
)
da = base + seasonality + trend
(You can obviously skip these two parts, in your case, you'd only do an xarray.open_dataset() or xarray.open_dataarray`)
I don't think your example is grouped by month: it's too smooth. So I'm going to group by day of year instead.
Let's start by getting a single locations, then using the dt accessor:
https://xarray.pydata.org/en/stable/time-series.html#datetime-components
In this case, it's also most convenient to store the data as a DataFrame, since it essentially becomes a table (month of dayofyear as the rows, separate years etc as columns). First we select one location, and calculate the minimum and maximum values and store them in a pandas DataFrame:
location = da.isel(lat=0, lon=0)
dataframe = location.groupby(da["time"].dt.dayofyear).min().drop(["lat", "lon"]).to_dataframe(name="min")
dataframe["max"] = location.groupby(da["time"].dt.dayofyear).max().values
Next, grab the year by year data, and add it to the DataFrame:
for year, yearda in location.groupby(location["time"].dt.year):
dataframe[year] = pd.Series(index=yearda["time"].dt.dayofyear, data=yearda.values)
If you want monthly values, add another groupby step:
for year, yearda in location.groupby(location["time"].dt.year):
monthly_mean = yearda.groupby(yearda["time"].dt.month).mean()
dataframe[year] = pd.Series(index=monthly_mean["month"], data=monthly_mean.values)
Note that by turning the data into a pandas Series first, it can add the values appriopriately, based on the values of the index (dayofyear here), even though we don't have 366 values for every year.
Next, plot it:
dataframe.plot()
It will automatically assign hue based on the columns.
(My minimum and maximum coincide with 2000 and 2004 due to the way I setup the mock data, ... you get the idea.)
In terms of styling, options, etc., you might like seaborn better:
https://seaborn.pydata.org/index.html
import seaborn as sns
sns.plot(data=dataframe)
If you want to use different styling, different kind of plots (e.g. the colored zones your example has), you'll have to combine different plot, e.g. as follows:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.fill_between(x=dataframe.index, y1=dataframe["min"], y2=dataframe["max"], alpha=0.5, color="orange")
dataframe.plot(ax=ax)
Note that seaborn, pandas, xarray, etc. all use matplotlib behind the scenes. Many of the plotting functions also accept an ax argument, to draw on top of an existing plot.

Histogram to show the count per month or day in R

I'm trying to create a histogram that shows count of event on date for each month so i can see the total for each month. When I create the histogram the left hand side is a density instead of a count.
How to a get a graph that shows total number of a date per month
Example code (rough indication only)
data_toview <- read.csv("file_with_data.csv", stringsAsFactors = FALSE)
#Distribution of count per day, so i know how the data can spike
hist(data_toview$interesting_date, breaks = "days")
Not i may be using the wrong plot type that is why i did not specify histogram in the question title. Also any suggestions to get the months on the labels.

How to order multiple plots through facet_wrap in ggplot2

I have a dataset like the following one, and I have about 1 million rows like this:
orderid prodid priceperitem date category_eng
3010419 2 62420 18.90 2014-10-09 roll toliet paper
I am currently plotting a plot of these products scatterplots using priceperitem as y-axis and date as x-axis. I have also ordered these rows based on these products' coefficient of variation of their prices throughout time. I have summarized these results in another dataset like the following one:
prodid mean count sd cv
424657 12.7124 5541.0000 10.239 80.54999886
158726193 23.7751 1231.0000 17.7567 74.68621596
And I have used the following code to get the scatterplots of many products at the same time:
ggplot(Roll50.last, aes(x=date, y=priceperitem)) + geom_point() + facet_wrap(~prodid)
But I want to order the plots based on these products' CV that I have summarized in another data.frame. I am wondering if there is a way that can allow me to specify that I want to order the panel plots by the order of a value in another dataframe.
Below is a small sample data. Basically, the idea is to get the different products' price cv which = s.d./mean. And I want to plot these scatterplot of these products in order of cv from highest to lowest.
#generate reproducible example
id = c(1: 3)
date = seq(as.Date("2011-3-1"), as.Date("2011-6-8"), by="days")
id = rep(c(1,2,3,4),each=25)
set.seed(01)
id = sample(id)
price = rep(c(1:20), each = 5)
price = sample(price)
data = data.frame(date, id, price)
You can turn prodid into a factor and set the order of the factor categories to be based on the size of the coefficient of variation. For example, let's assume your data frame with the cv values is called cv. Then, to order prodid by the values of cv:
Roll50.last$prodid = factor(Roll50.last$prodid,
levels = cv$prodid[order(cv$cv, decreasing=TRUE)])
Now, when you plot the data, the prodid facets will be ordered by decreasing size of cv.

Resources