SAP BO XI Desktop Intelligence Aggregate Calculations - aggregate-functions

I am new to Business Objects and more specifically Desktop Intelligence. We are trying to use it as a reporting tool for our scientific data but running into issues when performing calculations to "create" objects and then trying to perform statistical or aggregate functions on them. For example I run a query that pulls the columns subject name, result day, parameter, and result value. In a table it would look like this:
SUBJECT DAY PARAM RV
10001 0 Length 5.32
10001 0 Width 4.68
10002 0 Length 3.98
10002 0 Width 1.64
10001 7 Length 8.89
10001 7 Width 7.30
10002 7 Length 4.17
10002 7 Width 2.19
We then use the equation for Volume: L*W^2*0.52 in the report defined as measure variable. Using a cross tab with days across the top and subjects down the side I display Length Width and Tumor Volume like such:
0 7
SUBJECT L W V L W V
10001 5.32 4.68 60.59 8.89 7.30 246.35
10002 3.98 1.64 5.57 4.17 2.19 10.40
COUNT # #
MEAN # #
Within the footers I'd like to display aggregates such as count, standard deviation, percent change from day zero but they are all screwed up. It's not that it's also doubling the n by two either to account for the fact that Length and Width make up Volume. I have no clue and am at a loss. Any advice suggestions or guidance would be welcomed.
Thanks in advance,
Jeff

I assume that your cross tab looks like the following in Slice and Dice:
¦ <DAY> (Break)
¦ <PARAM>
--------------------
<SUBJECT> ¦ <RV>
So your table should look something like:
0 7
Length Width Volume Length Width Volume
10001 5.32 4.68 60.59 8.89 7.30 246.35
10002 3.98 1.64 5.57 4.17 2.19 10.40
With <DAY>'s break footer having the volume variable.
For your volume calculation I've used the formula: =(<RV> Where (<PARAM>="Length"))*(Power(<RV> Where (<PARAM>="Width") , 2))*0.52
Right click on the cross tab edge and select Format Crosstab... Then check the Show Footer check box in the Down Edge Display section of the General tab. Add extra rows in the footer if you need them.
Then manually add the formulas for count =Count(<VOLUME>) and mean =Average(<VOLUME>)
For me the final table now looks like this (With values rounded to 2dp):
0 7
Length Width Volume Length Width Volume
10001 5.32 4.68 60.59 8.89 7.30 246.35
10002 3.98 1.64 5.57 4.17 2.19 10.40
Count 2.00 2.00
Mean 33.08 128.37
The trick is making sure the calculations happen in the right context (that is, with respect to the header variables in the different sections of the table). You can add and remove variables and context with the functions In, ForAll and ForEach. Although I haven't needed to use them for this table.

Related

Subset/extract the highest values of groups in one column based on values of groups in other columns

I'm looking at oxygen concentrations in relation to bottom trawling at different depths in inner Danish waters for the last 40 years.
I have a data frame (Oxy) with four columns: ID, Date, Depth and Oxygen. The Oxygen has been measured throughout many years (Date), at many different locations (ID) and at many different Depths down the water column, spanning from 0-50 meters.
I would like to create a data frame where I have the Oxygen for the last 4 meters (Depth) (from the bottom and 4 meters up in the water column) for each station and the corresponding date. The measurements are not by every whole meter but at varying depths. The Depth where the Oxygen has been measured are not the same for each ID, so for one ID it has been sampled at 0.2, 0.4, 0.6 meters etc. and for another ID it has been sampled at 0.67, 1.3, 1.55 meters etc. The Depth for each ID also varies, so for one station the deepest measurement is at 30 meters and for another one it's 46 meters.
I have about 4 million rows, so this is just an output of my data:
ID Date Depth Oxygen
------ ---------- ----- ------
957001 2002-01-14 1.20 12.10
967503 2002-01-28 2.00 11.60
957001 2002-01-22 25.00 7.80
965206 2002-01-28 5.40 11.70
953001 2002-01-31 23.60 10.30
941101 2002-01-22 8.67 12.00
940201 2002-01-17 5.00 11.70
965404 2002-01-30 38.80 9.40
952003 2002-01-08 23.40 6.30
957101 2002-01-15 6.00 11.60
I have been searching on google for an answer but can't seem to find the right one. I can extract the highest value or the top 5 highest values by using arrange(), group_by() and slice(). However, that wouldn't work for my data frame because the measurement intervals vary in depth and it needs to be similar for all ID's and Dates.
I imagine that it could be something like; take the highest value and then keep the values that are within -4 from that highest value.
So, I need to end up with all the deepest (last 4 meters for Depth) measurements for Oxygen dependent on ID and Date.
It would look something like this:
ID Date Depth Oxygen
------ ---------- ----- ------
957001 2002-01-14 30.20 2.10
967503 2002-01-28 28.00 1.60
957001 2002-01-22 29.00 7.80
965206 2002-01-28 30.40 5.70
953001 2002-01-31 23.60 10.30
941101 2002-01-22 28.67 7.00
940201 2002-01-17 30.00 8.70
965404 2002-01-30 38.80 9.40
952003 2002-01-08 23.40 6.30
957101 2002-01-15 46.00 1.60
Just as you said, filter to Depth greater than max() - 4 within each ID. Using dplyr:
library(dplyr)
oxy %>%
group_by(ID) %>%
filter(Depth >= max(Depth) - 4) %>%
ungroup()
# A tibble: 9 × 4
ID Date Depth Oxygen
<dbl> <date> <dbl> <dbl>
1 967503 2002-01-28 2 11.6
2 957001 2002-01-22 25 7.8
3 965206 2002-01-28 5.4 11.7
4 953001 2002-01-31 23.6 10.3
5 941101 2002-01-22 8.67 12
6 940201 2002-01-17 5 11.7
7 965404 2002-01-30 38.8 9.4
8 952003 2002-01-08 23.4 6.3
9 957101 2002-01-15 6 11.6

Struggling to create a box plot, histogram, and qqplot in R [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 10 months ago.
Improve this question
I am a very new R user, and I am trying to use R to create a box plot for prices at target vs at Walmart. I also want to create 2 histograms for the prices at each store as well as qqplots. I keep getting various errors, including "Error in hist.default(mydata) : 'x' must be numeric:" and boxplot(mydata)
"Error in x[floor(d)] + x[ceiling(d)] :
non-numeric argument to binary operator" . I have correctly uploaded my csv file and I will attach my data for clarity. I have also added a direct c & p of some of my code. I have tried using hist(mydata), boxplot(mydata), and qqplot(mydata) as well, all which have returned with the x is not numeric error. I'm sorry if any of this is dumb, I am extremely new to R not to mention extremely bad at it. Thank you all for your help!
#[Workspace loaded from ~/.RData]
mydata <- read.csv(file.choose(), header = T) names(mydata)
#Error: unexpected symbol in " mydata <- read.csv(file.choose(), header = T) names"
mydata <- read.csv(file.choose(), header = T)
names(mydata)
#[1] "Product" "Walmart" "Target"
mydata
Product
1 Sara lee artesano bread
2 Store brand dozen large eggs
3 Store brand 2% milk 1 gallon (128 fl oz)
4 12.4 oz cheez its
5 Ritz cracker fresh stacks 8ct, 11.8 oz
6 Sabra classic hummus 10 oz
7 Oreo chocolate sandwich cookies 14.3 oz
8 Motts applesauce 6 ct/4oz cups
9 Bananas (each)
10 Hass Avocado (each)
11 Chips ahoy original family size, 18.2 oz
12 Lays potato chips party size, 13 oz
13 Amy’s frozen mexican casserole, 9.5 oz
14 Jack’s frozen pizza original thin crust, 13.8 oz
15 Store brand sweet cream unsalted butter, 4 count, 16 oz
16 Sour cream and onion pringles, 5.5 oz
17 Philadelphia original cream cheese spread, 8 oz
18 Daisy sour cream, regular, 16 oz:
19 Kraft singles, 24 ct/16 oz:
20 Doritos nacho cheese, party size, 14.5 oz
21 Tyson Fun Chicken nuggets, 1.81 lb (29 oz), frozen
22 Kraft mac n cheese original, 7.25 oz
23 appleapple gogo squeeze, 12ct, 3.2 oz each
24 Yoplait original french vanilla yogurt, 6oz
25 Essentia bottled water, 1 liter
26 Premium oyster crackers, 9oz
27 Aunt Jemima buttermilk pancake miz, 32 oz
28 Eggo frozen homestyle waffles, 10ct/12.3 oz
29 Kellogg's Froot Loops, 10.1 oz
30 Tostitos scoops tortilla chips, 10 oz
Walmart Target
1 2.98 2.99
2 1.93 1.99
3 2.92 2.99
4 3.14 3.19
5 3.28 3.29
6 3.68 3.69
7 3.48 3.39
8 2.26 2.29
9 0.17 0.25
10 1.18 1.19
11 3.98 4.49
12 4.48 4.79
13 4.58 4.59
14 3.42 3.59
15 3.18 2.99
16 1.78 1.79
17 3.24 3.39
18 1.94 2.29
19 4.18 4.39
20 4.48 4.79
21 6.42 6.69
22 1.00 0.99
23 5.98 6.49
24 0.56 0.69
25 1.88 1.99
26 3.12 2.99
27 2.64 2.79
28 2.63 2.69
29 2.98 2.99
30 3.48 3.99
hist(mydata)
#Error in hist.default(mydata) : 'x' must be numeric
x<-sample(LETTERS[1:5],20,replace=TRUE)
df<-data.frame(x)
df
x
1 E
2 B
3 A
4 B
5 E
6 B
7 A
8 A
9 C
10 E
11 A
12 B
13 A
14 B
15 C
16 D
17 C
18 E
19 A
20 D
x<-sample(LETTERS[1:5],20,replace=TRUE)
df<-data.frame(x)
hist(df$x)
#Error in hist.default(df$x) : 'x' must be numeric
x<-sample(LETTERS[1:5],20,replace=TRUE)
df<-data.frame(x)
barplot(table(df$x))
boxplot(mydata)
#Error in x[floor(d)] + x[ceiling(d)] :
# non-numeric argument to binary operator
qqplot("Walmart")
#Error in sort(y) : argument "y" is missing, with no default
qqplot(mydata)
#Error in `[.data.frame`(x, order(x, na.last = na.last, decreasing = decreasing)) :
# undefined columns selected
#In addition: Warning message:
#In xtfrm.data.frame(x) : cannot xtfrm data frames
There seems to be a problem with the data you uploaded but no matter...I will just create data resembling your problem and show you how to do it with some simple code (some may offer alternatives like ggplot, but I think my example will use shorter code and be more intuitive.)
First, we can load ggpubr for plotting functions:
# Load ggpubr for plotting functions:
library(ggpubr)
Then we can create a new data frame, first with the prices and store names, then combining them into a data frame we can use:
# Create price values and store values:
prices.1 <- c(1,2,3,4,5,3)
prices.2 <- c(8,6,4,2,0,1)
store <- c("walmart",
"walmart",
"walmart",
"target",
"target",
"target")
# Create dataframe for these values:
store.data <- data.frame(prices.1,
prices.2,
store)
Now we can just plug in our data into all of these plots nearly the same way each time. the first part of the code is the plot function name, the data part is our stored data, and the x and y values are what we use for our variables:
# Scatterplot:
ggscatter(data = store.data,
x="prices.1",
y="prices.2")
# Boxplot:
ggboxplot(data = store.data,
x="store",
y="prices.1")
# Histogram:
gghistogram(data = store.data,
x="prices.1")
# QQ Plot:
ggqqplot(data = store.data,
x="prices.1")
There are simpler alternatives like base R functions like this, but I find they are much harder to customize compared to ggpubr and ggplot:
plot(x,y)
Of course, you can really customize the ggpubr and ggplot output to look much better, but thats up to you and what you want to learn:
ggboxplot(data = store.data,
x="store",
y="prices.1",
fill = "store",
title = "Prices of Merchandise by Store",
caption = "*Data obtained from Stack Overflow",
palette = "jco",
legend = "none",
xlab ="Store Name",
ylab = "Prices of Merchandise",
ggtheme = theme_pubclean())
Hope thats helpful. Let me know if you have questions!

R log-transformation on dataframe

I have a dataframe (df) with the values (V) of different stocks at different dates (t). I would like to get a new df with the profitability for each time period.
Profitability is: ln(Vi_t / Vi_t-1)
where:
ln is the natural logarithm
Vi_t is the Value of the stock i at the date t
Vi_t-1 the value of the same stock at the date before
This is the output of df[1:3, 1:10]
date SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
1 01/08/88 1507.5 3.63 4.98 159.20 15.62 14.64 4.01 4.59 11.33
2 01/09/88 1467.4 3.69 4.97 161.55 15.69 14.40 4.06 4.87 11.05
3 01/10/88 1538.0 3.27 5.47 173.72 16.02 14.72 4.14 5.05 11.94
Specifically, instead of 1467.4 at [2, "SMI"] I want the profitability which is ln(1467.4/1507.5) and the same for all the rest of the values in the dataframe.
As I am new to R I am stuck. I was thinking of using something like mapply, and create the transformation function myself.
Any help is highly appreciated.
This will compute the profitabilities (assuming data is in a data.frame call d):
(d2<- log(embed(as.matrix(d[,-1]), 2) / d[-dim(d)[1], -1]))
# SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
#1 -0.02696052 0.01639381 -0.002010051 0.01465342 0.004471422 -0.01652930 0.01239173 0.05921391 -0.02502365
#2 0.04699074 -0.12083647 0.095858776 0.07263012 0.020814375 0.02197891 0.01951281 0.03629431 0.07746368
Then, you can add in the dates, if you want:
d2$date <- d$date[-1]
Alternatively, you could use an apply based approach:
(d2 <- apply(d[-1], 2, function(x) diff(log(x))))
# SMI Bond ABB ADDECO Credit Holcim Nestle Novartis Roche
#[1,] -0.02696052 0.01639381 -0.002010051 0.01465342 0.004471422 -0.01652930 0.01239173 0.05921391 -0.02502365
#[2,] 0.04699074 -0.12083647 0.095858776 0.07263012 0.020814375 0.02197891 0.01951281 0.03629431 0.07746368

Adding abline to augPred plot

apologies for what is likely to be a very basic question, I am very new to R.
I am looking to read off my augPred plot in order to average out the values to provide a prediction between a time period.
> head(tthm.groupeddata)
Grouped Data: TTHM ~ Yearmon | WSZ_Code
WSZ_Code Treatment_Code Year Month TTHM CL2_FREE BrO3 Colour PH TURB Yearmon
1 2 3 1996 1 30.7 0.35 0.00030 0.75 7.4 0.055 Jan 1996
2 6 1 1996 2 24.8 0.25 0.00055 0.75 6.9 0.200 Feb 1996
3 7 4 1996 2 60.4 0.05 0.00055 0.75 7.1 0.055 Feb 1996
4 7 4 1996 2 58.1 0.15 NA 0.75 7.5 0.055 Feb 1996
5 7 4 1996 3 62.2 0.20 NA 2.00 7.6 0.055 Mar 1996
6 5 2 1996 3 40.3 0.15 0.00140 2.00 7.7 0.055 Mar 1996
This is my model:
modellme<- lme(TTHM ~ Yearmon, random = ~ 1|WSZ_Code, data=tthm.groupeddata)
and my current plot:
plot(augPred(modellme, order.groups=T),xlab="Date", ylab="TTHM concentration", main="TTHM Concentration with Time for all Water Supply Zones")
I would like a way to read off the graph by either placing lines between a specific time period in a specific WSZ_Code (my group) and averaging the values between this period...
Of course any other way/help or guidance would be much appreciated!
Thanks in advance
I don't think we can tell whether it is "entirely incorrect", since you have not described the question and have not included any data. (The plotting question is close to being entirely incorrect, though.) I can tell you that the answer is NOT to use abline, since augPred objects are plotted with plot.augPred which returns (and plots) a lattice object. abline is a base graphic function and does not share a coordinate system with the lattice device. Lattice objects are lists that can be modified. Your plot probably had different panels at different levels of WSZ_Code, but the location of the desired lines is entirely unclear especially since you trail off with an ellipsis. You refer to "times" but there is no "times" variable.
There are lattice functions such as trellis.focus and update.trellis that allow one to apply modifications to lattice objects. You would first assign the plot object to a named variable, make mods to it and then plot() it again.
help(package='lattice')
?Lattice
(If this is a rush job, you might be better off making any calculations by hand and using ImageMagick to edit pdf or png output.)

R merge with itself

Can I merge data like
name,#797,"Stachy, Poland"
at_rank,#797,1
to_center,#797,4.70
predicted,#797,4.70
According to the second column and take the first column as column names?
name at_rank to_center predicted
#797 "Stachy, Poland" 1 4.70 4.70
Upon request, the whole set of data: http://sprunge.us/cYSJ
The first problem, of reading the data in, should not be a problem if your strings with commas are quoted (which they seem to be). Using read.csv with the header=FALSE argument does the trick with the data you shared. (Of course, if the data file had headers, delete that argument.)
From there, you have several options. Here are two.
reshape (base R) works fine for this:
myDF <- read.csv("http://sprunge.us/cYSJ", header=FALSE)
myDF2 <- reshape(myDF, direction="wide", idvar="V2", timevar="V1")
head(myDF2)
# V2 V3.name V3.at_rank V3.to_center V3.predicted
# 1 #1 Kitoman 1 2.41 2.41
# 5 #2 Hosaena 2 4.23 9.25
# 9 #3 Vinzelles, Puy-de-Dôme 1 5.20 5.20
# 13 #4 Whitelee Wind Farm 6 3.29 8.07
# 17 #5 Steveville, Alberta 1 9.59 9.59
# 21 #6 Rocher, Ardèche 1 0.13 0.13
The reshape2 package is also useful in these cases. It has simpler syntax and the output is also a little "cleaner" (at least in terms of variable names).
library(reshape2)
myDFw_2 <- dcast(myDF, V2 ~ V1)
# Using V3 as value column: use value.var to override.
head(myDFw_2)
# V2 at_rank name predicted to_center
# 1 #1 1 Kitoman 2.41 2.41
# 2 #10 4 Icaraí de Minas 6.07 8.19
# 3 #100 2 Scranton High School (Pennsylvania) 5.78 7.63
# 4 #1000 1 Bat & Ball Inn, Clanfield 2.17 2.17
# 5 #10000 3 Tăuteu 1.87 5.87
# 6 #10001 1 Oak Grove, Northumberland County, Virginia 5.84 5.84
Look at the reshape package from Hadley. If I understand correctly, you are just pivoting your data from long to wide.
I think in this case all you really need to do is transpose, cast to data.frame, set the colnames to the first row and then remove the first row. It might be possible to skip the last step through some combination of arguments to data.frame but I don't know what they are right now.

Resources