R: summarise data frame with repeating rows into boxplots

R: summarise data frame with repeating rows into boxplots - r

I am an R neophyte, with a data frame of database function runtimes with the following data:
> head(data2)
dbfunc runtime
1 fn_slot03_byperson 38.083
2 fn_slot03_byperson 32.396
3 fn_slot03_byperson 41.246
4 fn_slot03_byperson 92.904
5 fn_slot03_byperson 130.512
6 fn_slot03_byperson 113.853
The data has data for 127 discrete functions comprising of some 1940170 rows.
I would like to:
Summarise the data to only include database functions with a mean runtime of over 100 ms
Produce boxplots of the 25 slowest database functions showing the distribution of runtimes, sorted by slowest first.
I'm particularly stumped by the summary step.
Note : I've also asked this questions at stats.stackexchange.com.

Here's one approach using ggplot and plyr. The steps you outlined could be combined to be slightly more efficient, but for learning purposes I'll show you the steps as you asked them.
#Load ggplot and make some fake data
library(ggplot2)
dat <- data.frame(dbfunc = rep(letters[1:10], each = 100)
, runtime = runif(1000, max = 300))
#Use plyr to calculate a new variable for the mean runtime by dbfunc and add as
#a new column
dat <- ddply(dat, "dbfunc", transform, meanRunTime = mean(runtime))
#Subset only those dbfunc with mean run times greater than 100. Is this step necessary?
dat.long <- subset(dat, meanRunTime > 100)
#Reorder the level for the dbfunc variable in terms of the mean runtime. Note that relevel
#accepts a function like mean so if the subset step above isn't necessary, then we can simply
#use that instead.
dat.long$dbfunc <- reorder(dat.long$dbfunc, -dat.long$meanRunTime)
#Subset one more time to get the top *n* dbfunctions based on mean runtime. I chose three here...
dat.plot <- subset(dat.long, dbfunc %in% levels(dbfunc)[1:3])
#Now you have your top three dbfuncs, but a bunch of unused levels hanging out so let's drop them
dat.plot$dbfunc <- droplevels(dat.plot$dbfunc)
#Plotting time!
ggplot(dat.plot, aes(dbfunc, runtime)) +
geom_boxplot()
Like I said, I feel a few of those steps could be combined and made more efficient, but wanted to show you the steps as you outlined them.

The summary step is easy:
attach(data2)
func_mean = tapply(runtime, dbfunc, mean)
ad question 1:
func_mean[func_mean > 100]
ad question 2:
slowest25 = head(sort(func_mean, decreasing = TRUE), n=25)
sl25_data = merge(data.frame(dbfunc = names(slowest25), data2, sort = F)
plot(sl25_data$runtime ~ sl25_data$dbfunc)
Hope this helps. Yet the boxplots are not sorted in the plot.

I'm posting this as the 'answer' whereas Tomas and Chases' answers are in fact more complete. In Chase's case I couldn't get ggplot to operate, and time was short. In Tomas' case I got stuck at the sl25_data step.
We ended up using the following, which works with one remaining problem:
# load data frame
dbruntimes <- read.csv("db_runtimes.csv",sep=',',header=FALSE)
# calc means
meanruns <- aggregate(dbruntimes["runtime"],dbruntimes["dbfunc"],mean)
# filter
topmeanruns <- meanruns[meanruns$runtime>100,]
# order by means
meanruns <- meanruns[rev(order(meanruns$runtime)),]
# get top 25 results
drawfuncs <- meanruns[1:25,"dbfunc"]
# subset for plot
forboxplot <- subset(dbruntimes,dbfunc %in% levels(drawfuncs)[0:25])
# plot
boxplot(forboxplot$runtime~forboxplot$dbfunc)
This gives us the result we are looking for, but all the functions are still shown on the plot xaxis, rather than just the top 25.

Related

R summaries when dates in main df fall within ranges from small df

Similar to do.call/lapply approach here, and data.table approach here, but both have the setup of:
MainDF with data and startdate/enddate ranges
SubDF with a vector of single dates
Where the users are looking for summaries of all the MainDF ranges that overlap each SubDF date. I have
MainDF with data and a vector of single dates
SubDF with startdate/enddate ranges
And am looking to append summaries, to SubDF, for multiple rows of MainDF data which fall within each SubDF range. Example:
library(lubridate)
MainDF <- data.frame(Dates = seq.Date(from = as.Date("2020-02-12"),
by = "days",
length.out = 10),
DataA = 1:10)
SubDF <- data.frame(DateFrom = as.Date(c("2020-02-13", "2020-02-16", "2020-02-19")),
DateTo = as.Date(c("2020-02-14", "2020-02-17", "2020-02-21")))
SubDF$interval <- interval(SubDF$DateFrom, SubDF$DateTo)
Trying the data.table approach from the second link I figure it should be something like:
MainDF[SubDF, on = .(Dates >= DateFrom, Dates <= DateTo), allow = TRUE][
, .(SummaryStat = max(DataA)), by = .(Dates)]
But it errors with unused arguments for on. On my actual data I got a result by using (the equivalent of) max(MainDF$DataA), but it was 3 repeats of the second value (In my actual data the final row won't run as it doesn't have a value for DateTo). I suspect using MainDF$ means I've subverting the grouping.
I suspect I'm close but I'm really struggling to get my head around the data.table mindset for complex use cases. The summary stats I'm looking to do are (for example data):
Mean & Max of DataA
length(which(DataA > 3))
difftime(last(Dates), first(Dates), units = "mins")
Dates[which.max(DataA)]
I added the interval line above as data.table's %between% help suggests one might be able to use a Dates %between% interval format but it doesn't mention intervals/difftimes specifically in the text nor examples and my attempts are already failing elsewhere so I'm loathe to concentrate on improving my running while I can't walk!
I've focused on the data.table approach since it's used for a similar problem, but I've been wondering whether dplyr's group_by/group_by_if could be used instead? group_by_if's .predicate seems to be constrained to tests on the columns (e.g. are they factors) rather than relating to data in the columns' rows, but I could be wrong.
Thanks in advance for any help!

Summing rows grouped by another parameter in R

I am trying to calculate some rates for time on condition parameters, and have written the following, which successfully calculates the desired rates. But, I'm sure there must be a more succinct way to do this using the data.table methods. Any suggestions?
Background on what I'm trying to achieve with the code.
For each run number there are 10 record numbers. Each record number refers to a value bin (the full range of values for each parameter is split into 10 equal sized bins). The values are counts of time spent in each bin. I am trying to sum the counts for P1 over each run number (calling this opHours for the run number). I then want to divide each of the bin counts by the opHours to show the proportion of each run that is spent in each bin.
library(data.table)
#### Create dummy parameter values
P1 <- rnorm(2000,400, 50);
Date <- seq(from=as.Date("2010/1/1"), by = "day", length.out = length(P1));
RECORD_NUMBER <- rep(1:10, 200);
RUN_NUMBER <- rep(1:200, each=10, len = 2000);
#### Combine the dummy parameters into a dataframe
data <- data.frame(Date, RECORD_NUMBER, RUN_NUMBER, P1);
#### Calculating operating hours for each run
setDT(data);
running_hours_table <- data[ , .(opHours = sum(P1)), by = .(RUN_NUMBER)];
#### Set the join keys for the data and running_hours tables
setkey(data, RUN_NUMBER);
setkey(running_hours_table, RUN_NUMBER);
#### Combine tables row-wise
data <- data[running_hours_table];
data$P1.countRate <- (data$P1 / data$opHours)
Is it possible to generate the opHours column in the data table without first creating a separate table and then joining them back together?

data2[ , opHours := sum(P1), by = .(RUN_NUMBER)]
You should probably read some materials about data.table:
wiki Getting-started
or
data.table.cheat.sheet

Plot multiple ordered conditional box plots using columns from a data frame

I am trying to plot several ordered (ie., from high to low median) conditional box plots from a single data frame. The general sequence is as follows:
Reverse sort group medians for variable1 according to variable.group ;
Create ordered conditional box plot using variable.group and sorted medians;
Repeat (loop?) process for remaining variables in data frame.
I want to loop through about 70 variables using the above process but am stuck moving from tapply to aggregate, accessing each variable in the dataframe, and coding the looping sequence. Apologies in advance for the lack of elegance in my R code below:
bpdf = data.frame(group=c("A","A","A","B","B","B","C","C","C"),
x=c(1,1,2,2,3,3,3,4,4),
y=c(7,5,2,9,7,6,3,1,2),
z=c(4,5,2,9,8,9,7,6,7))
sorted.medians = rev(sort(with(bpdf,tapply(bpdf$x,bpdf$group,median))))
boxplot(bpdf$x~factor(bpdf$group,levels=names(sorted.medians)))

I think, you need just to put your 2 lines within lapply:
lapply(bpdf[,-1],function(x){
## decreasing better than rev here
y <- sort(tapply(x,bpdf$group,median),decreasing=TRUE)
boxplot(x~factor(bpdf$group,levels=names(y)))
})
EDIT to plot variable name , you use main argument of the boxplot and you loop over the colanmes of bpdf:
lapply(colnames(bpdf[,-1]),function(i){
## decreasing better than rev here
x <- bpdf[,i]
title <- paste0('title',i) ## you can change it here
y <- sort(tapply(x,bpdf$group,median),decreasing=TRUE)
boxplot(x~factor(bpdf$group,levels=names(y)),main=title)
})

If I understand the question correctly, I think following should do what you want:
Load in a few packages and create some data:
library(plyr)
library(reshape2)
dd = data.frame(group=c("A","B","C", "D"),
x1=runif(40),x2=runif(40),x3=runif(40),x4=runif(40))
Now calculate the median conditional on the variable and group
dd_m = melt(dd, "group")
meds = ddply(dd_m, c("variable", "group"), summarise, m = median(value))
Order the data frame by variable and median:
sorted_meds = meds[with(meds, order(variable, -m)), ]
Look through the variables, and sort each data frame in turn:
for(var in unique(sorted_meds$variable)){
grp_order = sorted_meds[sorted_meds$variable==var, ]$group
dd_tmp = dd_m[dd_m$variable==var,]
dd_tmp$group = factor(dd_tmp$group, levels = grp_order)
boxplot(dd_tmp$value ~ dd_tmp$group)
}

How do I tell R to remove the outlier from a correlation calculation?

How do I tell R to remove an outlier when calculating correlation? I identified a potential outlier from a scatter plot, and am trying to compare correlation with and without this value. This is for an intro stats course; I am just playing with this data to start understanding correlation and outliers.
My data looks like this:
"Australia" 35.2 31794.13
"Austria" 29.1 33699.6
"Canada" 32.6 33375.5
"CzechRepublic" 25.4 20538.5
"Denmark" 24.7 33972.62
...
and so on, for 26 lines of data. I am trying to find the correlation of the first and second numbers.
I did read this question, however, I am only trying to remove a single point, not a percentage of points. Is there a command in R to do this?

You can't do that with the basic cor() function but you can
use a correlation function from one of the robust statistics packages, eg robCov() from package robust
use a winsorize() function, eg from robustHD, to treat your data
Here is a quick example for the 2nd approach:
R> set.seed(42)
R> x <- rnorm(100)
R> y <- rnorm(100)
R> cor(x,y) # correlation of two unrelated series: almost zero
[1] 0.0312798
The we "contaminate" one point each with a big outlier:
R> x[50] <- y[50] <- 10
R> cor(x,y) # bigger correlation due to one bad data point
[1] 0.534996
So let's winsorize:
R> x <- robustHD::winsorize(x)
R> y <- robustHD::winsorize(y)
R> cor(x,y)
[1] 0.106519
R>
and we're back down to a less correlated measure.

If you apply the same conditional expression for both vectors you could exclude that "point".
cor( DF[2][ DF[2] > 100 ], # items in 2nd column excluded based on their values
DF[3][ DF[2] > 100 ] ) # items in 3rd col excluded based on the 2nd col values

In the following, I worked from the presumption (that I read between your lines) that you have identified that single outlier visually (ie., from a graph). From your limited data set it's probably easy to identify that point based on its value. If you have more data points, you could use something like this.
tmp <- qqnorm(bi$bias.index)
qqline(bi$bias.index)
(X <- identify(tmp, , labels=rownames(bi)))
qqnorm(bi$bias.index[-X])
qqline(bi$bias.index[-X])
Note that I just copied my own code because I couldn't work from sample code from you. Also check ?identify before.

It makes sense to put all your data on a data frame, so it's easier to handle.
I always like to keep track of outliers by using an extra column (in this case, B) in my data frame.
df <- data.frame(A=c(1,2,3,4,5), B=c(T,T,T,F,T))
And then filter out data I don't want before getting into the good analytical stuff.
myFilter <- with(df, B==T)
df[myFilter, ]
This way, you don't lose track of the outliers, and you are able to manage them as you see fit.
EDIT:
Improving upon my answer above, you could also use conditionals to define the outliers.
df <- data.frame(A=c(1,2,15,1,2))
df$B<- with(df, A > 2)
subset(df, B == F)

You are getting some great and informative answers here, but they seem to be answers to more complex questions. Correct me if I'm wrong, but it sounds like you just want to remove a single observation by hand. Specifying the negative of its index will remove it.
Assuming your dataframe is A and columns are V1 and V2.
WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[-1],a$V2[-1])
or you can remove several indexes. Let's say 1, 5 and 20
ToRemove <- c(-1,-5,-20)
WithAus <- cor(A$V1,A$V2)
WithoutAus <- cor(A$V1[ToRemove],a$V2[ToRemove])

Finding the mean of a variable within an imputed data set for population quintiles

I have a data set "base_data" which has missing values. I have therefore used the package 'Amelia' to impute the missing values into an object "a.output".
I have been able to find the mean for some variables within the imputed results using the following code:
q.out<-NULL
se.out<-NULL
for(i in 1:m) {
dclus <- svydesign(id=~site, data=a.output$base_data[[i]])
q.out <- rbind(q.out, coef(svymean(~hh_expenditure, dclus)))
se.out <- rbind(se.out, SE(svymean(~hh_expenditure, dclus)))}
I have combined the results using:
svymean.combine <- mi.meld(q = q.out, se = se.out)
Which gives me the mean and standard error for household expenditure (hh_expenditure) across the population.
However I have a variable which splits the population into wealth quintiles (wealth_quin).
As such, I am now wanting to find the average, and standard error, of the household expenditure per wealth_quin (a variable which is either 1,2,3,4,or 5).
I initially tried subsetting the imputed data, but this came up with many errors.
Is there a way to do this without having to split up the data into the 5 wealth quintiles before imputing the data?
Cheers,
Timothy
EDIT: HERE IS A WORKABLE EXAMPLE
require(Amelia)
require(survey)
a<-as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
b<-as.data.frame(c(1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,2))
c<-as.data.frame(c(2,7,8,5,4,4,3,8,7,9,10,1,3,3,2,8))
d<-as.data.frame(c(3,9,7,4,5,5,2,10,8,10,12,2,4,4,3,7))
e<-as.data.frame(c(2500,8000,NA,4500,4500,NA,2500,NA,7400,9648,1112,1532,3487,3544,NA,7000)
impute<-cbind(a,b,c,d,e)
names(impute) <- c("X","site","var2","var3", "hh_inc")
so no we have a data frame to work with, with missing values for hh_inc which I want to impute.
first step, set the number of imputations
m<-5
now run the imputation:
a.output <- amelia(x = impute, m=m, autopri=0.5,cs="X",
idvars=c("site","var2"),
logs=c("hh_inc","var3"))
a.output is now holds the data from the 5 imputations.
What I now want to do is find the average (and standard error) hh_inc for site 1 and site 2 separately using the imputed values from amelia.
How is that possible to do? I know it is possible to do if I just ignore the NA's. But this might introduce bias, hence why I imputed the values in the first place.
Cheers,
Timothy
EDIT:
I have placed a bounty to this. If no one knows the exact way to do it, then the results from the individual imputed data sets can be combined using Rubins formula (http://sites.stat.psu.edu/~jls/mifaq.html#minf)
As such, I will award to bounty to someone who can transform the 5 separate imputed datasets from the Amelia object into 5 separate, complete, data frames.

require(Amelia)
require(survey)
require(data.table)
require(plotrix)
a<-as.data.frame(c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16))
b<-as.data.frame(c(1,2,2,1,2,1,1,2,1,2,2,1,1,2,1,2))
c<-as.data.frame(c(2,7,8,5,4,4,3,8,7,9,10,1,3,3,2,8))
d<-as.data.frame(c(3,9,7,4,5,5,2,10,8,10,12,2,4,4,3,7))
e<-as.data.frame(c(2500,8000,NA,4500,4500,NA,2500,NA,7400,9648,1112,1532,3487,3544,NA,7000))
impute<-cbind(a,b,c,d,e)
names(impute) <- c("X","site","var2","var3", "hh_inc")
summary(impute)
m <- 5
a.output <- amelia(x = impute, m=m, autopri=0.5,cs="X",
idvars=c("site","var2"),
logs=c("hh_inc","var3"))
stats.out <- NULL
for(i in 1:m){
df2 <- data.table(a.output$imputations[[i]])
df3 <- data.frame(dataset=i,df2[,list(std.error(hh_inc),mean(hh_inc)), by="site"])
stats.out <- rbind(stats.out, df3)
}
colnames(stats.out) <- c("dataset","site","stdError","mean")
stats.out

I'm not sure I understand your question or the structure of your data (specifically the importance of whether the data is imputed or not) but here's how I've done some summary stats by group.
require(data.table)
require(plotrix)
# create some data
df1 <- data.frame(id=seq(1,50,1), wealth = runif(50)*1000)
df1$cutter <- cut(df1$wealth, 5, labels=FALSE)
head(df1)
# put the data into a data.table to speed things up
df2 <- as.data.table(df1)
head(df2)
grp1StdErr <- df2[,std.error(wealth), by="cutter"]
grp1Mean <- df2[,mean(wealth), by="cutter"]
Hope this helps.
Or, in one grouping step :
df2[,list(std.error(wealth),mean(wealth)), by=cut(wealth,5,labels=FALSE)]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: summarise data frame with repeating rows into boxplots - r

Related

R summaries when dates in main df fall within ranges from small df

Summing rows grouped by another parameter in R

Plot multiple ordered conditional box plots using columns from a data frame

How do I tell R to remove the outlier from a correlation calculation?

Finding the mean of a variable within an imputed data set for population quintiles

Categories

Resources