drawing multiple boxplots from imputed data in R - r

I have an imputed dataset that I'm analysing, and I'm trying to draw boxplots, but I can't wrap my head around the proper procedure.
my data (a sample, original has 20 observations per imputation and 13 vars per group, all values range from 0 to 25):
.imp .id FTE_RM FTE_PD OMZ_RM OMZ_PD
1 1 25 25 24 24
1 2 4 0 2 6
1 3 11 5 3 2
1 4 12 3 3 3
2 1 20 15 15 15
2 2 4 1 2 3
2 3 0 0 0 6
2 4 20 0 0 0
.imp signifies the imputation round, .id the identifer for each observartion.
I want to draw all the FTE_* variables in a single plot (and the `OMZ_* in another), but wonder what to do with all the imputations, can I just include all values? The imputated data now has 500 observations. With for instance an ANOVA I'd need to average the ANOVA results by 5 to get back to 20 observations. But is this needed for a boxplot as well, since I only deal with medians, means, max. and min.?
Such as:
data_melt <- melt(df[grep("^FTE_", colnames(df))])
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()
I've played a couple of times with ggplot, but consider myself a complete newbie.

I assume you want to keep the identifier for .imp and .id after melting so rather put:
data_melt <- melt(df,c(".imp",".id"))
For completeness of the dataframe it probably helps to introduce a column that identifies the type - FTE vs. OMZ:
data_melt$type <- ifelse(grepl("FTE",data_melt$variable),"FTE","OMZ")
Having this data.frame you can, for example, facet on the type (alternatively you can just use a simple filter statement on data_melt to restrict to one type):
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()+facet_wrap(~type,scales="free_x")
This would look like this.
EDIT: fixed the data mess-up

Related

R: Plot several lines in the same plot: ggplot + data tables or frames vs matrices

My general problem: I tend to struggle using ggplot, because it's very data-frame-centric but the objects I work with seem to fit matrices better than data frames. Here is an example (adapted a little).
I have a quantity x that can assume values 0:5, and a "context" that can have values 0 or 1. For each context I have 7 different frequency distributions over the quantity x. (More generally I could have more than two "contexts", more values of x, and more frequency distributions.)
I can represent these 7×2 frequency distributions as a list freqs of two matrices, say:
> freqs
$`context0`
x0 x1 x2 x3 x4 x5
sample1 20 10 10 21 37 2
sample2 34 40 6 10 1 8
sample3 52 4 1 2 17 25
sample4 16 32 25 11 5 10
sample5 28 2 10 4 21 35
sample6 22 13 35 12 13 5
sample7 9 5 43 29 4 10
$`context1`
x0 x1 x2 x3 x4 x5
sample1 15 21 14 15 14 21
sample2 27 8 6 5 29 25
sample3 13 7 5 26 48 0
sample4 33 3 18 11 13 22
sample5 12 23 40 11 2 11
sample6 5 51 2 28 5 9
sample7 3 1 21 10 63 2
or a 3D array.
Or I could use a data.table tablefreqs like this one:
> tablefreqs
context x0 x1 x2 x3 x4 x5
1: 0 20 10 10 21 37 2
2: 0 34 40 6 10 1 8
3: 0 52 4 1 2 17 25
4: 0 16 32 25 11 5 10
5: 0 28 2 10 4 21 35
6: 0 22 13 35 12 13 5
7: 0 9 5 43 29 4 10
8: 1 15 21 14 15 14 21
9: 1 27 8 6 5 29 25
10: 1 13 7 5 26 48 0
11: 1 33 3 18 11 13 22
12: 1 12 23 40 11 2 11
13: 1 5 51 2 28 5 9
14: 1 3 1 21 10 63 2
Now I'd like to draw the following line plot (there's a reason why I need line plots and not, say, histograms or bar plots):
The 7 frequency distributions for context 0, with x as x-axis and the frequency as y-axis, all in the same line plot (with some alpha).
The 7 frequency distributions for context 1, again with x as x-axis and the frequency as y-axis, all in the same line plot (with alpha), but displayed upside-down below the plot for context 0.
Ggplot would surely do this very nicely, but it seems to require some acrobatics with data tables:
– If I use the data table tablefreqs it's not clear to me how to plot all its rows having context==0 in the same plot: ggplot seems to only think column-wise, not row-wise. I could use the six values of x as table rows, but then the "context" values would also end up in a row, and I'm not sure I can subset a data table by values in a row, rather than in a column.
– If I use the matrix freqs, I could create a mini-data-table having x as one column and one frequency distribution as another column, input that into ggplot+geom_line, then go over all 7 frequency distributions in a for-loop maybe. Not clear to me how to tell ggplot to keep the previous plots in this case. Then another for-loop over the two "contexts".
I'd be grateful for suggestions on how to approach this problem in particular, and more generally on what objects to choose for storing this kind of data: matrices? data tables, maybe with a different structure than shown here? some other formats?
I would suggest to familiarize yourself with the concept of what is known as Tidy Data, which are principles for data handling and storage that are adopted by ggplot2 and a number of other packages.
You are free to use a matrix or list of matrices to store your data; however, you can certainly store the data as you describe it (and as I understand it) in a data frame or single table following the following convention of columns:
context | sample | x | freq
I'll show you how I would convert the tablefreqs dataset you shared with us into that format, then how I would go about creating a plot as you are describing it in your question. I'm assuming in this case you only have the two values for context, although you allude to there being more. I'm going to try to interpret correctly what you stated in your question.
Create the Tidy Data frame
Your data frame as shown contains columns x1 through x5 that have values for x spread across more than one column, when you really need these to be converted in the format shown above. This is called "gathering" your data, and you can do that with tidyr::gather().
First, I also need to replicate the naming of your samples according to the matrix dataset, so I'll do that and gather your data:
library(dplyr)
library(tidyr)
library(ggplot2)
# create the sample names
tablefreqs$sample <- rep(paste0('sample',1:7), 2)
# gather the columns together
df <- tablefreqs %>%
gather(key='x', value='freq', -c(context, sample))
Note that in the gather() function, we have to specify to leave alone the two columns df$context and df$sample, as they are not part of the gathering effort. But now we are left with df$x containing character vectors. We can't plot that, because we want the to be in the form of a number (at least... I'm assuming you do). For that, we'll convert using:
df$x <- as.numeric(gsub("[^[:digit:].]", "", df$x))
That extracts the number from each value in df$x and represents it as a number, not a character. We have the opposite issue with df$context, which is actually a discrete factor, and we should represent it as such in order to make plotting a bit easier:
df$context <- factor(df$context)
Create the Plot
Now we're ready to create the plot. From your description, I may not have this perfectly right, but it seems that you want a plot containing both context = 1 and context = 0, and when context = 1 the data should be "upside down". By that, I'm assuming you are talking about plotting df$freq when df$context == 0 and -df$freq when df$context == 1. We could do that using some fancy logic in the ggplot() call, but I find it's easier just to create a new column in your dataset to represent what we will be plotting on the y axis. We'll call this column df$freq_adj and use that for plotting:
df$freq_adj <- ifelse(df$context==1, -df$freq, df$freq)
Then we create the plot. I'll explain a bit below the result:
ggplot(df, aes(x=x, y=freq_adj)) +
geom_line(
aes(color=context, linetype=sample)
) +
geom_hline(yintercept=0, color='gray50') +
scale_x_continuous(expand=expansion(mult=0)) +
theme_bw()
Without some clearer description or picture of what you were looking to do, I took some liberties here. I used color to discriminate between the two values for context, and I'm using linetype to discriminate the different samples. I also added a line at 0, since it seemed appropriate to do so here, and the scale_x_continuous() command is removing the extra white space that is put in place at the extreme ends of the data.
An alternative that is maybe closer to your description would be to physically have a separation between the two plots, and represent context = 1 as a physically separate plot compared to context = 0, with one over top of the other.
Here's the code and plot:
ggplot(df, aes(x=x, y=freq_adj)) +
geom_line(aes(group=sample), alpha=0.3) +
facet_grid(context ~ ., scales='free_y') +
scale_x_continuous(expand=expansion(mult=0)) +
theme_bw()
There the use of aes(group=sample) is quite important, since I want all the lines for each sample to be the same (alpha setting and color), yet ggplot2 needs to know that the connections between the points should be based on "sample". This is done using the group= aesthetic. The scales='free_y' argument on facet_grid() allows the y axis scale to shrink and fit the data according to each facet.

boxplot with filtered values

I am new to coding and want to create boxplots based on my data.
For that, I want to filter a boxplot by specific values:
My data structure is called "Auswertungen" and is structured like this:
Participant Donation Treatment Manipulation
1 0 1 passed
2 0.4 2 passed
3 0.2 2 failed
4 0 3 failed
5 0.3 3 passed
now I want to plot the Donations based on the Treatments, using a boxplot. I want to graphs, one with all data points and one without those who failed the manipulation.
I found something like
boxplot(Donation ~ Treatment)
with(subset(Auswertungen, Manipulation == "passed"), boxplot(Donation ~ Treatment))
but the second formula is exactly showing me the same boxplots as before, so I guess the subset is not working?
Got it, sorry.
boxplot(Donation ~ Treatment)
boxplot(Donation[Manipulation == "passed"] ~ Treatment[Manipulation == "passed"]
If your data is roughly structured like this:
set.seed(222)
Donation <- abs(rnorm(20))
Treatment <- sample(1:3, 20, replace = T)
Manipulation <- sample(c("passed", "failed"), 20, replace = T)
df <- data.frame(Donation, Treatment, Manipulation)
df
Donation Treatment Manipulation
1 1.487757090 3 passed
2 0.001891901 2 failed
3 1.381020790 1 failed
4 0.380213631 3 passed
5 0.184136230 1 failed
6 0.246895883 3 passed
7 1.215560910 3 failed
8 1.561405098 1 failed
9 0.427310197 2 passed
10 1.201023506 3 passed
11 1.052458495 2 passed
12 1.305063566 2 failed
13 0.692607634 3 failed
14 0.602648854 3 failed
15 0.197753074 2 failed
16 1.185874517 2 passed
17 2.005512989 3 passed
18 0.007509885 2 passed
19 0.519490356 2 failed
20 0.746295471 2 failed
And you want to have two boxplots, you can first define a two-panel layout:
par(mfrow = c(1,2))
And then fill your two boxplots into it, the first one unfiltered:
boxplot(df$Donation ~ df$Treatment)
and the second filtered on the condition that Manipulation=="passed":
boxplot((df$Donation[df$Manipulation=="passed"] ~ df$Treatment[df$Manipulation=="passed"]))
The result would be something like this:

Factor Level issues after filling data frame using match

I am using two large data files, each having >2m records. The sample data frames are
x <- data.frame("ItemID" = c(1,2,1,1,3,4,2,3,4,1), "SessionID" = c(111,112,111,112,113,114,114,115,115,115), "Avg" = c(1.0,0.45,0.5,0.5,0.46,0.34,0.5,0.6,0.10,0.15),"Category" =c(0,0,0,0,0,0,0,0,0,0))
y <- data.frame("ItemID" = c(1,2,3,4,3,4,5,7),"Category" = c("1","0","S","120","S","120","512","621"))
I successfully filled the x$Category using following command
x$Category <- y$Category[match(x$ItemID,y$ItemID)]
but
x$Category
gave me
[1] 1 0 1 1 S 120 0 S 120 1
Levels: 0 1 120 512 621 S
In x there are only four distinct categories but the Levels shows six. Similarly, the frequency shows me 512 and 621 with 0 frequency. I am using the same data for classification where it shows six classes instead of four which effects the f measure and recall etc. negatively.
table(x$Category)
0 1 120 512 621 S
2 4 2 0 0 2
while I want
table(x$Category)
0 1 120 S
2 4 2 2
I tried merge this and this with a number of other questions but it is giving me an error message. I found here Practical limits of R data frame that it is the limitation of R.
I would omit the Category column from your x data.frame, since it seems to only be serving as a placeholder until values from the y data.frame are filled in. Then, you can use left_join from dplyr with ItemID as the key variable, followed by droplevels() as suggested by TingITangIBob.
This gets you close, but my table does not exactly match yours:
dplyr::select(x, -Category) %>%
dplyr::left_join(y, by = "ItemID") %>%
droplevels()
0 1 120 S
2 4 4 4
I think this may have to do with the repeat ItemIDs in x?

Plot empty groups in boxplot

I want to plot a lot of boxplots in on particular style to compare them.
But when a group is empty the group "isn't plotted".
lets say I have a dataframe:
a b
1 1 5
2 1 4
3 1 6
4 1 4
5 2 9
6 2 8
7 2 9
8 3 NaN
9 3 NaN
10 3 NaN
11 4 2
12 4 8
and I use boxplot to plot it:
boxplot(b ~ a , df)
than I get the plot without group 3
(which I can't show because I did not have "10 reputation")
I found some solutions for removing empty groups via Google but my problem is the other way around.
And I found the solution via at=c(1,2,4) but as I generate an Rscript with python and different groups are empty I would prefer, that the groups aren't dropped at all.
Oh I don't think I have the time to grapple with additional packages.
Therefore I would be thankful for solutions without them.
You can get the group on the x-axis by
boxplot(b ~ a , df, na.action=na.pass)
Or
boxplot(b~factor(a), df)

How to perform a repeated G.test in R?

I downloaded the R package RVAideMemoire in order to use the G.test.
> head(bio)
Date Trt Treated Control Dead DeadinC AliveinC
1 23Ap citol 1 3 1 0 13
2 23Ap cital 1 5 3 1 6
3 23Ap gerol 0 3 0 0 9
4 23Ap mix 0 5 0 0 8
5 23Ap cital 0 5 1 0 13
6 23Ap cella 0 5 0 1 4
So, I make subsets of the data to look at each treatment, because the G.test result will need to be pooled for each one.
datamix<-subset(bio, Trt=="mix")
head(datamix)
Date Trt Treated Control Dead DeadinC AliveinC
4 23Ap mix 0 5 0 0 8
8 23Ap mix 0 5 1 0 8
10 23Ap mix 0 2 3 0 5
20 23Ap mix 0 0 0 0 18
25 23Ap mix 0 2 1 0 15
28 23Ap mix 0 1 0 0 12
So for the G.test(x) to work if x is a matrix, it must be constructed as 2 columns containing numbers, with 1 row per population. If I use the apply() function I can run the G,test on each row if my data set contains only two columns of numbers. I want to look only at the treated and control for example, but I'm not sure how to omit columns so the G.test can ignore the headers, and other columns. I've tried using the following but I get an error:
apply(datamix, 1, G.test)
Error in match.fun(FUN) : object 'G.test' not found
I have also thought about trying to use something like this rather than creating subsets.
by(bio, Trt, rowG.test)
The G.test spits out this, when you compare two numbers.
G-test for given probabilities
data: counts
G = 0.6796, df = 1, p-value = 0.4097
My other question is, is there someway to add all the df and G values that I get for each row (once I'm able to get all these numbers) for each treatment? Is there also some way to have R report the G, df and p-values in a table to be summed rather than like above for each row?
Any help is hugely appreciated.
You're really close. This seems to work (hard to tell with such a small sample though).
by(bio,bio$Trt,function(x)G.test(as.matrix(x[,3:4])))
So first, the indices argument to by(...) (the second argument) is not evaluated in the context of bio, so you have to specify bio$Trt instead of just Trt.
Second, this will pass all the columns of bio, for each unique value of bio$Trt, to the function specified in the third argument. You need to extract only the two columns you want (columns 3 and 4).
Third, and this is a bit subtle, passing x[,3:4] to G.test(...) causes it to fail with an unintelligible error. Looking at the code, G.test(...) requires a matrix as it's first argument, whereas x[,3:4] in the code above is a data.frame. So you need to convert with as.matrix(...).

Resources