ggplot stat-ecdf cumulative distribution custom maximum - r

I have a df in the following format:
df <- read.table(text="
DAYS STATUS ID
2 Complete A
10 Complete A
15 Complete B
NA Incomplete A
NA Incomplete B
20 Complete C", header=TRUE)
I have plotted the cumulative distribution using:
ggplot(df,aes(x=DAYS, color=ID)) +
stat_ecdf(geom = "step")
Since this is only plotting the completed rows I would like to include the incomplete rows that have an NA for days. By doing this the cumulative distributions for each ID would not reach 100% because some of the rows do not have a value for days.
ID PERCENT_COMPLETE
A .95
B .55
C .5
For example in my full dataset ID A has .95 status complete so the distribution line would reach at max at .95 while B would reach a max at .55.

It doesn't appear any of the plotting functions handle NA values in the way you want. So we can just pre-calculate the values in the way we want using dplyr
library(ggplot2)
library(dplyr)
df <- read.table(text="
DAYS STATUS ID
2 Complete A
10 Complete A
15 Complete B
NA Incomplete A
NA Incomplete B
20 Complete C", header=TRUE)
incomplete_cdf <- function(x, gmin, gmax) {
cdf <- rle(sort(na.omit(x)))
obsx <- cdf$values
obsy <- cumsum(cdf$lengths)/length(x)
data.frame(x = c(gmin, obsx, gmax) , y=c(0, obsy, tail(obsy, 1)))
}
df %>%
mutate(gmin =min(DAYS, na.rm=TRUE), gmax=max(DAYS, na.rm=TRUE)) %>%
group_by(ID) %>%
summarize(incomplete_cdf(DAYS, first(gmin), first(gmax)))%>%
ggplot(aes(x=x, y=y, color=ID)) +
geom_step()

Related

Selecting 10 names based on 10 highest numbers of other column

I want to select the top 10 voted restaurants, and plot them together.
So i want to create a plot that shows the restaurant names and their votes.
I used:
topTenVotes <- top_n(dataSet, 10, Votes)
and it showed me data of the columns in dataset based on the top 10 highest votes, however i want just the number of votes and restaurant names.
My Question is how to select only the top 10 highest votes and their restaurant names, and plotting them together?
expected output:
Restaurant Names Votes
A 300
B 250
C 230
D 220
E 210
F 205
G 200
H 194
I 160
J 120
K 34
And then a bar plot that shows these restaurant names and their votes
Another simple approach with base functions creating another variable:
df <- data.frame(Names = LETTERS, Votes = sample(40:400, length(LETTERS)))
x <- df$Votes
names(x) <- df$Names # x <- setNames(df$Votes, df$Names) is another approach
barplot(sort(x, decreasing = TRUE)[1:10], xlab = "Restaurant Name", ylab = "Votes")
Or a one-line solution with base functions:
barplot(sort(xtabs(Votes ~ Names, df), decreasing = TRUE)[1:10], xlab = "Restaurant Names")
I'm not seeing a data set to use, so here's a minimal example to show how it might work:
library(tidyverse)
df <-
tibble(
restaurant = c("res1", "res2", "res3", "res4"),
votes = c(2, 5, 8, 6)
)
df %>%
arrange(-votes) %>%
head(3) %>%
ggplot(aes(x = reorder(restaurant, votes), y = votes)) +
geom_col() +
coord_flip()
The top_n command also works in this case but is designed for grouped data.
Its more efficient, though less readable, to use base functions:
#toy data
d <- data.frame(list(Names = sample(LETTERS, size = 15), value = rnorm(25, 10, n = 15)))
head(d)
Names value
1 D 25.592749
2 B 28.362303
3 H 1.576343
4 L 28.718517
5 S 27.648078
6 Y 29.364797
#reorder by, and retain, the top 10
newdata <- data.frame()
for (i in 1:10) {
newdata <- rbind(newdata,d[which(d$value == sort(d$value, decreasing = T)[1:10][i]),])
}
newdata
Names value
8 W 45.11330
13 K 36.50623
14 P 31.33122
15 T 30.28397
6 Y 29.36480
7 Q 29.29337
4 L 28.71852
10 Z 28.62501
2 B 28.36230
5 S 27.64808

loess regression on each group with dplyr::group_by()

Alright, I'm waving my white flag.
I'm trying to compute a loess regression on my dataset.
I want loess to compute a different set of points that plots as a smooth line for each group.
The problem is that the loess calculation is escaping the dplyr::group_by function, so the loess regression is calculated on the whole dataset.
Internet searching leads me to believe this is because dplyr::group_by wasn't meant to work this way.
I just can't figure out how to make this work on a per-group basis.
Here are some examples of my failed attempts.
test2 <- test %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
do(broom::tidy(predict(loess(Meth ~ AVGMOrder, span = .85, data=.))))
> test2
# A tibble: 136 x 2
# Groups: CpG [4]
CpG x
<chr> <dbl>
1 cg01003813 0.781
2 cg01003813 0.793
3 cg01003813 0.805
4 cg01003813 0.816
5 cg01003813 0.829
6 cg01003813 0.841
7 cg01003813 0.854
8 cg01003813 0.866
9 cg01003813 0.878
10 cg01003813 0.893
This one works, but I can't figure out how to apply the result to a column in my original dataframe. The result I want is column x. If I apply x as a column in a separate line, I run into issues because I called dplyr::arrange earlier.
test2 <- test %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
dplyr::do({
predict(loess(Meth ~ AVGMOrder, span = .85, data=.))
})
This one simply fails with the following error.
"Error: Results 1, 2, 3, 4 must be data frames, not numeric"
Also it still isn't applied as a new column with dplyr::mutate
fems <- fems %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
dplyr::mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.)))
This was my fist attempt and mostly resembles what I want to do. Problem is that this one performs the loess prediction on the entire dataframe and not on each CpG group.
I am really stuck here. I read online that the purr package might help, but I'm having trouble figuring it out.
data looks like this:
> head(test)
X geneID CpG CellLine Meth AVGMOrder neworder Group SmoothMeth
1 40 XG cg25296477 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.81107210 1 1 5 0.7808767
2 94 XG cg01003813 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.97052120 1 1 5 0.7927130
3 148 XG cg13176022 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.06900448 1 1 5 0.8045080
4 202 XG cg26484667 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.84077890 1 1 5 0.8163997
5 27 XG cg25296477 iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.81623880 2 2 3 0.8285259
6 81 XG cg01003813 iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.95569240 2 2 3 0.8409501
unique(test$CpG)
[1] "cg25296477" "cg01003813" "cg13176022" "cg26484667"
So, to be clear, I want to do a loess regression on each unique CpG in my dataframe, apply the resulting "regressed y axis values" to a column matching the original y axis values (Meth).
My actual dataset has a few thousand of those CpG's, not just the four.
https://docs.google.com/spreadsheets/d/1-Wluc9NDFSnOeTwgBw4n0pdPuSlMSTfUVM0GJTiEn_Y/edit?usp=sharing
This is a neat Tidyverse way to make it work:
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
models <- fems %>%
tidyr::nest(-CpG) %>%
dplyr::mutate(
# Perform loess calculation on each CpG group
m = purrr::map(data, loess,
formula = Meth ~ AVGMOrder, span = .5),
# Retrieve the fitted values from each model
fitted = purrr::map(m, `[[`, "fitted")
)
# Apply fitted y's as a new column
results <- models %>%
dplyr::select(-m) %>%
tidyr::unnest()
# Plot with loess line for each group
ggplot(results, aes(x = AVGMOrder, y = Meth, group = CpG, colour = CpG)) +
geom_point() +
geom_line(aes(y = fitted))
You may have already figured this out -- but if not, here's some help.
Basically, you need to feed the predict function a data.frame (a vector may work too but I didn't try it) of the values you want to predict at.
So for your case:
fems <- fems %>%
group_by(CpG) %>%
arrange(CpG, AVGMOrder) %>%
mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.),
data.frame(AVGMOrder = seq(min(AVGMOrder), max(AVGMOrder), 1))))
Note, loess requires a minimum number of observations to run (~4? I can't remember precisely). Also, this will take a while to run so test with a slice of your data to make sure it's working properly.
Unfortunately, the approaches described above did not work in my case. Thus, I implemented the Loess prediction into a regular function, which worked very well. In the example below, the data is contained in the df data frame while we group by df$profile and want to fit the Loess prediction into the df$daily_sum values.
# Define important variables
span_60 <- 60/365 # 60 days of a year
span_365 <- 365/365 # a whole year
# Group and order the data set
df <- as.data.frame(
df %>%
group_by(profile) %>%
arrange(profile, day) %>%
)
)
# Define the Loess function. x is the data frame that has to be passed
predict_loess <- function(x) {
# Declare that the loess column exists, but is blank
df$loess_60 <- NA
df$loess_365 <- NA
# Identify all unique profilee IDs
all_ids <- unique(x$profile)
# Iterate through the unique profilee IDs, determine the length of each vector (which should correspond to 365 days)
# and isolate the according rows that belong to the profilee ID.
for (i in all_ids) {
len_entries <- length(which(x$profile == i))
queried_rows <- result <- x[which(x$profile == i), ]
# Run the loess fit and write the result to the according column
fit_60 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_60))
fit_365 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_365))
x[which(x$profile == i), "loess_60"] <- fit_60
x[which(x$profile == i), "loess_365"] <- fit_365
}
# Return the initial data frame
return(x)
}
# Run the Loess prediction and put the results into two columns - one for a short and one for a long time span
df <- predict_loess(df)

r - ggplot multiple line graphs for each unique instance over time

The Problem
Plotting a bunch of line plots on top of one another, but I only want to color 10 specifically after they are all plotted amongst themselves (to visualize how my 'targets' traveled over time while being able to view the masses of other behind them. So an example of this would be like 100 line graphs over time, but I want to color 5 or 10 of them specifically to discuss about with respect to the trend of the 90 other grayscale ones.
The following post has a pretty good image that I want to replicate, but with slightly more meat on the bones, , Except I want MANY lines behind those 3 all grayscale, but those 3 are my highlighted cities I want to see in the foreground, per say.
My original data was in the following form:
# The unique identifier is a City-State combo,
# there can be the same cities in 1 state or many.
# Each state's year ranges from 1:35, but may not have
# all of the values available to us, but some are complete.
r1 <- c("city1" , "state1" , "year" , "population" , rnorm(11) , "2")
r2 <- c("city1" , "state2" , "year" , "population" , rnorm(11) , "3")
r3 <- c("city2" , "state1" , "year" , "population" , rnorm(11) , "2")
r4 <- c("city3" , "state2" , "year" , "population" , rnorm(11) , "1")
r5 <- c("city3" , "state2" , "year" , "population" , rnorm(11) , "7")
df <- data.frame(matrix(nrow = 5, ncol = 16))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- c("City", "State", "Year", "Population", 1:11, "Cluster")
head(df)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# City | State | Year | Population | ... 11 Variables ... | Cluster #
# ----------------------------------------------------------------------#
# Each row is a city instance with these features ... #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
But I thought it might be better to view the data differently, so I also have it in the following format. I am not sure which is better for this problem.
cols <- c(0:35)
rows <- c("unique_city1", "unique_city2","unique_city3","unique_city4","unique_city5")
r1 <- rnorm(35)
r2 <- rnorm(35)
r3 <- rnorm(35)
r4 <- rnorm(35)
r5 <- rnorm(35)
df <- data.frame(matrix(nrow = 5, ncol = 35))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- cols
row.names(df) <- rows
head(df)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Year1 Year2 .......... Year 35 #
# UniqueCityState1 VAL NA .......... VAL #
# UniqueCityState2 VAL VAL .......... NA #
# . #
# . #
# . #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
Prior Attempts
I have tried using melt to get the data into a format that is possible for ggplot to accept and plot each of these cities over time, but nothing has seemed to work. Also, I have tried creating my own functions to loop through each of my unique city-state combinations to stack ggplots which had some fair amount of research available on the topic, but nothing yet still. I am not sure how I could find each of these unique citystate pairs and plot them over time taking their cluster value or any numeric value for that matter. Or maybe what I am seeking is not possible, I am not sure.
Thoughts?
EDIT: More information about data structure
> head(df)
city state year population stat1 stat2 stat3 stat4 stat5
1 BESSEMER 1 1 31509 0.3808436 0 0.63473928 2.8563268 9.5528262
2 BIRMINGHAM 1 1 282081 0.3119671 0 0.97489728 6.0266377 9.1321287
3 MOUNTAIN BROOK 1 1 18221 0.0000000 0 0.05488173 0.2744086 0.4390538
4 FAIRFIELD 1 1 12978 0.1541069 0 0.46232085 3.0050855 9.8628448
5 GARDENDALE 1 1 7828 0.2554931 0 0.00000000 0.7664793 1.2774655
6 LEEDS 1 1 7865 0.2542912 0 0.12714558 1.5257470 13.3502861
stat6 stat6 stat7 stat8 stat9 cluster
1 26.976419 53.54026 5.712654 0 0.2856327 9
2 35.670605 65.49183 11.982374 0 0.4963113 9
3 6.311399 21.40387 1.426925 0 0.1097635 3
4 21.266759 68.11527 11.480968 0 1.0787487 9
5 6.770567 23.24987 3.960143 0 0.0000000 3
6 24.157661 39.79657 4.450095 0 1.5257470 15
agg
1 99.93970
2 130.08675
3 30.02031
4 115.42611
5 36.28002
6 85.18754
And ultimately I need it in the form of unique cities as row.names, 1:35 as col.names and the value inside each cell to be agg if that year was present or NA if it wasn't. Again I am sure this is possible, I just can't attain a good solution to it and my current way is unstable.
If I understand your question correctly, you want to plot all the lines in one color, and then plot a few lines with several different colors. You may use ggplot2, calling geom_line twice on two data frames. The first time plot all city data without mapping lines to color. The second time plot just the subset of your target city and mapping lines to color. You will need to re-organize your original data frame and subset the data frame for the target city. In the following code I used tidyr and dplyr to process the data frame.
### Set.seed to improve reproducibility
set.seed(123)
### Load package
library(tidyr)
library(dplyr)
library(ggplot2)
### Prepare example data frame
r1 <- rnorm(35)
r2 <- rnorm(35)
r3 <- rnorm(35)
r4 <- rnorm(35)
r5 <- rnorm(35)
df <- data.frame(matrix(nrow = 5, ncol = 35))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- 1:35
df <- df %>% mutate(City = 1:5)
### Reorganize the data for plotting
df2 <- df %>%
gather(Year, Value, -City) %>%
mutate(Year = as.numeric(Year))
The gather function takes df as the first argument. It will create the key column called Year, which will store year number. The year number are the column names of each column in the df data frame except the City column. gather function will also create a column called Value, which will store all the numeric values from each column in in the df data frame except the City column. Finally, City column will not involve in this process, so use -City to tell the gather function "do not transform the data from the City column".
### Subset df2, select the city of interest
df3 <- df2 %>%
# In this example, assuming that City 2 and City 3 are of interest
filter(City %in% c(2, 3))
### Plot the data
ggplot(data = df2, aes(x = Year, y = Value, group = factor(City))) +
# Plot all city data here in gray lines
geom_line(size = 1, color = "gray") +
# Plot target city data with colors
geom_line(data = df3,
aes(x = Year, y = Value, group = City, color = factor(City)),
size = 2)
The resulting plot can be seen here: https://dl.dropboxusercontent.com/u/23652366/example_plot.png

Plotting Bacteria according to Food Groups & Abundance in R

I have a dataframe that includes four bacteria types: R, B, P, Bi - this is in variable.x
value.y is their abundance and variable.y is various groups they are in.
I would like to plot them according to their food categories: "FiberCategory", "FruitCategory", "VegetablesCategory" & "WholegrainCategory." I have made 4 separate files that have the as such:
Sample Bacteria Abundance Category Level
30841102 R 0.005293192 1 Low
30841102 P 0.000002570 1 Low
30841102 B 0.005813275 1 Low
30841102 Bi 0.000000000 1 Low
49812105 R 0.003298709 1 Low
49812105 P 0.000000855 1 Low
49812105 B 0.131147541 1 Low
49812105 Bi 0.000350086 1 Low
So, I would like a bar plot of how much of each bacteria is in each category. So it should be 4 plots, for each bacteria, with value on the y-axis and food category on the x-axis.
I have tried this code:
library(dplyr)
genus_veg %>% group_by(Genus, Abundance) %>% summarise(Abundance = sum(Abundance)) %>%
ggplot(aes(x = Level, y= Abundance, fill = Genus)) + geom_bar(stat="identity")
But get this error:
Error: cannot modify grouping variable
Any suggestions?
TL;DR Combine individual plots with cowplot
In another interpretation of the super unclear question, this time from:
Plotting Bacteria according to Food Groups & Abundance in R
and
would like to plot them according to their food categories: "FiberCategory", "FruitCategory", "VegetablesCategory" & "WholegrainCategory." I have made 4 separate files
You might be asking for:
You want a bar chart
You want 4 plots, one for each of the food categories
x-axis = bacteria type
y-axis = abundance of bacteria
Input
Let say you have a data frame for each food category. (Again, I'm using dummy data)
library(tidyr)
library(dplyr)
library(ggplot2)
## The categories you have defined
bacteria <- c("R", "B", "P", "Bi")
food <- c("FiberCategory", "FruitCategory", "VegetablesCategory", "WholegrainCategory")
## Create dummy data for plotting
set.seed(1)
num_rows <- length(bacteria)
num_cols <- length(food)
dummydata <-
matrix(data = abs(rnorm(num_rows*num_cols, mean=0.01, sd=0.05)),
nrow=num_rows, ncol=num_cols)
rownames(dummydata) <- bacteria
colnames(dummydata) <- food
dummydata <-
dummydata %>%
as.data.frame() %>%
tibble::rownames_to_column("bacteria") %>%
gather(food, abundance, -bacteria)
## If we have 4 data frames
filter_food <- function(dummydata, foodcat){
dummydata %>%
filter(food == foodcat) %>%
select(-food)
}
dd_fiber <- filter_food(dummydata, "FiberCategory")
dd_fruit <- filter_food(dummydata, "FruitCategory")
dd_veg <- filter_food(dummydata, "VegetablesCategory")
dd_grain <- filter_food(dummydata, "WholegrainCategory")
Where one data frame looks something like
#> dd_grain
# bacteria abundance
#1 R 0.02106203
#2 B 0.10073499
#3 P 0.06624655
#4 Bi 0.00775332
Plot
You can create separate plots. (Here, I'm using a function to generate my plots)
plot_food <- function(dd, title=""){
dd %>%
ggplot(aes(x = bacteria, y = abundance)) +
geom_bar(stat = "identity") +
ggtitle(title)
}
plt_fiber <- plot_food(dd_fiber, "fiber")
plt_fruit <- plot_food(dd_fruit, "fruit")
plt_veg <- plot_food(dd_veg, "veg")
plt_grain <- plot_food(dd_grain, "grain")
And then combine them using cowplot
cowplot::plot_grid(plt_fiber, plt_fruit, plt_veg, plt_grain)
TL;DR Plotting by facets
How you posed the question is super unclear. So I have interpreted your question from
So, I would like a bar plot of how much of each bacteria is in each category. So it should be 4 plots, for each bacteria, with value on the y-axis and food category on the x-axis.
as:
You want a bar chart
You want 4 plots, one for each of the bacteria types: R, B, P, Bi
x-axis = food category
y-axis = abundance of bacteria
Input
In regards to the input data, the data was unclear e.g. you did not describe what "Sample", "Level", or "Category" is. Ideally, you would keep all the food category in one data frame. e.g.
library(tidyr)
library(dplyr)
library(ggplot2)
## The categories you have defined
bacteria <- c("R", "B", "P", "Bi")
food <- c("FiberCategory", "FruitCategory", "VegetablesCategory", "WholegrainCategory")
## Create dummy data for plotting
set.seed(1)
num_rows <- length(bacteria)
num_cols <- length(food)
dummydata <-
matrix(data = abs(rnorm(num_rows*num_cols, mean=0.01, sd=0.05)),
nrow=num_rows, ncol=num_cols)
rownames(dummydata) <- bacteria
colnames(dummydata) <- food
dummydata <-
dummydata %>%
as.data.frame() %>%
tibble::rownames_to_column("bacteria") %>%
gather(food, abundance, -bacteria)
of which the output looks like:
#> dummydata
# bacteria food abundance
#1 R FiberCategory 0.021322691
#2 B FiberCategory 0.019182166
#3 P FiberCategory 0.031781431
#4 Bi FiberCategory 0.089764040
#5 R FruitCategory 0.026475389
#6 B FruitCategory 0.031023419
#7 P FruitCategory 0.034371453
#8 Bi FruitCategory 0.046916235
#9 R VegetablesCategory 0.038789068
#10 B VegetablesCategory 0.005269419
#11 P VegetablesCategory 0.085589058
#12 Bi VegetablesCategory 0.029492162
#13 R WholegrainCategory 0.021062029
#14 B WholegrainCategory 0.100734994
#15 P WholegrainCategory 0.066246546
#16 Bi WholegrainCategory 0.007753320
Plot
Once you have the data formatted as above, you can simply do:
dummydata %>%
ggplot(aes(x = food,
y = abundance,
group = bacteria)) +
geom_bar(stat="identity") +
## Split into 4 plots
## Note: can also use 'facet_grid' to do this
facet_wrap(~bacteria) +
theme(
## rotate the x-axis label
axis.text.x = element_text(angle=90, hjust=1, vjust=.5)
)

barchart and standard errors

I have the following table in R (inspired by a cran help datasheet) :
> dfx <- data.frame(
+ group = c(rep('A', 108), rep('B', 115), rep('C', 106)),
+ sex = sample(c("M", "F","U"), size = 329, replace = TRUE),
+ age = runif(n = 329, min = 18, max = 54)
+ )
> head(dfx)
group sex age
1 A U 47.00788
2 A M 32.40236
3 A M 21.95732
4 A F 19.82798
5 A F 30.70890
6 A M 30.00830
I am interested in plotting the percentages of males (M), females (F) and "unknown"(U) in each group using barcharts, including error bars.
To do this graph, i plan to use the panel.ci/prepanel.ci commands.
I can easily build a proportion table for each group using the prop.table command :
> with(dfx, prop.table(table(group,sex), margin=1)*100)
sex
group F M U
A 29.62963 28.70370 41.66667
B 35.65217 35.65217 28.69565
C 37.73585 33.01887 29.24528
But now, i would like to build a similar table with error bars, and use these two tables to make a barchart.
If possible, i would like to use the ddply command, that i use for similar purposes (except that it was nor percentages but means).
Try something like this:
library(plyr)
library(ggplot2)
summary(dfx) # for example, each variable
dfx$interaction <- interaction(dfx$group, dfx$sex)
ddply(dfx, .(interaction), summary) #group by interaction, summary on dfx
ggplot(dfx, aes(x = sex, y = age, fill = group)) + geom_boxplot()
You can get a good on-line tutorial on building graphs here.
edit
I'm pretty sure you would need more than 1 value for the proportion in order to have any error. I only see 1 value for the proportion for each unique combination of variables group and sex.
This is the most I can help you with (below), but I'd be interested to see you post an answer to your own question when you find a suitable solution.
dfx$interaction <- interaction(dfx$group, dfx$sex)
dfx.summary <- ddply(dfx, .(group, sex), summarise, total = length(group))
dfx.summary$prop <- with(dfx.summary, total/sum(total))
dfx.summary
# group sex prop
# 1 A F 0.06382979
# 2 A M 0.12158055
# 3 A U 0.14285714
# 4 B F 0.12462006
# 5 B M 0.11854103
# 6 B U 0.10638298
# 7 C F 0.10334347
# 8 C M 0.12158055
# 9 C U 0.09726444
ggplot(dfx.summary, aes(sex, total, color = group)) + geom_point(size = 5)

Resources