ggplot2 - Pie/Bar Chart from Multiple Columns in Data Frame - r

I have a data frame that looks like the below. I have variables three variables per observation and I would like to create a bar graph per observation for each of these three variables. However, ggplot2 doesn't appear to have a way to specify multiple columns from the same data frame. What is the correct way to graph this data?
Aiming for something similar to the image below from Wikimedia (with a graph for each observation). Source: https://commons.wikimedia.org/wiki/File:Article_count_(en-de-fr).png
x English German French
Sample 1 5 10 14
Sample 2 4 4 14
Sample 3 5 10 53

Don't know why there are 2 row's per x-value.
This makes no sense. What do you want to plot? The sum per A,B,C? The mean?
Assuming you want to take the mean: Just do
dat <- read.table(textConnection(
"x A B C
1 5 10 14
1 4 4 14
2 5 10 14
2 4 4 14
3 5 10 14
3 4 4 14
"), header=TRUE)
dat <- aggregate(. ~ x, data=dat, mean) # instead of mean you can take your function
require(reshape2)
dat_molten <- melt(dat,"x")
require(ggplot2)
ggplot(dat_molten, aes(x=variable, y=value)) +
geom_bar(stat="identity") +
facet_grid(.~x)

Related

How do you randomly assign data into equal sized control and treatment groups in R?

set.seed(31)
resample(1:534, 90, replace = FALSE)
df.orig <- read.csv("project1data.csv")
df.groups <- filter(df.orig, participate == "y")
str(df.groups)
I have randomly selected 90 house numbers from 534 and entered whether or not they were willing to participate in the study into an excel sheet and then I filtered out the people who did not want to participate in the study. How do I now randomly assign the participants into two equally sized groups (control and treatment)
You haven't provided data or code that runs so I'll generate some code to show the idea
set.seed(31)
# Create dataset with three variables
# Participate are the ones that we wish to include in the study.
# You have those in your excel file.
fakedata <- data.frame(houseid=1:534,
size=rbinom(534, size=5, prob=.5),
participate=sample(c("y", "n"), size=534, replace=TRUE))
which produces
head(fakedata)
houseid size participate
1 1 3 y
2 2 4 n
3 3 2 n
4 4 2 y
5 5 4 y
6 6 2 n
Now we can use tidyverse to generate a random permutation of cases/controls. First we create a vector of the correct length (using rep with length) and then we shuffle them using sample.
library("tidyverse")
fakedata %>% # Take data
filter(participate=="y") %>%
mutate(group=sample(rep(c("Case", "Ctrl"), length=n())))
This gives
houseid size participate group
1 1 3 y Case
2 4 2 y Case
3 5 4 y Ctrl
4 7 4 y Case
5 8 1 y Case
6 9 4 y Ctrl
7 13 3 y Case
8 16 1 y Ctrl
.
.
.

sum up certain variables (columns) by variable names

i want to sum up certain variables (columns in a data frame).
I would like to select those variables by parts of their names.
The complex thing is that i have various conditions. So, using a single contains from dplyr does not work.
Here is an example:
ab_yy <- c(1:5)
bc_yy <- c(5:9)
cd_yy <- c(2:6)
de_xx <- c(3:7)
ab_yy bc_yy cd_yy de_xx
1 1 5 2 3
2 2 6 3 4
3 3 7 4 5
4 4 8 5 6
5 5 9 6 7
dat <- data.frame(ab_yy,bc_yy,cd_yy,de_xx)
#sum up all variables that contain yy and certain extra conditions
#may look something like this: rowSums(select(dat, contains(("yy&ab")|("yy&bc")) ) )
desired result:
6 8 10 12 14
EDIT: Fixed, sorry, low on caffeine
If you want to use dplyr, try using matches:
library(dplyr)
dat %>%
select(matches("*yy", )) %>%
select(matches("ab*|bc*")) %>%
rowSums(.)
[1] 6 8 10 12 14
I don't think that it's the best way but u can do it like that with a grepl:
rowSums(dat[,grepl(pattern = "ab.*yy|bc.*yy",colnames(dat))==T])

R plotly: Customize x-axis values in box plot

I have a data frame with 3 variables and 260 rows. (Sample below)
HouseID<-c(1:10)
Town<-c("D","A","B","C","A","B","C","C","C","A")
Occupants<-c(5,3,2,4,5,2,3,8,1,3)
df<-data.frame(HouseID,Town,Occupants)
HouseID Town Occupants
1 D 5
2 A 3
3 B 2
4 C 4
5 A 5
6 B 2
7 C 3
8 C 8
9 C 1
10 A 3
I want to create a box plot for the distribution of Occupants with the order of x-axis based on the descending order of frequencies of Towns
Town Freq
A 3
B 2
C 4
D 1
(Shown a sample image)
I tried sorting the data frame, but still, the box plot x-axis is displayed based on alphabetical order by default. Is there a way I could do this?
You simply have to use factor to reorder levels of df$Town according to their count summary(df$Town):
df$Town <- factor(df$Town, levels(df$Town)[order(summary(df$Town),decreasing = TRUE)])
plot_ly(df, x=~Town, y=~Occupants, type="box")

group and label rows in data frame by numeric in R

I need to group and label every x observations(rows) in a dataset in R.
I need to know if the last group of rows in the dataset has less than x observations
For example:
If I use a dataset with 10 observations and 2 variables and I want to group by every 3 rows.
I want to add a new column so that the dataset looks like this:
speed dist newcol
4 2 1
4 10 1
7 4 1
7 22 2
8 16 2
9 10 2
10 18 3
10 26 3
10 34 3
11 17 4
df$group <- rep(1:(nrow(df)/3), each = 3)
This works if the number of rows is an exact multiple of 3. Every three rows will get tagged in serial numbers.
A quick dirty way to tackle the problem of not knowing how incomplete the final group is to simply check the remained when nrow is modulus divided by group size: nrow(df) %% 3 #change the divisor to your group size
assuming your data is df you can do
df$newcol = rep(1:ceiling(nrow(df)/3), each = 3)[1:nrow(df)]

How to plot two data sets having different maximum X-axis values in a single plot?

For example, I have two data sets showed below. Using position as X, and count as Y, how can I plot them out in different color lines within a single plot using ggplot2 geom_line?
dataset a:
position count
1 3
2 9
3 10
4 15
5 19
6 28
7 15
8 13
9 11
10 5
dataset b:
position count
1 4
2 8
3 16
4 17
5 19
6 10
The trick is to combine your two data frames into a single data frame. First, we create a new identifier column on each data frame:
a$dataset = "a"
b$dataset = "b"
Then we combine them
dd = rbind(a, b)
All that's left is to add geom_line but condition on the dataset number:
ggplot(dd) + geom_line(aes(position, count, colour=dataset))

Resources