I have a data frame like this
structure(list(cli_exp = c(1L, 1L, 2L, 1L, 1L, 0L, 2L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 2L, 2L, 0L, 1L, 0L,
1L, 1L, 2L, 0L, 1L), vcs_exp = c(0L, 0L, 1L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 2L, 1L,
1L, 0L, 0L, 0L, 2L, 1L, 0L), web_exp = c(2L, 2L, 2L, 1L, 0L,
0L, 1L, 2L, 0L, 0L, 3L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 2L, 1L, 1L,
1L, 1L, 0L, 0L, 1L, 1L, 2L, 0L, 0L)), .Names = c("cli_exp", "vcs_exp",
"web_exp"), row.names = c(NA, 30L), class = "data.frame")
I want to use ggplot2 to express the relation between these three variables and tried the simple point plot
ggplot(data = data) +
geom_point(mapping = aes(x = web_exp, y = vcs_exp, color = cli_exp))
But apparently, there are many overlapping data points, which are not suitable for point display. Are there any better ways?
I would use ggpairs from GGally package
tmp_df <- structure(list(cli_exp = c(1L, 1L, 2L, 1L, 1L, 0L, 2L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 2L, 2L, 0L, 1L, 0L,
1L, 1L, 2L, 0L, 1L), vcs_exp = c(0L, 0L, 1L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 2L, 1L,
1L, 0L, 0L, 0L, 2L, 1L, 0L), web_exp = c(2L, 2L, 2L, 1L, 0L,
0L, 1L, 2L, 0L, 0L, 3L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 2L, 1L, 1L,
1L, 1L, 0L, 0L, 1L, 1L, 2L, 0L, 0L)), .Names = c("cli_exp", "vcs_exp",
"web_exp"), row.names = c(NA, 30L), class = "data.frame")
library(GGally)
ggpairs(tmp_df,
upper = list(continuous = wrap("cor", size = 10)),
lower = list(continuous = "smooth"))
Edit: use pairs from base R
pairs(tmp_df)
Use pairs.panels from psych package
library(psych)
pairs.panels(tmp_df,
method = "pearson",
density = TRUE,
ellipses = TRUE
)
As you mentioned, the points overlap, so some points aren't visible when using geom_point.
ggplot(data = df, aes(x = web_exp, y = vcs_exp, color = cli_exp)) +
geom_point()
This can be solved by adding a small amount of jitter. Also, making the points slightly transparent will make any overlaps more clear.
ggplot(data = df, aes(x = web_exp, y = vcs_exp, color = cli_exp)) +
geom_jitter(width = 0.05, height = 0.05, alpha = 0.8)
This is something I noticed just as I was about to put the histograms in my thesis. I noticed that the frequency did not reflect the correct count as displayed in the graph. To double check I tried this in excel and it was proved that the frequency being displayed in R using the ggplot2 was indeed incorrect. I noticed that in my syntax I had applied the xlim function. I removed that out of curiosity to see the result and then magically ggplot2 produced the correct histogram!
This is the code that I'm using:
ggplot(data, aes(x = variable) )+
geom_histogram(binwidth = 1) +
xlim(0, 40)
The one that is producing the correct histogram graph is this:
hist(data$variable, breaks = seq(0, 40, 1), ylim = c(0,700))
Can anybody please help me here? I've spent a lot of time trying to get this to work but to no avail. Any help would be greatly appreciated.
# example data
variable <- c(1L, 1L, 1L, 3L, 4L, 1L, 2L, 1L, 2L, 0L, 1L, 2L, 1L, 1L, 0L,
3L, 1L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 1L, 0L, 5L, 0L, 0L, 2L, 1L,
1L, 2L, 1L, 3L, 2L, 5L, 4L, 3L, 2L, 3L, 0L, 1L, 1L, 1L, 1L, 2L,
0L, 2L, 1L, 3L, 1L, 4L, 2L, 6L, 2L, 1L, 6L, 5L, 5L, 1L, 1L, 0L,
2L, 1L, 1L, 0L, 0L, 1L, 2L, 1L, 1L, 5L, 2L, 1L, 0L, 3L, 2L, 2L,
4L, 6L, 3L, 2L, 1L, 6L, 1L, 4L, 2L, 1L, 2L, 1L, 1L, 1L, 0L, 1L,
1L, 0L, 2L, 3L, 1L, 3L, 2L, 2L, 1L, 1L, 2L, 13L, 3L, 2L, 5L,
5L, 1L, 3L, 0L, 2L, 1L, 2L, 1L, 0L, 10L, 2L, 0L, 1L, 2L, 2L,
0L, 1L, 4L, 0L, 2L, 0L, 0L, 1L, 0L, 1L, 13L, 15L, 2L, 4L, 4L,
12L, 7L, 4L, 4L, 0L, 0L, 1L, 0L, 1L, 2L, 6L, 3L, 0L, 2L, 2L,
0L, 1L, 5L, 0L, 3L, 3L, 4L, 1L, 1L, 3L, 20L, 2L, 1L, 0L, 4L,
4L, 5L, 6L, 9L, 2L, 4L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 0L, 1L, 1L,
1L, 2L, 0L, 3L, 2L, 1L, 2L, 1L, 2L, 4L, 18L, 16L, 3L, 3L, 1L,
3L, 1L, 7L, 13L, 2L, 3L, 2L, 4L, 2L, 2L, 1L, 0L, 0L, 0L, 0L,
1L, 2L, 1L, 1L, 1L, 1L, 3L, 2L, 2L, 2L, 4L, 3L, 4L, 4L, 5L, 2L,
1L, 1L, 6L, 4L, 0L, 3L, 3L, 1L, 4L, 0L, 0L, 2L, 2L, 1L, 0L, 1L,
1L, 0L, 0L, 1L, 2L, 4L, 1L, 2L, 1L, 0L, 0L, 5L, 2L, 10L, 4L,
1L, 2L, 3L, 2L, 2L, 1L, 2L, 0L, 4L, 2L, 1L, 0L, 0L, 3L, 1L, 3L,
1L, 1L, 0L, 0L, 0L, 1L, 4L, 2L, 2L, 3L, 0L, 4L, 1L, 34L, 20L,
1L, 3L, 3L, 1L, 7L, 5L, 1L, 3L, 5L, 2L, 1L, 1L, 3L, 0L, 1L, 4L,
1L, 2L, 2L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 5L, 4L, 5L,
9L, 9L, 3L, 5L, 1L, 2L, 1L, 2L, 1L, 0L, 3L, 2L, 1L, 0L, 2L, 1L,
2L, 0L, 1L, 2L, 1L, 1L, 1L, 2L, 0L, 1L, 5L, 9L, 8L, 0L, 5L, 2L,
3L, 1L, 0L, 0L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L,
2L, 2L, 1L, 2L, 0L, 1L, 1L, 1L, 7L, 0L, 1L, 1L, 1L, 1L, 2L, 2L,
3L, 2L, 0L, 1L, 5L, 6L, 3L, 6L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 1L, 0L, 1L, 1L, 2L, 0L, 1L, 0L, 0L, 1L,
3L, 2L, 3L, 3L, 3L, 4L, 6L, 7L, 6L, 3L, 1L, 0L, 1L, 0L, 0L, 2L,
1L, 1L, 1L, 2L, 1L, 3L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 2L, 2L,
0L, 0L, 1L, 2L, 0L, 3L, 3L, 0L, 3L, 1L, 1L, 2L, 3L, 0L, 0L, 0L,
0L, 1L, 1L, 3L, 2L, 0L, 4L, 3L, 0L, 0L, 1L, 1L, 1L, 2L, 1L, 1L,
0L, 1L, 2L, 2L, 1L, 2L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 3L, 0L, 1L,
1L, 1L, 0L, 0L, 3L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 2L, 3L, 1L, 0L,
1L, 4L, 2L, 1L, 0L, 2L, 2L, 1L, 1L, 2L, 3L, 2L, 2L, 4L, 1L, 2L,
0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 2L, 1L, 1L,
1L, 1L, 3L, 1L, 1L, 0L, 3L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 2L, 1L,
1L, 1L, 0L, 0L, 5L, 8L, 6L, 4L, 2L, 1L, 1L, 0L, 1L, 0L, 2L, 1L,
1L, 1L, 1L, 0L, 1L, 0L, 2L, 0L, 1L, 0L, 3L, 3L, 1L, 0L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 0L, 1L, 2L, 3L, 3L, 2L, 3L, 2L, 1L,
1L, 0L, 0L, 1L, 0L, 0L, 2L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 2L, 0L,
2L, 0L, 1L, 2L, 2L, 0L, 0L, 0L, 1L, 0L, 0L, 4L, 0L, 1L, 0L, 0L,
2L, 1L, 0L, 4L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 1L,
1L, 2L, 1L, 0L, 3L, 5L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 2L, 1L, 0L, 0L, 3L,
2L, 0L, 1L, 0L, 2L, 2L, 3L, 2L, 1L, 0L, 0L, 2L, 0L, 2L, 1L, 1L,
0L, 0L, 0L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 4L,
0L, 1L, 0L, 0L, 2L, 2L, 0L, 2L, 0L, 4L, 3L, 3L, 4L, 1L, 2L, 1L,
1L, 1L, 1L, 2L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L,
2L, 1L, 1L, 0L, 1L, 3L, 3L, 2L, 1L, 1L, 1L, 4L, 2L, 2L, 3L, 2L,
1L, 3L, 1L, 4L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 2L, 0L, 1L, 1L, 1L,
1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 4L, 3L, 3L, 1L, 3L, 3L, 3L,
2L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L,
1L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 5L, 5L, 2L, 4L, 3L, 7L, 5L, 3L,
0L, 1L, 2L, 2L, 1L, 3L, 2L, 0L, 0L, 0L, 1L, 0L, 2L, 1L, 0L, 1L,
1L, 1L, 0L, 1L, 0L, 0L, 1L, 2L, 7L, 11L, 5L, 8L, 15L, 6L, 6L,
0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 4L, 1L, 0L, 1L, 0L, 0L, 0L,
1L, 1L, 1L, 1L, 1L, 0L, 2L, 14L, 19L, 8L, 9L, 3L, 4L, 0L, 0L,
0L, 1L, 1L, 0L, 0L, 2L, 1L, 1L, 2L, 1L, 0L, 0L, 1L, 0L, 1L, 0L,
2L, 1L, 1L, 7L, 7L, 3L, 4L, 6L, 2L, 1L, 2L, 1L, 1L, 1L, 0L, 1L,
0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 2L, 0L, 0L, 1L, 1L,
0L, 2L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 5L, 2L, 2L,
1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 0L,
2L, 0L, 0L, 1L, 1L, 0L, 1L, 2L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 2L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 2L, 1L,
2L, 1L, 0L, 1L, 0L, 2L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L,
2L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 0L, 1L, 0L, 1L, 1L, 11L, 1L, 0L, 0L, 1L, 1L, 3L, 4L, 0L,
0L, 0L, 1L, 6L)
data <- data.frame(variable)
Ok I see, the difference is the specific definition of a bin, i.e. whether you use [0,1) or [0,1] for the first bin. Try
ggplot(data, aes(x = variable)) +
geom_histogram(breaks = seq(0,40,by = 1), right = TRUE)
or if you don't use explicit breaks, you have to specify origin additionaly, either by xlim as you did, or
ggplot(data, aes(x = variable)) +
geom_histogram(binwidth = 1, right = TRUE, origin = 0)
I am trying to calculate the proportion of correct responses for each participant as a function of three factors (group, sound and language). My data frame looks like this:
participant group sound lang resp
advf03 adv a in 1
advf03 adv a sp 0
advf03 adv a in 1
advf03 adv a sp 0
advf03 adv a in 0
advf03 adv a sp 1
advf03 adv a sp 0
advf03 adv a in 1
advf03 adv a in 0
advf03 adv a in 1
begf03 beg a in 1
begf03 beg a in 1
begf03 beg a sp 0
"Group" has 3 levels: adv, int, and beg. "Sound" has 3 levels: a, e, i. "Lang" has 2 levels: in, sp. A "1" implies a correct response and a "0" implies an incorrect response. I would like to have a proportion (i.e. percent correct) of the "1"'s for each participant as a new column in a new data frame. An example of the type of information I would like to have: Participant advf03 got 53% correct for "a" in "sp".
Here are 50 observations from my data:
structure(list(sound = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("a",
"e", "i"), class = "factor"), resp = c(0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L), participant = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("2advf03", "2advf05", "2advm04", "2advm06", "2begf01",
"2begf02", "2begf04", "2begf05", "2begm03", "2advf01", "2intf01",
"2intf03", "2intf04", "2intf06", "2advm05"), class = "factor"),
group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("adv",
"beg", "int"), class = "factor"), lang = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("in", "sp"), class = "factor")), .Names = c("sound",
"resp", "participant", "group", "lang"), row.names = c(10L, 31L,
36L, 43L, 47L, 49L, 52L, 59L, 61L, 65L, 66L, 68L, 71L, 79L, 97L,
99L, 106L, 125L, 133L, 138L, 147L, 149L, 162L, 165L, 174L, 175L,
33L, 37L, 112L, 136L, 154L, 186L, 11L, 50L, 89L, 92L, 104L, 105L,
123L, 126L, 129L, 143L, 153L, 173L, 177L, 187L, 188L, 191L, 7L,
12L), class = "data.frame")
This is what I have so far:
# get counts of subsets of factors
df <- as.data.frame(table(df))
# new column that gives the proportion of responses
df$prop <- df$Freq / 32
But this does not seems to give me the correct proportions. I know that I need to reduce the data so that I don't have so many observations (i.e. 1 value for each sound for each language for each participant, but I don't know the correct steps do that.
If I understand your question correctly, you would like to know the proportion of 1s by participant, sound, and language.
Because the proportion of 1s in a vector with only 0s and 1s is just the mean, this should work:
aggregate(data=df, resp ~ participant + group + lang, FUN="mean")
The output of that with your 50 observations is:
participant group lang resp
1 2advf03 adv in 0.1875000
2 2advf03 adv sp 0.1111111