I've been limping my way around r data for a few months now. Sorry if any of this seems basic. I've been finding all kinds of close problems and solutions, but somehow I can't seem to adapt them to my situation. Starting to wonder if it's something I should be trying to do at all, but I suppose it can't hurt to ask.
I have a data frame that has a single scalar variable, and multiple T/F (yes/no; 1/0, 1/2) variables. like this:
scal var1 var2 var3
25 0 1 0
21 0 1 1
14 1 1 0
30 1 0 1
I know I can make a boxplot which separates the scalar variable column into categories using "by" for a single variable, like so:
boxplot(df$scal~df$var1)
I also know that I can make box plots for multiple scalar variables at once. I'd like to combine the two somehow to make a boxplot which can plot the dependent variable of each "true" subset and "false" subset of each variable next to one another. In my world, one solution should look something like "boxplot(df$scal~df$var1, df$scal~df$var2, df$scal~df$var3)", but r data doesn't agree with me. something about not being able to force a datatype.
I could also write a rough loop to go through each of the variables and generate all the plots separately, but I'd like to compare them side-by-side.
I've also thought to rearrange the dataset such that the "true" and "false" sets are in different columns (using subset(df$var1, df$var1==1) etc.), then making multiple boxplots as described before. (though this is quite tedius)
var1t var1f var2t var2f var3t var3f
14 25 25 30 21 25
30 21 21 30 14
14
boxplot(df2$var1t, df2$var1f, df2$var2t, df2$var2f, df2$var3t, df2$var3f)
However, the different lengths(number of rows) of the columns is giving me fits when creating the new dataset. I know that I can make a dataset in another program (saved as .csv, .xls, etc.) then import it. The null values would remain intact, but I'd really rather not do this manually. As one might imagine, this becomes quite tedious and prone to errors on larger scales.
Help with either approach would be most welcome.
Learning how to manipulate data in R can be hard when you're starting out. I agree with with #jentjr that learning ggplot2 would be helpful and Hadley's book provides great tips for working with data in addition to covering ggplot2.
To start off, I would suggest using the reshape2 Package to melt your data:
(I created a dummy set so it would be easier for other people to follow along)
library(reshape2)
nObs = 10
df = data.frame(
scal = rnorm(nObs),
var1 = rbinom(nObs, 1, 0.5),
var2 = rbinom(nObs, 1, 0.5),
var3 = rbinom(nObs, 1, 0.5))
Then `melt' the data into long form from wide form.
df2 = melt(df, id.vars = c('scal'),
variable.name ='myVars', value.name = "zeroOne")
Now you may create your desired boxplot using base R:
However, investing the time to learn ggplot2 would allow you to create figures such as this one:
Using code such as this:
library(ggplot2)
ggplot(data = df2, aes(x = zeroOne, y = scal)) +
geom_boxplot(aes(fill = myVars))
Note ggplot2 can make much fancier plots than this (and do so more easily than base R!) and I would encourage you to browse the ggplot2 webpage to see more examples. You may also wish to experiment with swappingzeroOne and myVars because it changes the plot groupings.
Plotluck is a library based on ggplot2 that aims at automating the choice of plot type based on characteristics of 1-3 variables. Here is an example with the resulting plot:
nObs = 100
df = data.frame(
scal = rnorm(nObs),
var1 = rbinom(nObs, 1, 0.5),
var2 = rbinom(nObs, 1, 0.5),
var3 = rbinom(nObs, 1, 0.5))
plotluck.multi(df, y=scal, opts=plotluck.options(use.geom.violin=F))
This command means: Plot column scal (on the y-axis) against each other column in df (on the x-axis; including itself, resulting in a density or histogram). We specify use.geom.violin=F to enforce a box plot, since the default is a violin plot, which can often convey better the shape of the distribution. If the number of rows is very low, individual points will be plotted.
Related
For an individual feedback sheet generated by a Shiny App in R I would like to visually compare an individual's value in variable X to the mean of the whole group, the mean of people of the same age and the mean of people playing the same sports. I was considering making a barplot with four bars for each value and since I keep reading ggplot2 is neat for making plots tried to figure out how to do it in ggplot2. However when trying to implement this idea the factor on the x axis would conceptually be the subsets of the dataset and since the subsets are build from different variables and one individual can be in more than one subset I absolutely can't seem to wrap my head around how to actually feed that into any barplot synthax I found. I wondered if your could just make a list along the lines of c(your_value, mean(group), mean(age_subset), mean(sports_subset)) but I didn't find if that was possible also first making a list or even a second dataframe seems kinda messy to me - isn't there an easier and more elegant way to do something like that?
Below I start with arbitrary numbers (equivalent to the list you considered starting with). The code might give you an idea how to make a general function of the kind you're seeking.
library(ggplot2)
library(dplyr)
own_result <- 5.4
mean_age <- 5.6
mean_sport <- 4.5
data.frame(group = c("age", "sport"),
means = c(mean_age, mean_sport)) %>%
ggplot(aes(x = group, y = means)) +
geom_bar(stat = "identity") +
geom_hline(yintercept = own_result, lty = 2, col = "red")
Created on 2021-07-20 by the reprex package (v2.0.0)
I have a data set that is measuring emotions of respondents as they are shown different stimuli. Here is an example:
Sample Attribute Score Rank
A Delighted 180 High
A Happy 200 High
A Tired 130 Medium
B Delighted 160 Medium
B Happy 128 Low
B Tired 115 Low
I am fairly new to R, and I'm having issues actually making a bar chart that only shows sample A. This is what I tried doing:
ggplot(data =
filter(DATA, Category == "A"),
mapping = aes(x = Score, y = Attribute)) +
geom_
But R gives me this error:
Error: `data` must be a data frame, or other object coercible by `fortify()`, not an S3 object with class mts/ts.
Ideally, I am trying to get a bar chart that has the attributes listed on the vertical axis, the scores on the x-axis, and only shows sample A, with the bars color coded by Rank. Any help would be great!
Welcome to SO. Starting with the ggplot2 reference site and working through the examples there will help you get oriented to working with your data and plotting it in ggplot2. That said, here are some suggestions to get you started.
First, as indicated by the error, you need a dataframe, rather than whatever format your data is currently stored in. Based on what you gave me, this is how I got a data.frame, but you might use something else.
t <- "Sample Attribute Score Rank
A Delighted 180 High
A Happy 200 High
A Tired 130 Medium
B Delighted 160 Medium
B Happy 128 Low
B Tired 115 Low"
df <- read.table(text=t, header=T)
## this might be more relevant for your current situation
df <- as.data.frame(DATA)
Try not to use "data" as an object name, as it is already a named function.
Second, if you're learning R, it's probably wise to separate steps in your analysis to ensure you understand what each part is doing. So next do the subset to get just Sample A.
library(tidyverse)
df.sub <- df %>%
filter(Sample=="A")
Now it'll be easier to do the plotting. Your plotting code looked like it was on the right track, but you didn't complete the line.
library(ggplot2)
ggplot(df.sub, aes(x=Score, y=Attribute, fill=Rank)) +
geom_bar(stat="identity")
You'll need to specify stat="identity" to tell ggplot that you want it to recognize the values you provide as data, rather than generating counts (as for a histogram).
This should get you what you are looking for.
This question already has answers here:
Single barplot for each row of dataframe
(2 answers)
Closed 13 days ago.
I picked up r recently and was trying some code for data visualization. For practice, I created a small data frame to plot the data and understand the result.
First I tried plotting a simple vector, like temperature over a week, and function barplot worked like a charm.
later I moved on to plot a simple tabular data of marks of students in 2 subjects as shown below:
stuname sub1 sub2
st1 rocket 95 70
st2 Ash 58 85
I used below to create the dataframe
plotdata=data.frame("stuname"=c("rocket","Ash"),
"sub1"=c(95,58),
"sub2"=c(70,85),
row.names = c("st1","st2"))
I am using below to plot the data
barplot(as.matrix(plotdata[ ,2:3]), xlab = "Stu", ylab = "marks", beside = TRUE)
I think the requirement is basic enough so I have not moved to ggplot yet.
This is what I'm getting:
This is what I was expecting:
I mean, this is how usually we would like to plot, we can keep on adding row data and the plot can keep on increasing and I see one figure to get all the marks for a particular student.
Separate just the numeric values and transpose them so that they will plot in the order you want. Note that if you transpose without separating the numeric values, they may be converted to character.
barplot(height = t(plotdata[c("sub1", "sub2")]),
names.arg = plotdata$stuname,
beside = TRUE)
I would still recommend using ggplot as it takes care of so many things for you
library(reshape2)
library(ggplot2)
#Convert to long format
d = melt(plotdata, id.vars = "stuname")
ggplot(data = d,
mapping = aes(x = stuname, y = value, fill = variable)) +
geom_col(position = position_dodge())
I want to visualize many time series at once. I am new at R, and have spent about 6 hours searching the web and reading about how to tackle this relatively simple problem. My dataset has five time points arranged as rows, and 100 columns. I can easily plot any column against the time points with qplot(time, var2, geom="line"). But I want to learn how to do this for a flexible number of columns, and how to print 6 to 12 of the individual graphs on one page.
Here I learned about the multiplot function, got that to work in terms of layout.
What I am stuck on is how for get the list of variables into a FOR statement so I can have one statement to plot all the variables against the same five time points.
this is what I am playing with. It makes 9 plots, 3 columns wide, but I do not know how to get all my variables into the array for yvars?
for (i in 1:9) {
p1 = qplot(symbol,yvar, geom ="smooth", main = i))
plots[[i]] <- p1 # add each plot into plot list
}
multiplot(plotlist = plots, cols = 3)
Stupidly on my part right now it makes 9 identical plots. So how do I create the list so the above will cycle through all my columns and make those plots?
first melt all your data using the reshape2 package
datm <- melt(your.original.data.frame, id = "time")
Now plot it using facets:
qplot(time, value, data = datm, facets= variable ~ ., geom="point")
Let me know if this works. If you could, please upload your data, it would help tremendously.
I want to plot some clinical characteristics of a sample of patients with a particular disease. There are four variables that are dichotomous and if any one of them is TRUE for being aggressive then the patients is labelled as having an aggressive course.
To do just one variable at a time would mean we could use a stacked or dodged bar plot. We could even use a pie chart.
But to display all the variables and the composite on a single chart is more challenging.
I created some dummy data (just three characteristics + composite). I cannot believe how many manipulations I had to take the data through to plot what I wanted. I encountered every problem that exists. Each problem needed more manipulation. When I looked for answers (for instance on stackoverflow) I could find nothing, probably because I do not know what the buzz words are to describe what I was trying to do.
Questions
1) What are the buzz words for what I am trying to do
2) Does it really need to be this hard or is there are more direct route in ggplot2 that would let me go straight to the chart from the raw data file containing as many rows as there are human subjects
created some simulated data
require(data.table)
aggr.freq <- sample(c(TRUE, FALSE), size=100, replace=TRUE, prob=c(0.1, 0.9) )
aggr.count <- sample(c(TRUE, FALSE), size=100, replace=TRUE, prob=c(0.2, 0.8) )
aggr.spread <- sample(c(TRUE, FALSE), size=100, replace=TRUE, prob=c(0.4, 0.6) )
human.subjects <- data.table(aggr.freq, aggr.count, aggr.spread)
human.subjects[,aggr.course.composite:=aggr.freq|aggr.count|aggr.spread]
tally the trues
aggr.true <- human.subjects [,list(aggr.freq = sum(aggr.freq), aggr.count = sum(aggr.count), aggr.spread = sum(aggr.spread), aggr.course.composite= sum(aggr.course.composite))]
that tally is in the wrong orientation for plotting
aggr.true.vertical <- data.table(t(aggr.true))
aggr.true.vertical[,clinical.characteristic:=factor(dimnames(t(aggr.true))[[1]], ordered=TRUE, levels= c("aggr.freq", "aggr.count", "aggr.spread", "aggr.course.composite"))]#have to specify levels otherwise ggplot2 will plot the variables in alphabetical order
setnames(x=aggr.true.vertical, old = "V1", new = "aggressive")
aggr.true.vertical[,indolent:=human.subjects[,.N]-aggressive]#we had the tally of trues now we need to tall the falses
ggplot(aggr.true.vertical, aes(x=clinical.characteristic, y=aggressive)) + geom_bar(stat="identity") # alas, this graph only shows the count of those with an aggressive characteristic and does not give the reader a feel for the proportion.
reshape for the second time
require(reshape2)
long <- melt(aggr.true.vertical, variable.name="aggressiveness",value.name="count")
ggplot(long, aes(x=clinical.characteristic, y=count, fill=aggressiveness)) + geom_bar(stat="identity")
Thanks.
I think I can see what happened in how you were thinking about the problem, but I think you "took a wrong turn" early on in the process. I'm not sure I can help you with the keywords to search on. Anyway, all you need is one melt and then you can plot. After your data generation:
human.subjects$id<-1:nrow(human.subjects) # Create an id variable (which you probably have)
melted.humans<-melt(human.subjects,id='id')
ggplot(melted.humans, aes(x=variable,fill=value)) + geom_bar()
Maybe you would prefer to flip the order of true and false, but you get the idea.
Also, you may be interested in some simplified code for the other parts of what you were doing, which was counting the trues and falses. (In my solution, I just let ggplot do it.)
# Count the trues:
sapply(human.subjects,sum)
# Collect all the trues and falses into a single matrix,
# by running table on each column.
sapply(human.subjects,table)