How hist() handles zeroes - r

first-time poster, relative R newbie (I've taken a few Coursera classes over the past two years). I've found some behavior with the hist() function I don't really understand; apologies if this has already been answered, but I couldn't find anything.
I read an article saying there were 1,280 police killings in the U.S. last year and only 14 days when they didn't happen. I thought 14 sounded low, so I decided to see what random numbers would look like. (Should have just done ppois, but bear with me.) I don't have enough reputation to do images, but if you run this code it sure looks like like 40-some zeros.
set.seed(0)
hist(rpois(365, lambda = 1280/365))
But when I run
sum(rpois(365, lambda = 1280/365) == 0)
I always get somewhere around 10, which is the correct answer (and in line with the article). And if I plot it in ggplot instead, I get an extra bar that goes up to around 10, properly placed between 0 and 1, with the other bars shifted to the right:
set.seed(0)
randomnos <- as.data.frame(rpois(365, lambda = 1280/365))
colnames(randomnos) <- "Numbers"
ggplot(randomnos, aes(Numbers)) + geom_histogram(binwidth = 1)
Apparently the hist() function (A) isn't plotting the zeroes and (B) is putting the 1's between 0 and 1 on the X axis. ggplot behaves differently. What gives?

Related

Linear regression with R: How do I get labels on data point in qq plot, scale location plot, Residuals vs Leverage etc

I have a small dataset with EU member states that contains values on their degree of negotiation success and the activity level the member states showed in the negotiations.
I am doing a linear regression with R.
In short the hypothesis is:
The more activity a member state shows, the more success it will have in negotiations.
I played around a lot with the data, transformed it etc.
What I have done so far:
# Stored the dataset from a csv file in object linData
linData = read.csv(file.choose(), sep = ";", encoding = "de_DE.UTF-8")
# As I like to switch variables and test different models, I send the relevant ones to objects x and y.
# So it is easier for me to change it in the future.
x = linData$ALL_Non_Paper_Art.Ann.Recit.Nennung
y = linData$Success_high
# I put the label for each observation in a factor lab
lab = linData$MS_short
# After this I run the linear model
linModel = lm(y~x, data = linData)
summary(linModel)
# I create a simple scatterplot. Here the labels from the factor lab work fine
plot(x, y)
text(x, y, labels=lab, cex= 0.5, pos = 4)
So far so good. Now I want to check for model quality. For visual insepection I found out I can use the command
plot(linModel)
This produces 4 plots in a row:
Residuals vs Fitted
Normal Q-Q
Scale Location
Residuals vs Leverage
As you can see in every picture R marks problematic observations by a number. It would be very convenient if R could just use the column "MS_short" from te dataset and add the label to the marked observations. I am sure this is possible... but how?
I work with R for 2 months now. I found some stuff here and via googe but nothing helped me to solve the problem. I have no one I can ask. This is my 1st post here an stackoverflow.
Thank you in advance
Rainer
With the help of G. Grothendieck I solved the problem.
After entering the R-help of plot, more specific the help for plot and linear regression (plot.lm) with the command
?plot.lm
I read the box with the "arguments and usage" part and identified the labels.id argument AND the id.n argument.
id.n is "number of points to be labelled in each plot, starting with the most extreme."
I needed that. I was interested in the identification of this extreme points. R already marked the 3 most extreme points in all graphics (see initial post) but used the observations numbers and not any useful labels. Any other labelling would mess up the graphics. So, we remember: In my case I want the 3 most extreme values to be labelled.
Now let's add this to the command:
I started the same as above, with a plot of my already computed linear model -> plot(linModel). After that I added "id.n =" and set the value to "3". That looked like that:
plot(linModel, id.n = 3,
So far so good, now R knows what to label BUT still not what should be used as label.
For this we have to add the labels.id to the command.
labels.id is the "vector of labels, from which the labels for extreme points will be chosen."
I assumed that one column in my dataset (NOT the linear model!) has the property of a vector and so I added a comma and then "labels.id =" to the command and typed in the name of my dataset and then the column, so in my case: "linData$MS_short" where linData is the dataset and MS_short the column with the 2 letter string for each member state. The final command looked like this:
plot(linModel, id.n = 3, labels.id = linData$MS_short)
And then it worked (see here). End of story.
Hope this can help some other newbies. Greetings.

R: how to make multiple plots from one CSV, grouping by a column

I'd like to put multiple plots onto a single visual output in R, based on data that I have in a CSV that looks something like this:
user,size,time
fred,123,0.915022
fred,321,0.938769
fred,1285,1.185608
wilma,5146,2.196687
fred,7506,1.181990
barney,5146,1.860287
wilma,1172,1.158015
barney,5146,1.219313
wilma,13185,1.455904
wilma,8754,1.381372
wilma,878,1.216908
barney,2974,1.223852
I can read this just fine, using, e.g.:
data = read.csv('data.csv')
For the moment, a fairly simple plot is fine, so I'm just trying plot(), without much to it (setting type='o' to get lines and points), and' from solving a past problem, I know that I can do, e.g., the following, to get data for just fred:
plot(data$time[which(data$user == 'fred')], data$size[which(data$user == 'fred')], type='o')
What I'd like, though, is to have the data for each user all showing up on one set of axes, with color coding (and a legend to match users to colors) to identify different user data.
And if another user shows up, I'd like another line to show up, with another color (perhaps recycling if I have too many users at once).
However, just this doesn't do it:
plot(data$size, data$time, type='o',col=c("red", "blue", "green"))
Because it doesn't seem to group by the user.
And just this:
plot(data, type='o')
gives me an error:
Error in plot.default(...) :
formal argument "type" matched by multiple actual arguments
This:
plot(data)
does do something, but not what I want.
I've poked around, but I'm new enough to R that I'm not quite sure how best to search for this, nor where to look for examples that would hit a use-case like this.
I even got somewhat closer with this:
plot(data$size[which(data$user == 'wilma')], data$time[which(data$user == 'wilma')], type='o', col=c('red'))
lines(data$size[which(data$user == 'fred')], data$time[which(data$user == 'fred')], type='o', col=c('green'))
lines(data$size[which(data$user == 'barney')], data$time[which(data$user == 'barney')], type='o', col=c('blue'))
This gives me a plot (which I'd post inline, but as a new user, I'm not allowed to yet):
not-quite-right plot
which is kind of close to what I want, except that it:
doesn't have a legend
has ugly axis labels, instead of just time and size
is scaled to the first plot, and thus is missing data from some of the others
isn't sorted by x-axis, which I could do externally, though I'm guessing I could do it fairly easily in R.
So, the question, ultimately, is this:
What's an easy way to plot data like this which:
has multiple lines based on the labels in the first column of the CSV
uses the same set of axes for the data in columns 2 and 3, regardless of the label
has a legend and color-coding for which label is being used for a particular line (or set of points)
will adapt to adding new labels to the data file, hopefully without change to the R code.
Thanks in advance for any help or pointers on this.
P.S. I looked around for similar questions, and found one that's sort of close, but it's not quite the same, and I failed to figure out how to adapt it to what I'm trying to do.
Good question. This is doable in base plot, but it's even easier and more intuitive using ggplot2. Below is an example of how to do this with random data in ggplot2
First download and install the package
install.packages("ggplot2",repos='http://cran.us.r-project.org')
require(ggplot2)
Next generate the data
a <- c(rep('a',3),rep('b',3),rep('c',3))
b <- rnorm(9,50,30)
c <- rep(seq(1,3),3)
dat <- data.frame(a,b,c)
Finally, make the plot
ggplot(data=dat, aes(x=c, y=b , group=a, colour=a)) + geom_line() + geom_point()
Basically, you are telling ggplot that your x axis corresponds to the c column (dat$c), your y axis corresponds to the b column (y$b) and to group (draw separate lines) by the a column (dat$a). Colour specifies that you want to group colour by the a column as well.
The resulting graph looks like this:

Stacked barplot using ggplot2

I am a newbie with R (we are using it at university for marketing research).
So, here's the thing:
I want to create a stacked barplot using ggplot2 that show 3 rectangles, with three different colours, for every var on the x line! On the Y line instead I would like to put percentages, but it's ok even with the frequencies.
I've been googling three day before posting, I swear. I'm not a programmer, so sometimes even If i may have found what I needed I think I haven't recognized it :/
This are my tries:
require(ggplot2)
head(dati)
ggplot(dati, aes(B4_1,B4_2,B4_3)) + geom_bar(position="dodge", stat="identity")
(but of course the results isn't right, because it puts the B4_1 var on the X and the B4_2 var on the Y)
so I tried:
ggplot(dati, aes(B4_1)) + geom_bar(position="dodge",binwidth=x)
because i tought it would at least give me the roots! But it says that x wasn't found.
I tried messing around with this code but it keeps giving me different errors, like that x wasn't found, that bin was wrong. I honestly tried and I can post my chronology on the internet of the last two days if anyone thinks I'm just asking you to do my work for me!
Thank you for the support, any help will be VERY appreciated! happy new year!
I need to make some assumptions because the structure of your data is not entirely clear. What I'm pretty sure of is that you are sampling pairs with two values:
x
something that can be either B4_1, B4_2, B4_3
I'm basing this on your comment about frequencies/counts. If B4_* are actual variables with values, then the answer will be different but will almost certainly involve "melting" your data into long format.
So, with that:
# Make up data
data <- data.frame(
x=sample(c("a", "b", "c"), 150, replace=T),
lab=sample(c("B4_1", "B4_2", "B4_3"), 150, replace=T)
)
# Plot
ggplot(data, aes(x=x, fill=lab)) + geom_bar()
Notice how I only specify the x values, and then geom_bar automatically counts them for me. You definitely don't want position=dodge.
Either way, you really should at a minimum post the results of dput(head(dati)) so people can provide better answers.

Uniform plot points in R -- Research / HW

This is for research I am doing for my Masters Program in Public Health
I am graphing data against each other, a standard x,y type deal, over top of that I am plotting a predicted line. I get what I think to be the most funky looking point/boxplot looking thing ever with an x axis that is half filled out and I don't understand why as I do not call a boxplot function. When I call the plot function it is my understanding that only the points will plot.
The data I am plotting looks like this
TOTAL.LACE | DAYS.TO.FAILURE
9 | 15
16 | 7
... | ...
The range of the TOTAL.LACE is from 0 to 19 and DAYS.TO.FAILURE is 0 - 30
My code is as follows, maybe it is something before the plot but I don't think it is:
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- unique(FAILURE + 1)
# Build a test frame that will predict values of the lace score due to
# a patient being in a state of failure
test <- survreg(Surv(time = DAYS.TO.FAILURE, event = FAILURE) ~ TOTAL.LACE,
dist = "logistic")
pred <- predict(test, type="response") <-- produces numbers from about 14 to 23
summary(pred)
ord <- order(TOTAL.LACE)
tl_ord <- TOTAL.LACE[ord]
pred_ord <- pred[ord]
plot(TOTAL.LACE, DAYS.TO.FAILURE, pch=unique(psymbol)) <-- Produces goofy graph
lines(tl_ord, pred_ord) <-- this produces the line not boxplots
Here is the resulting picture
Not to sure how to proceed from here, this is an off shoot of another problem I had with the same data set at this link here I am not understanding why boxplots are being drawn, the reason being is I did not specifically call the boxplot() command so I don't know why they appeared along with point plots. When I issue the following command: plot(DAYS.TO.FAILURE, TOTAL.LACE) I only get points on the resulting plot like I expected, but when I change the order of what is plotted on x and y the boxplots show up, which to me is unexpected.
Here is a link to sample data that will hopefully help in reproducing the problem as pointed out by #Dwin et all Some Sample Data
Thank you,
Since you don't have a reproducible example, it is a little hard to provide an answer that deals with your situation. Here I generate some vaguely similar-looking data:
set.seed(4)
TOTAL.LACE <- rep(1:19, each=1000)
zero.prob <- rbinom(19000, size=1, prob=.01)
DAYS.TO.FAILURE <- rpois(19000, lambda=15)
DAYS.TO.FAILURE <- ifelse(zero.prob==1, DAYS.TO.FAILURE, 0)
And here is the plot:
First, the problem with some of the categories not being printed on the x-axis is because they don't fit. When you have so many categories, to make them all fit you have to display them in a smaller font. The code to do this is to use cex.axis and set the value <1 (you can read more about this here):
boxplot(DAYS.TO.FAILURE~TOTAL.LACE, cex.axis=.8)
As to the question of why your plot is "goofy" or "funky-looking", it is a bit hard to say, because those terms are rather nebulous. My guess is that you need to more clearly understand how boxplots work, and then understand what these plots are telling you about the distribution of your data. In a boxplot, the midline of the box is the 50th percentile of your data, while the bottom and top of the box are the 25th and 75th percentiles. Typically, the 'whiskers' will extend out to the furthest datapoint that is at most 1.5 times the inter-quartile range beyond the ends of the box. In your case, for the first 9 TOTAL.LACEs, more than 75% of your data are 0's, so there is no box and thus no whiskers are possible. Everything beyond the whisker limits is plotted as an individual point. I don't think your plots are "funky" (although I'll admit I have no idea what you mean by that), I think your data may be "funky" and your boxplots are representing the distributions of your data accurately according to the rules by which boxplots are constructed.
In the future (and I mean this politely), it will help you get more useful and faster answers if you can write questions that are more clearly specified, and contain a reproducible example.
Update: Thanks for providing more information. I gather by "funky" you mean that it is a boxplot, rather than a typical scatterplot. The thing to realize is that plot() is a generic function that will call different methods depending on what you pass to it. If you pass simple continuous data, it will produce a scatterplot, but if you pass continuous data and a factor, then it will produce a boxplot, even if you don't call boxplot explicitly. Consider:
plot(TOTAL.LACE, DAYS.TO.FAILURE)
plot(as.factor(TOTAL.LACE), DAYS.TO.FAILURE)
Evidently, you have converted DAYS.TO.FAILURE to a factor without meaning to. Presumably this was done in the pch=unique(psymbol) argument via the code psymbol <- unique(FAILURE + 1) above. Although I haven't had time to try this, I suspect eliminating that line of code and using pch=(FAILURE + 1) will accomplish your goals.

pch in plot with R

I have a dataset that I have plotted, I am now trying to build a legend with the corresponding point styles, the points are plotted correctly on the graph but the legend shows the same symbol for the binary response set. I am a bit confused as to why and hope it is something small. Here is my code
# data should already be loaded in from the project on the school drive
library(survival)
attach(lace)
lace
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- FAILURE + 1
table(psymbol)
plot(AGE, TOTAL.LACE, pch=(psymbol))
legend(0, 15, c("FAILURE = 1", "FAILURE = 0"), pch=(psymbol))]
picture
Thank you,
pysmbol is a vector of length n, where n is the number of data points in your data set. Your legend call is passing this entire vector to pch where you really only need a vector of length 2. Hence legend uses the first two elements of psymbol for pch. Now, go look at psymbol[1:2]. I'll be very surprised if that doesn't return two 1s.
I'd suggest you do pch = unique(psymbol). It looks like it should be a numeric vector so that should work.
Note that you don't need parentheses around psymbol in your calls, and attach()ing an object is considered poor practice unless you quickly detach() immediately after. See ?with for an alternative approach.

Resources