I have a dataset that I have plotted, I am now trying to build a legend with the corresponding point styles, the points are plotted correctly on the graph but the legend shows the same symbol for the binary response set. I am a bit confused as to why and hope it is something small. Here is my code
# data should already be loaded in from the project on the school drive
library(survival)
attach(lace)
lace
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- FAILURE + 1
table(psymbol)
plot(AGE, TOTAL.LACE, pch=(psymbol))
legend(0, 15, c("FAILURE = 1", "FAILURE = 0"), pch=(psymbol))]
picture
Thank you,
pysmbol is a vector of length n, where n is the number of data points in your data set. Your legend call is passing this entire vector to pch where you really only need a vector of length 2. Hence legend uses the first two elements of psymbol for pch. Now, go look at psymbol[1:2]. I'll be very surprised if that doesn't return two 1s.
I'd suggest you do pch = unique(psymbol). It looks like it should be a numeric vector so that should work.
Note that you don't need parentheses around psymbol in your calls, and attach()ing an object is considered poor practice unless you quickly detach() immediately after. See ?with for an alternative approach.
Related
I want to plot the distribution of the datasets using the histogram in R. I tried using different arguments (default, Freedman-Diaconis, and Scott) to get the best representation. I consider using a log scale later, but first I want to know the raw distribution without any scaling. However, the results look different, why is that? The dataset I use can be downloaded from here data or here data. The code I'm running are
hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks = 200)
result is
hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks = "Scott")
Result is
hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks="Freedman-Diaconis")
result is
Please help. Thank you very much.
Histograms are very sensitive to the choice of cell break points. Even for the same (!) number of cells, the histogram can become considerably different by just a small shift of the cell borders. It is thus generally preferable to use kernel density estimators instead of histograms, because they do not depend on random cell border placement:
# increase n if you have a wide range of values
d <- density(as.matrix(deviation_all_genes_all_spots), n=512)
plot(d$x, d$y)
In your second and third call of hist, you ask for an automatic way to select the number of cells and the cell borders. Obviously, this results in more cells than in your first call with breaks=200. You can query the cells from the return value of hist, e.g.
h <- hist(as.matrix(deviation_all_genes_all_spots))
cat(srintf("number of cells = %i\n", length(h$mids))
I have a small dataset with EU member states that contains values on their degree of negotiation success and the activity level the member states showed in the negotiations.
I am doing a linear regression with R.
In short the hypothesis is:
The more activity a member state shows, the more success it will have in negotiations.
I played around a lot with the data, transformed it etc.
What I have done so far:
# Stored the dataset from a csv file in object linData
linData = read.csv(file.choose(), sep = ";", encoding = "de_DE.UTF-8")
# As I like to switch variables and test different models, I send the relevant ones to objects x and y.
# So it is easier for me to change it in the future.
x = linData$ALL_Non_Paper_Art.Ann.Recit.Nennung
y = linData$Success_high
# I put the label for each observation in a factor lab
lab = linData$MS_short
# After this I run the linear model
linModel = lm(y~x, data = linData)
summary(linModel)
# I create a simple scatterplot. Here the labels from the factor lab work fine
plot(x, y)
text(x, y, labels=lab, cex= 0.5, pos = 4)
So far so good. Now I want to check for model quality. For visual insepection I found out I can use the command
plot(linModel)
This produces 4 plots in a row:
Residuals vs Fitted
Normal Q-Q
Scale Location
Residuals vs Leverage
As you can see in every picture R marks problematic observations by a number. It would be very convenient if R could just use the column "MS_short" from te dataset and add the label to the marked observations. I am sure this is possible... but how?
I work with R for 2 months now. I found some stuff here and via googe but nothing helped me to solve the problem. I have no one I can ask. This is my 1st post here an stackoverflow.
Thank you in advance
Rainer
With the help of G. Grothendieck I solved the problem.
After entering the R-help of plot, more specific the help for plot and linear regression (plot.lm) with the command
?plot.lm
I read the box with the "arguments and usage" part and identified the labels.id argument AND the id.n argument.
id.n is "number of points to be labelled in each plot, starting with the most extreme."
I needed that. I was interested in the identification of this extreme points. R already marked the 3 most extreme points in all graphics (see initial post) but used the observations numbers and not any useful labels. Any other labelling would mess up the graphics. So, we remember: In my case I want the 3 most extreme values to be labelled.
Now let's add this to the command:
I started the same as above, with a plot of my already computed linear model -> plot(linModel). After that I added "id.n =" and set the value to "3". That looked like that:
plot(linModel, id.n = 3,
So far so good, now R knows what to label BUT still not what should be used as label.
For this we have to add the labels.id to the command.
labels.id is the "vector of labels, from which the labels for extreme points will be chosen."
I assumed that one column in my dataset (NOT the linear model!) has the property of a vector and so I added a comma and then "labels.id =" to the command and typed in the name of my dataset and then the column, so in my case: "linData$MS_short" where linData is the dataset and MS_short the column with the 2 letter string for each member state. The final command looked like this:
plot(linModel, id.n = 3, labels.id = linData$MS_short)
And then it worked (see here). End of story.
Hope this can help some other newbies. Greetings.
I am trying to plot a time series graph, but am having issues getting it to be a line graph while showing the decades at the bottom.
My data set has the decades (as factors) next to performance (integer)
If I write
plot(StockPerformance$Decade, StockPerformance$Performance)
I will get a graph that has horizontal lines in it
PLOT PICTURE
adding,
type ="o"
like this:
plot(StockPerformance$Decade, StockPerformance$Performance, type ="o")
doesn't change it....
In R, when you read/create a data frame using read.table (or a variant thereof) or make it using data.frame, it tries to figure out what you have, and treat it appropriately. Specifically, inputs with character vectors (like "1830s" get converted to factors.
Factors are a way to efficiently store character strings - which was a lot more important when R was first created than now. The important thing for you is that characters don't have any order to them unless you put it there, so R automatically makes boxplots out of them. That's why you are seeing lines - they are boxplots with only one point.
To get around this, you need to convert them to numbers for the purpose of plotting. Then, you need to "fix" the axes afterwards. Here's how:
plot(Performance ~ as.numeric(Decade),
data = StockPerformance,
xlab = "Decade", # otherwise we have "as.numeric(Decade)
xaxt = 'n', # removes default axis ticks and labels
pch = 1 # default open circle. Change the number to get other options. 16 and 20 are both closed circles (20 is small, 16 is big)
)
with(StockPerformance, # This just makes it so I don't have to type StockPerformance twice below.
axis(1, at = 1:nlevels(Decade),
value = levels(Decade)
))
I am following along with the MITx: 15.071x The Analytics Edge course online and am trying to determine how to color code points on graph. I can successfully make a graph of latitude and longitude using line 1 one below. However, when I try the second line I just get a blank graph. When I leave the original graph from line 1 in focus I can't type any code without returning focus to console, and then typing line two does not do anything. I am new to both stackoverflow and R so any insight is appreciated. Thanks!
plot(boston$LON, boston$LAT)
points(boston$LON[boston$CHAS == 1], boston$LAST[boston$CHAS ==1], col = "blue", pch = 19)
Here's how to do it on one plot call:
plot(boston$LON, boston$LAT,
col=c("black", "blue")[ (boston$CHAS==1)+1 ] )
This uses implicit coercion (via the +1 operation) of a logical vector formed by (boston$CHAS==1) from FALSE/TRUE (or {0,1}) to {1,2} which then is used to index from the color vector. Such color vectors are called palettes in R. The findInterval function would allow you to compactly create a multiple valued integer vector that could select from a much larger color palette vector.
(Your method should have succeeded in plotting blue points on the plot that was formed by the first line, but perhaps it was the misspelling of "LAT" that tripped you up?)
The function points can be used only once the function plot has been called previously, so you could call the following sequentially
plot(mtcars$disp[mtcars$cyl==4],mtcars$mpg[mtcars$cyl==4],col='red',pch=16,ylim=range(mtcars$mpg),xlim=range(mtcars$disp))
points(mtcars$disp[mtcars$cyl==6],mtcars$mpg[mtcars$cyl==6],col='blue',pch=16)
points(mtcars$disp[mtcars$cyl==8],mtcars$mpg[mtcars$cyl==8],col='black',pch=16)
You might also look at the package ggplot2 which makes this kind of thing more simple
library(ggplot2)
ggplot(data=mtcars,aes(x=disp,y=mpg,color=factor(cyl))) + geom_point()
I'm looking to plot a set of sparklines in R with just a 0 and 1 state that looks like this:
Does anyone know how I might create something like that ideally with no extra libraries?
I don't know of any simple way to do this, so I'm going to build up this plot from scratch. This would probably be a lot easier to design in illustrator or something like that, but here's one way to do it in R (if you don't want to read the whole step-by-step, I provide my solution wrapped in a reusable function at the bottom of the post).
Step 1: Sparklines
You can use the pch argument of the points function to define the plotting symbol. ASCII symbols are supported, which means you can use the "pipe" symbol for vertical lines. The ASCII code for this symbol is 124, so to use it for our plotting symbol we could do something like:
plot(df, pch=124)
Step 2: labels and numbers
We can put text on the plot by using the text command:
text(x,y,char_vect)
Step 3: Alignment
This is basically just going to take a lot of trial and error to get right, but it'll help if we use values relative to our data.
Here's the sample data I'm working with:
df = data.frame(replicate(4, rbinom(50, 1, .7)))
colnames(df) = c('steps','atewell','code','listenedtoshell')
I'm going to start out by plotting an empty box to use as our canvas. To make my life a little easier, I'm going to set the coordinates of the box relative to values meaningful to my data. The Y positions of the 4 data series will be the same across all plotting elements, so I'm going to store that for convenience.
n=ncol(df)
m=nrow(df)
plot(1:m,
seq(1,n, length.out=m),
# The following arguments suppress plotting values and axis elements
type='n',
xaxt='n',
yaxt='n',
ann=F)
With this box in place, I can start adding elements. For each element, the X values will all be the same, so we can use rep to set that vector, and seq to set the Y vector relative to Y range of our plot (1:n). I'm going to shift the positions by percentages of the X and Y ranges to align my values, and modified the size of the text using the cex parameter. Ultimately, I found that this works out:
ypos = rev(seq(1+.1*n,n*.9, length.out=n))
text(rep(1,n),
ypos,
colnames(df), # These are our labels
pos=4, # This positions the text to the right of the coordinate
cex=2) # Increase the size of the text
I reversed the sequence of Y values because I built my sequence in ascending order, and the values on the Y axis in my plot increase from bottom to top. Reversing the Y values then makes it so the series in my dataframe will print from top to bottom.
I then repeated this process for the second label, shifting the X values over but keeping the Y values the same.
text(rep(.37*m,n), # Shifted towards the middle of the plot
ypos,
colSums(df), # new label
pos=4,
cex=2)
Finally, we shift X over one last time and use points to build the sparklines with the pipe symbol as described earlier. I'm going to do something sort of weird here: I'm actually going to tell points to plot at as many positions as I have data points, but I'm going to use ifelse to determine whether or not to actually plot a pipe symbol or not. This way everything will be properly spaced. When I don't want to plot a line, I'll use a 'space' as my plotting symbol (ascii code 32). I will repeat this procedure looping through all columns in my dataframe
for(i in 1:n){
points(seq(.5*m,m, length.out=m),
rep(ypos[i],m),
pch=ifelse(df[,i], 124, 32), # This determines whether to plot or not
cex=2,
col='gray')
}
So, piecing it all together and wrapping it in a function, we have:
df = data.frame(replicate(4, rbinom(50, 1, .7)))
colnames(df) = c('steps','atewell','code','listenedtoshell')
BinarySparklines = function(df,
L_adj=1,
mid_L_adj=0.37,
mid_R_adj=0.5,
R_adj=1,
bottom_adj=0.1,
top_adj=0.9,
spark_col='gray',
cex1=2,
cex2=2,
cex3=2
){
# 'adJ' parameters are scalar multipliers in [-1,1]. For most purposes, use [0,1].
# The exception is L_adj which is any value in the domain of the plot.
# L_adj < mid_L_adj < mid_R_adj < R_adj
# and
# bottom_adj < top_adj
n=ncol(df)
m=nrow(df)
plot(1:m,
seq(1,n, length.out=m),
# The following arguments suppress plotting values and axis elements
type='n',
xaxt='n',
yaxt='n',
ann=F)
ypos = rev(seq(1+.1*n,n*top_adj, length.out=n))
text(rep(L_adj,n),
ypos,
colnames(df), # These are our labels
pos=4, # This positions the text to the right of the coordinate
cex=cex1) # Increase the size of the text
text(rep(mid_L_adj*m,n), # Shifted towards the middle of the plot
ypos,
colSums(df), # new label
pos=4,
cex=cex2)
for(i in 1:n){
points(seq(mid_R_adj*m, R_adj*m, length.out=m),
rep(ypos[i],m),
pch=ifelse(df[,i], 124, 32), # This determines whether to plot or not
cex=cex3,
col=spark_col)
}
}
BinarySparklines(df)
Which gives us the following result:
Try playing with the alignment parameters and see what happens. For instance, to shrink the side margins, you could try decreasing the L_adj parameter and increasing the R_adj parameter like so:
BinarySparklines(df, L_adj=-1, R_adj=1.02)
It took a bit of trial and error to get the alignment right for the result I provided (which is what I used to inform the default values for BinarySparklines), but I hope I've given you some intuition about how I achieved it and how moving things using percentages of the plotting range made my life easier. In any event, I hope this serves as both a proof of concept and a template for your code. I'm sorry I don't have an easier solution for you, but I think this basically gets the job done.
I did my prototyping in Rstudio so I didn't have to specify the dimensions of my plot, but for posterity I had 832 x 456 with the aspect ratio maintained.