I am doing a CCA on a table of abundance data of species and environmental parameters, very similar to the doubs data, package ade4.
cil.cca.r<-cca(cil~.,envm.r)
Where cil is a table with abundance data with 16 observations (sample sites) and 76 variables (abundances) and envm.r is my table with 16 observations (sample sites) and 11 variables (environmental parameters).
My problem is that I have a lot of species and I want the names of the species in the plot to be more readable.
Or if this does´t work, I would need to find out which species e.g. please allow me to say "prefer" the environmental parameter PK/ml. How can I get to that information when the species names are not readable.
The command I started with is the following:
plot(cil.cca.r,scaling=1,display=c("sp","lc","cn"),main="Triplot CCA spe ~envm.r - scaling 1",repel = TRUE )
Next I also tried to separate the commands to make it more readable.
plot(cil.cca.rh, type="n",xlim=c(-2,2))
text(cil.cca.rh, "species", col="red", cex=0.6)
text(cil.cca.rh, dis="cn",cex=0.7,adj=0.8,col="blue")
I don´t get anywhere with trying out the parameters.
Then I found the function orditorp()
plot(cil.cca.rh, type="n",xlim=c(-2,2))
orditorp(cil.cca.rh, "species", col="red", cex=0.6)
orditorp(cil.cca.rh,display="sites",cex=0.8,adj=0.1,col="darkgreen",air=0.7)
Plot with orditorp
This makes the plot look nicer, but still I don´t get the information about my species, since a lots of them get points.
I hoped for some solution like package ggrepel e.g. function repel() in ggplot2 but this doesn´t work for a CCA.
Error "ggplot2 doesn't know how to deal with data of class cca"....
I also hoped for some function like fviz_pca_ind() of package factoextra
but for CCA data.
But I could´t find anything and would be very grateful for solutions either for better plotting the labels or for getting the species names with the locations!
Of course, what I could do is give the species shorter names or numbers instead of names, but it would be nicer to get at one view which species it is...
Or perhaps "zoomig in" part of the graphic?
Related
This is my first question here so please excuse any mistakes I (may or may not) make.
The premise:
I got a vegetational dataset containing paired data on different plots for old and new observations. I used the 'openxlsx'-package to load my data, and 'vegan'-package to execute an NMDS as follows:
mydata <- read.xlsx(mydata)
mydataMDS <- metaMDS(mydata, k=2, trymax=500)
The result is then used for a model via the "envfit()"-function, including environmental variables:
myenvdata <- read.xlsx(myenvdata)
mydataMDS_fit <- envfit(mydataMDS, myenvdata, perm=10000, na.rm=TRUE)
plot(mydataMDS, display="sites")
plot(mydataMDS_fit, p.max=0.01, axis=TRUE)
Now I have a plot with my statistical "mydataMDS"-analyses, including vectors produced by the "mydataMDS_fit" R calculated.
The problem:
I want to colour and connect certain points within this plot. As "mydata" consists of observations within the same plot at different times, I intend to colour all points of old observations in one colour, and all the new ones in a different one. I've read about adding columns in order to group old and new observations, but as I'm working with a model there are no columns. How can I edit my datasets ("mydata", "mydataMDS", "myenvdata", "mydataMDS_fit") in order to show old and new plots in 2 different colours (one colour for old, one colour for new), and connect the paired observations with lines? Or: Is there a possibility to directly re-colour the points within my graphical output via checking for old/new observations?
(Sorry, I feel like my explanation is quite complicated, but I still hope someone may be able to help)
For my thesis i want to create a histogram on standardized earnings. This histogram should ideally have the following properties:
The histogram should be able to have the intervals of the data
(bins) played with.
Since i have my data in a spreadsheet. Is it possible to consider
more than one column?
Also it should have the ability to set the range of the data that is
included in the histogram for example from -50 mio. to 200 mio. (But
i could do this in my input)
Sadly I was not able to perform this task my own.
I have downloaded the data from orbis in spreadsheet (xlsx). Afterwards I cleaned my data of symbols that R can't read, saved everything as a Tab separated .txt and imported it into R-Studio:
setwd("/path")
getwd()
df<- read.table("importFile", header = TRUE)
View(df)
This worked nicely.
Now i tried creating the histogram
library(ggplot2)
myplot=ggplot(df, aes(JuStandartisiert2007))
myplot+ stat_count(width = 1000)
Then i received the following warning:
position_stack requires non-overlapping x intervals
My histogram looks horrible:
This perplexes me, I tried making a histogram on the airquality dataset and it works without problems.
Also note that i have to use stat_count for my histogram in a youtube video i saw, they did it the following way:
myplot+ geom_histogram(binwidth = 10)
My questions are now:
What is wrong with my Data why i have overlapping x Values? To my naked eye my data looks the same than that from R's airquality dataset.
How can I sepparate my x values?
Can i set max and min values for the data that enters my Histogram?
Can I consider more than one column in my dataset.
Here is my Dataset as TAB separated txt file.
https://www.dropbox.com/sh/jbscj6cftpcqaxh/AADglvv_xnG2wWN-o2SIrTwpa?dl=0
I would rather begin with base plotting such as:
hist(df$JuStandartisiert2007,breaks=1000,xlim=c(-2,2))
you can also observe the limits for the x-axis.
In order to have the plot of two columns try :
plot(df$JuStandartisiert2007,df$BilanzsummeAktiva2007,xlim = c(-5,5),ylim=c(-1,1000))
Once again observe the x and y limits represented by: xlim and ylim
I ran a pca on a set of 45000 genes on 5 different samples, and when I perform a biplot, all I see is a mass of text (responding to the observation names), and cannot see the location of my samples. Is there a way to plot the location of the samples only, and not the observation, in a biplot?
Using built in data from R
usa <- USArrests
pca1 <- prcomp(usa)
biplot(pca1)
This generates a biplot where all the states (observation names) overlap the variables (my different samples) rape, etc. Is it possible to plot only the variables (samples), and not the states (observation names)?
biplot.default uses text to write the categorical variable name of the observation. As it doesn't use points you need to modify the source if you only want the points (and not the labels) to be plotted.
However, you could "hack" it by doing something like:
biplot(pca1, xlabs = rep(".", nrow(usa)))
I hope this is what you're looking for!
Edit If this is not satisfactory, you can modify the source given when running stats:::biplot.default to use points.
I realize this is perhaps more of a data frame issue than a xyplot questions - but here it goes.
I have a data frame dat that has 108 rows and 5 columns. dat$Treatment is a factor with 5 levels. I want to create an xy plot with ONLY the data where dat$Treatment=="Control". Since I didn't know any better way to do it, I created tmp as shown below. xyplot plots the correct graph, with only the data in the rows where dat$Treatment=="Control". However the legend displays all the data, for example those where dat$Treatment=="High dose"
Where is auto.key getting that from? I thought my tmp data frame didn't even have it. Can someone please help me understand?
tmp <- dat[dat$Treatment=="Control",]
xyplot(tmp[,5] ~ Day, groups=tmp$Animal, data=tmp,
type="b", ylab="Tumor volume",
par.settings=simpleTheme(col=1:8,
pch=20,
cex=1.3,
lwd=2,
lty="dotted"),
auto.key=list(title="Animal", x=.05, y=.95,
corner=c(0,1), border=T, lines=T, points=F, type="b"))
I'm not too familiar with the lattice package, so others with more experience will have to weigh in. My guess is that you're seeing this behavior because of how R is handling dat$Treatment. I'm guessing this variable is stored as a factor, with levels you don't want to include in the plot. As a rough first step, I'd try saving the new data frame (as you have), but additionally run the following command:
tmp$Treatment = as.factor(as.character(tmp$Treatment))
This should save the Treatment variable as a factor with only one level. My guess is that the xyplot function looks up the levels of that factor when it plots. As a related example, consider the following:
data(iris)
iris.2 = iris[iris$Species == "setosa",]
table(iris.2$Species)
iris.2$Species = factor(as.character(iris.2$Species))
table(iris.2$Species)
Here, the two tables are reported differently because we've resaved the Species variable as a new factor. Hope this helps --
auto.key get's its values form the levels of the factor variables. When you subset a factor variable, all the levels are maintained (so in the future, you can know which levels are missing from a particular subset). If you want to remove levels that aren't used in your subset you can use
tmp <- droplevels(dat[dat$Treatment=="Control",])
This way auto.key will never see the other factor levels.
This is for research I am doing for my Masters Program in Public Health
I am graphing data against each other, a standard x,y type deal, over top of that I am plotting a predicted line. I get what I think to be the most funky looking point/boxplot looking thing ever with an x axis that is half filled out and I don't understand why as I do not call a boxplot function. When I call the plot function it is my understanding that only the points will plot.
The data I am plotting looks like this
TOTAL.LACE | DAYS.TO.FAILURE
9 | 15
16 | 7
... | ...
The range of the TOTAL.LACE is from 0 to 19 and DAYS.TO.FAILURE is 0 - 30
My code is as follows, maybe it is something before the plot but I don't think it is:
# To control the type of symbol we use we will use psymbol, it takes
# value 1 and 2
psymbol <- unique(FAILURE + 1)
# Build a test frame that will predict values of the lace score due to
# a patient being in a state of failure
test <- survreg(Surv(time = DAYS.TO.FAILURE, event = FAILURE) ~ TOTAL.LACE,
dist = "logistic")
pred <- predict(test, type="response") <-- produces numbers from about 14 to 23
summary(pred)
ord <- order(TOTAL.LACE)
tl_ord <- TOTAL.LACE[ord]
pred_ord <- pred[ord]
plot(TOTAL.LACE, DAYS.TO.FAILURE, pch=unique(psymbol)) <-- Produces goofy graph
lines(tl_ord, pred_ord) <-- this produces the line not boxplots
Here is the resulting picture
Not to sure how to proceed from here, this is an off shoot of another problem I had with the same data set at this link here I am not understanding why boxplots are being drawn, the reason being is I did not specifically call the boxplot() command so I don't know why they appeared along with point plots. When I issue the following command: plot(DAYS.TO.FAILURE, TOTAL.LACE) I only get points on the resulting plot like I expected, but when I change the order of what is plotted on x and y the boxplots show up, which to me is unexpected.
Here is a link to sample data that will hopefully help in reproducing the problem as pointed out by #Dwin et all Some Sample Data
Thank you,
Since you don't have a reproducible example, it is a little hard to provide an answer that deals with your situation. Here I generate some vaguely similar-looking data:
set.seed(4)
TOTAL.LACE <- rep(1:19, each=1000)
zero.prob <- rbinom(19000, size=1, prob=.01)
DAYS.TO.FAILURE <- rpois(19000, lambda=15)
DAYS.TO.FAILURE <- ifelse(zero.prob==1, DAYS.TO.FAILURE, 0)
And here is the plot:
First, the problem with some of the categories not being printed on the x-axis is because they don't fit. When you have so many categories, to make them all fit you have to display them in a smaller font. The code to do this is to use cex.axis and set the value <1 (you can read more about this here):
boxplot(DAYS.TO.FAILURE~TOTAL.LACE, cex.axis=.8)
As to the question of why your plot is "goofy" or "funky-looking", it is a bit hard to say, because those terms are rather nebulous. My guess is that you need to more clearly understand how boxplots work, and then understand what these plots are telling you about the distribution of your data. In a boxplot, the midline of the box is the 50th percentile of your data, while the bottom and top of the box are the 25th and 75th percentiles. Typically, the 'whiskers' will extend out to the furthest datapoint that is at most 1.5 times the inter-quartile range beyond the ends of the box. In your case, for the first 9 TOTAL.LACEs, more than 75% of your data are 0's, so there is no box and thus no whiskers are possible. Everything beyond the whisker limits is plotted as an individual point. I don't think your plots are "funky" (although I'll admit I have no idea what you mean by that), I think your data may be "funky" and your boxplots are representing the distributions of your data accurately according to the rules by which boxplots are constructed.
In the future (and I mean this politely), it will help you get more useful and faster answers if you can write questions that are more clearly specified, and contain a reproducible example.
Update: Thanks for providing more information. I gather by "funky" you mean that it is a boxplot, rather than a typical scatterplot. The thing to realize is that plot() is a generic function that will call different methods depending on what you pass to it. If you pass simple continuous data, it will produce a scatterplot, but if you pass continuous data and a factor, then it will produce a boxplot, even if you don't call boxplot explicitly. Consider:
plot(TOTAL.LACE, DAYS.TO.FAILURE)
plot(as.factor(TOTAL.LACE), DAYS.TO.FAILURE)
Evidently, you have converted DAYS.TO.FAILURE to a factor without meaning to. Presumably this was done in the pch=unique(psymbol) argument via the code psymbol <- unique(FAILURE + 1) above. Although I haven't had time to try this, I suspect eliminating that line of code and using pch=(FAILURE + 1) will accomplish your goals.