I do not understand letter ordering in the function multcompletters from multcompView. According to documentation it should be according to mean of the group. In the following example, the middle group got c (from abc) and should have got b. Is this a bug?
require(multcompView)
# Data
datacol <- c(21.1,20.2,21.8,20.9,23.3,21.1,20.2,21.8,20.9,23.3,19.8,16.4,
16.9,16.0,17.6,17.5,16.9,13.3,18.0,17.6,13.5,12.2,15.2,15.1,15.2,14.0)
# Group
faccol <- c(rep(c(1,2),each=10),rep(3,6))
# Combined Dataframe
tukeyset <- data.frame(datacol,as.factor(faccol))
colnames(tukeyset)[2] <- "faccol"
# Tukeytest
tukeyres <- TukeyHSD(x=aov(lm(datacol~faccol,data=tukeyset)))
Tlevels <- tukeyres$faccol[,4]
multcompLetters(Tlevels) # WRONG ORDER, even reversed
# Boxplot
boxplot(tukeyset$datacol~tukeyset$faccol)
# adding the labels
text(x=c(1,2,3),y=c(aggregate(data=tukeyset,datacol~faccol,mean)$datacol),
labels=as.character(multcompLetters(Tlevels,reversed=TRUE)$Letters)[order(names(multcompLetters(Tlevels,reversed=TRUE)['Letters']$Letters))])
Oh no just encountered this problem! Spent 2 hours finally sorting it out!
You cannot call multcompLetters. It will give you the completely wrong order.
You have to use multcompLetters2, multcompLetters3 or multcompLetters4.
Another very important point is you have to convert your input dataset into a dataframe, but not tibble! Tibble doesn't work for this.
Related
I'm trying to make a histogram but I keep running into the an error message.
Here is my code
library(readxl)
data <- read_excel("data.xls")
hist(data)
This is my sample. I want to create a histogram the y-axis be 0-100, x-axis (safely, basic, limited, etc) the numbers (39,29,8,12,12) be in the graph. Does this help make sense?
Safely Basic Limited Unimproved Open
39 29 8 12 12
Error in hist.default(data) : 'x' must be numeric
What am I doing wrong? I don't know understand the error message.
In your case data is not a variable, but a dataframe that contains variables.
You can take the histogram of each single variable like this:
library(readxl)
data <- read_excel("data.xls")
If you want to look at the histogram of variable Safely:
hist(data$Safely)
You can access each variable contained in data in the same way.
The issue is that you are passing a dataframe to the hist() function, when it requires a vector for its argument x (see ?hist). Based on your edited post, you would want:
hist(as.numeric(data[1,]))
Where data[1,] creates a vector from the first row of your dataframe.
Though it seems like you may actually be looking for a bar plot. In that case, try:
plot_data <- data.frame(t(data)) %>%
tibble::rownames_to_column()
ggplot(plot_data,aes(x = rowname,y=t.data.)) +
stat_identity(geom = "bar")
From #user2554330, a simpler base graphics method:
f <- as.numeric(data[1,])
names(f) <- names(data)
barplot(f)
I have a list of responses to 7 questions from a survey, each their own column, and am trying to find the response within the first 6 that is closest (numerically) to the 7th. Some won't be the exact same, so I want to create a new variable that produces the difference between the closest number in the first 6 and the 7th. The example below would produce 0.
s <- c(1,2,3,4,5,6,3)
s <- t(s)
s <- as.data.frame(s)
s
Any help is deeply appreciated. I apologize for not having attempted code as nothing I have tried has actually gotten close.
How about this?
which.min( abs(s[1, 1:6] - s[1, 7]))
I'm assuming you want it generalized somehow, but you'd need to provide more info for that. Or just run it through a loop :-)
EDIT: added the loop from the comment and changed exactly 2 tiny things.
s <- c(1,2,3,4,5,6,3)
t <- c(1,2,3,4,5,6,7)
p <- c(1,2,3,4,5,6,2)
s <- data.frame(s,t,p)
k <- t(s)
k <- as.data.frame(k)
k$t <- NA ### need to initialize the column
for(i in 1:3){
## need to refer to each line of k when populating the t column
k[i,]$t <- which.min(abs(k[i, 1:6] - k[i, 7])) }
I've been working on a project for a little bit for a homework assignment and I've been stuck on a logistical problem for a while now.
What I have at the moment is a list that returns 10000 values in the format:
[[10000]]
X-squared
0.1867083
(This is the 10000th value of the list)
What I really would like is to just have the chi-squared value alone so I can do things like create a histogram of the values.
Is there any way I can do this? I'm fine with repeating the test from the start if necessary.
My current code is:
nsims = 10000
for (i in 1:nsims) {cancer.cells <- c(rep("M",24),rep("B",13))
malig[i] <- sum(sample(cancer.cells,21)=="M")}
benign = 21 - malig
rbenign = 13 - benign
rmalig = 24 - malig
for (i in 1:nsims) {test = cbind(c(rbenign[i],benign[i]),c(rmalig[i],malig[i]))
cancerchi[i] = chisq.test(test,correct=FALSE) }
It gives me all I need, I just cannot perform follow-up analysis on it such as creating a histogram.
Thanks for taking the time to read this!
I'll provide an answer at the suggestion of #Dr. Mike.
hist requires a vector as input. The reason that hist(cancerchi) will not work is because cancerchi is a list, not a vector.
There a several ways to convert cancerchi, from a list into a format that hist can work with. Here are 3 ways:
hist(as.data.frame(unlist(cancerchi)))
Note that if you do not reassign cancerchi it will still be a list and cannot be passed directly to hist.
# i.e
class(cancerchi)
hist(cancerchi) # will still give you an error
If you reassign, it can be another type of object:
(class(cancerchi2 <- unlist(cancerchi)))
(class(cancerchi3 <- as.data.frame(unlist(cancerchi))))
# using the ldply function in the plyr package
library(plyr)
(class(cancerchi4 <- ldply(cancerchi)))
these new objects can be passed to hist directly
hist(cancerchi2)
hist(cancerchi3[,1]) # specify column because cancerchi3 is a data frame, not a vector
hist(cancerchi4[,1]) # specify column because cancerchi4 is a data frame, not a vector
A little extra information: other useful commands for looking at your objects include str and attributes.
Is there a way to only show the first 2 lines of output from the describe command in Hmisc?
For data safety reasons I can only really show n, missing, unique and mean in my output and possibly a histogram.
This means that I would have to hide the output for lowest, highest as well as frequencies and percentiles.
Is this possible? If not I'll probably have to calculate the values myself.
library(Hmisc)
res <- describe(rnorm(400))
#Look at the structure.
str(res)
#It's a list! You can change the objects in it.
res$counts <- res$counts[1:4]
res$values <- NULL
print(res)
#rnorm(400)
# n missing unique Mean
# 400 0 400 0.05392
I am trying to run some Monte Carlo simulations on animal position data. So far, I have sampled 100 X and Y coordinates, 100 times. This results in a list of 200. I then convert this list into a dataframe that is more condusive to eventual functions I want to run for each sample (kernel.area).
Now I have a data frame with 200 columns, and I would like to perform the kernel.area function using each successive pair of columns.
I can't reproduce my own data here very well, so I've tried to give a basic example just to show the structure of the data frame I'm working with. I've included the for loop I've tried so far, but I am still an R novice and would appreciate any suggestions.
# generate dataframe representing X and Y positions
df <- data.frame(x=seq(1:200),y=seq(1:200))
# 100 replications of sampling 100 "positions"
resamp <- replicate(100,df[sample(nrow(df),100),])
# convert to data frame (kernel.area needs an xy dataframe)
df2 <- do.call("rbind", resamp[1:2,])
# xy positions need to be in columns for kernel.area
df3 <- t(df2)
#edit: kernel.area requires you have an id field, but I am only dealing with one individual, so I'll construct a fake one of the same length as the positions
id=replicate(100,c("id"))
id=data.frame(id)
Here is the structure of the for loop I've tried (edited since first post):
for (j in seq(1,ncol(df3)-1,2)) {
kud <- kernel.area(df3[,j:(j+1)],id=id,kern="bivnorm",unin=c("m"),unout=c("km2"))
print(kud)
}
My end goal is to calculate kernel.area for each resampling event (ie rows 1:100 for every pair of columns up to 200), and be able to combine the results in a dataframe. However, after running the loop, I get this error message:
Error in df[, 1] : incorrect number of dimensions
Edit: I realised my id format was not the same as my data frame, so I change it and now have the error:
Error in kernelUD(xy, id, h, grid, same4all, hlim, kern, extent) :
id should have the same length as xy
First, a disclaimer: I have never worked with the package adehabitat, which has a function kernel.area, which I assume you are using. Perhaps you could confirm which package contains the function in question.
I think there are a couple suggestions I can make that are independent of knowledge of the specific package, though.
The first lies in the creation of df3. This should probably be
df3 <- t(df2), but this is most likely correct in your actual code
and just a typo in your post.
The second suggestion has to do with the way you subset df3 in the
loop. j:j+1 is just a single number, since the : has a higher
precedence than + (see ?Syntax for the order in which
mathematical operations are conducted in R). To get the desired two
columns, use j:(j+1) instead.
EDIT:
When loading adehabitat, I was warned to "Be careful" and use the related new packages, among which is adehabitatHR, which also contains a function kernel.area. This function has slightly different syntax and behavior, but perhaps it would be worthwhile examining. Using adehabitatHR (I had to install from source since the package is not available for R 2.15.0), I was able to do the following.
library(adehabitatHR)
for (j in seq(1,ncol(df3)-1,2)) {
kud <-kernelUD(SpatialPoints(df3[,j:(j+1)]),kern="bivnorm")
kernAr<-kernel.area(kud,unin=c("m"),unout=c("km2"))
print(kernAr)
}
detach(package:adehabitatHR, unload=TRUE)
This prints something, and as is mentioned in a comment below, kernelUD() is called before kernel.area().