I have an assignment for a statistics class where I copied code from the teacher, but I keep getting an error where x should be numerical and I don't see where the issue is
Survey=read.csv("ClassSurvey.csv", na.strings = c("", " ", "NA"))
attach(Survey)
library(FSAdata)
op = par(oma=c(0,0,1.5,0), mar=c(3,3,2,1))
hist(Height ~ Gender,
las=1,
nrow=2, ncol=1,
cex.main=0.9, cex.lab=0.8, cex.axis=0.8,
mgp=c(1.8,0.6,0),
xlab = "Height (cms)")
thank you!!
You have a couple of problems. First, you can't make a histogram on a variable that is not numeric. Try this:
data(ToothGrowth)
str(ToothGrowth)
'data.frame': 60 obs. of 3 variables:
$ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
$ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
$ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
attach(ToothGrowth)
Attempt to draw a histogram of supp, a variable of class "factor" (it's not numeric).
> hist(supp)
Error in hist.default(supp ~ dose) : 'x' must be numeric
Your second problem is that the hist function doesn't accept a formula y ~ x. So
hist(len~supp)
fails as well, even though len is numeric. This works though:
hist(len)
I wonder if you copied the teacher's code incorrectly? Or the teacher's code is incorrect? Or something else.
Related
I need to compute weighted Mann Whitney U test results a few hundred times. Each iteration involves is a two-sample test for differences between two groups. I can't figure out how to get the existing function to handle missing values without dynamically deleting cases.
The data for a few of the comparisons are here, in a data frame I call dat. All variables with numbers in this sheet are numeric in type.
Here's how I call the sjstats::mannwhitney() function:
mannwhitney(dat, measure1, group)
When I do so, I get the following error:
Error in `[[<-.data.frame`(`*tmp*`, "grp1.label", value = character(0)) :
replacement has 0 rows, data has 1
I suspect this is because of the missing value in the 212th observation of measure1. But wrapping the vector names in na.omit() or !is.na() don't address the problem, perhaps because doing so still results in a data frame where the number of non-NA values of group are greater than the number of non-NA values in measure1.
Any thoughts on how I could incorporate dynamic NA handling into the function call?
I am not sure what class your group column is, but if I do it like this:
library(sjstats)
dat = read.csv("question - Sheet1.csv")
str(dat)
'data.frame': 301 obs. of 5 variables:
$ measure1 : num 2 1.6 2.2 2.7 1.8 1.8 4 4 3.9 -3.7 ...
$ measure2 : num 0.9 0.1 0 0.4 -1 -1.3 2.1 0 -1.1 -3.9 ...
$ measure3 : num 1.1 1.1 2.2 1.2 1.9 1.2 0 3 1.9 -3.8 ...
$ measurre4: num 2 2 2 3 3 2 3 4 3 2.36 ...
$ group : int 0 0 0 0 0 0 0 0 0 0 ...
I get:
mannwhitney(dat, measure1, group)
Error in `[[<-.data.frame`(`*tmp*`, "grp1.label", value = character(0)) :
replacement has 0 rows, data has 1
Factor your group:
dat$group = factor(dat$group)
mannwhitney(dat, measure1, group)
# Mann-Whitney-U-Test
Groups 1 = 0 (n = 110) | 2 = 1 (n = 190):
U = 16913.000, W = 10808.000, p = 0.621, Z = 0.495
effect-size r = 0.029
rank-mean(1) = 153.75
rank-mean(2) = 148.62
Reading the code, the bug comes from this:
labels <- sjlabelled::get_labels(grp, attr.only = F, values = NULL,
non.labelled = T)
If your group is numeric, it doesn't have attributes and hence you get no labels:
sjlabelled::get_labels(0:1)
NULL
sjlabelled::get_labels(factor(0:1))
[1] "0" "1"
I am working on a two-way mixed ANOVA using the data below, using one dependent variable, one between-subjects variable and one within-subjects variable. When I tested the normality of the residuals, of the dependent variable, I find that they are not normally distributed. But at this point I am able to perform the two-way ANOVA. Howerver, when I perform a log10 transformation, and run the script again using the log transformed variable, I get the error "contrasts can be applied only to factors with 2 or more levels".
> str(m_runjumpFREQ)
'data.frame': 564 obs. of 8 variables:
$ ID1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ ID : chr "ID1" "ID2" "ID3" "ID4" ...
$ Group : Factor w/ 2 levels "II","Non-II": 1 1 1 1 1 1 1 1 1 1 ...
$ Pos : Factor w/ 3 levels "center","forward",..: 2 1 2 3 2 2 1 3 2 2 ...
$ Match_outcome : Factor w/ 2 levels "W","L": 2 2 2 2 2 2 2 2 2 1 ...
$ time : Factor w/ 8 levels "runjump_nADJmin_q1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ runjump : num 0.0561 0.0858 0.0663 0.0425 0.0513 ...
$ log_runjumpFREQ: num -1.25 -1.07 -1.18 -1.37 -1.29 ...
Some answers on StackOverflow to this error have mentioned that one or more factors in the data set, used for the ANOVA, are of less than two levels. But as seen above they are not.
Another explanation I have read is that it may be the issue of missing values, where there may be NA's. There is:
m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 88
However, I get the same error even after removing the rows including NA's as follows.
> m_runjumpFREQ <- na.omit(m_runjumpFREQ)
> m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 0
I could run the same script without log transformation and it would work, but with it, I get the same error. The factors are the same either way and the missing values do not make a difference. Either I am doing a crucial mistake or the issue is in the line of the log transformation below.
log_runjumpFREQ <- log10(m_runjumpFREQ$runjump)
m_runjumpFREQ <- cbind(m_runjumpFREQ, log_runjumpFREQ)
I appreciate the help.
It is not good enough that the factors have 2 levels. In addition those levels must actually be present. For example, below f has 2 levels but only 1 is actually present.
y <- (1:6)^2
x <- 1:6
f <- factor(rep(1, 6), levels = 1:2)
nlevels(f) # f has 2 levels
## [1] 2
lm(y ~ x + f)
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
## contrasts can be applied only to factors with 2 or more levels
I am using ggplot 2.1.0 to plot histograms, and I have an unexpected behaviour concerning the histogram bins.
I put here an example with left-closed bins (i.e. [ 0, 0.1 [ ) with a binwidth of 0.1.
mydf <- data.frame(myvar=c(-1,-0.5,-0.4,-0.1,-0.1,0.05,0.1,0.1,0.25,0.5,1))
myplot <- ggplot(mydf, aes(myvar)) + geom_histogram(aes(y=..count..),binwidth = 0.1, boundary=0.1,closed="left")
myplot
ggplot_build(myplot)$data[[1]]
On this example, one may expect the value -0.4 to be within the bin [-0.4, -0.3[, but it falls instead (mysteriously) in the bin [-0.5,-0.4[. Same thing for the value -0.1 which falls in [-0.2,-0.1[ instead of [-0.1,0[...etc.
Is there something here I do not fully understand (especially with the new "center" and "boundary" params)? Or is ggplot2 doing weird things there?
Thanks in advance,
Best regards,
Arnaud
PS: Also asked here: https://github.com/hadley/ggplot2/issues/1651
Edit: The problem described below was fixed in a recent release of ggplot2.
Your issue is reproducible and appears to be caused by rounding errors, as suggested in the comments by Roland. At this point, this looks to me like a bug introduced in version ggplot2_2.0.0. I speculate below about its origin, but first let me present a workaround based on the boundary option.
PROBLEM:
df <- data.frame(var = seq(-100,100,10)/100)
as.list(df) # check the data
$var
[1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2
[10] -0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
[19] 0.8 0.9 1.0
library("ggplot2")
p <- ggplot(data = df, aes(x = var)) +
geom_histogram(aes(y = ..count..),
binwidth = 0.1,
boundary = 0.1,
closed = "left")
p
SOLUTION
Tweak the boundary parameter. In this example, setting just below 1, say 0.99, works. Your use case should be amenable to tweaking too.
ggplot(data = df, aes(x = var)) +
geom_histogram(aes(y = ..count..),
binwidth = 0.05,
boundary = 0.99,
closed = "left")
(I have made the binwidth narrower for better visual)
Another workaround is to introduce your own fuzziness, e.g. multiply the data by 1 plus slightly less than the machine zero (see eps below). In ggplot2 the fuzziness multiplies by 1e-7 (earlier versions) or 1e-8 (later versions).
CAUSE:
The problem appears clearly in ncount:
str(ggplot_build(p)$data[[1]])
## 'data.frame': 20 obs. of 17 variables:
## $ y : num 1 1 1 1 1 2 1 1 1 0 ...
## $ count : num 1 1 1 1 1 2 1 1 1 0 ...
## $ x : num -0.95 -0.85 -0.75 -0.65 -0.55 -0.45 -0.35 -0.25 -0.15 -0.05 ...
## $ xmin : num -1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 ...
## $ xmax : num -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 ...
## $ density : num 0.476 0.476 0.476 0.476 0.476 ...
## $ ncount : num 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.5 0 ...
## $ ndensity: num 1.05 1.05 1.05 1.05 1.05 2.1 1.05 1.05 1.05 0 ...
## $ PANEL : int 1 1 1 1 1 1 1 1 1 1 ...
## $ group : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ ymin : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ymax : num 1 1 1 1 1 2 1 1 1 0 ...
## $ colour : logi NA NA NA NA NA NA ...
## $ fill : chr "grey35" "grey35" "grey35" "grey35" ...
## $ size : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ linetype: num 1 1 1 1 1 1 1 1 1 1 ...
## $ alpha : logi NA NA NA NA NA NA ...
ggplot_build(p)$data[[1]]$ncount
## [1] 0.5 0.5 0.5 0.5 0.5 1.0 0.5 0.5 0.5 0.0 1.0 0.5
## [13] 0.5 0.5 0.0 1.0 0.5 0.0 1.0 0.5
ROUNDING ERRORS?
Looks like:
df <- data.frame(var = as.integer(seq(-100,100,10)))
# eps <- 1.000000000000001 # on my system
eps <- 1+10*.Machine$double.eps
p <- ggplot(data = df, aes(x = eps*var/100)) +
geom_histogram(aes(y = ..count..),
binwidth = 0.05,
closed = "left")
p
(I have removed the boundary option altogether)
This behaviour appears some time after ggplot2_1.0.1. Looking at the source code, e.g. bin.R and stat-bin.r in https://github.com/hadley/ggplot2/blob/master/R, and tracing the computations of count leads to function bin_vector(), which contains the following lines:
bin_vector <- function(x, bins, weight = NULL, pad = FALSE) {
... STUFF HERE I HAVE DELETED FOR CLARITY ...
cut(x, bins$breaks, right = bins$right_closed,
include.lowest = TRUE)
... STUFF HERE I HAVE DELETED FOR CLARITY ...
}
By comparing the current versions of these functions with older ones, you should be able to find the reason for the different behaviour... to be continued...
SUMMING UP DEBUGGING
By "patching" the bin_vector function and printing the output to screen, it appears that:
bins$fuzzy correctly stores the fuzzy parameters
The non-fuzzy bins$breaks are used in the computations, but as far as I can see (and correct me if I'm wrong) the bins$fuzzy are not.
If I simply replace bins$breaks with bins$fuzzy at the top of bin_vector, the correct plot is returned. Not a proof of a bug, but a suggestion that perhaps more could be done to emulate the behaviour of previous versions of ggplot2.
At the top of bin_vector I expected to find a condition upon which to return either bins$breaks or bins$fuzzy. I think that's missing now.
PATCHING
To "patch" the bin_vector function, copy the function definition from the github source or, more conveniently, from the terminal, with:
ggplot2:::bin_vector
Modify it (patch it) and assign it into the namespace:
library("ggplot2")
bin_vector <- function (x, bins, weight = NULL, pad = FALSE)
{
... STUFF HERE I HAVE DELETED FOR CLARITY ...
## MY PATCH: Replace bins$breaks with bins$fuzzy
bin_idx <- cut(x, bins$fuzzy, right = bins$right_closed,
include.lowest = TRUE)
... STUFF HERE I HAVE DELETED FOR CLARITY ...
ggplot2:::bin_out(bin_count, bin_x, bin_widths)
## THIS IS THE PATCHED FUNCTION
}
assignInNamespace("bin_vector", bin_vector, ns = "ggplot2")
df <- data.frame(var = seq(-100,100,10)/100)
ggplot(data = df, aes(x = var)) + geom_histogram(aes(y = ..count..), binwidth = 0.05, boundary = 1, closed = "left")
Just to be clear, the code above is edited for clarity: the function has a lot of type-checking and other calculations which I have removed, but which you would need to patch the function. Before you run the patch, restart your R session or detach your currently loaded ggplot2.
OLD VERSIONS
The unexpected behaviour is NOT observed in versions 2.0.9.3 or 2.1.0.1 and appears to originate in the current release 2.2.0.1 (or perhaps the earlier 2.2.0.0, which gave me an error when I tried to call it).
To install and load an old version, say ggplot2_0.9.3, create a separate directory (no point in overwriting the current version), say ggplot2093:
URL <- "http://cran.r-project.org/src/contrib/Archive/ggplot2/ggplot2_0.9.3.tar.gz"
install.packages(URL, repos = NULL, type = "source",
lib = "~/R/testing/ggplot2093")
To load the old version, call it from your local directory:
library("ggplot2", lib.loc = "~/R/testing/ggplot2093")
Consider the dataset "ToothGrow" from the "datasets" package: a 60 rows dataset for three variables: "Tooth length", "Supplement lenght", "Dose in milligrams per day".
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
I use the function coplot() to see the effect of the variable dose on the variable len for each factor of supp.
with(ToothGrowth, coplot(len ~ dose | supp))
How can I create the same plot with boxplots for len ~ dose, instead of having a point for each case?
Using the coplot() function from R base graphics would be preferable.
Try this:
library(lattice)
data("ToothGrowth")
ToothGrowth[,3]<-factor(ToothGrowth[,3])
#before
xyplot(len ~ dose | supp, data=ToothGrowth, layout=c(2,1))
#after
bwplot(len ~ dose | supp, data=ToothGrowth, layout=c(2,1))
The result is the following:
Edit:
If you want to only employ the R base package you can use the following.
coplot(len ~ dose | supp, data=ToothGrowth, xlim = c(0, 4),
panel = function(x, y, ...){boxplot(y ~ x, add=TRUE)})
Which yields:
I created a histogram for a simulation, and now I need to find the total number of instances where the x-variable is greater than a given value. Specifically, my data is correlation (ranging from -1 to 1, with bin size 0.05), and I want to find the percent of the events where the correlation is greater than 0.1. Finding the total number of events greater than 0.1 is fine, because it's an easy percent to compute.
library(psych)
library(lessR)
corrData=NULL
for (i in 1:1000){
x1 <- rnorm(mean=0, sd = 1, n=20)
x2 <- rnorm(mean=0, sd = 1, n=20)
data <- data.frame(x1,x2)
r <- with(data, cor(x1, x2))
corrData <- append(corrData,r)
}
describe(corrData)
hist <- hist(corrData, breaks=seq(-1,1,by=.05), main="N=20")
describe(hist) count(0.1, "N=20")
TRy something like this:
N=500
bh=hist(runif(N,-1,1))
#str(bh)
sum(bh$counts[bh$mids>=.1])/N
Look at what hist is actually giving you (see ?hist):
set.seed(10230)
x<-hist(2*runif(1000)-1)
> str(x)
List of 6
$ breaks : num [1:11] -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 ...
$ counts : int [1:10] 92 99 100 105 92 116 95 102 100 99
$ density : num [1:10] 0.46 0.495 0.5 0.525 0.46 0.58 0.475 0.51 0.5 0.495
$ mids : num [1:10] -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5 0.7 0.9
$ xname : chr "2 * runif(1000) - 1"
$ equidist: logi TRUE
- attr(*, "class")= chr "histogram"
The breaks list item tells you the endpoints of the "catching" intervals. The counts item tells you the counts in the (one fewer) bins defined by these breaks.
So, to get as close as you can to what you want using only your hist object, you could do:
sum(x$counts[which(x$breaks>=.1)-1L])/sum(x$counts)
But, as #Frank said, this may be incorrect, particularly if the bin containing .1 does not have an endpoint at .1.