This question already has answers here:
R error "sum not meaningful for factors"
(1 answer)
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 3 years ago.
Converting factor to integer from a .csv using RStudio.
Hi, I know this question has been asked frequently but I've been trying to wrap my head around things for an hour with no success.
In my .csv file 'Weighted.average' is a calculation of Weighted.count/count (before conversion), but when I use the file in R it is a factor, despite being completely numeric (with decimal points).
I'm aiming to aggregate the data using Weighted.average's numeric values. But as it is still considered a factor it doesn't work. I'm newish to R so I'm having trouble converting other examples to my own.
Thanks
RENA <- read.csv('RENA.csv')
RENAVG <- aggregate(Weighted.average~Diet+DGRP.Line, data = RENA, FUN = sum)
ggplot(RENAVG, aes(x=DGRP.Line, y=Weighted.average, colour=Diet)) +
geom_point()
Expected to form a dot plot using Weighted.average, error
Error in Summary.factor(c(3L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, :
‘sum’ not meaningful for factors
occurs. I know it's due to it not being read as an integer, but I'm lost at how to convert.
Thanks
Output from dput
> dput(head(RENA))
structure(list(DGRP.Line = structure(c(19L, 19L, 19L, 19L, 20L,
20L), .Label = c("105a", "105b", "348", "354", "362a", "362b",
"391a", "391b", "392", "397", "405", "486a", "486b", "712", "721",
"737", "757a", "757b", "853", "879"), class = "factor"), Diet = structure(c(1L,
1L, 2L, 2L, 1L, 1L), .Label = c("Control", "Rena"), class = "factor"),
Sex = structure(c(2L, 1L, 2L, 1L, 2L, 1L), .Label = c("Female",
"Male"), class = "factor"), Count = c(0L, 0L, 0L, 0L, 1L,
0L), Date = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("16/07/2019",
"17/07/2019", "18/07/2019", "19/07/2019", "20/07/2019", "21/07/2019",
"22/07/2019"), class = "factor"), Day = c(1L, 1L, 1L, 1L,
1L, 1L), Weighted.count = c(0L, 0L, 0L, 0L, 1L, 0L), Weighted.average = structure(c(60L,
59L, 52L, 63L, 44L, 36L), .Label = c("", "#DIV/0!", "1.8",
"1.818181818", "2", "2.275862069", "2.282608696", "2.478873239",
"2.635135135", "2.705882353", "2.824561404", "2.903614458",
"2.911392405", "2.917525773", "3", "3.034090909", "3.038461538",
"3.083333333", "3.119402985", "3.125", "3.154929577", "3.175438596",
"3.1875", "3.220338983", "3.254237288", "3.263157895", "3.314606742",
"3.341463415", "3.35", "3.435483871", "3.5", "3.6", "3.606557377",
"3.666666667", "3.6875", "3.694214876", "3.797619048", "3.813953488",
"3.833333333", "3.875", "3.909090909", "3.916666667", "4.045454545",
"4.047169811", "4.111111111", "4.333333333", "4.40625", "4.444444444",
"4.529411765", "4.617021277", "4.620689655", "4.666666667",
"4.714285714", "4.732283465", "4.821428571", "4.823529412",
"4.846153846", "4.851851852", "4.855263158", "4.884615385",
"4.956521739", "5", "5.115384615", "5.230769231", "5.343283582",
"5.45", "5.464285714", "5.484848485", "5.538461538", "5.551724138",
"5.970588235", "6", "6.2"), class = "factor")), row.names = c(NA,
6L), class = "data.frame")
Just modify your first line (the read.csv) to specify the nature of each variable during the import.
Related
I'm attempting to produce an attractive graph of bandwidth data across a number of machines and tests. My attempts seem to work for small manually entered amounts of data, but when I feed the "full" 1773 entries, I get results in my graph that don't seem to exist in the input data.
I believe this is likely because the different tests are each of different duration, but I can't seem to prove this. If I use the following input data as csv (sorry, off-site because of size) I end up with a strange upwards-curve on my geom_smooth line, and additional data points that I can't actually see in my .csv input data. (I have much more data in real life, this is a subset that produces the strange behaviour)
I would expect the first four tries (try01-try04) to flat-line at zero, and try05 to carry on at around 1GBit/sec. Here's my code
library("ggplot2")
library("RColorBrewer")
speed = read.csv(file="data.csv")
svg("all_results.svg",width=24)
ggplot(speed,
aes(x = Second, y = Bandwidth, group=Test, colour=Test)) +
scale_fill_brewer(palette="Paired") +
geom_point() +
geom_smooth()
dev.off()
Here's the image produced
#Gregor seems to be exactly right in that the seconds are interpreted as text, when they should represent the number of the seconds since the start of that test.
Here's some example input data - please note the times are not always on a .00 second boundary due to the output of iperf.
structure(list(Machine = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "valhalla", class = "factor"),
User = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "alice", class = "factor"),
Test = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "try01", class = "factor"),
Second = structure(c(1L, 2L, 13L, 14L, 15L, 16L, 17L, 18L,
19L, 20L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("0.00-1.00",
"1.00-2.00", "10.00-11.00", "11.00-12.00", "12.00-13.00",
"13.00-14.00", "14.00-15.00", "15.00-16.00", "16.00-17.00",
"17.00-18.00", "18.00-19.00", "19.00-20.00", "2.00-3.00",
"3.00-4.00", "4.00-5.00", "5.00-6.00", "6.00-7.00", "7.00-8.00",
"8.00-9.00", "9.00-10.00"), class = "factor"), Bandwidth = c(937,
943, 944, 943, 943, 943, 943, 944, 658, 943, 944, 943, 944,
644, 943, 943, 943, 944, 943, 943)), row.names = c(NA, 20L
), class = "data.frame")
I'll try casting (or whatever R calls it) those to a float now.
Points have a single x value, not a range of x-values, so we'll separate your Second column into beginning and end of the interval and plot the points at the beginning. Calling your data dd"
library(tidyr)
library(dplyr)
dd = dd %>%
separate(Second, into = c("sec_start", "sec_end"), sep = "-", remove = FALSE) %>%
mutate(sec_start = as.numeric(sec_start),
sec_end = as.numeric(sec_end))
After that the plotting should go just fine if you put sec_start or sec_end on the x-axis. (Or calculate the middle, whatever you want...)
If you want to visualize the durations, you could use geom_segment and aes(x = sec_start, xend = sec_end, y = Bandwidth, yend = Bandwidth), but since everything is just about the same duration, it doesn't seem like this would add much value.
I'm trying to overlay a stat_smooth() line from one dataset over a bar plot of another. Both csv files draw from the same dataset, but I had to make a new one for the bar plot because I had to add a few columns (including error bars) that wouldn't make sense in the big csv. So, I have code for the bar plot, and code for the line made using stat_smooth, but can't figure out how to combine them. I just want a graph with the line on top of the bars. Here's the code for the bar plot:
`e <- read.csv("Retro Complex.csv", header=T, sep=",")
e <- subset(e, Accuracy != 0)
limits <- aes(ymax = Confidence + SE, ymin = Confidence - SE)
e$Complexity <- factor(e$Complexity)
p <- ggplot(e, aes(e$Complexity, Confidence))
p +
geom_bar(position = "dodge", stat = "identity") +
geom_errorbar(limits, position = "dodge", width = 0.25) +
coord_cartesian(ylim=c(0,1)) +
scale_y_continuous(labels = percent) +
ggtitle("Retro")`
And here's for the line
`ggplot(retroacc, aes(x=Complexity.Sample, y=risk)) +
stat_smooth(aes(x=Complexity.Sample, y=risk), data=retroacc,
method="glm", method.args=list(family="binomial"), se=FALSE) +
ylim(0,1)`
Here's what they both look like:
Stat_smooth() line:
Barplot:
Sample Data
For the bar plot:
structure(list(Complexity = structure(1:5, .Label = c("1", "2",
"3", "4", "5"), class = "factor"), Accuracy = c(1L, 1L, 1L, 1L,
1L), Risk = c(0.69297164, 0.695793434, 0.695891571, 0.746606335,
0.748717949), SE = c(0.003621776, 0.004254081, 0.00669456, 0.008114764,
0.021963804), Proportion = c(0.823475656, 0.809299751, 0.863727821,
0.94724695, 0.882352941), SEAcc = c(0.002716612, 0.003267882,
0.004639995, 0.004059001, 0.015325003)), .Names = c("Complexity",
"Accuracy", "Confidence", "SE", "Proportion", "SEAcc"), row.names = c(1L,
3L, 5L, 7L, 9L), class = "data.frame")
For the line:
structure(list(risk = c(0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), acc = c(0L, 1L, 1L, 1L,
0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
Uniqueness = c(0.405166959, 0.407414244, 0.285123931, 0.248994487,
0.259019778, 0.334552913, 0.300580793, 0.354632526, 0.309841996,
0.331460876, 0.289981111, 0.362405881, 0.37389863, 0.253672193,
0.342903451, 0.294459829, 0.387447291, 0.519657612, 0.278964406
), Average.Similarity = c(0.406700667, 0.409547355, 0.275663862,
0.240909144, 0.251796956, 0.31827466, 0.240574971, 0.349093002,
0.34253811, 0.348084627, 0.290495997, 0.318312198, 0.404143605,
0.290789337, 0.293259599, 0.320214236, 0.382449298, 0.506295194,
0.335167223), Complexity.Sample = c(8521L, 11407L, 3963L,
2536L, 2327L, 3724L, 4005L, 5845L, 5770L, 5246L, 3629L, 3994L,
4285L, 1503L, 8222L, 3683L, 5639L, 10288L, 3076L)), .Names = c("risk",
"acc", "Uniqueness", "Average.Similarity", "Complexity.Sample"
), class = "data.frame", row.names = c(NA, -19L))
So yeah, if any of you guys know how to combine these into one plot please let me know!!
I am constructing GLMMs (using glmer() of "lme4" R package) and sometimes I get an error when estimating R2 values (using r.squaredGLMM() from "MuMIn" package).
The model I am trying to fit is simmilar to this one:
library(lme4)
lmA <- glmer(x~y+(1|w)+(1|w/k), data = data1, family = binomial(link="logit"))
Then, to estime R2, I use:
library(MuMIn)
r.squaredGLMM(lmA)
And I get this:
The result is correct only if all data used by the model has not changed since model was fitted. Error in .rsqGLMM(fam = family(x),
varFx = var(fxpred), varRe = varRe, : 'names' attribute [2] must be the same length as the vector [0]
Do you have any idea why this error appears? For instance, If I use only a single random factor (in this case, (1|w)) this error does not appear.
Here is my dataset:
data1 <-
structure(list(w = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
1L, 2L, 1L), .Label = c("CA", "CB"), class = "factor"), k = structure(c(4L,
4L, 3L, 3L, 3L, 4L, 1L, 3L, 2L, 3L, 2L), .Label = c("CAF01-CAM01",
"CAM01", "CBF01-CBM01", "CBM01"), class = "factor"), x = c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L), y = c(-0.034973549,
0.671720643, 4.557044729, 5.347170897, 2.634240583, -0.555740207,
4.118277809, 2.599825716, 0.95853864, 4.327804344, 0.057331718
)), .Names = c("w", "k", "x", "y"), class = "data.frame", row.names = c(NA,
-11L))
Any thoughts?
This was a bug that has been fixed in version >= 1.15.8 (soon on CRAN, currently on R-Forge).
If I have two data.frames with the same column names, I can use rbind to make a single data frame. However, if I have one is a factor and the other is an int, I get a warning like this:
Warning message: In [<-.factor(*tmp*, ri, value = c(1L, 1L, 0L,
0L, 0L, 1L, 1L, : invalid factor level, NA generated
The following is a simplification of the problem:
t1 <- structure(list(test = structure(c(1L, 1L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 2L), .Label = c("False", "True"), class = "factor")), .Names = "test", row.names = c(NA,
-10L), class = "data.frame")
t2 <- structure(list(test = c(1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L
)), .Names = "test", row.names = c(NA, -10L), class = "data.frame")
rbind(t1, t2)
With the single column, this is easy to understand, but when it is part of a dozen or more factors, it can be difficult. What is there about the warning message to tell me which column to look at? Barring that, what is a good technique to understand which column is in error?
You could knock up a simple little comparison script using class and mapply, to compare where the rbind will break down due to non-matching data types, e.g.:
one <- data.frame(a=1,b=factor(1))
two <- data.frame(b=2,a=2)
common <- intersect(names(one),names(two))
mapply(function(x,y) class(x)==class(y), one[common], two[common])
# a b
# TRUE FALSE
Based on thelatemail's answer, here is a function to compare two data.frames for rbinding:
mergeCompare <- function(one, two) {
cat("Distinct items: ", setdiff(names(one),names(two)), setdiff(names(two),names(one)), "\n")
print("Non-matching items:")
common <- intersect(names(one),names(two))
print (mapply(function(x,y) {class(x)!=class(y)}, one[common], two[common]))
}
Hopefully someone here will be able to help me with a problem that I'm having with a ggplot script I'm trying to get right. The script will be used many times with different data, so it needs to be relatively flexible. I've got it almost where I want it, but I've come across a problem I haven't been able to solve.
The script is for a line graph with labels for each line in the right hand margin. Sometimes the graph is faceted, other times it is not.
The piece I'm having trouble with is that I would like to color code the labels in the right margin as black if there was no significant change over time, green if there was positive change, and red if there was negative change. I've got a script that works to carry this out when I only have a single facet, but as soon as I have multiple facets in the graph, the color coding of the labels gives the following error
Error: Incompatible lengths for set aesthetics:
Below is the script with data with multiple facets. The problem seems to be in the way that I'm specifying color in the geom_text line. If I delete the color call in the geom_text line in the script, then I get the attributes printed in the correct place, just not colored. I'm really at a loss on this one. This is my first post here, so let me know if I've done anything wrong with my post.
WITH MULTIPLE FACETS (DOES NOT WORK)
require(ggplot2)
require(grid)
require(zoo)
require(reshape)
require(reshape2)
require(directlabels)
time.data<-structure(list(Attribute = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
4L, 4L, 5L, 5L, 6L, 6L), .Label = c("Taste 1", "Taste 2", "Taste 3",
"Use 1", "Use 2", "Use 3"), class = "factor"), Attribute.Category = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Nutritional/Usage",
"Taste/Quality"), class = "factor"), Attribute.Order = c(1L,
1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L), Category.Order = c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), Color = structure(c(1L,
1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L), .Label = c("#084594",
"#2171B5", "#4292C6", "#6A51A3", "#807DBA", "#9E9AC8"), class = "factor"),
value = c(75L, 78L, 90L, 95L, 82L, 80L, 43L, 40L, 25L, 31L,
84L, 84L), Date2 = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L), .Label = c("1/1/2013", "9/1/2012"), class = "factor")), .Names = c("Attribute",
"Attribute.Category", "Attribute.Order", "Category.Order", "Color",
"value", "Date2"), class = "data.frame", row.names = c(NA, -12L
))
label.data<-structure(list(7:12, Attribute = structure(1:6, .Label = c("Taste 1",
"Taste 2", "Taste 3", "Use 1", "Use 2", "Use 3"), class = "factor"),
Attribute.Category = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Nutritional/Usage",
"Taste/Quality"), class = "factor"), Attribute.Order = 1:6,
Category.Order = c(1L, 1L, 1L, 2L, 2L, 2L), Color = structure(1:6, .Label = c("#084594",
"#2171B5", "#4292C6", "#6A51A3", "#807DBA", "#9E9AC8"), class = "factor"),
Significance = structure(c(2L, 3L, 1L, 1L, 3L, 2L), .Label = c("neg",
"neu", "pos"), class = "factor"), variable = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "1/1/2013", class = "factor"),
value = c(78L, 95L, 80L, 40L, 31L, 84L), Date2 = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "2013-01-01", class = "factor"),
label.color = structure(c(1L, 2L, 3L, 3L, 2L, 1L), .Label = c("black",
"forestgreen", "red"), class = "factor")), .Names = c("",
"Attribute", "Attribute.Category", "Attribute.Order", "Category.Order",
"Color", "Significance", "variable", "value", "Date2", "label.color"
), class = "data.frame", row.names = c(NA, -6L))
color.palette<-as.character(unique(time.data$Color))
time.data$Date2<-as.Date(time.data$Date2,format="%m/%d/%Y")
plot<-ggplot()+
geom_line(data=time.data,aes(as.numeric(time.data$Date2),time.data$value,group=time.data$Attribute,color=time.data$Color),size=1)+
geom_text(data=label.data,aes(x=Inf, y=label.data$value, label=paste(" ",label.data$Attribute)),
color=label.data$label.color,
size=4,vjust=0, hjust=0,na.rm=T)+
facet_grid(Attribute.Category~.,space="free")+
theme_bw()+
scale_x_continuous(breaks=as.numeric(unique(time.data$Date2)),labels=format(unique(time.data$Date2),format = "%b %Y"))+
theme(strip.background=element_blank(),
strip.text.y=element_blank(),
legend.text=element_blank(),
legend.title=element_blank(),
plot.margin=unit(c(1,5,1,1),"cm"),
legend.position="none")+
scale_colour_manual(values=color.palette)
gt3 <- ggplot_gtable(ggplot_build(plot))
gt3$layout$clip[gt3$layout$name == "panel"] <- "off"
grid.draw(gt3)
Some problems:
Inside your aesthetic declarations, you should not be referencing the data columns as time.data$Date2, but just as Date2. The data argument specifies where to look for that information (which needs to all be in the same data.frame for a given layer, but, as you take advantage of, can vary layer to layer).
In the geom_text call, color was not inside the aes call; if you are mapping it to data which is in the data.frame, you have to have it inside the aes call. This would throw a different error after fixing the first part because then it would not be able to find label.color anywhere because it would not know to look inside label.data.
Fixing those, then the scale_colour_manual complains that there are 9 colors and you have only supplied 6. That is because there are 6 colors from the lines and 3 from the text. Since you specified these as actual color names, you can just use scale_colour_identity.
Putting this all together:
plot <- ggplot()+
geom_line(data=time.data, aes(as.numeric(Date2), value,
group=Attribute, color=Color),
size=1)+
geom_text(data=label.data, aes(x=Inf, y=value,
label=paste(" ",Attribute),
color=label.color),
size=4,vjust=0, hjust=0)+
facet_grid(Attribute.Category~.,space="free") +
scale_x_continuous(breaks=as.numeric(unique(time.data$Date2)),
labels=format(unique(time.data$Date2),format = "%b %Y")) +
scale_colour_identity() +
theme_bw()+
theme(strip.background=element_blank(),
strip.text.y=element_blank(),
legend.text=element_blank(),
legend.title=element_blank(),
plot.margin=unit(c(1,5,1,1),"cm"),
legend.position="none")
gt3 <- ggplot_gtable(ggplot_build(plot))
gt3$layout$clip[gt3$layout$name == "panel"] <- "off"
grid.draw(gt3)
To get an idea how much you can strip down your example, this is much closer to minimal:
time.data <-
structure(list(Attribute = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
4L, 4L), .Label = c("Taste 1", "Taste 2", "Use 1", "Use 2"), class = "factor"),
Attribute.Category = structure(c(2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L), .Label = c("Nutritional/Usage", "Taste/Quality"), class = "factor"),
Color = c("#084594", "#084594", "#2171B5", "#2171B5", "#6A51A3",
"#6A51A3", "#807DBA", "#807DBA"), value = c(75L, 78L, 90L,
95L, 43L, 40L, 25L, 31L), Date2 = structure(c(15584, 15706,
15584, 15706, 15584, 15706, 15584, 15706), class = "Date")), .Names = c("Attribute",
"Attribute.Category", "Color", "value", "Date2"), row.names = c(NA,
-8L), class = "data.frame")
label.data <-
structure(list(value = c(78L, 95L, 40L, 31L), Attribute = structure(1:4, .Label = c("Taste 1",
"Taste 2", "Use 1", "Use 2"), class = "factor"), label.color = c("black",
"forestgreen", "red", "forestgreen"), Attribute.Category = structure(c(2L,
2L, 1L, 1L), .Label = c("Nutritional/Usage", "Taste/Quality"), class = "factor"),
Date2 = structure(c(15706, 15706, 15706, 15706), class = "Date")), .Names = c("value",
"Attribute", "label.color", "Attribute.Category", "Date2"), row.names = c(NA,
-4L), class = "data.frame")
ggplot() +
geom_line(data = time.data,
aes(x=Date2, y=value, group=Attribute, colour=Color)) +
geom_text(data = label.data,
aes(x=Date2, y=value, label=Attribute, colour=label.color),
hjust = 1) +
facet_grid(Attribute.Category~.) +
scale_colour_identity()
The theme stuff (and the making the labels visible outside the plot) isn't relevant to the question, nor is the x-axis conversions from Date to numeric to handle having Inf. I also trimmed the data to just the needed columns, and reduced categorical variable to only two categories.