I'm in need of assistance... I'm using R to analyze some data... I have a frequency table called mytable... that I created like this:
mytable=table(cut(var1,12),cut(var2,12))
the table looks something like this:
1-2 2-3 3-4
1-3 2 1 2
3-6 0 1 4
6-9 7 1 8
except is a 12 by 12 table.
I used boxplot.matrix(mytable),the boxplot looks ok... with the 12 boxes corresponding to my 12 stratums, but my boxplot has the frequency as the y-axis and I want the y-axis to be the values from var1, how can I do this?
I wanted to post a pic... but my rep wasnt high enough
use boxplot before you summarize your data.
boxplot(var1)
If you want to see the distribution per split, use the formula format:
boxplot(var1 ~ cut(var2, 12))
Related
I would like to add a column to my dataframe that contains categorical data based on numbers in another column. I found a similar question at Create categorical variable in R based on range, but the solution provided there didn't provide the solution that I need. Basically, I need a result like this:
x group
3 0-5
4 0-5
6 6-10
12 > 10
The solutions suggested using cut() and shingle(), and while those are useful for dividing the data based on ranges, they do not create the new categorical column that I need.
I have also tried using something like (please don't laugh)
data$group <- "0-5"==data[data$x>0 & data$x<5, ]
but that of course didn't work. Does anyone know how I might do this correctly?
Why didn't cut work? Did you not assign to a new column or something?
> data=data.frame(x=c(3,4,6,12))
> data$group = cut(data$x,c(0,5,10,15))
> data
x group
1 3 (0,5]
2 4 (0,5]
3 6 (5,10]
4 12 (10,15]
What you've created there is a factor object in a column of your data frame. The text displayed is the levels of the factor, and you can change them by assignment:
levels(data$group) = c("0-5","6-10",">10")
data
x group
1 3 0-5
2 4 0-5
3 6 6-10
4 12 >10
Read some basic R docs on factors and you'll get it.
I have conducted a study with triplicates (SampleID) for each sample (Sample) on different time points.
Now, I want to plot the means of the triplicates for the characteristic "Aerobic".
I want to plot for example the development of amount of aerobic bacteria over time. Therefore, I need to calculate the means (and the standard deviation) of the triplicates and then plot these means in the graph. Here, I could imagine to use a geom_line or geom_point diagram.
SampleID Sample Aerobic Anaerobic Day
[Factor] [Factor] [num] [num] [num]
1 V1.1.K1 V1.1.K 0.610063430 0.05146154 1
2 V1.1.K2 V1.1.K 0.740887757 0.02115290 1
3 V1.1.K3 V1.1.K 0.683726217 0.04270182 1
4 V1.1.N1 V1.1.N 0.432019752 0.35722350 1
5 V1.1.N2 V1.1.N 0.515792694 0.41357935 1
6 V1.14.K16 V1.14.K 0.038141335 0.84496088 14
7 V1.14.K17 V1.14.K 0.042078682 0.76523093 14
8 V1.14.K18 V1.14.K 0.009594763 0.90767637 14
9 V1.14.N0 V1.14.N 0.513100502 0.10618731 14
10 V1.14.W16 V1.14.W 0.483710571 0.32765968 14
How should i do this?
I tried it with the following code
plot <- mydata %>%
group_by(Sample) %>%
mutate(Mean=mean(Aerobic)) %>%
ggplot(aes(x=Day, y=Aerobic)) +
geom_point()
If I google the questions I get only information about how to calculate the mean alone, but not to set up a new table with the means for the different variables.
Is there something like
calc_mean_by_group ??
You would help me a lot :)
Simple base-R solution for calculating the means:
tapply(X = foo$Aerobic, INDEX = foo$Sample, FUN = mean)
("foo" being the name of your data.frame)
I am trying to create a line plot in R. For each 'RuleID' in my data frame I want to plot the 'ErrorCount' at each 'ProcessorTimeStamp'
DQ_Counts= data.frame(RuleID=c(1,2,1,2),
ProcessorTimeStamp=as.Date(c('2016-08-04','2016-08-04','2016-08-08','2016-08-08')),
ErrorCount=c(6,8,3,4))
# RuleID ProcessorTimeStamp ErrorCount
# 1 1 2016-08-04 6
# 2 2 2016-08-04 8
# 3 1 2016-08-08 3
# 4 2 2016-08-08 4
This is a plot I found online that I would like the end result to look like all though I am obviously not talking about trees. The code for this plot is here Code for Tree Growth Plot but I don't understand it well enough to make it work for me.
For my plot 'ProcessTimeStamp' would be my x and 'ErrorCount' would by my y. Each line would represent a different 'RuleID'.
One thing to note is that I have 'ErrorCounts' ranging from 0 to over 3 million (this is why I need to report on them to get them fixed!).
Thanks in advance.
This is probably the easiest way to get a basic plot like the one above with your data
lattice::xyplot(ErrorCount~ProcessorTimeStamp, DQ_Counts,
groups=RuleID, auto.key=T, type="l")
Which returns
or you could use ggplot2
library(ggplot2)
ggplot(DQ_Counts, aes(ProcessorTimeStamp, ErrorCount, color=factor(RuleID))) + geom_line()
to get
Consider the following frequency data:
> table(income)
income
3 5 6 7 8 5000
2 7 2 2 2 1
When I type >hist(income) I get the following histogram
So as you can see, the fact that most income values are concentrated around 5 and there is one value quite distant from the others makes the histogram not look very good. MS Excel can consider the 5000 value as of another category, so the data would like this instead:
> table(income)
income
3 5 6 7 8 more
2 7 2 2 2 1
So plotting this as a histogram would look much better, so you can see the frequency within a shorter range:
Is there anyway to do this either with the hist() function or others functions from lattice or ggplot2? I do however, don't want to overwrite the values that exceed a certain threshold, so as I do lose any information.
Thanks a lot!
Data generation:
income <- c(rep(3,2), rep(5,7), rep(6,2), rep(7,2), rep(8,2), 5000)
Function for preparing data for plotting:
nice.data <- function(x, threshold=10){
x[x>threshold] <- "More"
x
}
Plotting:
library(ggplot2)
ggplot() + geom_histogram(aes(x=nice.data(income))) + xlab("Income")
Result:
I can not seem to figure out how to get a nice barplot that contains the data from two tables that contain a different number of columns.
The tables in question are something like (snipped some data from the end):
> tab1
1 2 3 6 8 31
5872 1525 831 521 299 4
> tab2
1 2 3 4 22
7874 422 2 5 1
Note the column names and sizes are different. When I just do barplot() on one of these tables it comes out with the plot I'd like (showing the column names as the X-axis, frequencies on Y-axis). But, I would like these two side by side.
I've gotten as far as creating a data frame containing both variables as comments and the different row names in the first column (with data.frame()and merge()), but when I plot this the X-axis seems to be all wrong. Attempting to reorder the columns gives me an exception about lengths differing.
Code:
combined <- merge(data.frame(tab1), data.frame(tab2), by = c('Var1'), all=T)
barplot(t(combined[,2:3]), names.arg = combined[,1], beside=T)
This shows a plot, but not all labels are present and the value for position 26 is plotted after 33.
Is there any simple way to get this plot working? A ggplot2 solution would be nice.
You can put all your data in one data frame (as in example).
df<-data.frame(group=rep(c("A","B"),times=c(2,3)),
values=c(23,56,345,6,7),xval=c(1,2,1,2,8))
group values xval
1 A 23 1
2 A 56 2
3 B 345 1
4 B 6 2
5 B 7 8
Then ggplot() with geom_bar() can be used to plot the data.
ggplot(df,aes(xval,values,fill=group))+
geom_bar(stat="identity",position="dodge")