Does ggplot2 exclude some data? - r

I want to create some basic grouped barplots with ggplot2 but it seems to exclude some data. If I review my input data everything is there, but some bars are missing and it is also messing with the error bars. I tried to convert into multiple variable types, regrouped, loaded again, saved everything in .csv and loaded all new... I just don't know what is wrong.
Here is my code:
library(ggplot2)
limits <- aes(ymax = DataCm$mean + DataCm$sd,
ymin = DataCm$mean - DataCm$sd)
p <- ggplot(data = DataCm, aes(x = factor(DataCm$Zeit), y = factor(DataCm$mean)
) )
p + geom_bar(stat = "identity",
position = position_dodge(0.9),fill =DataCm$group) +
geom_errorbar(limits, position = position_dodge(0.9),
width = 0.25) +
labs(x = "Time [min]", y = "Individuals per foodsource")
This is DataCm:
Zeit mean sd group
1 30 0.1 0.3162278 1
2 60 0.0 0.0000000 2
3 90 0.1 0.3162278 3
4 120 0.0 0.0000000 4
5 150 0.1 0.3162278 5
6 180 0.1 0.3162278 6
7 240 0.3 0.6749486 1
8 300 0.3 0.6749486 2
9 360 0.3 0.6749486 3
10 30 0.1 0.3162278 4
11 60 0.1 0.3162278 5
12 90 0.2 0.4216370 6
13 120 0.3 0.4830459 1
14 150 0.3 0.4830459 2
15 180 0.4 0.5163978 3
16 240 0.3 0.4830459 4
17 300 0.4 0.5163978 5
18 360 0.4 0.5163978 6
19 30 1.2 1.1352924 1
20 60 1.8 1.6865481 2
21 90 2.2 2.0976177 3
22 120 2.2 2.0976177 4
23 150 2.0 1.8856181 5
24 180 2.3 1.9465068 6
25 240 2.4 2.0655911 1
26 300 2.1 1.8529256 2
27 360 2.0 2.1602469 3
28 30 0.2 0.4216370 4
29 60 0.1 0.3162278 5
30 90 0.1 0.3162278 6
31 120 0.1 0.3162278 1
32 150 0.0 0.0000000 2
33 180 0.1 0.3162278 3
34 240 0.1 0.3162278 4
35 300 0.1 0.3162278 5
36 360 0.1 0.3162278 6
37 30 1.3 1.5670212 1
38 60 1.5 1.5811388 2
39 90 1.5 1.7159384 3
40 120 1.5 1.9002924 4
41 150 1.9 2.1317703 5
42 180 1.9 2.1317703 6
43 240 2.2 2.3475756 1
44 300 2.4 2.3190036 2
45 360 2.2 2.1499354 3
46 30 2.1 2.1317703 4
47 60 3.0 2.2110832 5
48 90 3.3 2.1628171 6
49 120 3.2 2.1499354 1
50 150 3.4 2.6331224 2
51 180 3.5 2.4152295 3
52 240 3.7 2.6267851 4
53 300 3.7 2.4060110 5
54 360 3.8 2.6583203 6
The output is:
Maybe you can help me. Thanks in advance!
Best wishes,
Benjamin
Solved it:
I reshaped everything in Excel and exported it another way. The group variable was also not the way I wanted it. Now it is fixed, but I can't really tell you why.

Your data looks malformed. I guess you wanted to have 6 different group values for each time point, but now the group variable just loops over, and you have:
1 30 0.1 0.3162278 1
...
10 30 0.1 0.3162278 4
...
19 30 1.2 1.1352924 1
...
28 30 0.2 0.4216370 4
geom_bar then probably omits rows that have identical mean and time. Although I am not sure why it chooses to do so, you should solve the group problem first anyway.

Related

Roll join with no duplicates in R

I have two tables coming from devices that gather data with different sampling frequencies. One device samples every 30 seconds, the other is roughly 30 and sometimes drops measurements (example sequence might be 31, 61, 95, 151, notice how it missed the sample around ~120). My original data.frame contains a datetime instead of the number of seconds but the toy data should work to illustrate.
q1 <-
read.table(text="
A 0 1.1
A 30 1.2
A 90 1.3
A 120 1.4
B 15 -5
B 45 -3
B 75 -3.5
C 10 0
C 40 -1.4
C 70 -1")
q2 <-
read.table(text="
A 10 10.1
A 40 10.2
A 110 10.4
B 30 -50
B 90 -30
C 5 0
C 35 -10.4
C 76 -10")
names(q1) <- c("key","datetime","x")
names(q2) <- c("key","timepoint","y")
# create a joint_time to keep the originals in place
q1$joint_time <- q1$datetime
q2$joint_time <- q2$timepoint
If I try to join by nearest, I get
# set the keys
data.table::setkey(data.table::setDT(q1), key, joint_time)
data.table::setkey(data.table::setDT(q2), key, joint_time)
q2[q1, roll="nearest"]
Notice the duplicates on row 4 and 6.
key timepoint y joint_time datetime x
1: A 10 10.1 0 0 1.1
2: A 40 10.2 30 30 1.2
3: A 110 10.4 90 90 1.3
4: A 110 10.4 120 120 1.4
5: B 30 -50.0 15 15 -5.0
6: B 30 -50.0 45 45 -3.0
7: B 90 -30.0 75 75 -3.5
8: C 5 0.0 10 10 0.0
9: C 35 -10.4 40 40 -1.4
10: C 76 -10.0 70 70 -1.0
My ideal output would join by nearest but fill with NA instead of duplicate on y values.
key timepoint y joint_time datetime x
1: A 10 10.1 0 0 1.1
2: A 40 10.2 30 30 1.2
3: A 110 10.4 90 90 1.3
4: A NA NA 120 120 1.4
5: B 30 -50.0 15 15 -5.0
6: B NA NA 45 45 -3.0
7: B 90 -30.0 75 75 -3.5
8: C 5 0.0 10 10 0.0
9: C 35 -10.4 40 40 -1.4
10: C 76 -10.0 70 70 -1.0
I'm fine with doing the join first and then finding the duplicates and changing them to NA. I will later try to interpolate the y variable there. Not sure if there's a direct way to do the join and fill with NA or if it has to be done a posteriori.
Here's what I ended up doing, I don't think it's awesome but as far as I can see, it is working as expected.
q1$joint_time <- q1$datetime
q2$joint_time <- q2$timepoint
# create a sample id using the key since the data is grouped
q2$sample_id <- paste0(q2$key, as.character(1:nrow(q2)))
# Join
res <- q2[q1, roll="nearest"]
# fill with NAs
res %>% mutate_at(vars(y,timepoint), ~ifelse(duplicated(sample_id), NA, .))
Which produces
key timepoint y joint_time sample_id datetime x
1: A 10 10.1 0 A1 0 1.1
2: A 40 10.2 30 A2 30 1.2
3: A 110 10.4 90 A3 90 1.3
4: A NA NA 120 A3 120 1.4
5: B 30 -50.0 15 B4 15 -5.0
6: B NA NA 45 B4 45 -3.0
7: B 90 -30.0 75 B5 75 -3.5
8: C 5 0.0 10 C6 10 0.0
9: C 35 -10.4 40 C7 40 -1.4
10: C 76 -10.0 70 C8 70 -1.0

make grouping bar plot ggplot2

I'm new to ggplot2 but trying to use it. I have two variables: SA( with 4 levels :0, 1000,2000 and 3000) and GA (with 4 levels:0, 0.5,1 and 2). I would like to group these by SA (like this Figure)
> G<- read.table("k.csv", sep=";",header = TRUE)
> G
SA GA PH
1 0 0.0 41
2 0 0.0 27
3 0 0.0 28
4 0 0.0 25
5 0 0.5 35
6 0 0.5 45
7 0 0.5 35
8 0 0.5 55
9 0 1.0 45
10 0 1.0 35
11 0 1.0 38
12 0 1.0 46
13 0 2.0 52
14 0 2.0 40
15 0 2.0 40
16 0 2.0 35
17 1000 0.0 30
18 1000 0.0 30
19 1000 0.0 30
20 1000 0.0 30
21 1000 0.5 28
22 1000 0.5 33
23 1000 0.5 31
24 1000 0.5 42
25 1000 1.0 38
26 1000 1.0 30
27 1000 1.0 27
28 1000 1.0 25
29 1000 2.0 30
30 1000 2.0 22
31 1000 2.0 31
32 1000 2.0 44
33 2000 0.0 18
34 2000 0.0 25
35 2000 0.0 24
36 2000 0.0 31
37 2000 0.5 24
38 2000 0.5 22
39 2000 0.5 36
40 2000 0.5 40
41 2000 1.0 27
42 2000 1.0 29
43 2000 1.0 42
44 2000 1.0 33
45 2000 2.0 20
46 2000 2.0 40
47 2000 2.0 30
48 2000 2.0 25
49 3000 0.0 0
50 3000 0.0 0
51 3000 0.0 0
52 3000 0.0 0
53 3000 0.5 24
54 3000 0.5 20
55 3000 0.5 25
56 3000 0.5 NA
57 3000 1.0 37
58 3000 1.0 NA
59 3000 1.0 38
60 3000 1.0 25
61 3000 2.0 24
62 3000 2.0 15
63 3000 2.0 20
64 3000 2.0 32
> ggplot(G, aes(x=SA, y=PH, fill=factor(GA))) +
stat_summary(geom="bar",positiGon=position_dodge(1))
but it does not give me what I need. It gives me something different (here it is)
Also, I would like to add error bar to the bars.
Any ideas?
Solution where I use data.table to calculate standard error and mean.
library(data.table)
library(ggplot2)
setDT(G)
pd <- G[, .(SE = sd(PH, na.rm = TRUE) / sqrt(.N),
MN = mean(PH, na.rm = TRUE)),
.(SA, GA)]
ggplot(pd, aes(factor(SA), fill = factor(GA))) +
geom_bar(aes(y = MN),
stat = "identity", position = "dodge") +
geom_errorbar(aes(ymin = MN - SE, ymax = MN + SE),
position = "dodge") +
labs(x = "SA",
y = "PH",
fill = "GA") +
theme_classic()
library(tidyr)
std.error <- function(x) sd(x)/sqrt(length(x))
means <- function(x)mean(x, na.rm=TRUE)
df2 <- G %>%
group_by(SA, GA) %>%
mutate(error=std.error(PH)) %>%
summarise_at(vars(PH:error), funs(means))
ggplot(df2, aes(as.factor(SA), PH, fill=as.factor(GA))) +
geom_bar(stat="identity", position="dodge") +
geom_errorbar(aes(ymin=PH-error, ymax=PH+error),
width=.2, position=position_dodge(.9))

Wrong Fit using nls function

When I try to fit an exponential decay and my x axis has decimal number, the fit is never correct. Here's my data below:
exp.decay = data.frame(time,counts)
time counts
1 0.4 4458
2 0.6 2446
3 0.8 1327
4 1.0 814
5 1.2 549
6 1.4 401
7 1.6 266
8 1.8 182
9 2.0 140
10 2.2 109
11 2.4 83
12 2.6 78
13 2.8 57
14 3.0 50
15 3.2 31
16 3.4 22
17 3.6 23
18 3.8 20
19 4.0 19
20 4.2 9
21 4.4 7
22 4.6 4
23 4.8 6
24 5.0 4
25 5.2 6
26 5.4 2
27 5.6 7
28 5.8 2
29 6.0 0
30 6.2 3
31 6.4 1
32 6.6 1
33 6.8 2
34 7.0 1
35 7.2 2
36 7.4 1
37 7.6 1
38 7.8 0
39 8.0 0
40 8.2 0
41 8.4 0
42 8.6 1
43 8.8 0
44 9.0 0
45 9.2 0
46 9.4 1
47 9.6 0
48 9.8 0
49 10.0 1
fit.one.exp <- nls(counts ~ A*exp(-k*time),data=exp.decay, start=c(A=max(counts),k=0.1))
plot(exp.decay, col='darkblue',xlab = 'Track Duration (seconds)',ylab = 'Number of Particles', main = 'Exponential Fit')
lines(predict(fit.one.exp), col = 'red', lty=2, lwd=2)
I always get this weird fit. Seems to me that the fit is not recognizing the right x axis, because when I use a different set of data, with only integers in the x axis (time) the fit works! I don't understand why it's different with different units.
You need one small modification:
lines(predict(fit.one.exp), col = 'red', lty=2, lwd=2)
should be
lines(exp.decay$time, predict(fit.one.exp), col = 'red', lty=2, lwd=2)
This way you make sure to plot against the desired values on your abscissa.
I tested it like this:
data = read.csv('exp_fit_r.csv')
A0 <- max(data$count)
k0 <- 0.1
fit <- nls(data$count ~ A*exp(-k*data$time), start=list(A=A0, k=k0), data=data)
plot(data)
lines(data$time, predict(fit), col='red')
which gives me the following output:
As you can see, the fit describes the actual data very well, it was just a matter of plotting against the correct abscissa values.

comparing recent averaged values to a current value in R

I am using Rstudio (version .99.903), have a PC (windows 8). I have a follow up question from yesterday as the problem became more complicated. Here is what the data looks like:
Number Trial ID Open date Enrollment rate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0
What I need to do is compare the enrollment rate of the most current date within a given ID to the average of those values that are up to one year prior to it. For instance, for ID 53, the date of 1/19/2011 has an enrollment rate of 0.2 and I would want to compare this against the average of 8/17/2010 and 5/12/2010 enrollment rates (e.g., 0.15).
If there are no other dates within the ID prior to the current one, then the comparison should not be made. For instance, for ID 26, there would be no comparison. Similarly, for ID 53, there would be no comparison for 5/12/2010.
When I say "compare" I am not doing any analysis or visualization. I simply want a new column that takes the average value of those enrollment rates up to one year prior to the current one (I will be plotting them and percentile ranking them later). There are >20,000 data points. Any help would be much appreciated.
Verbose but possibly high performance way of doing this. No giant for loops looping over all the rows of the data frame. The two sapply loops only operate on a big numeric vector, which should be relatively quick regardless of your data row count. But I'm sure someone will waltz in with a trivial dplyr solution soon enough.
Approach assumes that your data is first sorted by ID then by Opendata. If they are not sorted, you need to sort them first.
# Find indices where the same ID is above and below it
A = which(unlist(sapply(X = rle(df$ID)$lengths,
FUN = function(x) {if(x == 1) return(F)
if(x == 2) return(c(F,F))
if(x >= 3) return(c(F,rep(T, x-2),F))})))
# Store list of date, should speed up code a tiny bit
V_opendate = df$Opendate
# Further filter on A, where the date difference < 365 days
B = A[sapply(A, function(x) (abs(V_opendate[x]-V_opendate[x-1]) < 365) & (abs(V_opendate[x]-V_opendate[x+1]) < 365))]
# Return actual indices of rows - 1, rows +1
C = sapply(B, function(x) c(x-1, x+1), simplify = F)
# Actually take the mean of these cases
D = sapply(C, function(x) mean(df[x,]$Enrollment))
# Create new column rate and fill in with value of C. You can do the comparison from here.
df[B,"Rate"] = D
Number Trial ID Opendate Enrollmentrate Rate
1 420 NCT00091442 9 2005-01-28 0.2 NA
2 1476 NCT00301457 26 2008-02-22 1.0 NA
3 10559 NCT01307397 34 2011-07-28 0.6 NA
4 6794 NCT00948675 53 2010-05-12 0.0 NA
5 6451 NCT00917384 53 2010-08-17 0.3 0.10
6 8754 NCT01168973 53 2011-01-19 0.2 1.35
7 8578 NCT01140347 53 2011-12-30 2.4 0.25
8 11655 NCT01358877 53 2012-04-02 0.3 NA
9 428 NCT00091442 55 2005-09-07 0.1 NA
10 112 NCT00065325 62 2003-10-15 0.2 NA
11 477 NCT00091442 62 2005-11-11 0.1 NA
12 16277 NCT01843374 62 2013-12-16 0.2 NA
13 17386 NCT01905657 62 2014-01-08 0.6 NA
14 411 NCT00091442 66 2005-01-12 0.0 NA
14 411 NCT00091442 66 1/12/2005 0.00 NA
The relevant rows are calculated. You can do your comparison with the newly created Rate column.
You might have to change the code a little since I changed removed the space in the column names
df = read.table(text = " Number Trial ID Opendate Enrollmentrate
420 NCT00091442 9 1/28/2005 0.2
1476 NCT00301457 26 2/22/2008 1
10559 NCT01307397 34 7/28/2011 0.6
6794 NCT00948675 53 5/12/2010 0
6451 NCT00917384 53 8/17/2010 0.3
8754 NCT01168973 53 1/19/2011 0.2
8578 NCT01140347 53 12/30/2011 2.4
11655 NCT01358877 53 4/2/2012 0.3
428 NCT00091442 55 9/7/2005 0.1
112 NCT00065325 62 10/15/2003 0.2
477 NCT00091442 62 11/11/2005 0.1
16277 NCT01843374 62 12/16/2013 0.2
17386 NCT01905657 62 1/8/2014 0.6
411 NCT00091442 66 1/12/2005 0", header = T)

Adding information on a graph using R

I would like to add some information on my graph which was plotted from this data set:
EDITTED:
#data set:
day <- c(0:28)
ndied <- c(342,335,240,122,74,64,49,60,51,44,35,48,41,34,38,27,29,23,20,15,20,16,17,17,14,10,4,1,2)
pdied <- c(19.1,18.7,13.4,6.8,4.1,3.6,2.7,3.3,2.8,2.5,2.0,2.7,2.3,1.9,2.1,1.5,1.6,1.3,1.1,0.8,1.1,0.9,0.9,0.9,0.8,0.6,0.2,0.1,0.1)
pmort <- data.frame(day,ndied,pdied)
> pmort
day ndied pdied
1 0 342 19.1
2 1 335 18.7
3 2 240 13.4
4 3 122 6.8
5 4 74 4.1
6 5 64 3.6
7 6 49 2.7
8 7 60 3.3
9 8 51 2.8
10 9 44 2.5
11 10 35 2.0
12 11 48 2.7
13 12 41 2.3
14 13 34 1.9
15 14 38 2.1
16 15 27 1.5
17 16 29 1.6
18 17 23 1.3
19 18 20 1.1
20 19 15 0.8
21 20 20 1.1
22 21 16 0.9
23 22 17 0.9
24 23 17 0.9
25 24 14 0.8
26 25 10 0.6
27 26 4 0.2
28 27 1 0.1
29 28 2 0.1
I have put together this script and still trying to improve on it so that the rest of the information can be added:
> barplot(pmort$pdied,xlab="Age(days)",ylab="Percent",xlim=c(0,28),ylim=c(0,20),legend="Mortality")
I am trying to insert the numbers 0 to 28 (age in days) on the x-axis but could not and I know that it could be a simple script. Secondly, I would like to add the number died or ndied (342 to 2) below each day(0 to 28) along the x-axis.
Example:
0 1 2 3 4 5 and so on...
(N=342) (N=335) (N=240) (N=122) (N=74) (N=64)
Graph:
Any help would be appreciated.
Baz
I gave you two ways to plot the info: one above the bars and one below. You can tweak it to meet your needs.
barX <- barplot(pmort$pdied,xlab="Age(days)",
ylab="Percent", names=pmort$day,
xlim=c(0,28),ylim=c(0,20),legend="Mortality")
text(cex=.5, x=barX, y=pmort$pdied+par("cxy")[2]/2, pmort$ndied, xpd=TRUE)
barX <- barplot(pmort$pdied,xlab="Age(days)",
ylab="Percent", names=pmort$day,
xlim=c(0,28),ylim=c(0,20),legend="Mortality")
text(cex=.5, x=barX, y=-.5, pmort$ndied, xpd=TRUE)

Resources