Using a for loop with ggplot2 to plot multiple graphs within a data frame - r

Was just wondering if somebody could help with a problem I am having in R with a for loop using ggplot2. I have carried out some clustering to find patterns of data that change over time. There are various patterns in total with 38 graphs of patterns. The output of the clustering is to put side by side all 38 graphs which is nice for visualisation.
But I want to zoom in to individual graphs to zoom in to them for presentation and a cleared view of a pattern. This is easy manually, however, writing 38 versions of the same script but just with a different cluster in each one is very tedious, so I would like to create a for loop in order to achieve in one chunk of quick code. I have done this code (with some help online also), however, I am unable to get the ouput of the individual 38 graphs. the code itself works as I can specify one cluster which will then give me an output of that specific cluster, but I want to create a code that will creat all 38 different clusters.
The code I am using is as follows:
The data frame is called dfllgc, within which dfllgc$cluster contains information on the individual clusters. The for loop I am attempting is as follows but does not work. Any help would really really be appreciated!
for(cluster in dfllgc$cluster){
df<-subset(dataframAMIRllgc,cluster == 1:38)
df$Time_point<-factor(df.s$Time_point, levels = c("p3", "p15", "p30","p60"))
g<-ggplot(df, aes(x=Time_point, y=abundance, group=llgc, colour=llgc))+
geom_line(size=1.5)+
geom_point(size=4)+
ggtitle("Cluster 29: Patterns over time (5 genes) \n") +
xlab("\nAge") + ylab("Expression(CPM)\n")
print(g) }
Changing df<-subset(dataframAMIRllgc,cluster == 1:38) to == 1, or 15 etc, or any other cluster does indeed produce that one cluster, but not all 38 with 1:38.
Finally, with the title (ggtitle), is there a way to automate also the titles such that I can have a template, but that the cluster number as well as number of genes are automatically applied to the correct clusters?
Thank you so much! Any help would be much appreciated :)
example data
merge cluster Time_point llgc abundance
1 High[26-50%]p15 1 p15 High[26-50%] 166.5400335
38 High[26-50%]p3 1 p3 High[26-50%] 255.5007952
75 High[26-50%]p30 1 p30 High[26-50%] 122.1110473
112 High[26-50%]p60 1 p60 High[26-50%] 78.84340532
149 Low[0-10%]p15 1 p15 Low[0-10%] 86.40962037
186 Low[0-10%]p3 1 p3 Low[0-10%] 205.9750297
223 Low[0-10%]p30 1 p30 Low[0-10%] 60.23843127
260 Low[0-10%]p60 1 p60 Low[0-10%] 56.64259547
297 Medium[11-25%]p15 1 p15 Medium[11-25%] 165.2372227
334 Medium[11-25%]p3 1 p3 Medium[11-25%] 223.3891249
371 Medium[11-25%]p30 1 p30 Medium[11-25%] 155.1325448
408 Medium[11-25%]p60 1 p60 Medium[11-25%] 176.8285175
2 High[26-50%]p15 2 p15 High[26-50%] 85.21789981
39 High[26-50%]p3 2 p3 High[26-50%] 211.5359752
76 High[26-50%]p30 2 p30 High[26-50%] 35.7475454
113 High[26-50%]p60 2 p60 High[26-50%] 12.87995477
150 Low[0-10%]p15 2 p15 Low[0-10%] 77.20608808
187 Low[0-10%]p3 2 p3 Low[0-10%] 43.04550979
224 Low[0-10%]p30 2 p30 Low[0-10%] 34.88976766
261 Low[0-10%]p60 2 p60 Low[0-10%] 9.791146582
298 Medium[11-25%]p15 2 p15 Medium[11-25%] 46.21377697
335 Medium[11-25%]p3 2 p3 Medium[11-25%] 34.89603178
372 Medium[11-25%]p30 2 p30 Medium[11-25%] 14.18668175
409 Medium[11-25%]p60 2 p60 Medium[11-25%] 7.360330065
3 High[26-50%]p15 3 p15 High[26-50%] 47.75793997
40 High[26-50%]p3 3 p3 High[26-50%] 62.3529071
77 High[26-50%]p30 3 p30 High[26-50%] 17.8348889
114 High[26-50%]p60 3 p60 High[26-50%] 14.26366778
151 Low[0-10%]p15 3 p15 Low[0-10%] 138.1451371
188 Low[0-10%]p3 3 p3 Low[0-10%] 185.1184602
225 Low[0-10%]p30 3 p30 Low[0-10%] 63.52332626
262 Low[0-10%]p60 3 p60 Low[0-10%] 39.40566363
299 Medium[11-25%]p15 3 p15 Medium[11-25%] 26.32551336
336 Medium[11-25%]p3 3 p3 Medium[11-25%] 49.72067928
373 Medium[11-25%]p30 3 p30 Medium[11-25%] 8.288553629
410 Medium[11-25%]p60 3 p60 Medium[11-25%] 5.385031193

I'm not sure I 100% understand what you are trying to do but I think there is a problem with your subset and then you need to add a save function to the end. Hopefully this does what you want:
dfllgc$Time_point<-factor(dfllgc$Time_point, levels = c("p3", "p15", "p30","p60"))
for(cluster in unique(dfllgc$cluster)) {
g<-ggplot( dfllgc[ dfllgc$cluster == cluster, ],
aes(x=Time_point, y=abundance, group=llgc, colour=llgc)) +
geom_line(size=1.5) +
geom_point(size=4) +
ggtitle( paste0("Cluster ", cluster,": Patterns over time (5 genes)") ) +
xlab("Age") + ylab("Expression(CPM)")
ggsave(paste0("Cluster_", cluster,".png"), g)
}
Changes made:
removed the subset line and added the cluster subset/filter to ggplot line but it could just as easily be separate.
moved the factor conversion outside the for loop so it only needs to be applied once.
set the title and file name to change with each cluster

Related

how to add regression lines for each factor on a plot

I've created a model and I'm trying to add curves that fit the two parts of the data, insulation and no insulation. I was thinking about using the insulation coefficient as a true/false term, but I'm not sure how to translate that into code. Entries 1:56 are "w/o" and 57:101 are "w/". I'm not sure how to include the data I'm using but here's the head and tail:
month year kwh days est cost avgT dT.yr kWhd.1 id insulation
1 8 2003 476 21 a 33.32 69 -8 22.66667 1 w/o
2 9 2003 1052 30 e 112.33 73 -1 35.05172 2 w/o
3 10 2003 981 28 a 24.98 60 -6 35.05172 3 w/o
4 11 2003 1094 32 a 73.51 53 2 34.18750 4 w/o
5 12 2003 1409 32 a 93.23 44 6 44.03125 5 w/o
6 1 2004 1083 32 a 72.84 34 3 33.84375 6 w/o
month year kwh days est cost avgT dT.yr kWhd.1 id insulation
96 7 2011 551 29 e 55.56 72 0 19.00000 96 w/
97 8 2011 552 27 a 61.17 78 1 20.44444 97 w/
98 9 2011 666 34 e 73.87 71 -2 19.58824 98 w/
99 10 2011 416 27 a 48.03 64 0 15.40741 99 w/
100 11 2011 653 31 e 72.80 53 1 21.06452 100 w/
101 12 2011 751 33 a 83.94 45 2 22.75758 101 w/
bill$id <- seq(1:101)
bill$insulation <- as.factor(ifelse(bill$id > 56, c("w/"), c("w/o")))
m1 <- lm(kWhd.1 ~ avgT + insulation + I(avgT^2), data=bill)
with(bill, plot(kWhd.1 ~ avgT, xlab="Average Temperature (F)",
ylab="Daily Energy Use (kWh/d)", col=insulation))
no_ins <- data.frame(bill$avgT[1:56], bill$insulation[1:56])
curve(predict(m1, no_ins=x), add=TRUE, col="red")
ins <- data.frame(bill$avgT[57:101], bill$insulation[57:101])
curve(predict(m1, ins=x), add=TRUE, lty=2)
legend("topright", inset=0.01, pch=21, col=c("red", "black"),
legend=c("No Insulation", "Insulation"))
ggplot2 makes this a lot easier than base plotting. Something like this should work:
ggplot(bill, aes(x = avgT, y = kWhd.1, color = insulation)) +
geom_smooth(method = "lm", formula = y ~ x + I(x^2), se = FALSE) +
geom_point()
In base, I'd create a data frame with point you want to predict on, something like
pred_data = expand.grid(
kWhd.1 = seq(min(bill$kWhd.1), max(bill$kWhd.1), length.out = 100),
insulation = c("w/", "w/o")
)
pred_data$prediction = predict(m1, newdata = pred_data)
And then use lines to add the predictions to your plot. My base graphics is pretty rusty, so I'll leave that to you (or another answerer) if you want it.
In base R it's important to order the x-values. Since this is to be done on multiple factors, we can do this with by, resulting in a list L.
Since your example data is not complete, here's an example with iris where we consider Species as the "factor".
L <- by(iris, iris$Species, function(x) x[order(x$Petal.Length), ])
Now we can do the plot and add loess predictions as lines with a sapply.
with(iris, plot(Sepal.Width ~ Petal.Length, col=Species))
sapply(seq(L), function(x)
lines(L[[x]]$Petal.Length,
predict(loess(Sepal.Width ~ Petal.Length, L[[x]], span=1.1)), # span=1.1 for smoothing
col=x))
Yields

Reading unkown file type with strange entries into R

I am completely new at this and here, so please have mercy.
I want to open an ASCII data file in R.
After several different attempts, I have tried df=read.csv("C:MyDirectory" ,header=FALSE, sep="").
This has produced a table with several variables, but some rows clearly contain the wrong information, some cells are blank, some contain NA values.
Any ideas what has gone wrong? I have gotten the file from an offical Spanish research institute:
http://www.cis.es/cis/opencm/ES/2_bancodatos/estudios/listaTematico.jsp?tema=1&todos=si
Then BARĂ“METRO DE OCTUBRE 2017, to the right is a small link entitled "fichero de datos", which allows you to download after providing them with some info. The file giving the trouble is DA3191. If anyone could go through the trouble of helping me with this, it would be awesome. Thank you.
Part 1
This looks like a fixed width format, so you need read.fwf instead of read.csv and friends. I made a screen shot of an almost random place of that file: my hypothesis is that the 99's and 98's etc are missing data codes, so the first 99 marked in yellow would belong to the same column with 4, 2, 0, etc, and the immediately following 99 (not marked) is in the same column with 0, 5, 7, etc.
Part 2
And then look at the file ES3191 -- this looks like SPSS code (pardon my French!) containing the rules about reading in the data file. You can probably figure out the width of each column and what's in there from that file:
DATA LIST FILE= 'DA3191'
/ESTU 1-4 CUES 5-9 CCAA 10-11 PROV 12-13 MUN 14-16 TAMUNI 17 CAPITAL 18 DISTR 19-20 SECCION 21-23
ENTREV 24-27 P0 28 P0A 29-31 P1 32 P2 33 P3 34 P4 35 P5 36 P6 37 P701 38-39 P702 40-41 P703 42-43
P801 44-45 P802 46-47 P803 48-49 P901 50-51 P902 52-53 P903 54-55 P904 56-57 P905 58-59 P906 60-61
P907 62-63 P1001 64 P1002 65 P1003 66 P1101 67 P1102 68 P1103 69 P1104 70 P1201 71 P1202 72
P1203 73 P1204 74 P1205 75 P1206 76 P1207 77 P1208 78 P1209 79 P13 80-81 P13A 82-83 P1401 84-85
P1402 86-87 P1403 88-89 P1404 90-91 P1405 92-93 P1406 94-95 P1407 96-97 P1408 98-99 P1409 100-101
P1410 102-103 P1411 104-105 P1412 106-107 P1413 108-109 P1414 110-111 P1415 112-113 P1416 114-115
I'm not an SPSS expert but I would guess that what it is trying to tell us is that
columns 1-4 contain the variable "ESTU"
columns 5-9 contain the variable "CUES"
etc
For read.fwf you have to calculate each variable's "width" i.e. 4 characters for ESTU (if my reading was right) 5 characters for CUES etc.
Part 3
Using the guesses above, I used the following code to read in your data, and it looks like it works:
# this is copy/pasted SPSS code from file "ES3191"
txt <- "ESTU 1-4 CUES 5-9 CCAA 10-11 PROV 12-13 MUN 14-16 TAMUNI 17 CAPITAL 18 DISTR 19-20 SECCION 21-23
ENTREV 24-27 P0 28 P0A 29-31 P1 32 P2 33 P3 34 P4 35 P5 36 P6 37 P701 38-39 P702 40-41 P703 42-43
P801 44-45 P802 46-47 P803 48-49 P901 50-51 P902 52-53 P903 54-55 P904 56-57 P905 58-59 P906 60-61
P907 62-63 P1001 64 P1002 65 P1003 66 P1101 67 P1102 68 P1103 69 P1104 70 P1201 71 P1202 72
P1203 73 P1204 74 P1205 75 P1206 76 P1207 77 P1208 78 P1209 79 P13 80-81 P13A 82-83 P1401 84-85
P1402 86-87 P1403 88-89 P1404 90-91 P1405 92-93 P1406 94-95 P1407 96-97 P1408 98-99 P1409 100-101
P1410 102-103 P1411 104-105 P1412 106-107 P1413 108-109 P1414 110-111 P1415 112-113 P1416 114-115
P1501 116-117 P1502 118-119 P1503 120-121 P1504 122-123 P1505 124-125 P1506 126-127 P1507 128-129
P1508 130-131 P1509 132-133 P1510 134-135 P1511 136-137 P1512 138-139 P1513 140-141 P1514 142-143
P1515 144-145 P1516 146-147 P16 148 P17 149 P1801 150-151 P1802 152-153 P1803 154-155 P1804 156-157
P1805 158-159 P1806 160-161 P1807 162-163 P1808 164-165 P1809 166-167 P1810 168-169 P1811 170-171
P1812 172-173 P1813 174-175 P19 176 P20 177 P21 178-179 P22 180-181 P23 182-183 P2401 184-185
P2402 186-187 P2403 188-189 P2404 190-191 P2405 192-193 P2406 194-195 P2407 196-197 P2408 198-199
P2409 200-201 P2410 202-203 P2411 204-205 P2412 206-207 P2413 208-209 P2414 210-211 P2415 212-213
P2416 214-215 P25 216 P26 217 P27 218 P27A 219-220 P28 221-222 P29 223 P30 224-225 P31 226 P31A 227-228
P32 229 P32A 230 P33 231 P34 232 P35 233 P35A 234 P36 235 P37 236 P37A 237 P37B 238 P38 239-241
P39 242 P39A 243 P40 244-246 P41 247-248 P42 249-250 P43 251 P43A 252 P43B 253 P44 254 P4501 255
P4502 256 P4503 257 P4504 258 P4601 259-261(A) P4602 262-264(A) P4603 265-267(A) P4604 268-270(A)
P4605 271-273(A) P4701 274-276(A) P4702 277-279(A) P4703 280-282(A) P4704 283-285(A) P4705 286-288(A)
P48 289 P49 290 P50 291 P51 292 I1 293-295 I2 296-298 I3 299-301 I4 302-304 I5 305-307 I6 308-310
I7 311-313 I8 314-316 I9 317-319 E101 320-321 E102 322-323 E103 324-325 E2 326 E3 327-329 E4 330
C1 331 C1A 332-333 C2 334 C2A 335 C2B 336-337 C3 338 C4 339-340 P21R 341-342 P22R 343-344 VOTOSIMG 345-346
P27AR 347-348 RECUERDO 349-350 ESTUDIOS 351 OCUMAR11 352-353 RAMA09 354 CONDICION11 355-356
ESTATUS 357 "
# making a 2-column matrix (name = left column, position = right column)
m <- matrix(scan(text=txt, what=""), ncol=2, byrow=TRUE)
m <- as.data.frame(m, stringsAsFactors=FALSE)
names(m) <- c("Var", "Pos")
pos <- sub("(A)", "", m$Pos, fixed = TRUE) # some entries contain '(A)' - no idea what it means so deleting it
pos <- strsplit(pos, "-")
starts <- as.numeric(sapply(pos, head, 1)) # get the first element from left
ends <- as.numeric(sapply(pos, tail, 1)) # get the first element from right
w <- ends - starts +1
MyData <- read.fwf("R/MD3191/DA3191", widths = w)
names(MyData) <- m$Var
head(MyData)
# ESTU CUES CCAA PROV MUN TAMUNI CAPITAL DISTR SECCION ENTREV P0 P0A P1 P2 P3 P4 P5 P6
# 1 3191 1 16 1 59 5 1 0 0 0 1 0 3 2 2 5 1 2
# 2 3191 2 16 1 59 5 1 0 0 0 1 0 4 2 3 5 2 3
# 3 3191 3 16 1 59 5 1 0 0 0 1 0 4 2 2 4 2 2

cluster analysis with weight

I have a data frame 'heat' demonstrating people's performance across time.
'Var1' represents the code of persons.
'Var2' represents a time line (measured by number of days from the starting point).
'Variable' is the score they get at a given time point.
Var1 Var2 value
1 1 36 -0.6941826
2 2 36 -0.5585414
3 3 36 0.8032384
4 4 36 0.7973031
5 5 36 0.7536959
6 6 36 -0.5942059
....
54 10 73 0.7063218
55 11 73 -0.6949616
56 12 73 -0.6641516
57 13 73 0.6890433
58 14 73 0.6310124
59 15 73 -0.6305091
60 16 73 0.6809655
61 17 73 0.8957870
....
101 13 110 0.6495796
102 14 110 0.5990869
103 15 110 -0.6210600
104 16 110 0.6441960
105 17 110 0.7838654
....
Now I want to cluster their performance and reflect it on a heatmap. So I used the function dist() and hclust() to clustered the data frame and plotted it with ggplot2:
ggplot(data = heat) + geom_tile(aes(x = Var2, y = Var1 %>% as.character(),
fill = value)) +
scale_fill_gradient(low = "yellow",high = "red") +
geom_vline(xintercept = c(746, 2142, 2917))
It looks like this:
However, I am more interested in what happened around day 746, day 2142 and day 2917 (the black lines). I would like the scores around these days bearing more weight in the clustering. I want people demonstrating similar performance around these days to have more priority to be clustered together. Is there a way of doing this?
As long as your weights are integer, you supposedly can just replicate those days artificially.
If you want more control, just compute the distance matrix yourself, with whatever weighted distance you want to use.

R ggplot ordering bars within groups

I'm attempting to format a grouped bar plot in R with ggplot such that bars are in decreasing order per group. This is my current plot:
based on this data frame:
> top_categories
Category Count Community
1 Singer-Songwriters 151 1
2 Adult Alternative 147 1
3 Dance Pop 95 1
4 Folk 89 1
5 Adult Contemporary 88 1
6 Pop Rap 473 2
7 Gangsta & Hardcore 413 2
8 Soul 175 2
9 East Coast 170 2
10 West Coast 135 2
11 Album-Oriented Rock (AOR) 253 3
12 Singer-Songwriters 217 3
13 Soft Rock 196 3
14 Folk 145 3
15 Adult Contemporary 106 3
16 Soul 278 4
17 Blues 137 4
18 Funk 119 4
19 Quiet Storm 76 4
20 Dance Pop 74 4
21 Indie & Lo-Fi 235 5
22 Indie Rock 234 5
23 Adult Alternative 114 5
24 Alternative Rock 49 5
25 Singer-Songwriters 47 5
created with this code:
ggplot(
top_categories,
aes(
x=Community,
y=Count,
group=Category,
label=Category
)
) +
geom_bar(
stat="identity",
color="black",
fill="#9C27B0",
position="dodge"
) +
geom_text(
angle=90,
position=position_dodge(width=0.9),
hjust=-0.05
) +
ggtitle("Number of Products in each Category in Each Community") +
guides(fill=FALSE)
Based on suggestions from related posts, I've attempted to use the reorder function and turn the Count into a factor, both with results that seem to break the ordering of the bars vs. the text or rescale the plot in a nonsensical way such as this (with factors):
Any tips on how I might accomplish this in-group ordering? Thanks!
When you group by Category, the bars are ordered according to the order of appearance of Categories in the dataframe. This works fine for Community 1 and 2 as your rows are already ordered by decreasing Count. But in Community 3, as Category "Singer-Songwriters" is the first occcurring Category in the dataframe, it is put first.
Grouping instead by an Id variable resolves the problem:
top_categories$Id=rep(c(1:5),5)
ggplot(
top_categories,
aes(
x=Community,
y=Count,
group=Id,
label=Category
)
) +
geom_bar(
stat="identity",
color="black",
fill="#9C27B0",
position="dodge"
) +
geom_text(
angle=90,
position=position_dodge(width=0.9),
hjust=-0.05
) +
ggtitle("Number of Products in each Category in Each Community") +
guides(fill=FALSE)

ggplot each group consists of only one observation

I'm trying to make a plot similar to this answer: https://stackoverflow.com/a/4877936/651779
My data frame looks like this:
df2 <- read.table(text='measurements samples value
1 4hours sham1 6
2 1day sham1 175
3 3days sham1 417
4 7days sham1 163
5 14days sham1 37
6 90days sham1 134
7 4hours sham2 8
8 1day sham2 402
9 3days sham2 482
10 7days sham2 67
11 14days sham2 16
12 90days sham2 31
13 4hours sham3 185
14 1day sham3 402
15 3days sham3 482
16 7days sham3 85
17 14days sham3 29
18 90days sham3 10',header=T)
And plot it with
ggplot(df2, aes(measurements, value)) + geom_line(aes(colour = samples))
No lines show in the plot, and I get the message
geom_path: Each group consist of only one observation.
Do you need to adjust the group aesthetic?
I don't see where what I'm doing is different from the answer I linked above. What should I change to make this work?
Add group = samples to the aes of geom_line. This is necessary since you want one line per samples rather than for each data point.
ggplot(df2, aes(measurements, value)) +
geom_line(aes(colour = samples, group = samples))

Resources