In R, plotting wide form data with ggplot2 or base plot. Is there a way to use ggplot2 without melting wide form data frame? - r

I have a data frame that looks like this (though thousands of times larger).
df<-data.frame(sample(1:100,10,replace=F),sample(1:100,10,replace=F),runif(10,0,1),runif(10,0,1),runif(10,0,1), rep(c("none","summer","winter","sping","allyear"),2))
names(df)<-c("Mother","ID","Wavelength1","Wavelength2","Wavelength3","WaterTreatment")
df
Mother ID Wavelength1 Wavelength2 Wavelength3 WaterTreatment
1 2 34 0.9143670 0.03077356 0.82859497 none
2 24 75 0.6173382 0.05958151 0.66552338 summer
3 62 77 0.2655572 0.63731302 0.30267893 winter
4 30 98 0.9823510 0.45690437 0.40818031 sping
5 4 11 0.7503750 0.93737900 0.24909228 allyear
6 55 76 0.6451885 0.60138475 0.86044856 none
7 97 21 0.5711019 0.99732068 0.04706894 summer
8 87 14 0.7699293 0.81617911 0.18940531 winter
9 92 30 0.5855559 0.70152698 0.73375917 sping
10 93 44 0.1040359 0.85259166 0.37882469 allyear
I want to plot wavelength values on the y axis, and wavelength on the x. I have two ways of doing this:
First method which works, but uses base plot and requires more code than should be necessary:
colors=c("red","blue","green","orange","yellow")
plot(0,0,xlim=c(1,3),ylim=c(0,1),type="l")
for (i in 1:10) {
if (df$WaterTreatment[i]=="none"){
a<-1
} else if (df$WaterTreatment[i]=="allyear") {
a<-2
}else if (df$WaterTreatment[i]=="summer") {
a<-3
}else if (df$WaterTreatment[i]=="winter") {
a<-4
}else if (df$WaterTreatment[i]=="spring") {
a<-5
}
lines(seq(1,3,1),df[i,3:5],type="l",col=colors[a])
}
Second method: I attempt to melt the data to put it in long form, then use ggplot2. The plot it produces is not correct because there is a line for each water treatment, rather than a line for each "Mother" "ID" (the unique identifier, what were the rows in the original data frame).
require(reshape2)
require(data.table)
df_m<-melt(df,id.var=c("Mother","ID","WaterTreatment"))
df_m$variable<-as.numeric(df_m$variable) #sets wavelengths to numeric
qplot(x=df_m$variable,y=df_m$value,data=df_m,color=df_m$WaterTreatment,geom = 'line')
There is probably something simple I'm missing about ggplot2 that would fix the plotting of the lines. I'm a newbie with ggplot, but am working to get more familiar with it and would like to use it in this application.
But more broadly, is there an efficient way to plot this type of wide form data in ggplot2? The time it takes to transform/melt the data is enormous and I'm wondering if it is worth it, or if there is some kind of work around that can eliminate the redundant cells created when melting.
Thanks for your help, if you need more clarity on this question please let me know and I can edit.

I'd like to point out that you are basically re-inventing an existing base plotting function, namely matplot. This could replace your plot and for-loop:
matplot(1:3, t( df[ ,3:5] ), type="l",col=colors[ as.numeric(df$WaterTreatment)] )
With that in mind you might want to search SO for: [r] matplot ggplot2, as I did, and see if this see if this or any of the other hits are effective.

It looks like you want a separate line for each ID, but you want the lines colored based on the value of WaterTreatment. If so, you can do it like this in ggplot:
ggplot(df_m, aes(x=variable, y=value, group=ID, colour=WaterTreatment)) +
geom_line() + geom_point()
You can also use faceting to make it easier to see the different levels of WaterTreatment
ggplot(df_m, aes(x=variable, y=value, group=ID, colour=WaterTreatment)) +
geom_line() + geom_point() +
facet_grid(WaterTreatment ~ .)
To answer your general question: ggplot is set up to work most easily and powerfully with a "long" (i.e., melted) data frame. I guess you could work with a "wide" data frame and plot separate layers for each combination of factors you want to plot. But that would be a lot of extra work compared to a single melt command to get your data into the right format.

Related

Simple barplot displaying voting of a county

I'm fairly new to R and making plots, so sorry about that. I have a dataset of the voting for counties and I want to make a barplot showing how many mandates each county voted for.
What I've done so far is to extract one row, which includes the name of the county and the number of mandates it voted for the different parties (which are headers).
Fylker AP FRP H KrF SP
Ostlandet 3 2 2 0 1
Sorry for the bad display of code, whenever I paste the code, it looks really weird, despite indenting.
The data is called "Ostlandet" and is only 1 row. So as I tried to explain above, I want to make some sort of barplot out of this. The idea is to have the different parties on the x-axis and number of votes on y. I've tried this so far
ggplot(Ostfold, aes(x = Ostfold[1,])) +
geom_histogram(binwidth = 20)
Which just gave me tons of errors.
I've also tried using barplot, but I just can't seem to figure this out.
Sorry, this is probably super easy, but I'm just getting into coding.
You have a few issues. First, there's no need for extracting rows. Second, the data are in "wide" format (mandates in columns) instead of "long format" (a column named "mandate" with values). Third, you want to plot counts so geom_col() is better than geom_histogram().
The gather() function from the tidyr package will get your data from wide into long:
library(tidyr)
library(ggplot2)
Ostfold %>%
gather(Mandate, Votes, -Fylker)
That should generate something like this:
Fylker Mandate Votes
1 Ostlandet AP 3
2 Ostlandet FRP 2
3 Ostlandet H 2
4 Ostlandet KrF 0
5 Ostlandet SP 1
You can pass that to ggplot:
Ostfold %>%
gather(Mandate, Votes, -Fylker) %>%
ggplot(aes(Mandate, Votes)) + geom_col()
Result for your one row:
For a dataset with multiple counties, you might want to add + facet_wrap(~Fylker) to facet the plot by county, depending on how many there are.

R column dataframe names number

I have a dataframe like this
geo 2001 2002
Spain 21 23
Germany 34 50
Italy 57 89
France 19 13
As the names of 2nd an 3rd column are considered as number I'm not able to get a bar chart wth ggplot2. Is there any solution to set the column names to be considered as text?
data
pivot_dat <- read.table(text="geo 2001 2002
Spain 21 23
Germany 34 50
Italy 57 89
France 19 13",strin=F,h=T)
pivot_dat <- setNames(pivot_dat,c("geo","2001","2002"))
Here's how to do it :
library(ggplot2)
ggplot(pivot_dat, aes(x = geo, y = `2002`)) + geom_col()+ coord_flip()
by using ticks instead of quotes/double quotes you make sure you pass a name to the function and not a string.
If you use quotes, ggplot will convert this character value to a factor and recycle it, so all bars will have the same length of 1, and a label of value "2002".
Note 1 :
You might want to learn the difference between geom_col and geom_bar :
?ggplot2::geom_bar
In short geom_col is geom_bar with stat = "identity", which is what you want here since you want to show on your plot the raw values from your table.
Note 2:
aes_string can be used to give string instead of names but here it doesn't work as "2002" is evaluated as a number :
ggplot(pivot_dat, aes_string(x = "geo", y = "2002")) +
geom_col()+ coord_flip() # incorrect output
ggplot(pivot_dat, aes_string(x = "geo", y = "`2002`")) +
geom_col()+ coord_flip() # correct output
Without an example to see exactly what your problem is, and what you want, it is hard to give you a perfect answer. But here's the thing.
You can do a geom_bar with numeric data. There are 3 possible ways I see that you could have problems (but I may not be able to guess every way.
First, let's set up the r for plotting.
library(readr)
library(ggplot2)
test <- read_csv("geo,2001,2002
Spain,21,23
Germany,34,50
Italy,57,89
France,19,13")
Next, let's make the first mistake...incorrectly calling the column name. In the next example I will tell ggplot to make a bar of the number 2001. Not the column 2001! r has to guess whether we mean 2001 or whether we mean the object 2001. By default it always picks the number instead of the column.
ggplot(test) +
geom_bar(aes(x=2001))
Ok, that just gives you a bar at 2001...because you gave it a single number input instead of a column. Let's fix that. Use the right facing quotes `` to identify the column name 2001 instead of the number 2001.
ggplot(test) +
geom_bar(aes(x=`2001`))
This creates a perfectly workable bar chart. But maybe you don't want the spaces? That's the only possible reason you would use text instead of a number. But you want text so I'm going to show you how to use as.factor to do something similar (and more powerful).
ggplot(test) +
geom_bar(aes(x=as.factor(`2001`)))

R: Split data in ggplot based on other factor

I am a beginner with R so I don't have much experience. I ran into a problem when trying to split my scatterplot in groups based on infection status. My dataset consists of log transformed antibody levels logapfhap2 in this example. Infection status any Pf inf is coded as Yes or No and gives information on if someone has been infected during the follow-up period. I am plotting timepoints (x) against antibody levels (y). For time point 1 and 14 I would like to make 2 groups based on infection status.
This is the main part of the code I use to plot the data without splitting in groups:
ggplot() +
geom_jitter(data=data2, aes(x='1', y=logapfhap2, colour='PfHAP2A')) +
geom_jitter(data=data2,aes(x='14', y=logbpfhap2, colour='PfHAP2B')) +
geom_jitter(data=TRC, aes(x='C', y=PfHAP2, colour='PfHAP2C'))
which results in this graph:
Then I tried to split it (I only show the first time point here) which returns an error.
ggplot() +
geom_jitter(data=data2[data2$any_Pf_inf=='Yes'],
aes(x='1inf', y=logapfhap2[data2$any_Pf_inf=='Yes'],
colour='PfHAP2A')) +
geom_jitter(data=data2[data2$any_Pf_inf=='No'],
aes(x='1un', y=logapfhap2[data2$any_Pf_inf=='No'],
colour='PfHAP2B'))
I wanted to create this graph but I get this error:
Error: Length of logical index vector must be 1 or 55, got: 482
Hope this is clear! Could anyone help me with this problem? Thanks!
EDIT
Not sure if this makes it clearer, but this is what my data looks like:
I just tried some other things and I have solved it now!
ggplot()+
geom_jitter(data=data2[data2$any_Pf_inf=='Yes',],
aes(x='1inf', y=logapfhap2,
colour='PfHAP2A')) +
geom_jitter(data=data2[data2$any_Pf_inf=='No',],
aes(x='1un', y=logbpfhap2,
colour='PfHAP2B'))
Apparently you have to add a comma after [data2$any_Pf_inf=='Yes',] to extract rows instead of columns.

In R, how can I use different colors for each range in my scatterplot?

I'm trying to plot with R a series of 120 numbers where the first 40 are of one type, the next 40 items are of a second type and the last 40 items are of a third type.
Right now I'm just plotting it as a scatter-plot and its hard to tell the three sections apart:
data <- read.table("mydata.txt")
plot(data[,1])
Is there a way to distinguish the three sections, as in this following mockup that I made?
You could provide a colour vector if the data are already ordered.
mydata <- runif(120)
plot(mydata, col = rep(rainbow(3), each = 40))
rainbow(3) makes a colour vector of 3 colours, and rep with each = 40 makes 40 copies of each.
A longer answer not as nice as the one of Mark O'Connell but has the merit to be more flexible.(I think)
data<-data.frame(y=seq(1,1000)+rnorm(1000,0,100),index=seq(1,1000))
data.blue<-data[data$index<200,]
data.green<-data[data$index>=200&data$index<400, ]
data.red<-data[data$index>=400&data$index<600, ]
data.purple<-data[data$index>=600&data$index<1000, ]
plot(data.blue,col='blue',xlim=c(-200,1300),ylim=c(0,1000))
points(data.green$index,data.green$y,col='green')
points(data.red$index,data.red$y,col='red')
points(data.purple$index,data.purple$y,col='purple')

How to indicate factors in ggplot with horizontal line and Text

My data looks like this example:
dataExample<-data.frame(Time=seq(1:10),
Data1=runif(10,5.3,7.5),
Data2=runif(10,4.3,6.5),
Application=c("Substance1","Substance1","Substance1",
"Substance1","Substance2","Substance2","Substance2",
"Substance2","Substance1","Substance1"))
dataExample
Time Data1 Data2 Application
1 1 6.511573 5.385265 Substance1
2 2 5.870173 4.512775 Substance1
3 3 6.822132 5.109790 Substance1
4 4 5.940528 6.281412 Substance1
5 5 7.269394 4.680380 Substance2
6 6 6.122454 6.015899 Substance2
7 7 5.660429 6.113362 Substance2
8 8 6.649749 4.344978 Substance2
9 9 7.252656 4.764667 Substance1
10 10 7.204440 5.835590 Substance1
I would like to indicate at which time any Substance was applied that is different from dataExample$Application[1].
Here I show you the way I get this ploted, but I assume that there is a much easier way to do it with ggplot.
library(reshape2)
library(ggplot)
plotDataExample<-function(DataFrame){
longDF<-melt(DataFrame,id.vars=c("Time","Application"))
p=ggplot(longDF,aes(Time,value,color=variable))+geom_line()
maxValue=max(longDF$value)
minValue=min(longDF$value)
yAppLine=maxValue+((maxValue-minValue)/20)
xAppLine1=min(longDF$Time[which(longDF$Application!=longDF$Application[1])])
xAppLine2=max(longDF$Time[which(longDF$Application!=longDF$Application[1])])
lineData=data.frame(x=c(xAppLine1,xAppLine2),y=c(yAppLine,yAppLine))
xAppText=xAppLine1+(xAppLine2-xAppLine1)/2
yAppText=yAppLine+((maxValue-minValue)/20)
appText=longDF$Application[which(longDF$Application!=longDF$Application[1])[1]]
textData=data.frame(x=xAppText,y=yAppText,appText=appText)
p=p+geom_line(data=lineData,aes(x=x, y=y),color="black")
p=p+geom_text(data=textData,aes(x=x,y=y,label = appText),color="black")
return(p)
}
plotDataExample(dataExample)
Question:
Do you know a better way to get a similar result so that I could possibly indicate more than one factor (e.g. Substance3, Substance4 ...).
First, made new sample data to have more than 2 levels and twice repeated Substance2.
dataExample<-data.frame(Time=seq(1:10),
Data1=runif(10,5.3,7.5),
Data2=runif(10,4.3,6.5),
Application=c("Substance1","Substance1","Substance2",
"Substance2","Substance1","Substance1","Substance2",
"Substance2","Substance3","Substance3"))
Didn't make this as function to show each step.
Add new column groups to original data frame - this contains identifier for grouping of Applications - if substance changes then new group is formed.
dataExample$groups<-c(cumsum(c(1,tail(dataExample$Application,n=-1)!=head(dataExample$Application,n=-1))))
Convert to long format data for lines of data.
longDF<-melt(dataExample,id.vars=c("Time","Application","groups"))
Calculate positions for Substance identifiers. Used function ddply() from library plyr. For calculation only data that differs from first Application value are used (that's subset()). Then Application and groups are used for grouping of data. Calculated starting, middle and ending positions on x axis and y value taken as maximal value +0.3.
library(plyr)
lineData<-ddply(subset(dataExample,Application != dataExample$Application[1]),
.(Application,groups),
summarise,minT=min(Time),maxT=max(Time),
meanT=mean(Time),ypos=max(longDF$value)+0.3)
Now plot longDF data with ggplot() and geom_line() and add segments above plot with geom_segment() and text with annotate() using new data frame lineData.
ggplot(longDF,aes(Time,value,color=variable))+geom_line()+
geom_segment(data=lineData,aes(x=minT,xend=maxT,y=ypos,yend=ypos),inherit.aes=FALSE)+
annotate("text",x=lineData$meanT,y=lineData$ypos+0.1,label=lineData$Application)

Resources