Where is this extra data coming from? (in R plot) - r

I'm a biologist, but I had to teach myself python and R working different places a few years ago. A situation came up at my current job that R would be really useful for, and so i cobbled together a program. Surprisingly, it does just what I'd like EXCEPT the graphs it's generating have an extra bar at the beginning. !
I've entered no data to correspond to that first bar:
I'm hoping this is some simple error in how I've set the plot parameters. Could it be because I'm using plot instead of boxplot? Is it plotting the headings?
More worrisome is the possibility that while reading in and merging my 3 data frames I'm creating some sort of artifact data, which would also affect the statistical tests and make me very sad, though I don't see anything like this when I have it write the matrix to a file.
I greatly appreciate any help!
Here's what it looks like, and then the function it calls (in another script).
(I'm really not a programmer, so I apologize if the following code is miserable.) The goal is to compare our data (which is in columns 10-17 of a csv) to all of the data in a big sheet of clinical data in turn. Then, if there is a significant correlation (the p value is less than .05), to graph the two against each other. This gives me a fast way to find if there's something worth looking further into in this big data set.
first <- read.csv(labdata)
second <- read.csv(mrntoimacskey)
third <- read.csv(imacsdata)
firsthalf<-merge(first,second)
mp <-merge(firsthalf, third, by="PATIENTIDNUMBER")
setwd(aplaceforus)
pfile2<- sprintf("%spvalues", todayis)
setwd("fulldataset")
for (m in 10:17) {
n<-m-9
pretty= pretties[n]
for (i in 1:length(colnames(mp))) {
tryCatch(sigsearchA(pfile2,mp, m, i, crayon=pretty), error= function(e)
{cat("ERROR :", conditionMessage(e), "\n")})
tryCatch(sigsearchC(pfile2,mp, m, i, crayon=pretty), error= function(e)
{cat("ERROR :", conditionMessage(e), "\n")})
}
}
sigsearchA<-function(n, mp, y, x, crayon="deepskyblue"){
#anova, plots if significant. takes name of file, name of database,
#and the count of the columns to use for x and y
stat<-oneway.test(mp[[y]]~mp[[x]])
pval<-stat[3]
heads<-colnames(mp)
a<-heads[y]
b<-heads[x]
ps<-c(a, b, pval)
write.table(ps, file=n, append= TRUE, sep =",", col.names=FALSE)
feedback<- paste(c("Added", b, "to", n), collapse=" ")
if (pval <= 0.05 & pval>0) {
#horizontal lables
callit<-paste(c(a,b,".pdf"), collapse="")
val<-sprintf("p=%.5f", pval)
pdf(callit)
plot(mp[[x]], mp[[y]], ylab=a, main=b, col=crayon)
mtext(val, adj=1)
dev.off()
#with vertical lables, in case of many groups
callit<-paste(c(a,b,"V.pdf"), collapse="")
pdf(callit)
plot(mp[[x]], mp[[y]], ylab=a, main=b,las=2,cex.axis=0.7, col=crayon)
mtext(val, adj=1)
dev.off()
}
print(feedback) }
graphics.off()

I can't be absolutely certain without a reproducible example, but it looks like the x-variable in your plot (let's call it x and let's assume your data frame is called df) has at least one row with an empty string ("") or maybe a space character (" ") and x is also coded as a factor. Even if you remove all of the "" values from the data frame, the level for that value will still be part of the factor coding and will show up in plots. To remove the level, do df$x = droplevels(df$x) and then run your plot again.
For illustration, here's an analogous example with the built-in iris data frame:
# Shows that Species is coded as a factor
str(iris)
# Species is a factor with three levels
levels(iris$Species)
# There are 50 rows for each level of Species
table(iris$Species)
# Three boxplots, one for each level of Species
boxplot(iris$Sepal.Width ~ iris$Species)
# Now let's remove all the rows with Species = "setosa"
iris = iris[iris$Species != "setosa",]
# The "setosa" rows are gone, but the factor level remains and shows up
# in the table and the boxplot
levels(iris$Species)
table(iris$Species)
boxplot(iris$Sepal.Width ~ iris$Species)
# Remove empty levels
iris$Species = droplevels(iris$Species)
# Now the "setosa" level is gone from all plots and summaries
levels(iris$Species)
table(iris$Species)
boxplot(iris$Sepal.Width ~ iris$Species)

Related

Set common y axis limits from a list of ggplots

I am running a function that returns a custom ggplot from an input data (it is in fact a plot with several layers on it). I run the function over several different input data and obtain a list of ggplots.
I want to create a grid with these plots to compare them but they all have different y axes.
I guess what I have to do is extract the maximum and minimum y axes limits from the ggplot list and apply those to each plot in the list.
How can I do that? I guess its through the use of ggbuild. Something like this:
test = ggplot_build(plot_list[[1]])
> test$layout$panel_scales_x
[[1]]
<ScaleContinuousPosition>
Range:
Limits: 0 -- 1
I am not familiar with the structure of a ggplot_build and maybe this one in particular is not a standard one as it comes from a "custom" ggplot.
For reference, these plots are created whit the gseaplot2 function from the enrichplot package.
I dont know how to "upload" an R object but if that would help, let me know how to do it.
Thanks!
edit after comments (thanks for your suggestions!)
Here is an example of the a gseaplot2 plot. GSEA stands for Gene Set Enrichment Analysis, it is a technique used in genomic studies. The gseaplot2 function calculates a running average and then plots it and another bar plot on the bottom.
and here is the grid I create to compare the plots generated from different data:
I would like to have a common scale for the "Running Enrichment Score" part.
I guess I could try to recreate the gseaplot2 function and input all of the datasets and then create the grid by facet_wrap, but I was wondering if there was an easy way of extracting parameters from a plot list.
As a reproducible example (from the enrichplot package):
library(clusterProfiler)
data(geneList, package="DOSE")
gene <- names(geneList)[abs(geneList) > 2]
wpgmtfile <- system.file("extdata/wikipathways-20180810-gmt-Homo_sapiens.gmt", package="clusterProfiler")
wp2gene <- read.gmt(wpgmtfile)
wp2gene <- wp2gene %>% tidyr::separate(term, c("name","version","wpid","org"), "%")
wpid2gene <- wp2gene %>% dplyr::select(wpid, gene) #TERM2GENE
wpid2name <- wp2gene %>% dplyr::select(wpid, name) #TERM2NAME
ewp2 <- GSEA(geneList, TERM2GENE = wpid2gene, TERM2NAME = wpid2name, verbose=FALSE)
gseaplot2(ewp2, geneSetID=1, subplots=1:2)
And this is how I generate the plot list (probably there is a much more elegant way):
plot_list = list()
for(i in 1:3) {
fig_i = gseaplot2(ewp2,
geneSetID=i,
subplots=1:2)
plot_list[[i]] = fig_i
}
ggarrange(plotlist=plot_list)

Scatter plot in R doesn't use the x values in the variable indicated in the plot statement

I am trying to make a scatter plot in R between two numeric variables, and it uses the observation number as the x variable. This is the problem I'm trying to fix: I would like to have a scatter plot that uses the values of the x variable I indicated in the plot statement.
Yes, both the X variable and the Y variable are numeric.
I've attached a screenshot showing the data setup (Galton height data), the fact that the father and son variables are both numeric, and the resulting plot.
Here's the code that sets up the data and runs the scatter plot:
#install.packages("dplyr")
library('dplyr')
#tidyverse is name of package used for class
library(tidyverse)
remove.packages('HistData')
install.packages('HistData')
library(HistData)
data("GaltonFamilies")
childNum <- galton_heights[,6]
gender <- galton_heights[,8]
#Different code to get son height
#If we wanted to follow the lesson exactly, we would
#use the following
son_data <- GaltonFamilies[GaltonFamilies$gender == "male" & GaltonFamilies$childNum == 1,]
son <- son_data$childHeight
#Now we can compare the oldest child's height (if they happen to be male) with that of the father:
GaltonFamilies %>% summarize(mean(father), sd(father), mean(son), sd(son))
GaltonFamilies$father2 <- as.numeric(GaltonFamilies$father)
#galton_heights$father <- as.numeric(levels(galton_heights$father))[galton_heights$father]
plot(GaltonFamilies$father,GaltonFamilies$son)
plot(GaltonFamilies$father2, GaltonFamilies$son, main="Scatterplot Example",
xlab="Father ", ylab="Son ")
Edit: the filter statement creating son_data wasn't working when I ran the above code fresh. I don't know why. I've replaced it with a way to get son_data without the filter.
son_data <- GaltonFamilies[GaltonFamilies$gender == "male" & GaltonFamilies$childNum == 1,]
There is no GaltonFamilies$son. See also: Random data added when using `plot` in R

Combining dotplot R

Im trying to combine two plots into the same plot in R.
My code looks like this:
#----------------------------------------------------------------------------------------#
# RING data: Mikkel
#----------------------------------------------------------------------------------------#
# Set working directory
setwd("/Users/mikkelastrup/Dropbox/Master/RING R")
#### Read data & Converting factors ####
dat <- read.table("R SUM kopi.txt", header=TRUE)
str(dat)
dat$Vial <- as.factor(dat$Vial)
dat$Line <- as.factor(dat$Line)
dat$rep <- as.factor(dat$rep)
dat$fly <- as.factor(dat$fly)
str(dat)
mtdata <- droplevels(dat[dat$Line=="20",])
mt1data <- droplevels(mtdata[mtdata$rep=="1",])
tdata <- melt(mt1data, id=c("rep","Conc","Sex","Line","Vial", "fly"))
tdata$variable <- as.factor(tdata$variable)
tfdata <- droplevels(tdata[tdata$Sex=="f",])
tmdata <- droplevels(tdata[tdata$Sex=="m",])
####Plotting####
d1 <- dotplot(tfdata$value~tdata$variable|tdata$Conc,
main="Y Position over time Line 20 Female",
xlab="Time", ylab="mm above buttom")
d2 <- dotplot(tmdata$value~tdata$variable|tdata$Conc,
main="Y Position over time Line 20 Male",
xlab="Time", ylab="mm above buttom")
grid.arrange(d1,d2,ncol=2)
And that looks like this:
Im trying to combine it into one plot, with two different colors for male and female, i have tried to write it into one dotplot separated by a , and or () but that dosen't work and when i dont split the data and use tdata instead of tfdata and tfmdata i get all the dots in the same color. Im open to suggestions, using another package or another way of plotting the data that still looks somewhat like this since im new to R
All you need to do is to use the group parameter.
dotplot(value~variable|Conc, group=Sex, data=tdata,
main="Y Position over time Line 20 All",
xlab="Time", ylab="mm above buttom")
Also, don't use the $ notation in these functions; notice that you're using value from tfdata but value and variable from tdata. This is a problem because there's twice as many rows in tdata! Instead, use the data argument to specify which data frame to get the variables from.

Gantt plot in base r - modifying plot properties

I would like to ask a follow-up question related to the answer given in this post [Gantt style time line plot (in base R) ] on Gantt plots in base r. I feel like this is worth a new question as I think these plots have a broad appeal. I'm also hoping that a new question would attract more attention. I also feel like I need more space than the comments of that question to be specific.
The following code was given by #digEmAll . It takes a dataframe with columns referring to a start time, end time, and grouping variable and turns that into a Gantt plot. I have modified #digEmAll 's function very slightly to get the bars/segments in the Gantt plot to be contiguous to one another rather than having a gap. Here it is:
plotGantt <- function(data, res.col='resources',
start.col='start', end.col='end', res.colors=rainbow(30))
{
#slightly enlarge Y axis margin to make space for labels
op <- par('mar')
par(mar = op + c(0,1.2,0,0))
minval <- min(data[,start.col])
maxval <- max(data[,end.col])
res.colors <- rev(res.colors)
resources <- sort(unique(data[,res.col]),decreasing=T)
plot(c(minval,maxval),
c(0.5,length(resources)+0.5),
type='n', xlab='Duration',ylab=NA,yaxt='n' )
axis(side=2,at=1:length(resources),labels=resources,las=1)
for(i in 1:length(resources))
{
yTop <- i+0.5
yBottom <- i-0.5
subset <- data[data[,res.col] == resources[i],]
for(r in 1:nrow(subset))
{
color <- res.colors[((i-1)%%length(res.colors))+1]
start <- subset[r,start.col]
end <- subset[r,end.col]
rect(start,yBottom,end,yTop,col=color)
}
}
par(op) # reset the plotting margins
}
Here are some sample data. You will notice that I have four groups 1-4. However, not all dataframes have all four groups. Some only have two, some only have 3.
mydf1 <- data.frame(startyear=2000:2009, endyear=2001:2010, group=c(1,1,1,1,2,2,2,1,1,1))
mydf2 <- data.frame(startyear=2000:2009, endyear=2001:2010, group=c(1,1,2,2,3,4,3,2,1,1))
mydf3 <- data.frame(startyear=2000:2009, endyear=2001:2010, group=c(4,4,4,4,4,4,3,2,3,3))
mydf4 <- data.frame(startyear=2000:2009, endyear=2001:2010, group=c(1,1,1,2,3,3,3,2,1,1))
Here I run the above function, but specify four colors for plotting:
plotGantt(mydf1, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
plotGantt(mydf2, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
plotGantt(mydf3, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
plotGantt(mydf4, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
These are the plots:
What I would like to do is modify the function so that:
1) it will plot on the y-axis all four groups regardless of whether they actually appear in the data or not.
2) Have the same color associated with each group for every plot regardless of how many groups there are. As you can see, mydf2 has four groups and all four colors are plotted (1-red, 2-orange, 3-yellow, 4-gray). These colors are actually plotted with the same groups for mydf3 as that only contains groups 2,3,4 and the colors are picked in reverse order. However mydf1 and mydf4 have different colors plotted for each group as they do not have any group 4's. Gray is still the first color chosen but now it is used for the lowest occurring group (group2 in mydf1 and group3 in mydf3).
It appears to me that the main thing I need to work on is the vector 'resources' inside the function, and have that not just contain the unique groups but all. When I try manually overriding to make sure it contains all the groups, e.g. doing something as simple as resources <-as.factor(1:4) then I get an error:
'Error in rect(start, yBottom, end, yTop, col = color) : cannot mix zero-length and non-zero- length coordinates'
Presumably the for loop does not know how to plot data that do not exist for groups that don't exist.
I hope that this is a replicable/readable question and it's clear what I'm trying to do.
EDIT: I realize that to solve the color problem, I could just specify the colors for the 3 groups that exist in each of these sample dfs. However, my intention is to use this plot as an output to a function whereby it wouldn't be known ahead of time if all of the groups exist for a particular df.
I slightly modified your function to account for NA in start and end dates :
plotGantt <- function(data, res.col='resources',
start.col='start', end.col='end', res.colors=rainbow(30))
{
#slightly enlarge Y axis margin to make space for labels
op <- par('mar')
par(mar = op + c(0,1.2,0,0))
minval <- min(data[,start.col],na.rm=T)
maxval <- max(data[,end.col],na.rm=T)
res.colors <- rev(res.colors)
resources <- sort(unique(data[,res.col]),decreasing=T)
plot(c(minval,maxval),
c(0.5,length(resources)+0.5),
type='n', xlab='Duration',ylab=NA,yaxt='n' )
axis(side=2,at=1:length(resources),labels=resources,las=1)
for(i in 1:length(resources))
{
yTop <- i+0.5
yBottom <- i-0.5
subset <- data[data[,res.col] == resources[i],]
for(r in 1:nrow(subset))
{
color <- res.colors[((i-1)%%length(res.colors))+1]
start <- subset[r,start.col]
end <- subset[r,end.col]
rect(start,yBottom,end,yTop,col=color)
}
}
par(mar=op) # reset the plotting margins
invisible()
}
In this way, if you simply append all your possible group values to your data you'll get them printed on the y axis. e.g. :
mydf1 <- data.frame(startyear=2000:2009, endyear=2001:2010,
group=c(1,1,1,1,2,2,2,1,1,1))
# add all the group values you want to print with NA dates
mydf1 <- rbind(mydf1,data.frame(startyear=NA,endyear=NA,group=1:4))
plotGantt(mydf1, res.col='group', start.col='startyear', end.col='endyear',
res.colors=c('red','orange','yellow','gray99'))
About the colors, at the moment the ordered res.colors are applied to the sorted groups; so the 1st color in res.colors is applied to 1st (sorted) group and so on...

Variable class different within function?

In order to streamline future data analysis, I'm trying to write a script that will identify the different self-report scales included in a data.frame and perform routine analyses on each scale's items. Currently, I want it to identify which scales are present, find the responses for each of the scale's items, and then calculate the Cronbach's Alphas for each scale.
Everything seems to be working except when I run my function that should produce a list of alpha() outputs for each scale I get the following error:
> Cronbach.Alphas(scales.data, scale.names)
Error in alpha(data[, responses[[i]]]) :
Data must either be a data frame or a matrix
Obviously I know that this is saying the information being given to the alpha() function is not a data.frame or matrix. The reason I'm so confused though is that when I do these calculations manually step-by-step outside of my Cronbach.Alphas() function, it clearly tells me that it is a data.frame and seems to work like a charm:
> class(scales.data[,responses[[1]]])
[1] "data.frame"
This is driving me crazy and I'll be extremely appreciative of any help with figuring this out. My full code is pasted below. (Note: I'm pretty new to programming functions in R so the way I'm doing things is probably not optimal. Any additional advice is welcome as well.)
Also, it might help to mention that my code is designed to identify scale names based on the presence of an underscore in a column name. That is, "rsq_12" indicates the scale as rsq and the column as responses to item 12 of the scale.
require(psych)
##### Function for identifying names of scales present in the data file #####
GetScales <- function(x) {
find.scale.names <- regexec("^(([^_]+)_)", colnames(x))
scales <- do.call(rbind, lapply(regmatches(colnames(x), find.scale.names), `[`, 3L))
colnames(scales) <- "scale"
na.find <- ifelse(is.na(scales[,1]), 0, 1)
scales <- cbind(scales, na.find)
output <- scales[scales[,2] == 1,]
output[,1]
}
##### Function for calculating cronbach's alpha for each scale #####
Cronbach.Alphas <- function(data, scales){
for(i in 1:length(scales)){
if(i == 1) {
responses <- list(grep(scales[i], colnames(data)))
alphas <- list(alpha(data[,responses[[i]]]))
} else {
responses <- append(responses, list(grep(scales[i], colnames(data))))
alphas <- append(alphas, list(alpha(data[,responses[[i]]])))
}
}
return(alphas)
}
### Import data from .csv file ###
scales.data <- data.frame(read.csv(file.choose()))
### Identify each item's scale ###
scale.items <- GetScales(scales.data)
### Reduce to names of scales ###
scale.names <- cbind(scale.items, !duplicated(scale.items))
scale.names <- scale.names[scale.names[,2] == TRUE, 1]
scale.names
### Calculate list of alphas ###
Cronbach.Alphas(scales.data, scale.names)
Thank you to anyone who has taken the time to look over my code. I appreciate your help. I was working off of the suggestions left here when I realized a simple mistake on my part...
One of the scales in the dataset that I've been using as a test while working on this script had only one item in it. Thus, data[,responses[[i]]] in my Cronbach.Alphas() function was passing a vector (rather than a data.frame or matrix) to the alpha() function at that point in the for loop. It is impossible to calculate cronbach alpha for a single item scale because it is an index of inter-item reliability...
Sooooo, all my code needed was a way to identify scales with just one item:
Cronbach.Alphas <- function(data, scales){
for(i in 1:length(scales)){
if(i == 1) {
responses <- list(grep(scales[i], colnames(data)))
if(length(responses[[i]]) > 1){
alphas <- list(alpha(data[,responses[[i]]]))
}
} else {
responses <- append(responses, list(grep(scales[i], colnames(data))))
if(length(responses[[i]]) > 1){
alphas <- append(alphas, list(alpha(data[,responses[[i]]])))
}
}
}
return(alphas)
}
Sorry for wasting anyone's time with my mistake. On the plus side, by substituting this new Cronbach.Alphas() function into the script above, I've now posted a script that will automatically identify scales and produce a list of cronbach's alphas (provided the columns are named with an underscore after the scale names) for anyone who might interested. Thanks again!

Resources