How to plot specific data points in a column in R script - r

Imagine there are two columns, one for p-value and the other representing slope. I want to find a way to plot only the slope data points that have a significant p-value. Here is my code:
print("State the file name (include .csv)")
filename <- readline()
file <- read.csv(filename)
print ("Only include trials with p-value < .05? (enter yes or no)")
pval_filter <- readline()
if (pval_filter == "yes"){
i <- 0
count <- 0
filtered <- NULL
while (i > length(file$pval)){
if (file$pval[i] < .05){
filtered[count] <- i
count <- count + 1
}
i <- i + 1
}
x <- 0
while (x != -1){
print("State the variable to be plotted")
temp_var <- readline()
counter <- 0
var <- NULL
while (counter > length(filtered)){
var[counter] = file [, temp_var][filtered[counter]]
counter <- counter + 1
}
print ("State the title of the histogram")
title <- readline()
hist(var, main = title, xlab = var)
print("Enter -1 to exit or any other number to plot another variable")
x <- readline()
}
}

Isn't this much shorter and produces roughly the same:
df = read.csv('file.csv')
df = df[df$pval < 0.05,]
hist(df$value)
This should at least get you started.
Some remarks regarding the code:
You use a lot of reserved names (var, file) as an object name, that is a bad idea.
If you want the program to work with user input, you need to check it before doing anything with it.
There is no need to explicitly loop over rows in a data.frame, R is vectorized (e.g. see how I subsetted df above). This style looks like Fortran, there is no need for it in R.

It is hard to tell exactly what you want. It is best if an example is reproducible (we can copy/paste and run, we don't have your data so that does not work) and is minimal (there is a lot in your code that I don't think deals with your question).
But some pointers that may help.
First, the readline function has a prompt argument that will give you better looking interaction than the print statements.
If all your data is in a data frame with columns p and b for p-value and slope then you can include only the b values for which p<=0.05 with simple subsetting like:
hist( mydataframe$b[ mydataframe$p <= 0.05 ] )
or
with( mydataframe, hist(b[p<=0.05]) )
Is that enough to answer your question?

Given that data = cbind(slopes, pvalues) (so col(data) == 2)
Like this:
plot(data[data[ ,2] < 0.05 , ])
Explanation:
data[ ,2] < 0.05 will return a vector of TRUE/FALSE with the length of the columns.
so then you will get:
data[c(TRUE, FALSE....), ]
From there on, only the data will be selected where it says TRUE.
You will thus plot only those x's and y's where the pvalue is lower than 0.05.

Here is the code to plot only the slope data points with significant p-value:
Assuming the column names of the file will be pval and slope.
# Prompt a message on the Terminal
filename <- readline("Enter the file name that have p-value and slopes (include .csv)")
# Read the filename from the terminal
file <- read.csv(filename, header = TRUE)
# Prompt a message again on the Terminal and read the acceptance from user
pval_filter <- readline("Only include trials with p-value < .05? (enter yes or no)")
if (to-lower(pval_filter) == "yes"){
# Create a filtered file that contain only rows with the p-val less than that of siginificatn p-val 0.05
file.filtered <- file[file$pval < 0.05, ]
# Get the title of the Histogram to be drawn for the slopes (filtered)
hist.title <- readline("State the title of the histogram")
# Draw histogram for the slopes with the title
# las = 2 parameter in the histogram below makes the slopes to be written in parpendicular to the X-axis
# so that, the labels will not be overlapped, easily readable.
hist(file.filtered$slope, main = hist.title, xlab = Slope, ylab = frequency, las = 2)
}
Hope this would help.

Related

Way to progressively overlap line plots in R

I have a for loop from which I call a function grapher() which extracts certain columns from a dataframe (position and w, both continuous variables) and plots them. My code changes the Y variable (called w here) each time it runs and so I'd like to plot it as an overlay progressively. If I run the grapher() function 4 times for example, I'd like to have 4 plots where the first plot has only 1 line, and the 4th has all 4 overlain on each other (as different colours).
I've already tried points() as suggested in other posts, but for some reason it only generates a new graph.
grapher <- function(){
position.2L <- data[data$V1=='2L', 'V2']
w.2L <- data[data$V1=='2L', 'w']
plot(position.2L, w.2L)
points(position.2L, w.2L, col='green')
}
# example of my for loop #
for (t in 1:200){
#code here changes the 'w' variable each iteration of 't'
if (t%%50==0){
grapher()
}
}
Not knowing any details about your situation I can only assume something like this might be applicable.
# Example data set
d <- data.frame(V1=rep(1:2, each=6), V2=rep(1:6, 2), w=rep(1:6, each=2))
# Prepare the matrix we will write to.
n <- 200
m <- matrix(d$w, nrow(d), n)
# Loop progressively adding more noise to the data
set.seed(1)
for (i in 2:n) {
m[,i] <- m[,i-1] + rnorm(nrow(d), 0, 0.05)
}
# We can now plot the matrix, selecting the relevant rows and columns
matplot(m[d$V1 == 1, seq(1, n, by=50)], type="o", pch=16, lty=1)

r - how to find the first empty bin of a histogram

I have a large dataset with response times. I need to make reference to the first empty bin of the histogram (with x being milliseconds), and exclude all data that comes after that.
I think
Can anybody help?
If you capture the return of hist it contains all of the information that you need.
set.seed(3)
x = rnorm(20)
H = hist(x)
min(which(H$counts == 0))
[1] 5
To exclude the data that bin and above
MIN = min(which(H$counts == 0))
x = x[x<H$breaks[MIN]]

Where is this extra data coming from? (in R plot)

I'm a biologist, but I had to teach myself python and R working different places a few years ago. A situation came up at my current job that R would be really useful for, and so i cobbled together a program. Surprisingly, it does just what I'd like EXCEPT the graphs it's generating have an extra bar at the beginning. !
I've entered no data to correspond to that first bar:
I'm hoping this is some simple error in how I've set the plot parameters. Could it be because I'm using plot instead of boxplot? Is it plotting the headings?
More worrisome is the possibility that while reading in and merging my 3 data frames I'm creating some sort of artifact data, which would also affect the statistical tests and make me very sad, though I don't see anything like this when I have it write the matrix to a file.
I greatly appreciate any help!
Here's what it looks like, and then the function it calls (in another script).
(I'm really not a programmer, so I apologize if the following code is miserable.) The goal is to compare our data (which is in columns 10-17 of a csv) to all of the data in a big sheet of clinical data in turn. Then, if there is a significant correlation (the p value is less than .05), to graph the two against each other. This gives me a fast way to find if there's something worth looking further into in this big data set.
first <- read.csv(labdata)
second <- read.csv(mrntoimacskey)
third <- read.csv(imacsdata)
firsthalf<-merge(first,second)
mp <-merge(firsthalf, third, by="PATIENTIDNUMBER")
setwd(aplaceforus)
pfile2<- sprintf("%spvalues", todayis)
setwd("fulldataset")
for (m in 10:17) {
n<-m-9
pretty= pretties[n]
for (i in 1:length(colnames(mp))) {
tryCatch(sigsearchA(pfile2,mp, m, i, crayon=pretty), error= function(e)
{cat("ERROR :", conditionMessage(e), "\n")})
tryCatch(sigsearchC(pfile2,mp, m, i, crayon=pretty), error= function(e)
{cat("ERROR :", conditionMessage(e), "\n")})
}
}
sigsearchA<-function(n, mp, y, x, crayon="deepskyblue"){
#anova, plots if significant. takes name of file, name of database,
#and the count of the columns to use for x and y
stat<-oneway.test(mp[[y]]~mp[[x]])
pval<-stat[3]
heads<-colnames(mp)
a<-heads[y]
b<-heads[x]
ps<-c(a, b, pval)
write.table(ps, file=n, append= TRUE, sep =",", col.names=FALSE)
feedback<- paste(c("Added", b, "to", n), collapse=" ")
if (pval <= 0.05 & pval>0) {
#horizontal lables
callit<-paste(c(a,b,".pdf"), collapse="")
val<-sprintf("p=%.5f", pval)
pdf(callit)
plot(mp[[x]], mp[[y]], ylab=a, main=b, col=crayon)
mtext(val, adj=1)
dev.off()
#with vertical lables, in case of many groups
callit<-paste(c(a,b,"V.pdf"), collapse="")
pdf(callit)
plot(mp[[x]], mp[[y]], ylab=a, main=b,las=2,cex.axis=0.7, col=crayon)
mtext(val, adj=1)
dev.off()
}
print(feedback) }
graphics.off()
I can't be absolutely certain without a reproducible example, but it looks like the x-variable in your plot (let's call it x and let's assume your data frame is called df) has at least one row with an empty string ("") or maybe a space character (" ") and x is also coded as a factor. Even if you remove all of the "" values from the data frame, the level for that value will still be part of the factor coding and will show up in plots. To remove the level, do df$x = droplevels(df$x) and then run your plot again.
For illustration, here's an analogous example with the built-in iris data frame:
# Shows that Species is coded as a factor
str(iris)
# Species is a factor with three levels
levels(iris$Species)
# There are 50 rows for each level of Species
table(iris$Species)
# Three boxplots, one for each level of Species
boxplot(iris$Sepal.Width ~ iris$Species)
# Now let's remove all the rows with Species = "setosa"
iris = iris[iris$Species != "setosa",]
# The "setosa" rows are gone, but the factor level remains and shows up
# in the table and the boxplot
levels(iris$Species)
table(iris$Species)
boxplot(iris$Sepal.Width ~ iris$Species)
# Remove empty levels
iris$Species = droplevels(iris$Species)
# Now the "setosa" level is gone from all plots and summaries
levels(iris$Species)
table(iris$Species)
boxplot(iris$Sepal.Width ~ iris$Species)

Identify spikes/peaks in density plot by group

I created a density plot with ggplot2 package for R. I would like to identify the spikes/peaks in the plot which occur between 0.01 and 0.02. There are too many legends to pick it out so I deleted all legends. I tried to filter my data out to find most number of rows that a group has between 0.01 and 0.02. Then I filtered out the selected group to see whether the spike/peak is gone but no, it is there plotted still. Can you suggest a way to identify these spikes/peaks in these plots?
Here is some code :
ggplot(NumofHitsnormalized, aes(NumofHits_norm, fill = name)) + geom_density(alpha=0.2) + theme(legend.position="none") + xlim(0.0 , 0.15)
## To filter out the data that is in the range of first spike
test <- NumofHitsnormalized[which(NumofHitsnormalized$NumofHits_norm > 0.01 & NumofHitsnormalized$NumofHits_norm <0.02),]
## To figure it out which group (name column) has the most number of rows ##thus I thought maybe I could get the data that lead to spike
testMatrix <- matrix(ncol=2, nrow= length(unique(test$name)))
for (i in 1:length(unique(test$name))){
testMatrix[i,1] <- unique(test$name)[i]
testMatrix[i,2] <- nrow(unique(test$name)[i])}
Konrad,
This is the new plot made after I filtered my data out with extremevalues package. There are new peaks and they are located at different intervals and it also says 96% of the initial groups have data in the new plot (though number of rows in filtered data reduced to 0.023% percent of the initial dataset) so I cant identify which peaks belong to which groups.
I had a similar problem to this.
How i did was to create a rolling mean and sd of the y values with a 3 window.
Calculate the average sd of your baseline data ( the data you know won't have peaks)
Set a threshold value
If above threshold, 1, else 0.
d5$roll_mean = runMean(d5$`Current (pA)`,n=3)
d5$roll_sd = runSD(x = d5$`Current (pA)`,n = 3)
d5$delta = ifelse(d5$roll_sd>1,1,0)
currents = subset(d5,d5$delta==1,na.rm=TRUE) # Finds all peaks
my threshold was a sd > 1. depending on your data you may want to use mean or sd. for slow rising peaks mean would be a better idea than sd.
Without looking at the code, I drafted this simple function to add TRUE/FALSE flags to variables indicating outliers:
GenerateOutlierFlag <- function(x) {
# Load required packages
Vectorize(require)(package = c("extremevalues"), char = TRUE)
# Run check for ouliers
out_flg <- ifelse(1:length(x) %in% getOutliers(x, method = "I")$iLeft,
TRUE,FALSE)
out_flg <- ifelse(1:length(x) %in% getOutliers(x, method = "I")$iRight,
TRUE,out_flg)
return(out_flg)
}
If you care to read about the extremevalues package you will see that it provides some flexibility in terms of identifying outliers but broadly speaking it's a good tool for finding various peaks or spikes in the data.
Side point
You could actually optimise it significantly by creating one object corresponding to getOutliers(x, method = "I") instead of calling the method twice.
More sensible syntax
GenerateOutlierFlag <- function(x) {
# Load required packages
require("extremevalues")
# Outliers object
outObj <- getOutliers(x, method = "I")
# Run check for ouliers
out_flg <- ifelse(1:length(x) %in% outObj$iLeft,
TRUE,FALSE)
out_flg <- ifelse(1:length(x) %in% outObj$iRight,
TRUE,out_flg)
return(out_flg)
}
Results
x <- c(1:10, 1000000, -99099999)
table(GenerateOutlierFlag(x))
FALSE TRUE
10 2

Utilise Surv object in ggplot or lattice

Anyone knows how to take advantage of ggplot or lattice in doing survival analysis? It would be nice to do a trellis or facet-like survival graphs.
So in the end I played around and sort of found a solution for a Kaplan-Meier plot. I apologize for the messy code in taking the list elements into a dataframe, but I couldnt figure out another way.
Note: It only works with two levels of strata. If anyone know how I can use x<-length(stratum) to do this please let me know (in Stata I could append to a macro-unsure how this works in R).
ggkm<-function(time,event,stratum) {
m2s<-Surv(time,as.numeric(event))
fit <- survfit(m2s ~ stratum)
f$time <- fit$time
f$surv <- fit$surv
f$strata <- c(rep(names(fit$strata[1]),fit$strata[1]),
rep(names(fit$strata[2]),fit$strata[2]))
f$upper <- fit$upper
f$lower <- fit$lower
r <- ggplot (f, aes(x=time, y=surv, fill=strata, group=strata))
+geom_line()+geom_ribbon(aes(ymin=lower,ymax=upper),alpha=0.3)
return(r)
}
I have been using the following code in lattice. The first function draws KM-curves for one group and would typically be used as the panel.group function, while the second adds the log-rank test p-value for the entire panel:
km.panel <- function(x,y,type,mark.time=T,...){
na.part <- is.na(x)|is.na(y)
x <- x[!na.part]
y <- y[!na.part]
if (length(x)==0) return()
fit <- survfit(Surv(x,y)~1)
if (mark.time){
cens <- which(fit$time %in% x[y==0])
panel.xyplot(fit$time[cens], fit$surv[cens], type="p",...)
}
panel.xyplot(c(0,fit$time), c(1,fit$surv),type="s",...)
}
logrank.panel <- function(x,y,subscripts,groups,...){
lr <- survdiff(Surv(x,y)~groups[subscripts])
otmp <- lr$obs
etmp <- lr$exp
df <- (sum(1 * (etmp > 0))) - 1
p <- 1 - pchisq(lr$chisq, df)
p.text <- paste("p=", signif(p, 2))
grid.text(p.text, 0.95, 0.05, just=c("right","bottom"))
panel.superpose(x=x,y=y,subscripts=subscripts,groups=groups,...)
}
The censoring indicator has to be 0-1 for this code to work. The usage would be along the following lines:
library(survival)
library(lattice)
library(grid)
data(colon) #built-in example data set
xyplot(status~time, data=colon, groups=rx, panel.groups=km.panel, panel=logrank.panel)
If you just use 'panel=panel.superpose' then you won't get the p-value.
I started out following almost exactly the approach you use in your updated answer. But the thing that's irritating about the survfit is that it only marks the changes, not each tick - e.g., it will give you 0 - 100%, 3 - 88% instead of 0 - 100%, 1 - 100%, 2 - 100%, 3 - 88%. If you feed that into ggplot, your lines will slope from 0 to 3, rather than remaining flat and dropping straight down at 3. That might be fine depending on your application and assumptions, but it's not the classic KM plot. This is how I handled the varying numbers of strata:
groupvec <- c()
for(i in seq_along(x$strata)){
groupvec <- append(groupvec, rep(x = names(x$strata[i]), times = x$strata[i]))
}
f$strata <- groupvec
For what it's worth, this is how I ended up doing it - but this isn't really a KM plot, either, because I'm not calculating out the KM estimate per se (although I have no censoring, so this is equivalent... I believe).
survcurv <- function(surv.time, group = NA) {
#Must be able to coerce surv.time and group to vectors
if(!is.vector(as.vector(surv.time)) | !is.vector(as.vector(group))) {stop("surv.time and group must be coercible to vectors.")}
#Make sure that the surv.time is numeric
if(!is.numeric(surv.time)) {stop("Survival times must be numeric.")}
#Group can be just about anything, but must be the same length as surv.time
if(length(surv.time) != length(group)) {stop("The vectors passed to the surv.time and group arguments must be of equal length.")}
#What is the maximum number of ticks recorded?
max.time <- max(surv.time)
#What is the number of groups in the data?
n.groups <- length(unique(group))
#Use the number of ticks (plus one for t = 0) times the number of groups to
#create an empty skeleton of the results.
curves <- data.frame(tick = rep(0:max.time, n.groups), group = NA, surv.prop = NA)
#Add the group names - R will reuse the vector so that equal numbers of rows
#are labeled with each group.
curves$group <- unique(group)
#For each row, calculate the number of survivors in group[i] at tick[i]
for(i in seq_len(nrow(curves))){
curves$surv.prop[i] <- sum(surv.time[group %in% curves$group[i]] > curves$tick[i]) /
length(surv.time[group %in% curves$group[i]])
}
#Return the results, ordered by group and tick - easier for humans to read.
return(curves[order(curves$group, curves$tick), ])
}

Categories

Resources