Remove outliers by condition from list of data frames - r

I try to create a function to remove multiple outliers via cooks distance from a list of data frames.
There are some problems at the moment:
Can I formulate part 1 as function? I tried several things that did not work out. I want to use several different variables for the lm - so it would be great if I could use colnumbers and the regular expression syntax of data frames as input argument.
Part 2 - the filename of the plots are not correct. It takes the first observation in each data frame from the list as filename. How can I correct this?
Part 3: data frames without the outliers are not created. Function comes to an end after the message is printed. I can't find my mistake.
data(iris)
iris.lst <- split(iris[, 1:2], iris$Species)
new_names <- c(paste0(unlist(levels(iris$Species)),"_data"))
for (i in 1:length(iris.lst)) {
assign(new_names[i], iris.lst[[i]])
}
# Part 1: Then cooks distances
fit <- lapply(mget(ls(pattern = "_data")),
function(x) lm(x[,1] ~ x[,3], data = x))
cooksd <-lapply(fit,cooks.distance)
# Part 2: Plot each data frame with suspected outlier
plots <- function(x){
jpeg(file=paste0(names(x),".jpeg")) # file names are numbers
#par(mfrow=c(2,1))
plot(x, pch="*", cex=2, main="Influential cases by Cooks distance") # plot cook's distance
abline(h = 3*mean(x, na.rm=T), col="red") # add cutoff line
text(x=1:length(x)+1, y=x, labels=ifelse(x > 3*mean(x, na.rm=T),
names(x),""), col="red")
dev.off()
}
myplots <- lapply(cooksd, plots)
# Part 3: give me new data frames without influential cases
show_influential_cases <- function(x){
# invisible(cooksd[["n_OG"]] <- lapply(cooksd, length)
influential <- lapply(x,function(x) names(x)[x > 3*mean(x, na.rm=T)])
test <- as.data.frame(unlist(influential))[,1]
test <- as.numeric(test)
}
tested <- show_influential_cases(result)
cleaned_data <- add_new[-tested,] # removing outliers by indexing
Could someone please help me to improve my code?
Many thanks,
Nadine

In general, it is not a good practice to create multiple dataframes in global environment. Lists always are a better option, they are easy to manage.
Part 1 -
You can combine multiple steps in one lapply function. Here in part 1 we apply lm and cooks.distance function together in the same lapply call.
master_data <- split(iris[, 1:2], iris$Species)
data <- lapply(master_data, function(x) {
cooks.distance(lm(Sepal.Length ~ Sepal.Width, data = x))
})
new_names <- paste0(levels(iris$Species),"_data")
names(data) <- new_names
Part 2 -
lapply does not have access to names of the list, pass them separately and use Map to call plots function.
plots <- function(x, y){
jpeg(file=paste0(y,".jpeg"))
plot(x, pch="*", cex=2, main="Influential cases by Cooks distance")
abline(h = 3*mean(x, na.rm=T), col="red") # add cutoff line
text(x=1:length(x)+1,y=x,labels=ifelse(x > 3*mean(x, na.rm=T),y,""), col="red")
dev.off()
}
Map(plots, data, names(data))
Part 3 -
I am not exactly clear about how you want to perform Part3 but for now I am showing outlier and data separately.
remove_influential_cases <- function(x, y){
inds <- x > 3*mean(x, na.rm=TRUE)
y[!inds, ]
}
result <- Map(remove_influential_cases, data, master_data)

Related

Can someone explain what these lines of code mean?

I have been trying to find a way to make a scatter plot with colour intensity that is indicative of the density of points plotted in the area (it's a big data set with lots of overlap). I found these lines of code which allow me to do this but I want to make sure I actually understand what each line is actually doing.
Thanks in advance :)
get_density <- function(x, y, ...){
dens <- MASS::kde2d(x, y, ...)
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
ii <- cbind(ix, iy)
return(dens$z[ii])
}
set.seed(1)
dat <- data.frame(x = subset2$conservation.phyloP, y = subset2$gene.expression.RPKM)
dat$density <- get_density(dat$x, dat$y, n = 100)
Below is the function with some explanatory comments, let me know if anything is still confusing:
# The function "get_density" takes two arguments, called x and y
# The "..." allows you to pass other arguments
get_density <- function(x, y, ...){
# The "MASS::" means it comes from the MASS package, but makes it so you don't have to load the whole MASS package and can just pull out this one function to use.
# This is where the arguments passed as "..." (above) would get passed along to the kde2d function
dens <- MASS::kde2d(x, y, ...)
# These lines use the base R function "findInterval" to get the density values of x and y
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
# This command "cbind" pastes the two sets of values together, each as one column
ii <- cbind(ix, iy)
# This line takes a subset of the "density" output, subsetted by the intervals above
return(dens$z[ii])
}
# The "set.seed()" function makes sure that any randomness used by a function is the same if it is re-run (as long as the same number is used), so it makes code more reproducible
set.seed(1)
dat <- data.frame(x = subset2$conservation.phyloP, y = subset2$gene.expression.RPKM)
dat$density <- get_density(dat$x, dat$y, n = 100)
If your question is about the MASS::kde2d function itself, it might be better to rewrite this StackOverflow question to reflect that!
It looks like the same function is wrapped into a ggplot2 method described here, so if you switch to making your plot with ggplot2 you could give it a try.

Way to progressively overlap line plots in R

I have a for loop from which I call a function grapher() which extracts certain columns from a dataframe (position and w, both continuous variables) and plots them. My code changes the Y variable (called w here) each time it runs and so I'd like to plot it as an overlay progressively. If I run the grapher() function 4 times for example, I'd like to have 4 plots where the first plot has only 1 line, and the 4th has all 4 overlain on each other (as different colours).
I've already tried points() as suggested in other posts, but for some reason it only generates a new graph.
grapher <- function(){
position.2L <- data[data$V1=='2L', 'V2']
w.2L <- data[data$V1=='2L', 'w']
plot(position.2L, w.2L)
points(position.2L, w.2L, col='green')
}
# example of my for loop #
for (t in 1:200){
#code here changes the 'w' variable each iteration of 't'
if (t%%50==0){
grapher()
}
}
Not knowing any details about your situation I can only assume something like this might be applicable.
# Example data set
d <- data.frame(V1=rep(1:2, each=6), V2=rep(1:6, 2), w=rep(1:6, each=2))
# Prepare the matrix we will write to.
n <- 200
m <- matrix(d$w, nrow(d), n)
# Loop progressively adding more noise to the data
set.seed(1)
for (i in 2:n) {
m[,i] <- m[,i-1] + rnorm(nrow(d), 0, 0.05)
}
# We can now plot the matrix, selecting the relevant rows and columns
matplot(m[d$V1 == 1, seq(1, n, by=50)], type="o", pch=16, lty=1)

Plotting subsets of an AffyRNAdeg {affy} object with plot AffyRNAdeg?

library(affy)
microarrays <- ReadAffy() # 98 CEL files are read into the same object
RNAdeg <- AffyRNAdeg(microarrays)
Now I want to plot subsets of RNAdeg
plotAffyRNAdeg(RNAdeg[.......?]) # What can I do?
I've tried various 'for' loops without success.
But if plot line colors are specified then plotAffyRNAdeg plots a subset of 1:(number of colors specified), but I haven't thought of a way to use that effectively. For example, below plots the first through the sixth AffyRNAdeg'd set of microarray data (first through sixth .CEL file read in by ReadAffy() )
plotAffyRNAdeg(RNAdeg,col=c(2,2,2,3,3,3))
OK, one way was found by running AffyRNAdeg() on subsets of the object the CEL files are in and putting the resulting data in a list of lists organized by experiment, then plotting the list elements. Maybe there is an easier way, but this worked (I'm quite new to R).
library(affy)
library(RColorBrewer)
> sampleNames(ARTHwoundMA[,11:14])
[1] "GSE18960_05_GSM469416_trt_rep2.CEL" "GSE18960_06_GSM469418_trt_rep3.CEL"
[3] "GSE5525_GSM128715_ctrl12h.CEL" "GSE5525_GSM128716_ctrl24h.CEL
# RNA DEG
# Indices to subset by experiment
cel_names <- substr(sampleNames(ARTHwoundMA),1,7)
unique_exp <- unique(substr(sampleNames(ARTHwoundMA),1,7))
exp_ind <- list()
for (i in 1:length(unique_exp))
{
tempvec <- vector()
for (j in 1:length(cel_names))
{
if (cel_names[j]==unique_exp[i])
{
tempvec <- append(tempvec,j)
}
}
exp_ind[[(length(exp_ind)+1)]] <- tempvec
}
# Calculating
RNAdeg_exp <- list()
for(i in 1:length(exp_ind))
{
RNAdeg_exp[[i]] <- AffyRNAdeg(ARTHwoundMA[,exp_ind[[i]]])
}
# Plotting
colors <- colorRampPalette(rev(brewer.pal(9, "Reds")))(length(exp_ind[[i]])
pdf(file="C:\\R working directory\\TEST\\RNAdeg_plots.pdf")
for(i in 1:length(exp_ind))
{
par(bg="gray")
colors <- colorRampPalette(rev(brewer.pal(9, "Reds")))(length(exp_ind[[i]]))
plotAffyRNAdeg(RNAdeg_exp[[i]], col=colors)
plot.new()
legend("topleft", lty=1, lwd=2,col=colors,
legend=paste(sampleNames(ARTHwoundMA[,exp_ind[[i]]])))
}
dev.off()

How to include variable values in histogram titles in R - using by()

I want to produce histograms using by(), how can I access the values of the factors, to include in histogram headings, for example...
a <- runif(500, 0, 10)
b <- LETTERS[1:5]
c <- c("Condition1", "Condition2")
x <- data.frame("Variable1" = b, "Variable2"= c, "Value"=a)
head(x)
by(x$Value, x$Variable2, hist)
or using two variables
by(x$Value, list(x$Variable2, x$Variable1), hist)
Is there a way of passing the variable value (eg Condition1) to the title of the histogram using the options within hist(), eg putting function(x) hist(x, main=...) into by()?
Pass the split up dataframe rather than just the Values. Then you will have more to work with:
by(x, x$Variable2, function(x) hist(x$Value, main=unique(x$Variable2) ) )
Produced two plots labled Condition1, Condition2
This doesn't really answer your question, since you're specifying the use of by(), but I usually use split() and lapply() for these types of problems. My approach is usually along the lines of:
temp <- split(x$Value, list(x$Variable2, x$Variable1))
lapply(names(temp), function(x) hist(temp[[x]], main = x, xlab = "Value"))

Utilise Surv object in ggplot or lattice

Anyone knows how to take advantage of ggplot or lattice in doing survival analysis? It would be nice to do a trellis or facet-like survival graphs.
So in the end I played around and sort of found a solution for a Kaplan-Meier plot. I apologize for the messy code in taking the list elements into a dataframe, but I couldnt figure out another way.
Note: It only works with two levels of strata. If anyone know how I can use x<-length(stratum) to do this please let me know (in Stata I could append to a macro-unsure how this works in R).
ggkm<-function(time,event,stratum) {
m2s<-Surv(time,as.numeric(event))
fit <- survfit(m2s ~ stratum)
f$time <- fit$time
f$surv <- fit$surv
f$strata <- c(rep(names(fit$strata[1]),fit$strata[1]),
rep(names(fit$strata[2]),fit$strata[2]))
f$upper <- fit$upper
f$lower <- fit$lower
r <- ggplot (f, aes(x=time, y=surv, fill=strata, group=strata))
+geom_line()+geom_ribbon(aes(ymin=lower,ymax=upper),alpha=0.3)
return(r)
}
I have been using the following code in lattice. The first function draws KM-curves for one group and would typically be used as the panel.group function, while the second adds the log-rank test p-value for the entire panel:
km.panel <- function(x,y,type,mark.time=T,...){
na.part <- is.na(x)|is.na(y)
x <- x[!na.part]
y <- y[!na.part]
if (length(x)==0) return()
fit <- survfit(Surv(x,y)~1)
if (mark.time){
cens <- which(fit$time %in% x[y==0])
panel.xyplot(fit$time[cens], fit$surv[cens], type="p",...)
}
panel.xyplot(c(0,fit$time), c(1,fit$surv),type="s",...)
}
logrank.panel <- function(x,y,subscripts,groups,...){
lr <- survdiff(Surv(x,y)~groups[subscripts])
otmp <- lr$obs
etmp <- lr$exp
df <- (sum(1 * (etmp > 0))) - 1
p <- 1 - pchisq(lr$chisq, df)
p.text <- paste("p=", signif(p, 2))
grid.text(p.text, 0.95, 0.05, just=c("right","bottom"))
panel.superpose(x=x,y=y,subscripts=subscripts,groups=groups,...)
}
The censoring indicator has to be 0-1 for this code to work. The usage would be along the following lines:
library(survival)
library(lattice)
library(grid)
data(colon) #built-in example data set
xyplot(status~time, data=colon, groups=rx, panel.groups=km.panel, panel=logrank.panel)
If you just use 'panel=panel.superpose' then you won't get the p-value.
I started out following almost exactly the approach you use in your updated answer. But the thing that's irritating about the survfit is that it only marks the changes, not each tick - e.g., it will give you 0 - 100%, 3 - 88% instead of 0 - 100%, 1 - 100%, 2 - 100%, 3 - 88%. If you feed that into ggplot, your lines will slope from 0 to 3, rather than remaining flat and dropping straight down at 3. That might be fine depending on your application and assumptions, but it's not the classic KM plot. This is how I handled the varying numbers of strata:
groupvec <- c()
for(i in seq_along(x$strata)){
groupvec <- append(groupvec, rep(x = names(x$strata[i]), times = x$strata[i]))
}
f$strata <- groupvec
For what it's worth, this is how I ended up doing it - but this isn't really a KM plot, either, because I'm not calculating out the KM estimate per se (although I have no censoring, so this is equivalent... I believe).
survcurv <- function(surv.time, group = NA) {
#Must be able to coerce surv.time and group to vectors
if(!is.vector(as.vector(surv.time)) | !is.vector(as.vector(group))) {stop("surv.time and group must be coercible to vectors.")}
#Make sure that the surv.time is numeric
if(!is.numeric(surv.time)) {stop("Survival times must be numeric.")}
#Group can be just about anything, but must be the same length as surv.time
if(length(surv.time) != length(group)) {stop("The vectors passed to the surv.time and group arguments must be of equal length.")}
#What is the maximum number of ticks recorded?
max.time <- max(surv.time)
#What is the number of groups in the data?
n.groups <- length(unique(group))
#Use the number of ticks (plus one for t = 0) times the number of groups to
#create an empty skeleton of the results.
curves <- data.frame(tick = rep(0:max.time, n.groups), group = NA, surv.prop = NA)
#Add the group names - R will reuse the vector so that equal numbers of rows
#are labeled with each group.
curves$group <- unique(group)
#For each row, calculate the number of survivors in group[i] at tick[i]
for(i in seq_len(nrow(curves))){
curves$surv.prop[i] <- sum(surv.time[group %in% curves$group[i]] > curves$tick[i]) /
length(surv.time[group %in% curves$group[i]])
}
#Return the results, ordered by group and tick - easier for humans to read.
return(curves[order(curves$group, curves$tick), ])
}

Resources