Delete missing datapoints (NA's) from multiple vectors - r

So I am working with biological data at a hospital, (I won't disclose anything here but I won't need to in order to ask this question). We are looking at concentrations of antibodies taken a certain amount of time. There are, for one reason or another, missing data points all over our data set. What I am doing is trying to remove the missing data points along with their corresponding time. Right now the basic goal is just to get some basic graphs and charts up and running but eventually we're going to want to create some logistical models and nonlinear dynamics models which we'll do in another language.
1) First I put my data into a vector along with it's corresponding time:
data <- read.csv("blablabla.csv" header = T)
Biomarker <- data[,2]
time <- data[,1]
2)Then I sort the data:
Biomarker <- Biomarker[order(time)]
time <- sort(time, decreasing = F)
3)Then I put the indexes of the NA values into a vector
NA_Index <- which(is.na(Biomarker))
4)Then I try to remove the data points at that index for both the biomarker and time vector
i <- 1
n <- length(NA_Index)
for(i:n){
Biomarker[[NA_Index[i]]] <- NULL
time[[NA_Index[i]]] <- NULL
}
Also I have tried a few different things than the one above:
1)
Biomarker <- Biomarker[-NA_Index[i]]
2)
Biomarker <- Biomarker[!= "NA"]
My question is: "How do I remove NA values from my vectors and remove the time with the same index?"
So Obviously I am very new to R and might be going about this in a completely wrong. I just ask that you explain all what all the functions do if you post some code. Thanks for the help.

First I'd recommend storing your data in a data.frame instead of two vectors, since the entries in the vecotors correspond to cases this is a more appropriate datastructure.
my_table <- data.frame(time=time, Biomarker=Biomarker)
Then you can simply subset the whole data.frame, the first dimension are rows, the second columns, as usual, leave the second dimension free to keep all columns.
my_table <- my_table[!is.na(my_table$Biomarker), ]

> BioMarker
[1] 1 2 NA 3 NA 5
> is.na(BioMarker)
[1] FALSE FALSE TRUE FALSE TRUE FALSE
> BioMarker[is.na(BioMarker)]
[1] NA NA
> BioMarker[! is.na(BioMarker)]
[1] 1 2 3 5
> BioMarker <- BioMarker[! is.na(BioMarker)]
> BioMarker
[1] 1 2 3 5

Related

outlier detection for a grouped (clustered) data

I wanna find outliers and eliminate them in my data(named "df"):
> head(df)
cluster machine.code age Good.Times repair.price
1 1 13010132 23 58.54 198170000
2 1 13010129 23 105.25 390847500
3 1 13010131 23 20.50 20701747
4 1 13010072 18 14.30 22340000
5 1 13010101 18 57.63 13220000
6 1 13010106 27 49.96 254450000
where my data has 65 clusters and I wanna run the outlier detection within each cluster separately,
I had used the code below for outlier detecting before for one cluster and it was fine:
library("ggstatsplot")
df<- read.csv("C:/Users/gadmin/Desktop/dataE.csv",header = TRUE)
ggbetweenstats(df,cluster, repair.price , outlier.tagging = TRUE)
Q <- quantile(df$repair.price, probs=c(.25, .75), na.rm = FALSE)
iqr <- IQR(df$repair.price)
up <- Q[2]+1.5*iqr # Upper Range
low<- Q[1]-1.5*iqr # Lower Range
eliminated<- subset(df, df$repair.price > (Q[1] - 1.5*iqr) & df$repair.price < (Q[2]+1.5*iqr))
ggbetweenstats(eliminated, cluster, repair.price, outlier.tagging = TRUE)
now I wanna do the same thing for all 65 clusters using "for" something like this:
for(i in 1:length(unique(df$cluster))) {
...
}
but I don't how? (I mean the part that after outlier detecting the first cluster, how should it be replaced(subset) and continue the process to another cluster)
Core question
There are various ways to detect outliers. As for the core of your question, I understand it as "How do I subset the data so I can apply a for-loop to remove the outliers for each cluster?"
# maybe insert a column id that assigns an id (identical to the row number) to identify individual entries
df$id <- seq(1, nrow(df))
# make a list to store the outlier ids for each cluster
outlrs <- list()
# loop through the clusters
for(clust in unique(df$cluster)){
subset <- df[df$cluster == clust,]
outlrs[[clust]] <- [INSERT YOUR OUTLIER DETECTION FUNCTION HERE*]
}
# remove the outliers
outliers <- do.call(rbind, outlrs)
df <- df[-outliers, ]
* the outlier detection function you use should ultimately output the id of the row containing the outlier. This part would have to be adapted to your method of outlier identification.
I didn't test it since I have insufficient data. You could use e.g. dput(df) to output a version of your data you can copy and paste to make it accessible to people who want to test their proposed solutions.
Edit: one (of many) alternative approaches
Alternatively, you could apply the functions you included in your question on a subset of the data within the loop, store the cleaned-up output e.g. as a list and subsequently apply do.call(rbind.data.frame, your_list) to the list.
Note
As Phil pointed out, it is questionable whether outliers should be removed, especially when you're just applying a loop that "takes care of them". While we can provide the means by which "outliers" can be removed programmatically, the question whether you should actually remove those outliers in a given situation is another one (probably more adequate on CrossValidated). It should also be noted that there are many algorithms to determine which values differ "significantly" from the bulk of values and the border between "significant" and not significant is arbitrary.

Build loop to use increasing part of dataframe in R as input to function

I'm using the first principal component from a PCA analysis as an explanatory variable in a forecasting model that forecasts recursively using Kalman filtering. In other words, at each point in time, the model updates and produces a new forecast based on the new observation included into the model. Since PCA uses data from all observations included in the model for its calculations, I need to run also the PCAs recursively, using only the observations prior to the point in time that I am forecasting (otherwise, the PCA-result could reveal information about the future, and help the model produce a more accurate answer than it would have otherwise). I think a loop might be the solution, but I am struggling with how to formulate the code.
As a more specific example, consider if I have the following data.frame
data <- as.data.frame(rbind(c(6,15,23),c(9,11,22), c(7,13,23), c(6,12,25),c(7,13,23)))
names(data) <- c("V1","V2","V3")
> data
V1 V2 V3
1 6 15 23
2 9 11 22
3 7 13 23
4 6 12 25
5 7 13 23
At each observation date, I wish to run a PCA (function prcomp() from the stats-package) for all observations up to, and including, that observation. So I want to first run PCA for the two first observation
pca2 <- prcomp(data[1:2,], scale = TRUE)
next I want to run PCA with the first, second and third observation as input
pca3 <- prcomp(data[1:3,], scale = TRUE)
next I want to run PCA with the first, second, third and fourth observation as input
pca4 <- prcomp(data[1:4,], scale = TRUE)
and so on, until the last run of the PCA, which includes all observations in the dataframe. For each of these "runs" of the PCA, I wish to extract the last value (though for pca2, I use both the first and second value) of the first principal component (PC1), and merge these into a final dataframe, where each monthly observation is the last value of the first principal component of PCA results for each of the runs.
The principal component outputs are:
> my_pca2 <- as.data.frame(pca2$x)
> my_pca2
PC1 PC2
1 -1.224745 -5.551115e-17
2 1.224745 5.551115e-17
> my_pca3 <- as.data.frame(pca3$x)
> my_pca3
PC1 PC2 PC3
1 -1.4172321 -0.2944338 6.106227e-16
2 1.8732448 -0.1215046 3.330669e-16
3 -0.4560127 0.4159384 4.163336e-16
> my_pca4 <- as.data.frame(pca4$x)
> my_pca4
PC1 PC2 PC3
1 -1.03030993 -1.10154914 0.015457199
2 2.00769890 0.07649216 0.011670433
3 0.03301806 -0.24226508 -0.033461874
4 -1.01040702 1.26732205 0.006334242
So I want my final output to be a dataframe to look like
>final.output
PC1
1 -1.224745
2 1.224745
3 -0.4560127
4 -1.01040702
Comment: yes, it looks a bit weird with the two first values, but please don't pay too much attention to that. My point is that I wish to build a dataframe that consists of the last calculated value for the first principal component for each of the PCA runs.
I am thinking that a for.loop might be the best solution here, but I have not been successful in finding any threads that might guide me closer to a coding solution. How can I make the loop use an increasing amount of the dataframe in the calculations? Does anyone have any suggestions/tips/links? Any help on this is much appreciated!
I had a very similar approach.
PCA <- vector("list", length=nrow(data)-1)
for(i in 1:(nrow(data)-1)) {
if(i==1) j <- 1:2 else j<-i+1
PCA[[i]] <- as.data.frame(prcomp(data[1:(1+i),], scale = TRUE)$x)[j, 1]
}
unlist(PCA)
You can use a for loop. It's maybe not the most efficient solution, but it will work.
First, you create an empty list to store your results:
all_results <- list()
Next, you iterate from 2 to the number of rows of data with a loop. For each iteration of the loop, run prcomp on data[1:i,]. You can directly create your pca data frame and extract PC1from it as a vector. Now you store it in the list at index i - 1
for(i in 2:nrow(data))
{
all_results[[i - 1]] <- as.data.frame(prcomp(data[1:i,], scale = TRUE)$x)$PC1
}
Now to extract all the results, you use lapply (list apply) to extract only the last element from each PC1 vector:
PC1 <- lapply(all_results, function(pca) pca[length(pca)] )
Now you convert these from a list of single elements to a vector:
PC1 <- do.call("c", PC1)
Finally, you want to stick the first value of the first analysis back on to the front of this vector:
PC1 <- c(all_results[[1]][1], PC1)

Looping a Student T-Test and Chi-Squared with Missing Data in R

I am trying to use R to run a student t-test and a chi squared test with large data sets. Since I am fairly new to R my inexperience has been preventing much success in my own code.
Both data sets have missing data and look something like this:
AA assayX activity assayY1 activity assayY2 activity
chemical 1 TRUE 0 12.2
chemical 2 TRUE 0
chemical 3 45.2 35.6
chemical 4 FALSE 0 0
AB assayX activity assayY1 activity assayY2 activity
chemical 1 TRUE FALSE TRUE
chemical 2 TRUE FALSE
chemical 3 TRUE TRUE
chemical 4 FALSE FALSE FALSE
Since it is a large data set I am trying to create a code where I can compare assayX to all assayYs. I'm hoping to create a student t-test loop for the first data set, and a chi squared loop to come the second data set. I had previously been successful creating a loop code for a correlation analysis, so I based my code off of that idea.
x<- na.omit(mydata1[, c(assayX)])
y<- na.omit(mydata1[, c(assayY1:assayYend)])
lapply(y, function(x)t.test(y~x))
x<-na.omit(mydata2[, c(assayX)])
y<- na.omit(mydata2[, c(assayY1:assayYend)]
lapply(y, x=x, chisq.test)
Problem with the first code is:
Invalid variable y
Problem with the second code is:
x and y must have the same length
I've done small tweaks here and there and have just got different types of errors like not enough 'y' observations and so on. I've been primarily using this site to figure out how work R, so I'm hoping you guys will have a clever little solution for a new guy.
After a long time and gaining experience in R, I can answer my own question. First is to make the datafile change blanks to NA.
df1 <- read.csv("data2.csv", header=T, na.strings=c("","NA"))
Then for the student.t
df1.p= rep(NA, 418)
for (i in length(df1$Assays)){
test= t.test(df1[,c(i)]~df1$assay.activity)
current.p.val= test$p.value
p.df1[i]=current.p.val
}
Then to add a Pearson's or Chi sq (not actually appropriate for this dataset, but just as an ex)
df1.p.2= rep(NA, length(df1$Assays))
df1.r.2= rep(NA, length(df1$Assays))
for (i in length(df1$Assays)){
test2= cor.test(df1$assay.activity, df1[,c(i)], mehtod='pearson')
current.p.val2= test2$p.value
current.rval = test2$estimate
df1.p.2[i] = current.p.val2
df1.r.2[i] = current.rval
}
df2= cbind(df1$Assays, df1.p, df1.p.2, df1.r.2)
I then filtered it for only assays with 0.1 significance, but that wasn't the question here. If you want to know that, just ask a question and I'll post an answer there :)
I don't think your data is being passed correctly to the test. t.test has arguments for whether the data is paired or not (default is false) and how to handle NAs should you want to change from the defalt. You should probably use those rather than omit NAs up front. An example with NAs in the data:
set.seed(1)
y <- runif(30, 0, 1)
y.NA <- c(3,24,27)
y[y.NA] <- NA
x <- runif(30, 0, 1)
x.NA <- c(1,3,8,12,21)
x[x.NA] <- NA
t.test(x,y)
For chisq.test you can use the table function.
chisq.test(table(x,y))$p.value

R: collapse two time series to create vectors containing only the points where both exist

Using R, I wish to:
Take two time series:
a<-c(NA,1,2,3,NA,5)
b<-c(0,NA,6,7,NA,NA)
I would like to end up with
aa<-c(2,3), bb<-(6,7)
or alternately
aa<-c(NA,NA,2,3,NA,NA)
The genesis for this question lies in 'feature' of the ccf/acf function in R. The mean of a series is calculated prior to checking for the existance of mutual data points. The default fails on NA values, but if na.action=na.pass, this can result in correlation coeficients greater than one.
Although my actual data are time series, I am not currently interested in time lagged ACF, I am only interested in spatial cross correlation between disparate data sets, so the loss of absolute temporal data inherent in this approach is not important. I wish to run the CCF with vetors in which the unusable data has already been 'knocked out'
The actual data sets are ~ 10,000 points each x 20 sets
Thank you in advance for advice
You can use standard subsetting and is.na to find where both have non NA elements.
a[!is.na(a)&!is.na(b)]
[1] 2 3
b[!is.na(a)&!is.na(b)]
[1] 6 7
The question has already been answered by #James. You can create the alternative versions with the following commands:
idx <- ! (a + b) * 0
a[idx]
# [1] NA NA 2 3 NA NA
b[idx]
# [1] NA NA 6 7 NA NA
index = !is.na(a) & !is.na(b)
aa = a[index]
bb = b[index]
1) Combine them both into a data.frame, remove rows with any NAs in them and then extract back out what is left:
both <- data.frame(a, b)
both <- na.omit(both) ##
aa <- both$a
bb <- both$b
To get the representation with NAs in it replace the second line (marked with ##) with:
both[is.na(rowSums(both), ] <- NA
2) Alternately consider a time series representation:
library(zoo)
z <- zoo(cbind(a, b))
z <- na.omit(z) ##
aa.z <- z$a
bb.z <- z$b
To get the repesentation with NAs in it replace the second line (marked with ##) with:
z[is.na(rowSums(both))] <- NA
3) The representation with NAs in the output could also be done using "ts" class:
tt <- ts(cbind(a, b))
tt[is.na(rowSums(both))] <- NA
aa.ts <- tt[, "a"]
bb.ts <- tt[, "b"]
Note: Depending on how you use the them afterwards you might not need to extract them out at the end, i.e. you might not need the last two lines in each solution.
ADDED additional solutions.

Exclude data based on the number of non NA observations for each value of key

I have a dataset consisting of monthly observations for returns of US companies. I am trying to exclude from my sample all companies which have less than a certain number of non NA observations.
I managed to do what I want using foreach, but my dataset is very large and this takes a long time. Here is a working example which shows how I accomplished what I wanted and hopefully makes my goal clear
#load required packages
library(data.table)
library(foreach)
#example data
myseries <- data.table(
X = sample(letters[1:6],30,replace=TRUE),
Y = sample(c(NA,1,2,3),30,replace=TRUE))
setkey(myseries,"X") #so X is the company identifier
#here I create another data table with each company identifier and its number
#of non NA observations
nobsmyseries <- myseries[,list(NOBSnona = length(Y[complete.cases(Y)])),by=X]
# then I select the companies which have less than 3 non NA observations
comps <- nobsmyseries[NOBSnona <3,]
#finally I exclude all companies which are in the list "comps",
#that is, I exclude companies which have less than 3 non NA observations
#but I do for each of the companies in the list, one by one,
#and this is what makes it slow.
for (i in 1:dim(comps)[1]){
myseries <- myseries[X != comps$X[i],]
}
How can I do this more efficiently? Is there a data.table way of getting the same result?
If you have more than 1 column you wish to consider for NA values then you can use complete.cases(.SD), however as you want to test a single columnI would suggest something like
naCases <- myseries[,list(totalNA = sum(!is.na(Y))),by=X]
you can then join given a threshold total NA values
eg
threshold <- 3
myseries[naCases[totalNA > threshold]]
you could also select using not join to get those cases you have excluded
myseries[!naCases[totalNA > threshold]]
As noted in the comments, something like
myseries[,totalNA := sum(!is.na(Y)),by=X][totalNA > 3]
would work, however, in this case you are performing a vector scan on the entire data.table, whereas the previous solution performed the vector scan on a data.table that is only nrow(unique(myseries[['X']])).
Given that this is a single vector scan, it will be efficient regardless (and perhaps binary join + small vector scan may be slower than larger vector scan), However I doubt there will be much difference either way.
How about aggregating the number of NAs in Y over X, and then subsetting?
# Aggregate number of NAs
num_nas <- as.data.table(aggregate(formula=Y~X, data=myseries, FUN=function(x) sum(!is.na(x))))
# Subset
myseries[!X %in% num_nas$X[Y>=3],]

Resources