Making a data frame of only outliers of a large data set - r

Instead of trying to remove outliers from a data set, I am trying to create a new data frame consisting only of the rows tha have outliers in them.
I was able to column-bind the averages and standard deviations of the different groups onto the end of the data set. Now, I have tried this code to produce a table of outlier data:
Outliers <- Sample[((Sample$x - Sample$Averages)/Sample$StDevs) > 2.00,]
This process runs, but produces an empty table for Outliers. I tested some individual values from the data to make sure outliers existed, and they do. If I specify a row, the above calculation indeed produces a Boolean argument. It is when I try to collect these outliers in a table that I have problems. I also tried initializing Outliers as a data.frame or data.table, but was unsuccessful here as well (probably just because I am new to R).
ex:
When I run
((Sample$x[3] - Sample$Averages[3])/Sample$StDevs[3]) > 2
it returns TRUE. This is good. Why, then, do I get an empty table of outliers when I simply want to KEEP everything in Sample where this condition is true? I do not feel that this should be a difficult problem, but I cannot for the life of me get it to work.
Any suggestions? Thanks in advance!

Sample[ 0, ] should get you an empty dataframe with no rows and the same column names.

Related

R - Can I have a matrix with different number of columns for rows?

This might be a stupid question. I have some 'NA' in a matrix, I need to put this matrix into jags model, but I want to remove those NA. Can I remove only NA but keep the rest of the data?
My data looked like the picture below. Can I have rows with different column numbers?
You cannot.
You need to impute these missing values or remote either the column or the row entirely.
Imputing missing values is as complicated as you want it to be. You'd be best of looking into the first few google searches on the topic or just using the mean value of the column.

R empty data frame after subsetting by factor

I need to subset my data depending on the content of one factor variable.
I tried to do it with subset:
new <- subset(data, original$Group1=="SALAD")
data is already a subset from a bigger data frame, in original I have the factor variable which should identify the wanted rows.
This works perfectly for one level of the factor variable, but (and I really don´t understand why!!) when I do it with the other factor level "BREAD" it creates the data frame but says "no data available" - so it is empty. I´ve imported the data from SPSS, if this matters. I´ve already checked the factor levels, but the naming should be right!
Would be really grateful for help, I spent 3 hours on this problem and wasn´t able to find a solution.
I´ve also tried other ways to subset my data (e.g. split), but I want a data frame as output.
Do you have advice in general, what is the best way to subset a data frame if I want e.g. 3 columns of this data frame and these should be extracted depending on the level of a factor (most Code examples are only for one or all columns..)
The entire point of the subset function (as I understand it) is to look inside the data frame for the right variable - so you can type
subset(data, var1 == "value")
instead of
data[data$var1 == "value,]
Please correct me anyone if that is incorrect.
Now, in you're case, you are explicitly taking Group1 from the data frame original and using that to subset data - which you say is a subset of original. Based on this, I see no reason to believe (and every reason not to believe) that the elements of original$Group1 will align with the rows of data. If Group1 is defined within data, why not just use the copy defined there - which is aligned correctly? If not, you need to be very explicit about what you are trying to accomplish, so that you can ensure that things are aligned correctly.

function to remove all observations that contain a "prohibited" value - R

I have an large dataset looking like:
There are overall 43 different values for PID. I have identified PIDs that need to be removed and summarized them in a vector:
I want to remove all observations (rows) from my data set that contain one of the PIDs from the vecotor NullNK. I have tried writing a function for it, but i get an error ( i have never written functiones before):
for (i in length(NullNK)){
SR_DynUeber_einfam <- SR_DynUeber_einfam [-which(SR_DynUeber_einfam$PID == NullNK(i)),]
}
How can i efficently remove the observations from my original data set that are containing PIDs from NullNK vector?
What is wrong with my function?
Thanks!
For basic operations like this, for loops are often not needed. This does what you are looking for:
SR_DynUeber_einfam[!SR_DynUeber_einfam$PID %in% NullNK,]
One mistake in your function is NullNK(i). You should subset from a vector with NullNK[i] in R.
Hope this helps!

R rows unanalysed

So I'm trying to format my xls data in a way that the first row will be seen in R, but it won't be analysed as in this example: http://bowtie-bio.sourceforge.net/recount/ExpressionSets/bodymap_eset.RData
When you open the exprs(bm) expression data in this the first row gives you the gene names, but these aren't e.g. being log transformed.
I formatted my own data into a similar table, but cannot figure out how to omit the first table from showing up in R and more importantly being used in calculations, which of course results in error codes all the way.
Hope that makes sense?
Cheers

Cluster PAM in R - How to ignore a Column/variable but still keep it

I would like to use the Cluster PAM algorithm in R to cluster a dataset of around 6000 rows.
I want the PAM algorithm to ignore a column called "ID" (Not use it in the clustering) but i do not want to delete that column. I want to use that column later on to combine my clustered data with the original dataset.
basically what i want is to add a cluster column to the original dataset.
I am want to use the PAM as a data compression/variables reduction method. I have 220 variables and i would like to cluster some of the variables and reduce the dimensionality of my dataset so i can apply a classification algorithm (Most likely a tree) to classify a problem that i am trying to solve.
If anyone knows a way around this or a better approach, please let me know.
Thank you
import data
data <- read.table(“sampleiris.txt”)
execution
result <- pam(data[2:4], 3, FALSE, “euclidean”)
Here subset [2:4] is done considering id is the first column.And the below code should fetch you the cluster values from PAM. you can the add this as a column to your Data
result$silinfo[[1]][1:nrow(pam.result$silinfo[[1]])]
Their is a small problem in the above code.
You should not use the silhouette information because it re-orders the rows as a preparation for the plot.
If you want to extract the cluster assignment while preserving the original dataset order and adding just a column of cluster assignment you should use $cluster. I tried it and it works like a charm.
This is the code:
data<- swiss[4:6]
result <- pam(data, 3)
summary (result)
export<-result$cluster
swiss[,"Clus"]<- export
View(export)
View(swiss)
Cheers

Resources