I have a dataset with 50 variables (columns) and 30 of them have missing values more than half its own observations.
I want to subset a dataset where those 30 variables with too many missing values are gone. I think I can do it one by one, but I was just wondering if there could be a way to do it more quickly in R.
Logic : First iterate through each column using sapply and check which all columns have less than half missing values. The output from first line is a logical vector which is used to subset the data.
ind <- sapply( colnames(df), function(x) sum(is.na(df[[x]])) < nrow(df)/2)
df <- df[colnames(df)[ind]]
Related
In R code, I want to select all the variables from a dataset where same value occurs for each column is less than 40% for that column.
I am appling the sapply, but not getting the correct output.
Note: All the columns values are numeric.
train = train[, sapply(train, function(col) length(unique(col))) < 0.4*nrow(train)]
Please suggest how to proceed.
By playing around with a toy dataset, I found this code that works
train[, sapply(train, function(x) {(sort(table(x), decreasing = TRUE)/nrow(train))[[1]] < 0.4})]
Basically, I create the table of relative frequencies (sorted in decreasing order) for each numeric column in train, and then I check whether the most frequent value for each column occurs less than 40% of the times. If yes, this column is selected, otherwise discarded.
I have two data frames, one containing the predictors and one containing the different categories I want to predict. Both of the data frames contain a column named geoid. Some of the rows of my predictors contains NA values, and I need to remove these.
After extracting the geoid value of the rows containing NA values, and removing them from the predictors data frame I need to remove the corresponding rows from the categories data frame as well.
It seems like a rather basic operation but the code won't work.
categories <- as.data.frame(read.csv("files/cat_df.csv"))
predictors <- as.data.frame(read.csv("files/radius_100.csv"))
NA_rows <- predictors[!complete.cases(predictors),]
geoids <- NA_rows['geoid']
clean_categories <- categories[!(categories$geoid %in% geoids),]
None of the rows in categories/clean_categories are removed.
A typical geoid value is US06140231. typeof(categories$geoid) returns integer.
I can't say this is it, but a very basic typo won't be doing what you want, try this correction
clean_categories <- categories[!(categories$geoid %in% geoids),]
Almost certainly this is what you meant to happen in that line. You want to negate the result of the %in% operator. You don't include a reproducible example so I can't say whether the whole thing will do as you want.
No good example here since my datasets that I am working with are huge.
But if I have a 200,300something column dataset I want to have some sort of rule to quickly classify and convert some of these columns to factors. Is there some quick R code to do it?
Reason being I don't have time to go column by column to completely understand or interpret data, but if I see there are just unique 4 values out 5000 rows, I assume that this is categorical data.
Anyone have any quick code snippets or ways to go about doing this?
Assuming that df refers to your dataframe:
## Find all columns with less than 5 unique values
cols <- apply(df, 2, FUN = function(x) length(unique(x))) < 5
## Convert columns with less than 5 unique values to factor
df[cols] <- lapply(df[cols], factor)
I have a matrix called LungData with gene names and patient samples. There are ~26,000 genes and for each gene there are 41 samples. The gene names are in the first column, and the samples in the subsequent columns.
> dim(LungData)
[1] 26002 42
I have a subset of ~2,000 genes that I'm interested in. This subset is a list called GeneSubset.
> dim(GeneSubset)
[1] 1999 1
How can I get the 2000x42 sub-matrix which only contains the genes from GeneSubset? I'm not interested in the other genes, and dealing with a smaller sub-matrix will make the computations go a lot faster.
We can use either %in% or match. If the first column of 'LungData' is the 'genenames' and the dataset is a matrix, we use %in% to get a logical vector of TRUE/FALSE by comparing with 'GeneSubset' and this can be used for filtering the rows of 'LungData'.
LungData[LungData[,1] %in% GeneSubset[,1],]
Store the subset of desired genes in a vector instead of a list.
GeneSubset <- as.vector(GeneSubset)
Also,
rownames(LungData) <- LungData[,1] #assigning row names to the original matrix
LungData <- LungData[,-1] #removing 1st column since we already assigned rownames
ReqdData <- LungData[GeneSubset,] #subsetting the data on the basis of rownames
You might also want to use the subset function in base R or the code given by #akrun.
I'm trying to update a bunch of columns by adding and subtracting SD to each value of the column. The SD is for the given column.
The below is the reproducible code that I came up with, but I feel this is not the most efficient way to do it. Could someone suggest me a better way to do this?
Essentially, there are 20 rows and 9 columns.I just need two separate dataframes one that has values for each column adjusted by adding SD of that column and the other by subtracting SD from each value of the column.
##Example
##data frame containing 9 columns and 20 rows
Hi<-data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
##Standard Deviation calcualted for each row and stored in an object - i don't what this objcet is -vector, list, dataframe ?
Hi_SD<-apply(Hi,2,sd)
#data frame converted to matrix to allow addition of SD to each value
Hi_Matrix<-as.matrix(Hi,rownames.force=FALSE)
#a new object created that will store values(original+1SD) for each variable
Hi_SDValues<-NULL
#variable re-created -contains sum of first column of matrix and first element of list. I have only done this for 2 columns for the purposes of this example. however, all columns would need to be recreated
Hi_SDValues$X1<-Hi_Matrix[,1]+Hi_SD[1]
Hi_SDValues$X2<-Hi_Matrix[,2]+Hi_SD[2]
#convert the object back to a dataframe
Hi_SDValues<-as.data.frame(Hi_SDValues)
##Repeat for one SD less
Hi_SDValues_Less<-NULL
Hi_SDValues_Less$X1<-Hi_Matrix[,1]-Hi_SD[1]
Hi_SDValues_Less$X2<-Hi_Matrix[,2]-Hi_SD[2]
Hi_SDValues_Less<-as.data.frame(Hi_SDValues_Less)
This is a job for sweep (type ?sweep in R for the documentation)
Hi <- data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
Hi_SD <- apply(Hi,2,sd)
Hi_SD_subtracted <- sweep(Hi, 2, Hi_SD)
You don't need to convert the dataframe to a matrix in order to add the SD
Hi<-data.frame(replicate(9,sample(0:20,20,rep=TRUE)))
Hi_SD<-apply(Hi,2,sd) # Hi_SD is a named numeric vector
Hi_SDValues<-Hi # Creating a new dataframe that we will add the SDs to
# Loop through all columns (there are many ways to do this)
for (i in 1:9){
Hi_SDValues[,i]<-Hi_SDValues[,i]+Hi_SD[i]
}
# Do pretty much the same thing for the next dataframe
Hi_SDValues_Less <- Hi
for (i in 1:9){
Hi_SDValues[,i]<-Hi_SDValues[,i]-Hi_SD[i]
}