Calculate the mode of all non-numeric columns in a dataframe - r

I would like to calculate the mode of each column from a dataframe. I have found similar posts on how to determine the mode of a vector of rows in a dataframe (but most have been with numeric data).
df <- data.frame(c("A","B","C","C"), c("A","A","B","C"),c("A","B","B","C"))
colnames(df) <- c("V1","V2","V3")
rownames(df) <- c(1,2,3,4)
df
I am using the following function:
modefunc <- function(x){
tabresult <- tabulate(x)
themode <- which(tabresult == max(tabresult))
if(sum(tabresult == max(tabresult))>1) themode <- NA
return(themode)
}
mode.vector <- apply(df, 1, modefunc)
Since my dataframe is not numeric, I unfortunately get the following error:
Error in tabulate(x) : 'bin' must be numeric or a factor
Any assistance with this would be helpful. Thanks in advance.

Related

undefined columns selected and cannot xtfrm data frame error

I am trying to write a code that checks for outliers based on IQR and change those respective values to "NA". So I wrote this:
dt <- rnorm(200)
dg <- rnorm(200)
dh <- rnorm(200)
l <- c(1,3) #List of relevant columns
df <- data.frame(dt,dg,dh)
To check if the column contains any outliers and change their value to NA:
vector.is.empty <- function(x) return(length(x) ==0)
#Checks for empty values in vector and returns booleans.
for (i in 1:length(l)){
IDX <- l[i]
BP <- boxplot.stats(df[IDX])
OutIDX <- which(df[IDX] %in% BP$out)
if (vector.is.empty(OutIDX)==FALSE){
for (u in 1:length(OutIDX)){
IDX2 <- OutIDX[u]
df[IDX2,IDX] <- NA
}
}
}
So, when I run this code, I get these error messages:
I've tried to search online for any good answers. but I'm not sure why they claim that the column is unspecified. Any clues here?
I would do something like that in order to replace the outliers:
# Set a seed (to make the example reproducible)
set.seed(31415)
# Generate the data.frame
df <- data.frame(dt = rnorm(100), dg = rnorm(100), dh = rnorm(100))
# A list to save the result of boxplot.stats()
l <- list()
for (i in 1:ncol(df)){
l[[i]] <- boxplot.stats(df[,i])
df[which(df[,i]==l[[i]]$out),i] <- NA
}
# Which values have been replaced?
lapply(l, function(x) x$out)

R aggregate function unexpected NA

When I use aggregate function on a data.frame which contains character and numeric columns, aggregate fails and returns only NAs for all. How can I solve this? My first idea was to check for value class but it did not work.
name <- rep(LETTERS[1:5],each=2)
feat <- paste0("Feat",name)
valuesA <- runif(10)*10
valuesB <- runif(10)*10
daf <- data.frame(ID=name,feature=feat,valueA=valuesA,valueB=valuesB, stringsAsFactors = FALSE)
aggregate(.~ID, data=daf,FUN=mean)
aggregate(.~ID, data=daf,FUN=function(x){
if(is.character(x)){
return(NA)
}else{ return(mean(x))}
})

Replacing all negative values from a dataset

I have a dataframe with mixed data ranging from variables(or columns) with numerical values to variables(or columns) with factors.
I would like to use the following piece of code in R to replace all negative values with NA and subsequently remove the entire variable if more than 99% of observations for that variable are NA.
The first part should make sure there is no problem when encountering strings.
Would it be possible to simply start with:
mydata$v1[mydata$v1<0] <- NA
But then not specific for v1 and only if the observation is not a string ?
Follow up:
This is how far I got with the explanation provided by #stas g. It does however not seem like any variable was dropped from the df.
#mixed data
df <- data.frame(WVS_Longitudinal_1981_2014_R_v2015_04_18)
dat <- df[,sapply(df, function(x) {class(x)== "numeric" | class(x) ==
"integer"})]
foo <- function(dat, p){
ind <- colSums(is.na(dat))/nrow(dat)
dat[dat < 0] <- NA
dat[, ind < p]
}
#process numeric part of the data separately
ii <- sapply(df, class) == "numeric" | sapply(df, class) == "integer"
dat.num <- foo(as.matrix(df[, ii]), 0.99)
#then stick the two parts back together again
WVS <- data.frame(df[, !ii], dat.num)
impossible to know exactly how to help you without a minimal reproducible example, but assuming you have a sample data below:
#matrix of random normal observations, 20 samples, 5 variables
dat <- matrix(rnorm(100), nrow = 20)
#if entry is negative, replace with 'NA'
dat[dat < 0] <- NA
#threshold for dropping a variable
p <- 0.99
#check how many NAs in each column (proportionally)
ind <- colSums(is.na(dat))/nrow(dat)
#only keep columns where threshold is not exceded
dat <- dat[, ind < p]
if you have non-numeric variables and you are dealing with a data.frame you could do something like this (assuming you don't care about order of columns):
#generate mixed data
dat <- matrix(rnorm(100), nrow = 20) #20 * 50 numeric numbers
df <- data.frame(letters[1 : 20], dat) #combined with one character column
foo <- function(dat, p){
ind <- colSums(is.na(dat))/nrow(dat)
dat[dat < 0] <- NA
dat[, ind < p]
}
#process numeric part of the data separately
ii <- sapply(df, class) == "numeric" #ind of numeric columns
dat.num <- foo(as.matrix(df[, ii]), 0.99) #feed numeric part of data to foo
#then stick the two partw back together again
data.frame(df[, !ii], dat.num)
This approach: Solution by YOLO suggested by #YOLO finally solved the issue:
cleanFun <- function(df){
# set negative values as NA
df[df < 0] <- NA
# faster, vectorized solution
# select numeric columns
num_cols <- names(df)[sapply(df, is.numeric)]
# get name of columns with 99% or more NA values
col_to_remove <- names(df)[colMeans(is.na(df[num_cols]))>=0.99]
# drop those columns
return (df[setdiff(colnames(df),col_to_remove)])
}
your_df <- cleanFun(your_df)

Apply a user defined function to a list of data frames

I have a series of data frames structured similarly to this:
df <- data.frame(x = c('notes','year',1995:2005), y = c(NA,'value',11:21))
df2 <- data.frame(x = c('notes','year',1995:2005), y = c(NA,'value',50:60))
In order to clean them I wrote a user defined function with a set of cleaning steps:
clean <- function(df){
colnames(df) <- df[2,]
df <- df[grep('^[0-9]{4}', df$year),]
return(df)
}
I'd now like to put my data frames in a list:
df_list <- list(df,df2)
and clean them all at once. I tried
lapply(df_list, clean)
and
for(df in df_list){
clean(df)
}
But with both methods I get the error:
Error in df[2, ] : incorrect number of dimensions
What's causing this error and how can I fix it? Is my approach to this problem wrong?
You are close, but there is one problem in code. Since you have text in your dataframe's columns, the columns are created as factors and not characters. Thus your column naming does not provide the expected result.
#need to specify strings to factors as false
df <- data.frame(x = c('notes','year',1995:2005), y = c(NA,'value',11:21), stringsAsFactors = FALSE)
df2 <- data.frame(x = c('notes','year',1995:2005), y = c(NA,'value',50:60), stringsAsFactors = FALSE)
clean <- function(df){
colnames(df) <- df[2,]
#need to specify the column to select the rows
df <- df[grep('^[0-9]{4}', df$year),]
#convert the columns to numeric values
df[, 1:ncol(df)] <- apply(df[, 1:ncol(df)], 2, as.numeric)
return(df)
}
df_list <- list(df,df2)
lapply(df_list, clean)

Simpler method to insert dataframe variable and name when creating many dataframes from raster type

Is there a simpler way to designate new dataframe rows and rownames in the creation of a data frame from raster data?
rastA <- raster("rasterA.txt")
rastB <- raster("rasterB.txt")
rastC <- raster("rasterC.txt")
rastD <- raster("rasterD.txt")
rastE <- raster("rasterE.txt")
dfA <- as.data.frame(rastA)
dfB <- as.data.frame(rastB)
dfC <- as.data.frame(rastC)
dfD <- as.data.frame(rastD)
dfE <- as.data.frame(rastE)
# Renaming column in dataframe
names(dfA)[1] <- 'values'
names(dfB)[1] <- 'values'
names(dfC)[1] <- 'values'
names(dfD)[1] <- 'values'
names(dfE)[1] <- 'values'
# Adding new column with classifier 'X'
dfA$type <- 'X'
dfB$type <- 'X'
dfC$type <- 'X'
dfD$type <- 'X'
dfE$type <- 'X'
df_AB <- rbind.data.frame(dfA, dfB)
df_AC <- rbind.data.frame(dfA, dfC)
df_AD <- rbind.data.frame(dfA, dfD)
With the final combined data frames fed into ggplot to generate various histogram and density plots. This method (line by line) is easy enough, but I am wondering what efficiencies can be gained by using different methods.
Here is an approach that simplifies part of this
f <- system.file("external/test.grd", package="raster")
fls <- c(f, f, f, f, f)
s <- stack(fls) * 1:5
names(s) <- LETTERS[1:5]
df <- as.data.frame(s)
df <- na.omit(df)
I would expect that for most plots, df is what you want to use, and that would not not need to create all these separate objects that you do. However, if that is what you want, perhaps do
x <- reshape(df, varying=colnames(df), v.name='values', timevar='group', times=colnames(df), direction='long', new.row.names=NULL)
# see http://www.ats.ucla.edu/stat/r/faq/reshape.htm
rownames(x) <- NULL
x$id <- NULL
x$type <- 'X'
df_AB <- x[x$group %in% c('A', 'B'), ]
# etc

Resources