I would like to remove the columns that have all zeros. But, some of the columns appear to have non numeric values. How can I remove the non numeric columns, and the columns with all zeros. It would be helpful if the non numeric column name was printed, or the column number, so I can determine if it was ok to remove the column.
Here's what I'm trying, but it doesn't work when the data table has non numeric values.
removeColsAllZeros = function(ddt) {
m <- as.matrix(ddt)
# isNumericColList <- lapply(1:ncol(m), function(ii,mm){is.numeric(mm[,ii])}, mm=m)
# indexNonNumericCols <- which(!unlist(isNumericColList))
mnz <- m[, colSums(abs(m),na.rm = TRUE) != 0]
return(mnz)
}
Here's a simple function that can be applied to all columns in your data frame, returning just the ones that are numeric and not all zero:
# Fake data
dat = data.frame(x=rnorm(5),
y=rep(0,5),
z=sample(c(1,0),5,replace=TRUE),
w=sample(LETTERS[1:3],5,replace=TRUE),
stringsAsFactors=FALSE)
dat
x y z w
1 0.5450570 0 0 B
2 0.5292899 0 0 B
3 -0.2142306 0 1 C
4 -0.7246841 0 0 C
5 -0.7567683 0 1 A
# Remove columns with all zeros or that are not numeric
dat[, !sapply(names(dat), function(col) {all(dat[,col]==0) |
!is.numeric(dat[,col])})]
x z
1 0.5450570 0
2 0.5292899 0
3 -0.2142306 1
4 -0.7246841 0
5 -0.7567683 1
To unpack this, the function checks, for a single column of dat, whether it has all zeros or is not numeric. sapply then "applies" this function to every column in the data frame, returning a logical vector with TRUE for columns of dat with all zeros or that are non-numeric, and FALSE for columns that are numeric and not all zeros. The ! ("NOT") before sapply just reverses the FALSE and TRUE values:
!sapply(names(dat), function(col) {
all(dat[, col]==0) | !is.numeric(dat[, col])
})
x y z w
TRUE FALSE TRUE FALSE
Then we use this logical vector to return only those columns of dat that are TRUE.
dat[ , c(TRUE, FALSE, TRUE, FALSE)]
x z
1 0.5450570 0
2 0.5292899 0
3 -0.2142306 1
4 -0.7246841 0
5 -0.7567683 1
Finally, to check the non-numeric columns that were removed, do the following, which will return all non-numeric columns:
dat[, sapply(names(dat), function(col) {!is.numeric(dat[,col])})]
This is not compact but works on data table after modifying #eipi10's code.
# toy data
set.seed(1)
dat = data.frame(x=rnorm(5),
y=rep(0,5),
z=sample(c(1,0),5,replace=TRUE),
w=sample(LETTERS[1:3],5,replace=TRUE),
stringsAsFactors=FALSE)
# code for a data table
library(data.table)
setDT(dat)
idx = sapply(dat, function(x){ !(all(x==0) | !is.numeric(x)) })
dat[, .SD, .SDcols = idx]
# x z
# 1: -0.6264538 1
# 2: 0.1836433 1
# 3: -0.8356286 0
# 4: 1.5952808 1
# 5: 0.3295078 0
Both of the other answers were helpful, but they didn't totally answer the question. Here's a function with to identify and remove the non-numeric and all zero columns from a data table. This was helpful and provided additional insight into the data set.
removeColsAllZeros = function(ddt) {
# Identify and remove nonnumeric cols and cols with all zeros
idx_all_zeros = ddt[, lapply(.SD, function(x){ (is.numeric(x) & all(x==0)) })]
idx_not_numeric = ddt[, lapply(.SD, function(x){ (!is.numeric(x)) })]
idx_all_zeros = which(unlist(idx_all_zeros))
idx_not_numeric = which(unlist(idx_not_numeric))
# Print bad column names
if (length(idx_all_zeros)>0) {
cat('Numeric columns with all zeros are\n',paste(names(ddt)[idx_all_zeros],collapse='\n'),'\n')
flush.console()
}
if (length(idx_not_numeric)>0) {
cat('Nonnumeric columns are\n',paste(names(ddt)[idx_not_numeric],collapse='\n'),'\n')
flush.console()
}
# Determine the numeric columns that have nonzero values
idx_bad = union(idx_all_zeros, idx_not_numeric)
idx_good = setdiff(seq(1,ncol(ddt)), idx_bad)
# Return nonzero numeric data
ddt[, .SD, .SDcols = idx_good]
}
Related
In R dataframe, how to add margin column using COUNT function just like EXCEL ?
For instance, there is dataframe as red area, i want to add column and row (yellow area) for counting the cells which have numeric content by rows/columns.
mdata <- data.frame(
"CATEGORY"=c("A","B","C","D","E","F"),
"SALES"=c(1,20,2,2,0,3),
"QTY"=c(0,4,0,0,0,2),
"RETURN"=c(0,3,1,0,9,1)
)
There are a couple of things that make this not straightforward for your example:
You're using 0 for both the values that are zero in the Excel spreadsheet, and the empty cells - I've replaced the empty cells with NA in the example
For the columns you're storing labels in the first column, so you don't want to include this in your count, whereas the first row is part of the data
Taking this into account, one solution would be:
mdata <- data.frame(
"CATEGORY"=c("A","B","C","D","E","F"),
"SALES"=c(1,20,2,2,0,3),
"QTY"=c(NA,4,NA,NA,NA,2),
"RETURN"=c(NA,3,1,NA,9,1)
)
mdata <- rbind(mdata, c("rowCount", colSums(!is.na(mdata))[-1]))
mdata <- cbind(mdata, columnCount = c(head(rowSums(!is.na(mdata[,-1])),-1),NA))
mdata
# CATEGORY SALES QTY RETURN columnCount
# 1 A 1 <NA> <NA> 1
# 2 B 20 4 3 3
# 3 C 2 <NA> 1 2
# 4 D 2 <NA> <NA> 1
# 5 E 0 <NA> 9 2
# 6 F 3 2 1 3
# 7 rowCount 6 2 4 NA
The main trick is to use colSums(!is.na())/rowSums(!is.na()) to count the number of non-NA values in each row/column.
Base R solution, note unless this is an output this is not an advisable data-structure:
# Function to resolve numeric column vectors:
# resolve_num_vecs => function()
resolve_num_vecs <- function(df){
# Resolve the numeric vectors: num_cols => logical vector
num_cols <- vapply(
df,
is.numeric,
logical(1)
)
# Explicitly define the returned object:
# logical vector => env
return(num_cols)
}
# Apply the function: num_cols => logical vector
num_cols <- resolve_num_vecs(mdata)
# Create data.frame as requried, note this is not an
# advisable data structure: res => data.frame
res <- rbind(
transform(
mdata,
colCOUNT = rowSums(
Negate(is.na)(mdata[,num_cols]) && mdata[,num_cols] != 0
)
),
c(
"rowCOUNT",
colSums(
Negate(is.na)(mdata[,num_cols]) && mdata[,num_cols] != 0
),
NA_integer_
)
)
I want to use a list external to my data.table to inform what a new column of data should be, in that data.table. In this case, the length of the list element corresponding to a data.table attribute;
# dummy list. I am interested in extracting the vector length of each list element
l <- list(a=c(3,5,6,32,4), b=c(34,5,6,34,2,4,6,7), c = c(3,4,5))
# dummy dt, the underscore number in Attri2 is the element of the list i want the length of
dt <- data.table(Attri1 = c("t","y","h","g","d","e","d"),
Attri2 = c("fghd_1","sdafsf_3","ser_1","fggx_2","sada_2","sfesf_3","asdas_2"))
# extract that number to a new attribute, just for clarity
dt[, list_gp := tstrsplit(Attri2, "_", fixed=TRUE, keep=2)]
# then calculate the lengths of the vectors in the list, and attempt to subset by the index taken above
dt[,list_len := '[['(lapply(1, length),list_gp)]
Error in lapply(l, length)[[list_gp]] : no such index at level 1
I envisaged the list_len column to be 5,3,5,8,8,3,8
A couple of things.
tstrsplit gives you a string. convert to number.
not quite sure about the [[ construct there, see proposed solution:
dt[, list_gp := as.numeric( tstrsplit(Attri2, "_", fixed=TRUE, keep=2)[[1]] )]
dt[, list_len := sapply( l[ list_gp ], length ) ]
Output:
> dt
Attri1 Attri2 list_gp list_len
1: t fghd_1 1 5
2: y sdafsf_3 3 3
3: h ser_1 1 5
4: g fggx_2 2 8
5: d sada_2 2 8
6: e sfesf_3 3 3
7: d asdas_2 2 8
I have a data.table with multiple categorical variables for which I would like to create contrast (or "dummy") variables along with many more numerical variables which I would like to simply pass by reference.
Example dataset:
library('data.table')
d <- data.table(1:3, # there are lots of numerics, so I want to avoid copying
letters[1:3], # convert these to factor then dummy variable
10:12,
LETTERS[24:26])
# >d
# V1 V2 V3 V4
# 1: 1 a 10 X
# 2: 2 b 11 Y
# 3: 3 c 12 Z
The desired result looks like:
>dummyDT(d)
V1 V3 V2.b V2.c V4.Y V4.Z
1: 1 10 0 0 0 0
2: 2 11 1 0 1 0
3: 3 12 0 1 0 1
which can be produced with:
# this does what I want but is slow and inelegant and not idiomatic data.table
categorToMatrix <- function(x, name_prefix='Var'){
# set levels in order of appearance to avoid default re-sort by alpha
m <- contrasts(factor(x, levels=unique(x)))
dimnames(m) <- list(NULL, paste(name_prefix, colnames(m), sep='.') )
m
}
dummyDT <- function(d){
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <-
data.table(
do.call(cbind, lapply(toDummy, function(j) {
categorToMatrix(d[[j]], name_prefix = names(d)[j])
} )
)
)
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
(I do not care about maintaining original column ordering.)
I have tried in addition to the above, the approach of splitting each matrix into a list of columns, as in:
# split a matrix into list of columns and keep track of column names
# expanded from #Tommy's answer at: https://stackoverflow.com/a/6821395/2573061
splitMatrix <- function(m){
setNames( lapply(seq_len(ncol(m)), function(j) m[,j]), colnames(m) )
}
# Example:
splitMatrix(categoricalToMatrix(d$V2, name_prefix='V2'))
# $V2.b
# [1] 0 1 0
#
# $V2.c
# [1] 0 0 1
which works for an individual column, but then when I try to lapply to multiple columns, these lists get somehow coerced into string-rows and recycled, which is baffling me:
dummyDT2 <- function(d){
stopifnot(inherits(d,'data.table'))
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <- d[, lapply(.SD, function(x) splitMatrix( categorToMatrix(x) ) ) ,
.SDcols=isChar]
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
dummyDT2(d)
# V1 V3 V2
# 1: 1 10 0,1,0
# 2: 2 11 0,0,1
# 3: 3 12 0,1,0
# Warning message:
# In data.table::data.table(...) :
# Item 2 is of size 2 but maximum size is 3 (recycled leaving remainder of 1 items)
I then tried wrapping splitMatrix with data.table() and got an amusingly laconic error message.
I know that functions like caret::dummyVars exist for data.frame. I am trying to create a data.table optimized version.
Closely related question: How to one-hot-encode factor variables with data.table?
But there are two differences: I do not want full-rank dummy variables (because I'm using this for regression) but rather contrast variables (n-1 of these for n levels) and I have multiple numeric variables that I do not want to OHE.
I have a data frame with 200 rows and 150 columns. Out of those columns, I wish to change the NAs of about 50 rows, and 100 columns.
Below is an example of (a small part) of my data frame:
>df
Bird Mammal Type
1 NA 1 A
2 1 0 B
3 1 0 A
4 0 NA A
5 NA 1 A
6 0 0 B
7 0 0 A
8 NA NA A
9 1 1 B
10 1 1 A
What I want, is to change all the NAs to 0 ONLY for type "A", but not for type "B". For type "B", I want everything to remain the same.
I have tried to do this with various ifelse options, but I think I still don't have the hang of it. Here are some of the things I've tried:
a) Subsetting only the columns as a list:
try <- c(1,2)
for(i in 1:length(try)){
df[,try[i]] <- ifelse(df[,is.na(try[i])],0,df[,try[i]])
}
b) Subsetting both rows and columns (this gave me a data frame, so off course the ifelse didn't run)
Here is a very simple one liner that gets exactly what you want. No loops or apply needed.
df[is.na(df) & df$Type=='A'] <- 0
You can use a combination of lapply and ifelse.
Assuming you have a vector of indices or names of the columns with the NAs stored as cols you can do the following:
df[ ,cols] <- as.data.frame(lapply(cols,
FUN = function(x) ifelse(df$Type == "A" & is.na(df[,x]), 0, df[, x])))
Here is an option using set from data.table. We are considering all the other columns except the 'Type' column. The set option is fast. Also, this is changing the values in the column without converting to a logical matrix.
library(data.table)
setDT(df)
nm1 <- setdiff(names(df), 'Type')
for(j in nm1){
set(df, i= which(is.na(df[[j]]) & df$Type=='A'), j=j, value=0)
}
I regularly have situations where I need to replace missing values from a data.frame with values from some other data.frame that is at a different level of aggregation. So, for example, if I have a data.frame full of county data I might replace NA values with state values stored in another data.frame. After writing the same merge... ifelse(is.na()) yada yada a few dozen times I decided to break down and write a function to do this.
Here's what I cooked up, along with an example of how I use it:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols){
mergedDf <- merge(naDf, fillDf, by=mergeCols)
for (col in fillCols){
colWithNas <- mergedDf[[paste(col, "x", sep=".")]]
colWithOutNas <- mergedDf[[paste(col, "y", sep=".")]]
k <- which( is.na( colWithNas ) )
colWithNas[k] <- colWithOutNas[k]
mergedDf[col] <- colWithNas
mergedDf[[paste(col, "x", sep=".")]] <- NULL
mergedDf[[paste(col, "y", sep=".")]] <- NULL
}
return(mergedDf)
}
## test case
fillDf <- data.frame(a = c(1,2,1,2), b = c(3,3,4,4) ,f = c(100,200, 300, 400), g = c(11, 12, 13, 14))
naDf <- data.frame( a = sample(c(1,2), 100, rep=TRUE), b = sample(c(3,4), 100, rep=TRUE), f = sample(c(0,NA), 100, rep=TRUE), g = sample(c(0,NA), 200, rep=TRUE) )
fillNaDf(naDf, fillDf, mergeCols=c("a","b"), fillCols=c("f","g") )
So after I got this running I had this odd feeling that someone has probably solved this problem before me and in a much more elegant way. Is there a better/easier/faster solution to this problem? Also, is there a way that eliminates the loop in the middle of my function? That loop is there because I am often replacing NAs in more than one column. And, yes, the function assumes the columns we're filling from are named the same and the columns we are filling to and the same applies to the merge.
Any guidance or refactoring would be helpful.
EDIT on Dec 2 I realized I had logic flaws in my example which I fixed.
What a great question.
Here's a data.table solution:
# Convert data.frames to data.tables (i.e. data.frames with extra powers;)
library(data.table)
fillDT <- data.table(fillDf, key=c("a", "b"))
naDT <- data.table(naDf, key=c("a", "b"))
# Merge data.tables, based on their keys (columns a & b)
outDT <- naDT[fillDT]
# a b f g f.1 g.1
# [1,] 1 3 NA 0 100 11
# [2,] 1 3 NA NA 100 11
# [3,] 1 3 NA 0 100 11
# [4,] 1 3 0 0 100 11
# [5,] 1 3 0 NA 100 11
# First 5 rows of 200 printed.
# In outDT[i, j], on the following two lines
# -- i is a Boolean vector indicating which rows will be operated on
# -- j is an expression saying "(sub)assign from right column (e.g. f.1) to
# left column (e.g. f)
outDT[is.na(f), f:=f.1]
outDT[is.na(g), g:=g.1]
# Just keep the four columns ultimately needed
outDT <- outDT[,list(a,b,g,f)]
# a b g f
# [1,] 1 3 0 0
# [2,] 1 3 11 0
# [3,] 1 3 0 0
# [4,] 1 3 11 0
# [5,] 1 3 11 0
# First 5 rows of 200 printed.
Here's a slightly more concise/robust version of your approach. You could replace the for-loop with a call to lapply, but I find the loop easier to read.
This function assumes any columns not in mergeCols are fair game to have their NAs filled. I'm not really sure this helps, but I'll take my chances with the voters.
fillNaDf.ju <- function(naDf, fillDf, mergeCols) {
mergedDf <- merge(fillDf, naDf, by=mergeCols, suffixes=c(".fill",""))
dataCols <- setdiff(names(naDf),mergeCols)
# loop over all columns we didn't merge by
for(col in dataCols) {
rows <- is.na(mergedDf[,col])
# skip this column if it doesn't contain any NAs
if(!any(rows)) next
rows <- which(rows)
# replace NAs with values from fillDf
mergedDf[rows,col] <- mergedDf[rows,paste(col,"fill",sep=".")]
}
# don't return ".fill" columns
mergedDf[,names(naDf)]
}
My preference would be to pull out the code from merge that does the matching and do it myself so that I could keep the ordering of the original data frame intact, both row-wise and column-wise. I also use matrix indexing to avoid any loops, though to do so I create a new data frame with the revised fillCols and replace the columns of the original with it; I thought I could fill it in directly but apparently you can't use matrix ordering to replace parts of a data.frame, so I wouldn't be surprised if a loop over the names would be faster in some situations.
With matrix indexing:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
na.ind <- is.na(naDf[,fillCols])
fill.ind <- cbind(match(naB, fillB)[row(na.ind)[na.ind]], col(na.ind)[na.ind])
naX <- naDf[,fillCols]
fillX <- fillDf[,fillCols]
naX[na.ind] <- fillX[fill.ind]
naDf[,colnames(naX)] <- naX
naDf
}
With a loop:
fillNaDf2 <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
m <- match(naB, fillB)
for(col in fillCols) {
fix <- which(is.na(naDf[,col]))
naDf[fix, col] <- fillDf[m[fix],col]
}
naDf
}