In R dataframe, how to add margin column using COUNT function just like EXCEL ?
For instance, there is dataframe as red area, i want to add column and row (yellow area) for counting the cells which have numeric content by rows/columns.
mdata <- data.frame(
"CATEGORY"=c("A","B","C","D","E","F"),
"SALES"=c(1,20,2,2,0,3),
"QTY"=c(0,4,0,0,0,2),
"RETURN"=c(0,3,1,0,9,1)
)
There are a couple of things that make this not straightforward for your example:
You're using 0 for both the values that are zero in the Excel spreadsheet, and the empty cells - I've replaced the empty cells with NA in the example
For the columns you're storing labels in the first column, so you don't want to include this in your count, whereas the first row is part of the data
Taking this into account, one solution would be:
mdata <- data.frame(
"CATEGORY"=c("A","B","C","D","E","F"),
"SALES"=c(1,20,2,2,0,3),
"QTY"=c(NA,4,NA,NA,NA,2),
"RETURN"=c(NA,3,1,NA,9,1)
)
mdata <- rbind(mdata, c("rowCount", colSums(!is.na(mdata))[-1]))
mdata <- cbind(mdata, columnCount = c(head(rowSums(!is.na(mdata[,-1])),-1),NA))
mdata
# CATEGORY SALES QTY RETURN columnCount
# 1 A 1 <NA> <NA> 1
# 2 B 20 4 3 3
# 3 C 2 <NA> 1 2
# 4 D 2 <NA> <NA> 1
# 5 E 0 <NA> 9 2
# 6 F 3 2 1 3
# 7 rowCount 6 2 4 NA
The main trick is to use colSums(!is.na())/rowSums(!is.na()) to count the number of non-NA values in each row/column.
Base R solution, note unless this is an output this is not an advisable data-structure:
# Function to resolve numeric column vectors:
# resolve_num_vecs => function()
resolve_num_vecs <- function(df){
# Resolve the numeric vectors: num_cols => logical vector
num_cols <- vapply(
df,
is.numeric,
logical(1)
)
# Explicitly define the returned object:
# logical vector => env
return(num_cols)
}
# Apply the function: num_cols => logical vector
num_cols <- resolve_num_vecs(mdata)
# Create data.frame as requried, note this is not an
# advisable data structure: res => data.frame
res <- rbind(
transform(
mdata,
colCOUNT = rowSums(
Negate(is.na)(mdata[,num_cols]) && mdata[,num_cols] != 0
)
),
c(
"rowCOUNT",
colSums(
Negate(is.na)(mdata[,num_cols]) && mdata[,num_cols] != 0
),
NA_integer_
)
)
Related
this is my first post in stackoverflow and english is not my first language, so I'll apologize in advance for any mistakes both in grammar and programming.
I need to replace values in one column of my data frame based on part of values which are in another data frame. My question is similar to this post here, but in their example they have all the possible errors mapped out. In my case, I only need a part of the string to know if I need to replace a value or not.
I already tried to use "if_else" and "grepl" with dplyr. "Grepl" works as long as I have only one row on the second dataframe, when I insert another example I get an error.
Right now my real DF has around 30k rows and 33 variables, and the second DF with the right values may grow every month, so I'm trying to run away from loops as much as I can.
I made a mock table with random data to simulate my need:
library(dplyr)
df1 <- data.frame(Supplier = c("AAA","CCC","CCE","DDD","EEE","EED","GGG","HHH","III","JJJ"),
Value = c(100,200,300,400,200, 100,200,40,150,70))
df2 <- data.frame(Supplier =c("CC","EE","GG"),
New_Supplier = c("Red","Blue","Green"))
#Example 1: Unfortunately this Won't work unless I have an exact match:
df1$Supplier <- if_else(df1$Supplier %in% df2$Supplier, df2$New_Supplier, df1$Supplier)
# Example 2: Only works if I have one example:
df1$Supplier <- if_else(grepl(df2$Supplier, df1$Supplier), df2$New_Supplier, df1$Supplier)
So I have this on the first data frame:
Supplier Value
1 AAA 100
2 CCC 200
3 CCE 300
4 DDD 400
5 EEE 200
6 EED 100
7 GGG 200
8 HHH 40
9 III 150
10 JJJ 70
And this on the second data frame:
Supplier New_Supplier
1 CC Red
2 EE Blue
3 GG Green
My end goal is to have something like this:
Supplier Value
1 AAA 100
2 Red 200
3 Red 300
4 DDD 400
5 Blue 200
6 Blue 100
7 Green 200
8 HHH 40
9 III 150
10 JJJ 70
Thanks in advance!
This seems to be a case for fuzzy_join with regex_left_join. After the regex_left_join, coalecse the columns together so that it will return the first non-NA element per each row
library(fuzzyjoin)
library(dplyr)
regex_left_join(df1, df2, by = 'Supplier') %>%
transmute(Supplier = coalesce(New_Supplier, Supplier.x), Value)
-output
Supplier Value
1 AAA 100
2 Red 200
3 Red 300
4 DDD 400
5 Blue 200
6 Blue 100
7 Green 200
8 HHH 40
9 III 150
10 JJJ 70
A Base R approach:
# Coerce 0 length vectors to na values of the appropriate type:
# zero_to_nas => function()
zero_to_nas <- function(x){
if(identical(x, character(0))){
res <- NA_character_
}else if(identical(x, integer(0))){
res < -NA_integer_
}else if(identical(x, numeric(0))){
res <- NA_real_
}else if(identical(x, complex(0))){
res <- NA_complex_
}else if(identical(x, logical(0))){
res <- NA
}else{
res <- x
}
# If the result is Null return the vector:
if(is.null(res)){
res <- x
}else{
invisible()
}
# Explicitly define returned object: vector => Global Env
return(res)
}
# Unlist handling 0 length vectors: list_2_vec => function()
list_2_vec <- function(lst){
# Unlist cleaned list: res => vector
res <- unlist(lapply(lst, zero_to_nas))
# Explictly define return object: vector => GlobalEnv()
return(res)
}
# Function to perform a fuzzy match:
# fuzzy_match => function()
fuzzy_match <- function(vec_to_match_to, vec_to_match_on){
# Perform a fuzzy match: res => character vector:
res <- list_2_vec(
regmatches(
vec_to_match_to,
gregexpr(
paste0(
vec_to_match_on,
collapse = "|"
),
vec_to_match_to
)
)
)
# Explicitly define returned object:
# character vector => Global Env
return(res)
}
# Function to coalesce vectors: br_coalesce => function()
br_coalesce <- function(vec, ..., to_vec = TRUE){
# Coalesce the vectors: res_ir => list
res_ir <- apply(
cbind(
as.list(...),
as.list(vec)
),
1,
function(x){
head(zero_to_nas(x[!(is.na(x))]), 1)
}
)
# If the result is null return the original vector:
if(is.null(unlist(res_ir))){
res_ir <- vec
}else{
invisible()
}
# If the we want the result to be a vector not a list then:
if(isTRUE(to_vec)){
# Unlist the resultant list: res => vector
res <- unlist(res_ir)
# Otherwise
}else{
# Deep copy the list: res => list
res <- res_ir
}
# Explicitly define returned object:
# list or vector => Global Env
return(res)
}
# Apply the fuzzy match and coalesce functions:
# clean_df => data.frame
clean_df <- transform(
df1,
Supplier = br_coalesce(
df1$Supplier,
df2$New_Supplier[
match(
fuzzy_match(
df1$Supplier,
df2$Supplier
),
df2$Supplier
)
]
)
)
Data:
df1 <- data.frame(Supplier = c("AAA","CCC","CCE","DDD","EEE","EED","GGG","HHH","III","JJJ"),
Value = c(100,200,300,400,200, 100,200,40,150,70))
df2 <- data.frame(Supplier =c("CC","EE","GG"),
New_Supplier = c("Red","Blue","Green"))
After scraping some review data from a website, I am having difficulty organizing the data into a useful structure for analysis. The problem is that the data is dynamic, in that each reviewer gave ratings on anywhere between 0 and 3 subcategories (denoted as subcategories "a", "b" and "c"). I would like to organize the reviews so that each row is a different reviewer, and each column is a subcategory that was rated. Where reviewers chose not to rate a subcategory, I would like that missing data to be 'NA'. Here is a simplified sample of the data:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
The vec contains the information of the subcategories that were scored, and the "stop" is the end of each reviewers rating. As such, I would like to organize the result into a data frame with this structure. Expected Output
I would greatly appreciate any help on this, because I've been working on this issue for far longer than it should take me..
#alexis_laz provided what I believe is the best answer:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
stops <- vec == "stop"
i = cumsum(stops)[!stops] + 1L
j = vec[!stops]
tapply(ratings, list(factor(i, 1:max(i)), factor(j)), identity) # although mean/sum work
# a b c
#[1,] 2 5 1
#[2,] 1 3 NA
#[3,] NA NA NA
#[4,] NA NA 2
base R, but I'm using a for loop...
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
categories <- unique(vec)[unique(vec)!="stop"]
row = 1
df = data.frame(lapply(categories, function(x){NA_integer_}))
colnames(df) <- categories
rating = 1
for(i in vec) {
if(i=='stop') {row <- row+1
} else { df[row,i] <- ratings[[rating]]; rating <- rating+1}
}
Here is one option
library(data.table)
library(reshape2)
d1 <- as.data.table(melt(split(vec, c(1, head(cumsum(vec == "stop")+1,
-1)))))[value != 'stop', ratings := ratings
][value != 'stop'][, value := as.character(value)][, L1 := as.integer(L1)]
dcast( d1[CJ(value = value, L1 = seq_len(max(L1)), unique = TRUE), on = .(value, L1)],
L1 ~value, value.var = 'ratings')[, L1 := NULL][]
# a b c
#1: 2 5 1
#2: 1 3 NA
#3: NA NA NA
#4: NA NA 2
Using base R functions and rbind.fill from plyr or rbindlist from data.table to produce the final object, we can do
# convert vec into a list, split by "stop", dropping final element
temp <- head(strsplit(readLines(textConnection(paste(gsub("stop", "\n", vec, fixed=TRUE),
collapse=" "))), split=" "), -1)
# remove empty strings, but maintain empty list elements
temp <- lapply(temp, function(x) x[nchar(x) > 0])
# match up appropriate names to the individual elements in the list with setNames
# convert vectors to single row data.frames
temp <- Map(function(x, y) setNames(as.data.frame.list(x), y),
relist(ratings, skeleton = temp), temp)
# add silly data.frame (single row, single column) for any empty data.frames in list
temp <- lapply(temp, function(x) if(nrow(x) > 0) x else setNames(data.frame(NA), vec[1]))
Now, you can produce the single data.frame (data.table) with either plyr or data.table
# with plyr, returns data.frame
library(plyr)
do.call(rbind.fill, temp)
a b c
1 2 5 1
2 1 3 NA
3 NA NA NA
4 NA NA 2
# with data.table, returns data.table
rbindlist(temp, fill=TRUE)
a b c
1: 2 5 1
2: 1 3 NA
3: NA NA NA
4: NA NA 2
Note that the line prior to the rbinding can be replaced with
temp[lengths(temp) == 0] <- replicate(sum(lengths(temp) == 0),
setNames(data.frame(NA), vec[1]), simplify=FALSE)
where the list items that are empty data frames are replaced using subsetting instead of an lapply over the entire list.
I have a data.table with multiple categorical variables for which I would like to create contrast (or "dummy") variables along with many more numerical variables which I would like to simply pass by reference.
Example dataset:
library('data.table')
d <- data.table(1:3, # there are lots of numerics, so I want to avoid copying
letters[1:3], # convert these to factor then dummy variable
10:12,
LETTERS[24:26])
# >d
# V1 V2 V3 V4
# 1: 1 a 10 X
# 2: 2 b 11 Y
# 3: 3 c 12 Z
The desired result looks like:
>dummyDT(d)
V1 V3 V2.b V2.c V4.Y V4.Z
1: 1 10 0 0 0 0
2: 2 11 1 0 1 0
3: 3 12 0 1 0 1
which can be produced with:
# this does what I want but is slow and inelegant and not idiomatic data.table
categorToMatrix <- function(x, name_prefix='Var'){
# set levels in order of appearance to avoid default re-sort by alpha
m <- contrasts(factor(x, levels=unique(x)))
dimnames(m) <- list(NULL, paste(name_prefix, colnames(m), sep='.') )
m
}
dummyDT <- function(d){
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <-
data.table(
do.call(cbind, lapply(toDummy, function(j) {
categorToMatrix(d[[j]], name_prefix = names(d)[j])
} )
)
)
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
(I do not care about maintaining original column ordering.)
I have tried in addition to the above, the approach of splitting each matrix into a list of columns, as in:
# split a matrix into list of columns and keep track of column names
# expanded from #Tommy's answer at: https://stackoverflow.com/a/6821395/2573061
splitMatrix <- function(m){
setNames( lapply(seq_len(ncol(m)), function(j) m[,j]), colnames(m) )
}
# Example:
splitMatrix(categoricalToMatrix(d$V2, name_prefix='V2'))
# $V2.b
# [1] 0 1 0
#
# $V2.c
# [1] 0 0 1
which works for an individual column, but then when I try to lapply to multiple columns, these lists get somehow coerced into string-rows and recycled, which is baffling me:
dummyDT2 <- function(d){
stopifnot(inherits(d,'data.table'))
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <- d[, lapply(.SD, function(x) splitMatrix( categorToMatrix(x) ) ) ,
.SDcols=isChar]
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
dummyDT2(d)
# V1 V3 V2
# 1: 1 10 0,1,0
# 2: 2 11 0,0,1
# 3: 3 12 0,1,0
# Warning message:
# In data.table::data.table(...) :
# Item 2 is of size 2 but maximum size is 3 (recycled leaving remainder of 1 items)
I then tried wrapping splitMatrix with data.table() and got an amusingly laconic error message.
I know that functions like caret::dummyVars exist for data.frame. I am trying to create a data.table optimized version.
Closely related question: How to one-hot-encode factor variables with data.table?
But there are two differences: I do not want full-rank dummy variables (because I'm using this for regression) but rather contrast variables (n-1 of these for n levels) and I have multiple numeric variables that I do not want to OHE.
I would like to remove the columns that have all zeros. But, some of the columns appear to have non numeric values. How can I remove the non numeric columns, and the columns with all zeros. It would be helpful if the non numeric column name was printed, or the column number, so I can determine if it was ok to remove the column.
Here's what I'm trying, but it doesn't work when the data table has non numeric values.
removeColsAllZeros = function(ddt) {
m <- as.matrix(ddt)
# isNumericColList <- lapply(1:ncol(m), function(ii,mm){is.numeric(mm[,ii])}, mm=m)
# indexNonNumericCols <- which(!unlist(isNumericColList))
mnz <- m[, colSums(abs(m),na.rm = TRUE) != 0]
return(mnz)
}
Here's a simple function that can be applied to all columns in your data frame, returning just the ones that are numeric and not all zero:
# Fake data
dat = data.frame(x=rnorm(5),
y=rep(0,5),
z=sample(c(1,0),5,replace=TRUE),
w=sample(LETTERS[1:3],5,replace=TRUE),
stringsAsFactors=FALSE)
dat
x y z w
1 0.5450570 0 0 B
2 0.5292899 0 0 B
3 -0.2142306 0 1 C
4 -0.7246841 0 0 C
5 -0.7567683 0 1 A
# Remove columns with all zeros or that are not numeric
dat[, !sapply(names(dat), function(col) {all(dat[,col]==0) |
!is.numeric(dat[,col])})]
x z
1 0.5450570 0
2 0.5292899 0
3 -0.2142306 1
4 -0.7246841 0
5 -0.7567683 1
To unpack this, the function checks, for a single column of dat, whether it has all zeros or is not numeric. sapply then "applies" this function to every column in the data frame, returning a logical vector with TRUE for columns of dat with all zeros or that are non-numeric, and FALSE for columns that are numeric and not all zeros. The ! ("NOT") before sapply just reverses the FALSE and TRUE values:
!sapply(names(dat), function(col) {
all(dat[, col]==0) | !is.numeric(dat[, col])
})
x y z w
TRUE FALSE TRUE FALSE
Then we use this logical vector to return only those columns of dat that are TRUE.
dat[ , c(TRUE, FALSE, TRUE, FALSE)]
x z
1 0.5450570 0
2 0.5292899 0
3 -0.2142306 1
4 -0.7246841 0
5 -0.7567683 1
Finally, to check the non-numeric columns that were removed, do the following, which will return all non-numeric columns:
dat[, sapply(names(dat), function(col) {!is.numeric(dat[,col])})]
This is not compact but works on data table after modifying #eipi10's code.
# toy data
set.seed(1)
dat = data.frame(x=rnorm(5),
y=rep(0,5),
z=sample(c(1,0),5,replace=TRUE),
w=sample(LETTERS[1:3],5,replace=TRUE),
stringsAsFactors=FALSE)
# code for a data table
library(data.table)
setDT(dat)
idx = sapply(dat, function(x){ !(all(x==0) | !is.numeric(x)) })
dat[, .SD, .SDcols = idx]
# x z
# 1: -0.6264538 1
# 2: 0.1836433 1
# 3: -0.8356286 0
# 4: 1.5952808 1
# 5: 0.3295078 0
Both of the other answers were helpful, but they didn't totally answer the question. Here's a function with to identify and remove the non-numeric and all zero columns from a data table. This was helpful and provided additional insight into the data set.
removeColsAllZeros = function(ddt) {
# Identify and remove nonnumeric cols and cols with all zeros
idx_all_zeros = ddt[, lapply(.SD, function(x){ (is.numeric(x) & all(x==0)) })]
idx_not_numeric = ddt[, lapply(.SD, function(x){ (!is.numeric(x)) })]
idx_all_zeros = which(unlist(idx_all_zeros))
idx_not_numeric = which(unlist(idx_not_numeric))
# Print bad column names
if (length(idx_all_zeros)>0) {
cat('Numeric columns with all zeros are\n',paste(names(ddt)[idx_all_zeros],collapse='\n'),'\n')
flush.console()
}
if (length(idx_not_numeric)>0) {
cat('Nonnumeric columns are\n',paste(names(ddt)[idx_not_numeric],collapse='\n'),'\n')
flush.console()
}
# Determine the numeric columns that have nonzero values
idx_bad = union(idx_all_zeros, idx_not_numeric)
idx_good = setdiff(seq(1,ncol(ddt)), idx_bad)
# Return nonzero numeric data
ddt[, .SD, .SDcols = idx_good]
}
How do I create a fixed size data frame of size [40 2], declare the first column with unique strings, and populate the other with specific values? Again, I want the first column to be the list of strings; I don't
want a row of headers.
(Someone please give me some pointers. I haven't program in R for a while and my R skills are terrible to
begin with.)
Two approaches:
# sequential strings
library(stringr)
df.1 <- data.frame(id=paste0("X",str_pad(1:40,2,"left","0")),value=NA)
head(df.1)
# id value
# 1 X01 NA
# 2 X02 NA
# 3 X03 NA
# 4 X04 NA
# 5 X05 NA
# 6 X06 NA
Second Approach:
# random strings
rstr <- function(n,k){
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
set.seed(1)
df.2 <- data.frame(id=rstr(40,5),value=NA)
head(df.2)
# id value
# 1 gjoxf NA
# 2 xyrqb NA
# 3 ferju NA
# 4 mszju NA
# 5 yfqdg NA
# 6 kajwi NA
The function rstr(n,k) produces a vector of length n with each element being a string of random characters of length k. rstr(...) does not guarantee that all strings are unique, but the probability of duplication is O(n/26^k).
Create the data.frame and define it's columns with the values
The reciclying rule, repeats the strings to match the 40 rows defined by the second column
df <- data.frame(x = c("unique_string 1", "unique_string 2"), y = rpois(40, 2))
# Change column names
names(df) <- c("string_col", "num_col")
I found this way of creating dataframes in R extremely productive and easy,
Create a raw array of values , then convert into matrix of required dimenions and finally name the columns and rows
dataframe.values = c(value1, value2,.......)
dataframe = matrix(dataframe.values,nrow=number of rows ,byrow = T)
colnames(dataframe) = c("column1","column2",........)
row.names(dataframe) = c("row1", "row2",............)
exampledf <- data.frame(columnofstrings=c("a string", "another", "yetanother"),
columnofvalues=c(2,3,5) )
gives
> exampledf
columnofstrings columnofvalues
1 a string 2
2 another 3
3 yetanother 5