I was looking for an clear explanation of the 'labels are constructed using "(a,b]" interval notation' - as described in the cut help file, which seemed to lack an explanation.
So I tested cut on some simple examples as follows:
df <- data.frame(c(1,2,3,4,5,6,7,99))
names(df) <- 'x'
df$cut <- cut(df[ ,1], breaks = c(2,4,6,8), right = TRUE)
df
x cut
# 1 <NA>
# 2 <NA>
# 3 (2,4]
# 4 (2,4]
# 5 (4,6]
# 6 (4,6]
# 7 (6,8]
# 99 <NA>
So the '(' means x>break on the left and '[' means <= (next) break on the right and if a value is lower than the lowest break it is flagged as NA, similarly if a value exceed the highest break it is also flagged as NA.
Next testing the option include.lowest = TRUE
df$cut <- cut(df[ ,1], breaks = c(2,4,6,8), right = TRUE, include.lowest = TRUE)
df
x cut
# 1 <NA>
# 2 [2,4]
# 3 [2,4]
# 4 [2,4]
# 5 (4,6]
# 6 (4,6]
# 7 (6,8]
So here for the first bin between the first two breaks, the '[' on left means >=(first break) and the ']' means <=(second) break. Subsequent breaks are treated as above.
Next the NA values can be addressed by using -Inf and/or +Inf in the breaks as follows:
df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = TRUE, include.lowest = TRUE)
df
x cut
# 1 [-Inf,2]
# 2 [-Inf,2]
# 3 (2,4]
# 4 (2,4]
# 5 (4,6]
# 6 (4,6]
# 7 (6,8]
# 99 (8, Inf]
Setting the right = FALSE option swaps around the sense of the thresholds as per the example below:
df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = FALSE)
df
# x cut
# 1 [-Inf,2)
# 2 [2,4)
# 3 [2,4)
# 4 [4,6)
# 5 [4,6)
# 6 [6,8)
# 7 [6,8)
# 99 [8, Inf)
Finally the labels option allows custom names for the thresholds should you so wish ...
lbls <- c('x<=2','2<x<=4','4<x<=6','6<x<=8','x>8')
df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = TRUE, include.lowest = TRUE, labels = lbls)
df
x cut
# 1 x<=2
# 2 x<=2
# 3 2<x<=4
# 4 2<x<=4
# 5 4<x<=6
# 6 4<x<=6
# 7 6<x<=8
# 99 x>8
Related
I built this custom "winsorize" function that does what it should, unless there are NA's in the data.
How it works:
winsor1 <- function(x, probability){
numWin <- ceiling(length(x)*probability)
# Replace first lower, then upper
x <- pmax(x, sort(x)[numWin+1])
x <- pmin(x, sort(x)[length(x)-numWin])
return(x)
}
x <- 0:10
winsor1(x, probability=0.01)
[1] 1 1 2 3 4 5 6 7 8 9 9
So it replaces the top (and bottom) 1% of the data (rounded up to the next value, since there are only 11 values in the example). If there are, e.g., 250 values then the bottom 3 and top 3 values would be replaced by the bottom 4th and top 4th respectively.
The whole thing breaks down when there are NA's in the data, causing an error. However, if I set na.rm = TRUE in the pmax() and pmin() then the NA's themselves are replaced by the bottom value.
x[5] <- NA
winsor1(x, probability=0.01)
[1] 1 1 2 3 1 5 6 7 8 9 9
What can I do so that the NA's are preserved but do not cause an error? This is the output I want for the last line:
winsor1(x, probability=0.01)
[1] 1 1 2 3 NA 5 6 7 8 9 9
The issue is with sort as it removes the NA by default or else we have to specify na.last = TRUE which may also not be the case we need. One option is order
winsor1 <- function(x, probability){
numWin <- ceiling(length(x)*probability)
# Replace first lower, then upper
x1 <- x[order(x)]
x <- pmax(x, x1[numWin+1])
x1 <- x1[order(x1)]
x <- pmin(x, x1[length(x)-numWin], na.rm = TRUE)
return(x)
}
-testing
x <- 0:10
winsor1(x, probability=0.01)
#[1] 1 1 2 3 4 5 6 7 8 9 9
x[5] <- NA
winsor1(x, probability=0.01)
#[1] 1 1 2 3 NA 5 6 7 8 9 10
or with na.last in sort
winsor1 <- function(x, probability){
numWin <- ceiling(length(x)*probability)
# Replace first lower, then upper
x <- pmax(x, sort(x, na.last = TRUE)[numWin+1])
x <- pmin(x, sort(x, na.last = TRUE)[length(x)-numWin], na.rm = TRUE)
return(x)
}
Is there a more elegant way to solve this problem?
For every TRUE value I'm looking for the positions of the closest previous and following FALSE values.
data:
vec <- c(FALSE, TRUE, TRUE, FALSE, TRUE, FALSE)
desired outcome: (something like)
pos start end
[1,] 2 1 4
[2,] 3 1 4
[3,] 5 4 6
explanation of the first row of the outcome:
pos = 2, position of the first TRUE,
start = 1, position of the closest FALSE in front of pos = 2
end = 4, position of the closest FALSE after pos = 2.
Already working solution:
pos = which(vec)
f_pos = which(!vec)
t(
sapply(pos, function(x){ s <- rev(f_pos[f_pos < x])[1]; e <- f_pos[x < f_pos][1]; return(data.frame(pos = x, start = s, end = e)) })
)
Using findInterval
pos <- which(vec)
b <- which(!vec)
ix <- findInterval(pos, b)
cbind(pos, from = b[ix], to = b[ix + 1])
# pos from to
# [1,] 2 1 4
# [2,] 3 1 4
# [3,] 5 4 6
If we stretch your "something like" slightly, a simple cut will do:
data.frame(pos, rng = cut(pos, b))
# pos rng
# 1 2 (1,4]
# 2 3 (1,4]
# 3 5 (4,6]
If the vector ends with TRUE, the findInterval solution will give NA in 'to' column. In cut, the last 'interval' is then coded as NA.
You can do as if FALSE defined intervals and use data.table::foverlaps to find the right ones:
library(data.table)
# put your objects in data.tables:
f_pos_inter <- data.table(start=head(f_pos, -1), end=tail(f_pos, -1))
pos_inter <- data.table(start=pos, end=pos)
# define the keys:
setkeyv(pos_inter, c("start", "end")); setkeyv(f_pos_inter, c("start", "end"))
res <- foverlaps(pos_inter, f_pos_inter)
# start end i.start i.end
#1: 1 4 2 2
#2: 1 4 3 3
#3: 4 6 5 5
You can further reorder the columns and keep only the ones you need:
res[, i.end:=NULL]
setcolorder(res, c(3, 1, 2))
setnames(res, "i.start", "pos")
res
# pos start end
#1: 2 1 4
#2: 3 1 4
#3: 5 4 6
N.B: this will give NA in both columns start and end if vec ends with TRUE
I would like to create bins for the variable numbers per category name inside a function. But I am having difficulties related to using the category name provided as parameter inside the function. Maybe a data.table approach would be better.
set.seed(10)
b<-(rnorm(10, sd=1,mean=10))
y<-runif(3)
pr<-y/sum(y)
names<-unlist(lapply(mapply(rep, LETTERS[1:3], 1:3), function (x) paste0(x, collapse = "") ) )
x <- sample(names, 10, replace=TRUE, prob=pr)
df<-data.frame(name=x,numbers=b)
df
#working without bin limits per category (not desired)
#and using "numbers" in cut (not desired)
binfunction1 <- function(df, colgroup1, varcount,binsize) {
new<-df %>%
group_by_(colgroup1) %>%
mutate(bin = cut(numbers, breaks <- c(seq(7, 15, by = binsize)), # limits by colgroup not implemented
labels = 1:(length(breaks)-1) ) )
return(new)
}
binfunction1(df,"name","numbers",0.5)
name numbers bin
<fctr> <dbl> <fctr>
1 BB 10.018746 7
2 A 9.815747 6
3 CCC 8.628669 4
4 CCC 9.400832 5
5 BB 10.294545 7
6 CCC 10.389794 7
7 A 8.791924 4
8 A 9.636324 6
9 A 8.373327 3
10 A 9.743522 6
Not the most elegant solution, but is the outcome something you are after? (I didn't quite understand your question)
binfunction3 <- function(x, colgroup1, varcount, binsize) {
tmp <- split(x, x[[colgroup1]], drop = TRUE)
tp <- lapply(tmp, function(k) {
breaks <- c(seq(min(k[[varcount]])*0.9, max(k[[varcount]])*1.1, by = binsize))
cbind(k, data.frame(bin = cut(k[[varcount]], breaks, labels = 1:(length(breaks)-1))))
})
tp <- do.call(rbind, tp)
rownames(tp) <- gsub("[[:alpha:]]*\\.", "", rownames(tp))
return(tp[rownames(x),])
}
binfunction3(df,"name","numbers",0.5)
# name numbers bin
# 1 A 10.018746 5
# 2 CCC 9.815747 5
# 3 CCC 8.628669 2
# 4 BB 9.400832 2
# 5 A 10.294545 6
# 6 BB 10.389794 4
# 7 A 8.791924 3
# 8 CCC 9.636324 4
# 9 A 8.373327 2
# 10 A 9.743522 5
My answer based on Mikko's, but allowing better control over min and max limits of breaks and bin size.
binfunctionnew <- function(x, colgroup, varcount, binexp) {
tmp <- split(x, x[colgroup], drop = TRUE)
tp <- lapply(tmp, function(k) {
bin<-cut(k[,varcount],
breaks=c(seq(min(k[,varcount])*(1-10^(-(binexp+1))),
max(k[,varcount])*(1+10^(-(binexp-2))),
by = 10^(-(binexp))) ) , labels=F)
cbind (k, data.frame(bin = bin))
} )
tp <- do.call(rbind, tp)
return(tp)
}
#example or
binfunctionnew(df,"name","numbers",1) binfunctionnew(df,"name","numbers",0)
# name numbers bin name numbers bin
# A.1 A 10.018746 18 A.1 A 10.018746 3
# A.5 A 10.294545 21 A.5 A 10.294545 3
# A.7 A 8.791924 6 A.7 A 8.791924 2
# A.9 A 8.373327 1 A.9 A 8.373327 1
# A.10 A 9.743522 15 A.10 A 9.743522 3
# BB.4 BB 9.400832 1 BB.4 BB 9.400832 1
# BB.6 BB 10.389794 11 BB.6 BB 10.389794 2
# CCC.2 CCC 9.815747 13 CCC.2 CCC 9.815747 3
# CCC.3 CCC 8.628669 1 CCC.3 CCC 8.628669 1
# CCC.8 CCC 9.636324 11 CCC.8 CCC 9.636324 2
I need to loop over a data frame and calculate functions over the variable that is being looped.
A table example:
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10)))
I create a list of variables:
nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']
And the table that I will populate:
planF<-data.frame(deciles=c(1), min=c(1), max=c(1), pos=c(1))
planF<-planF[-1,]
And the loop:
library(plyr)
for (i in 1:length(nums)){
table$deciles<-ntile(table[,nums[i]],5)
plan<-ddply(table, 'deciles', summarize, min=min(nums[i]),
max=max(nums[i]),pos=sum(target))
planF<-rbind(planF,plan)
}
I need to get the min and max of the variable por each decile. But instead I get:
deciles min max pos
1 1 num1 num1 4
2 2 num2 num2 4
3 3 <NA> <NA> 2
4 4 <NA> <NA> 0
5 5 <NA> <NA> 0
6 1 num1 num1 4
7 2 num2 num2 4
8 3 <NA> <NA> 2
9 4 <NA> <NA> 0
10 5 <NA> <NA> 0
For variable num1 I need to get the result of:
ddply(table, 'deciles', summarize, min=min(num1),
max=max(num1),pos=sum(target))
deciles min max pos
1 5.736842 7.157895 0
2 7.631579 9.052632 0
3 1.000000 10.000000 2
4 1.947368 3.368421 4
5 3.842105 5.263158 4
And below the result of doing the same with num2.
I understand that I need to introduce the variable with the following form:
num1
but the code is writing
'num1'
I tried with:
min=min(as.name(nums[i]))
But I get an error:
Error in min(as.name(nums[i])) : 'type' (symbol) not valid argument
how can I calculate a function over the variable that is being looped?
The gist of your question is to apply a list of functions over the split-apply-combine method, so here is one way you can do this in base r.
## your data
table<-data.frame(num1=seq(1,10,len=20), num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10)))
nums<-colnames(table)[sapply(table, class)=='numeric']
nums<-nums[nums!='target']
table$deciles <- ntile(table[, nums[1]], 5)
FUNS <- list(min = min, max = max, mean = mean)
## split the variable num1 by deciles
## apply each function to each piece
x <- with(table, tapply(num1, deciles, function(x)
setNames(sapply(FUNS, function(y) y(x)), names(FUNS))))
## combine results
do.call('rbind', x)
# min max mean
# 1 1.000000 2.421053 1.710526
# 2 2.894737 4.315789 3.605263
# 3 4.789474 6.210526 5.500000
# 4 6.684211 8.105263 7.394737
# 5 8.578947 10.000000 9.289474
Instead of using a loop, since we have the above which works and is fairly simple, put it into a function like below
f <- function(num, data = table) {
FUNS <- list(min = min, max = max, mean = mean)
x <- tapply(data[, num], data[, 'deciles'], function(x)
setNames(sapply(FUNS, function(y) y(x)), names(FUNS)))
cbind(deciles = as.numeric(names(x)), do.call('rbind', x))
}
This way we have the method generalized so it can use any column you have with any data you have. You can call it for individual columns like
f('num1')
f('num2')
Or use a loop to get everything at once
lapply(c('num1','num2'), f)
# [[1]]
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
#
# [[2]]
# deciles min max mean
# 1 1 20.00000 21.57895 20.78947
# 2 2 22.10526 23.68421 22.89474
# 3 3 24.21053 25.78947 25.00000
# 4 4 26.31579 27.89474 27.10526
# 5 5 28.42105 30.00000 29.21053
If you don't like lapply, you can Vectorize the function to make it a little easier:
Vectorize(f, SIMPLIFY = FALSE)(c('num1', 'num2'))
Which you would more commonly use like this (SIMPLIFY = FALSE to retain the list structures)
v <- Vectorize(f, SIMPLIFY = FALSE)
v(c('num1','num1'))
# $num1
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
#
# $num1
# deciles min max mean
# 1 1 1.000000 2.421053 1.710526
# 2 2 2.894737 4.315789 3.605263
# 3 3 4.789474 6.210526 5.500000
# 4 4 6.684211 8.105263 7.394737
# 5 5 8.578947 10.000000 9.289474
I would strictly prefer to use dplyr for this, even though there is some ugliness in handling string variable names in the call to summarize_ (note the trailing _):
library(lazyeval)
library(dplyr)
# create the data.frame
dfX = data.frame(num1=seq(1,10,len=20),
num2=seq(20,30,len=20),
char1=c(rep('a',10), rep('b',10)),
target=c(rep(1,10), rep(0,10))
)
# select the numeric columns
numericCols = names(dfX)[sapply(dfX, is.numeric)]
numericCols = setdiff(numericCols, "target")
# cycle over numeric columns, creating summary data.frames
liDFY = setNames(
lapply(
numericCols, function(x) {
# compute the quantiles
quantiles = quantile(dfX[[x]], probs = seq(0, 1, 0.2))
# create quantile membership
dfX[["quantile_membership"]] =
findInterval(dfX[[x]], vec = quantiles,
rightmost.closed = TRUE,
all.inside = TRUE)
# summarize variables by decile
dfX %>%
group_by(quantile_membership) %>%
summarize_(min = interp( ~ min(x_name), x_name = as.name(x)),
max = interp( ~ max(x_name), x_name = as.name(x)),
mean = interp( ~ mean(x_name), x_name = as.name(x)))
}),
numericCols
)
# inspect the output
liDFY[[numericCols[1]]]
I've generated the following data frame:
random <- data.frame(replicate(10, sample(1:12, 564, replace=TRUE)))
It contains 10 columns and 564 rows.
Each number in the column pertains to a day (1-12).
To this data frame I appended a new column containing the words "Green", "Pink" and "Red" in no particular order and filling all 564 rows using existing data I have:
random <- fruit$color
Here's what I'd like to do:
For each column 1-10, create the following table of counts:
Day Green Pink Red
1 # # #
2 # # #
3 # # #
4 # # #
... # # #
12 # # #
So, I should be able to know from this table the number of Day 1 Greens from column 1, for example. It is important that color counts from different columns are distinguishable from one another.
And there's a fun twist!
Counts for Day 9 and 10 need to be added for each Color, so each table should look like:
Day Green Pink Red
1 # # #
2 # # #
3 # # #
4 # # #
5 # # #
6 # # #
7 # # #
8 # # #
9 - 10 # # #
11 # # #
12 # # #
So far, I've tried to work with ddply and cast to do this and loop over each column, but I'm not familiar with loops. Here's the bologna I've got so far:
for(i in names(random)) {
random_counts <- ddply(random, c('color', i), function(x) c(count=nrow(x)))
random_counts <- cast(random_counts, i ~ color, mean, value='count')
random_counts
}
Help with this would be much much appreciated!
Thanks
Here is a base alternative:
# slightly smaller toy data
random <- data.frame(replicate(2, sample(1:5, 20, replace = TRUE)))
color <- sample(c("Green", "Pink", "Red"), nrow(random), replace = TRUE)
# use cut to put e.g. 3 and 4 in the same interval
random[] <- lapply(random, function(x) cut(x, breaks = c(0, 1, 2, 4, 5)))
# count
lapply(random, function(x) table(x, color))
# $X1
# color
# x Green Pink Red
# (0,1] 2 0 1
# (1,2] 3 0 2
# (2,4] 3 4 2
# (4,5] 1 1 1
#
# $X2
# color
# x Green Pink Red
# (0,1] 3 0 1
# (1,2] 4 0 1
# (2,4] 1 3 1
# (4,5] 1 2 3