Quantile cuts despite duplicates - r

I have a dataset with > 900,000 rows with many duplicates:
> sum(duplicated(df$colB))
[1] 904515
So when I try to quantile cut into ten equally large subsets, I get an error
> df$colC <- cut(df$colB, quantile(df$colB,c(0:10)/10), labels=FALSE,
+ include.lowest=TRUE)
Error in cut.default(df$colB, quantile(df$colB, :
'breaks' are not unique
Using unique(quantile(df$colB,c(0:10)/10)) doesn't give equally sized subsets. There must be an easy solution to make quantile cuts which also considers the number of rows, in addition to the values in colB. Starting a loop sequence would probably take forever as I have a high number of rows. Any ideas?
Dummy dataset:
set.seed(10)
B <- round(runif(100, 0, 0.4), digits=2) # gives 63 duplicates
df$colB <- B
df <- as.data.frame(df)

There might be a neater solution than this, but this will do it:
df$colC <- ceiling((1:nrow(df))*10/nrow(df))[rank(df$colB, ties.method = 'first')]
table(df$colC)
#>
#> 1 2 3 4 5 6 7 8 9 10
#> 10 10 10 10 10 10 10 10 10 10

It might be hard to imagine, but there must be a range of values in df$colB that is invariant, so quantile returns two (or more) of a single value.
A contrived example:
set.seed(42)
vec <- c(rep(10,20),sample(100,size=80,))
brks <- quantile(vec, (0:10)/10)
brks
# 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
# 2.0 10.0 10.0 14.7 25.6 36.5 47.4 58.9 72.4 88.1 100.0
The cut function requires that there be no repeated values in its breaks= arguments. It should be informative to look at just the quantiles of your function to confirm this.
One way around this is to use .bincode, which does not enforce unique breaks.
cut(vec, brks, include.lowest = TRUE)
# Error in cut.default(vec, brks, include.lowest = TRUE) :
# 'breaks' are not unique
.bincode(vec, brks, include.lowest = TRUE)
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 8 4 9 4
# [26] 10 6 4 8 10 6 4 5 1 6 5 5 1 5 9 7 6 10 5 6 4 4 9 1 9
# [51] 8 10 1 7 10 9 8 1 8 1 7 9 7 4 8 7 6 1 6 9 5 8 6 10 6
# [76] 9 1 5 3 10 6 5 9 4 5 7 10 7 8 9 4 5 7 3 8 4 10 7 8 10
(Note that there is no "2" in the return values with this data, because brks[2] is the same as brks[3], so appears to be ignored.)
One side-effect of this is that you don't get the factor labels by default, which might be useful.
labels <- sprintf("(%0.01f-%0.01f]", brks[-10], brks[-1])
substr(labels[1], 1, 1) <- "["
labels
# [1] "[2.0-10.0]" "(10.0-10.0]" "(10.0-14.7]" "(14.7-25.6]"
# [5] "(25.6-36.5]" "(36.5-47.4]" "(47.4-58.9]" "(58.9-72.4]"
# [9] "(72.4-88.1]" "(100.0-100.0]"
head(labels[ .bincode(vec, brks, include.lowest = TRUE) ])
# [1] "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]" "[2.0-10.0]"
(Where the use of %0.01f is where you may want to customize this assumption.)

Related

shorten vectors by using certain means

I have a simple question, but I can't find the right solution => I have a list (let's call it "list") consisting of about 2000 ordinary vectors (list[[1]],list[[2]], etc.). Each of those vectors contains 50399 numbers. Now what I want is to shorten each vector so that it consits of 840 numbers in the end.
So I want the first number to be the mean of the first 60 numbers of the original vector (mean(list[[i]][1:60])), the second number shall be the mean of the next 60 numbers etc. That should work 839 times (for 50399 numbers altogether). So the last number should be the mean of the last 59 (not 60) numbers of the original vector.
That should work for each veactor (list[[i]]) in "list"!
Do you guys have an idea how that works?
You can work it out like this:
set.seed(1)
(list <- replicate(3, sample(1:10, 10, T), simplify = FALSE))
# [[1]]
# [1] 3 4 6 10 3 9 10 7 7 1
#
# [[2]]
# [1] 3 2 7 4 8 5 8 10 4 8
#
# [[3]]
# [1] 10 3 7 2 3 4 1 4 9 4
n <- 5 # crunch vectors of 10 into 5 means
lapply(list, function(x) sapply(split(x, ceiling(seq_along(x)/(length(x)/n))), mean))
# [[1]]
# 1 2 3 4 5
# 3.5 8.0 6.0 8.5 4.0
#
# [[2]]
# 1 2 3 4 5
# 2.5 5.5 6.5 9.0 6.0
#
# [[3]]
# 1 2 3 4 5
# 6.5 4.5 3.5 2.5 6.5
i.e., in your case:
list <- replicate(2000, sample(1:10, 50399, T), simplify = FALSE)
res <- lapply(list, function(x) sapply(split(x, ceiling(seq_along(x)/(length(x)/840))), mean))
sapply(res, length) # check

Finding the minimum positive value

I guess I don't know which.min as well as I thought.
I'm trying to find the occurrence in a vector of a minimum value that is positive.
TIME <- c(0.00000, 4.47104, 6.10598, 6.73993, 8.17467, 8.80862, 10.00980, 11.01080, 14.78110, 15.51520, 16.51620, 17.11680)
I want to know for the values z of 1 to 19, the index of the above vector TIME containing the value that is closest to but above z. I tried the following code:
vec <- sapply(seq(1,19,1), function(z) which.min((z-TIME > 0)))
vec
#[1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 1 1
To my mind, the last two values of vec should be '12, 12'. The reason it's doing this is because it thinks that '0.0000' is closest to 0.
So, I thought that maybe it was because I exported the data from external software and that 0.0000 wasn't really 0. But,
TIME[1]==0 #TRUE
Then I got further confused. Why do these give the answer of index 1, when really they should be an ERROR?
which.min(0 > 0 ) #1
which.min(-1 > 0 ) #1
I'll be glad to be put right.
EDIT:
I guess in a nutshell, what is the better way to get this result:
#[1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 12 12
which shows the index of TIME that gives the smallest possible positive value, when subtracting each element of TIME from the values of 1 to 19.
The natural function to use here (both to limit typing and for efficiency) is actually not which.min + sapply but the cut function, which will determine which range of times each of the values 1:19 falls into:
cut(1:19, breaks=TIME, right=FALSE)
# [1] [0,4.47) [0,4.47) [0,4.47) [0,4.47) [4.47,6.11) [4.47,6.11) [6.74,8.17)
# [8] [6.74,8.17) [8.81,10) [8.81,10) [10,11) [11,14.8) [11,14.8) [11,14.8)
# [15] [14.8,15.5) [15.5,16.5) [16.5,17.1) <NA> <NA>
# 11 Levels: [0,4.47) [4.47,6.11) [6.11,6.74) [6.74,8.17) [8.17,8.81) ... [16.5,17.1)
From this, you can easily determine what you're looking for, which is the index of the smallest element in TIME greater than the cutoff:
(x <- as.numeric(cut(1:19, breaks=TIME, right=FALSE))+1)
# [1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 NA NA
The last two entries appear as NA because there is no element in TIME that exceeds 18 or 19. If you wanted to replace these with the largest element in TIME, you could do so with replace:
replace(x, is.na(x), length(TIME))
# [1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 12 12
Here's one way:
x <- t(outer(TIME,1:19,`-`))
max.col(ifelse(x<0,x,Inf),ties="first")
# [1] 2 2 2 2 3 3 5 5 7 7 8 9 9 9 10 11 12 12 12
It's computationally wasteful to take all the differences in this way, since both vectors are ordered.

Subset columns using logical vector

I have a dataframe that I want to drop those columns with NA's rate > 70% or there is dominant value taking over 99% of rows. How can I do that in R?
I find it easier to select rows with logic vector in subset function, but how can I do the similar for columns? For example, if I write:
isNARateLt70 <- function(column) {//some code}
apply(dataframe, 2, isNARateLt70)
Then how can I continue to use this vector to subset dataframe?
If you have a data.frame like
dd <- data.frame(matrix(rpois(7*4,10),ncol=7, dimnames=list(NULL,letters[1:7])))
# a b c d e f g
# 1 11 2 5 9 7 6 10
# 2 10 5 11 13 11 11 8
# 3 14 8 6 16 9 11 9
# 4 11 8 12 8 11 6 10
You can subset with a logical vector using one of
mycols<-c(T,F,F,T,F,F,T)
dd[mycols]
dd[, mycols]
There's really no need to write a function when we have colMeans (thanks #MrFlick for the advice to change from colSums()/nrow(), and shown at the bottom of this answer).
Here's how I would approach your function if you want to use sapply on it later.
> d <- data.frame(x = rep(NA, 5), y = c(1, NA, NA, 1, 1),
z = c(rep(NA, 3), 1, 2))
> isNARateLt70 <- function(x) mean(is.na(x)) <= 0.7
> sapply(d, isNARateLt70)
# x y z
# FALSE TRUE TRUE
Then, to subset with the above line your data using the above line of code, it's
> d[sapply(d, isNARateLt70)]
But as mentioned, colMeans works just the same,
> d[colMeans(is.na(d)) <= 0.7]
# y z
# 1 1 NA
# 2 NA NA
# 3 NA NA
# 4 1 1
# 5 1 2
Maybe this will help too. The 2 parameter in apply() means apply this function column wise on the data.frame cars.
> columns <- apply(cars, 2, function(x) {mean(x) > 10})
> columns
speed dist
TRUE TRUE
> cars[1:10, columns]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17

Using ddply to aggregate over irregular time periods in longitudinal data

I'm looking for help adapting two existing scripts.
I am working with a longitudinal dataset, and aggregating a key variable over time periods. I have a variable for both weeks and months. I'm able to aggregate over both weeks and months - but my goal is to aggregate over weeks for the first six weeks, and then move over to aggregating by months after 6 weeks+.
Aggregating by weeks and months is easy enough...
df.summary_week <- ddply(df, .(weeks), summarise,
var.mean = mean(var,na.rm=T))
Which yields something like:
weeks var.mean
1 3.99
2 5.44
3 6.7
4 8.100
5 2.765
6 2.765
7 3.765
8 4.765
9 1.765
10 4.765
11 1.765
And then aggregating by month would yield something similar:
df.summary_months <- ddply(df, .(months), summarise,
var.mean = mean(var,na.rm=T))
months var.mean
1 5.00
2 3.001
3 4.7
4 7.100
My initial idea was to simply subset the two datasets with cut points and then bind them together, but I don't know how to do that when the 1-month aggregation starts at 6 weeks rather than 8.
Thoughts, R wizards?
Basic example data.
dat <- data.frame(var=1:24,weeks=1:24,months=rep(1:6,each=4))
Means for first 6 grps should be just 1:6, then means will be values
for subsequent 4 week periods. E.g. (mean(7:10) = 8.5 etc).
Make a suitable group identifier going from weeks to months:
dat$grp <- findInterval(dat$weeks,seq(7,max(dat$weeks),4)) + 6
dat$grp <- ifelse(dat$grp==6,dat$weeks,dat$grp)
#[1] 1 2 3 4 5 6 7 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10 11 11
Group the data:
ddply(dat, .(grp), summarise, var.mean = mean(var,na.rm=T))
grp var.mean
1 1 1.0
2 2 2.0
3 3 3.0
4 4 4.0
5 5 5.0
6 6 6.0
7 7 8.5
8 8 12.5
9 9 16.5
10 10 20.5
11 11 23.5
How about just creating a new grouping column?
set.seed(1618)
dat <- data.frame(week = sample(1:26, 200, replace = TRUE),
value = rpois(200, 2))
dat <- within(dat, {
idx <- cut(week, c(0, 6, seq(10, max(week), by = 4)))
})
# head(dat)
# week value idx
# 1 6 1 (0,6]
# 2 16 2 (14,18]
# 3 9 1 (6,10]
# 4 13 2 (10,14]
# 5 8 2 (6,10]
# 6 16 2 (14,18]
library(plyr)
ddply(dat, .(idx), summarise,
mean = mean(value, na.rm = TRUE))
# idx mean
# 1 (0,6] 1.870968
# 2 (6,10] 2.259259
# 3 (10,14] 2.171429
# 4 (14,18] 1.931034
# 5 (18,22] 1.560000
# 6 (22,26] 1.954545
# checking a couple values
mean(dat[dat$week %in% 1:6, 'value'])
# [1] 1.870968
mean(dat[dat$week %in% 7:10, 'value'])
# [1] 2.259259
mean(dat[dat$week %in% 23:26, 'value'])
# [1] 1.954545

Performing calculations on binned counts in R

I have a dataset stored in a text file in the format of bins of values followed by counts, like this:
var_a 1:5 5:12 7:9 9:14 ...
indicating that var_a took on the value 1 5 times in the dataset, 5 12 times, etc. Each variable is on its own line in that format.
I'd like to be able to perform calculations on this dataset in R, like quantiles, variance, and so on. Is there an easy way to load the data from the file and calculate these statistics? Ultimately I'd like to make a box-and-whisker plot for each variable.
Cheers!
You could use readLines to read in the data file
.x <- readLines(datafile)
I will create some dummy data, as I don't have the file. This should be the equivalent of the output of readLines
## dummy
.x <- c("var_a 1:5 5:12 7:9 9:14", 'var_b 1:5 2:12 3:9 4:14')
I split by spacing to get each
#split by space
space_split <- strsplit(.x, ' ')
# get the variable names (first in each list)
variable_names <- lapply(space_split,'[[',1)
# get the variable contents (everything but the first element in each list)
variable_contents <- lapply(space_split,'[',-1)
# a function to do the appropriate replicates
do_rep <- function(x){rep.int(x[1],x[2])}
# recreate the variables
variables <- lapply(variable_contents, function(x){
.list <- strsplit(x, ':')
unlist(lapply(lapply(.list, as.numeric), do_rep))
})
names(variables) <- variable_names
you could get the variance for each variable using
lapply(variables, var)
## $var_a
## [1] 6.848718
##
## $var_b
## [1] 1.138462
or get boxplots
boxplot(variables, ~.)
Not knowing the actual form that your data is in, I would probably use something like readLines to get each line in as a vector, then do something like the following:
# Some sample data
temp = c("var_a 1:5 5:12 7:9 9:14",
"var_b 1:7 4:9 3:11 2:10",
"var_c 2:5 5:14 6:6 3:14")
# Extract the names
NAMES = gsub("[0-9: ]", "", temp)
# Extract the data
temp_1 = strsplit(temp, " |:")
temp_1 = lapply(temp_1, function(x) as.numeric(x[-1]))
# "Expand" the data
temp_1 = lapply(1:length(temp_1),
function(x) rep(temp_1[[x]][seq(1, length(temp_1[[x]]), by=2)],
temp_1[[x]][seq(2, length(temp_1[[x]]), by=2)]))
names(temp_1) = NAMES
temp_1
# $var_a
# [1] 1 1 1 1 1 5 5 5 5 5 5 5 5 5 5 5 5 7 7 7 7 7 7 7 7 7 9 9 9 9 9 9 9 9 9 9 9 9 9 9
#
# $var_b
# [1] 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2
#
# $var_c
# [1] 2 2 2 2 2 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Resources