getting from histogram counts to cdf - r

I have a dataframe where I have values, and for each value I have the counts associated with that value. So, plotting counts against values gives me the histogram. I have three types, a, b, and c.
value counts type
0 139648267 a
1 34945930 a
2 5396163 a
3 1400683 a
4 485924 a
5 204631 a
6 98599 a
7 53056 a
8 30929 a
9 19556 a
10 12873 a
11 8780 a
12 6200 a
13 4525 a
14 3267 a
15 2489 a
16 1943 a
17 1588 a
... ... ...
How do I get from this to a CDF?
So far, my approach is super inefficient: I first write a function that sums up the counts up to that value:
get_cumulative <- function(x) {
result <- numeric(nrow(x))
for (i in seq_along(result)) {
result[i] = sum(x[x$num_groups <= x$num_groups[i], ]$count)
}
x$cumulative <- result
x
}
Then I wrap this in a ddply that splits by the type. This is obviously not the best way, and I'd love any suggestions on how to proceed.

You can use ave and cumsum (assuming your data is in df and sorted by value):
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
Here is a toy example:
df <- data.frame(counts=sample(1:100, 10), type=rep(letters[1:2], each=5))
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
that produces:
counts type cdf
1 55 a 0.2750000
2 61 a 0.5800000
3 27 a 0.7150000
4 20 a 0.8150000
5 37 a 1.0000000
6 45 b 0.1836735
7 79 b 0.5061224
8 12 b 0.5551020
9 63 b 0.8122449
10 46 b 1.0000000

If your data is in data.frame DF then following should do
do.call(rbind, lapply(split(DF, DF$type), FUN=cumsum))

The HistogramTools package on CRAN has several functions for converting between Histograms and CDFs, calculating information loss or error margins, and plotting functions to help with this.
If you have a histogram h then calculating the Empirical CDF of the underlying dataset is as simple as:
library(HistogramTools)
h <- hist(runif(100), plot=FALSE)
plot(HistToEcdf(h))
If you first need to convert your input data of breaks and counts into an R Histogram object, then see the PreBinnedHistogram function first.

Related

Why is the for loop returning NA vectors in some positions (in R)?

Following a youtube tutorial, I have created a vector x [-3,6,2,5,9].
Then I create an empty variable of length 5 with the function 'numeric(5)'
I want to store the squares of my vector x in 'Storage2' with a for loop.
When I do the for loop and update my variable, it returns a very strange thing:
[1] 9 4 0 9 25 36 NA NA 81
I can see all numbers in x have been squared, but the order is so random, and there's more than 5.
Also, why are there NAs?? If it's because the last number of x is 9 (and so this number defines the length??), and there's no 7 and 8 position, I would understand, but then I'm also missing positions 1, 3 and 4, so there should be more NAs...
I'm just starting with R, so please keep it simple, and correct me if I'm wrong during my thought process! Thank you!!
x <- c(-3,6,2,5,9)
Storage2 <- numeric(5)
for(i in x){
Storage2[i] <- i^2
}
Storage2
# [1] 9 4 0 9 25 36 NA NA 81
You're looping over the elements of x not over the positions as probably intended. You need to change your loop like so:
for(i in 1:length(x)) {
Storage2[i] <- x[i]^2
}
Storage2
# [1] 9 36 4 25 81
(Note: 1:length(x) can also be expressed as seq_along(x), as pointed out by #NelsonGon in comments and might be faster.)
However, R is a vectorized language so you can simply do that:
Storage2 <- x^2
Storage2
# [1] 9 36 4 25 81

Calculate Total Sum of Square Inconsistency

I am attempting to write my own function for total sum of square, within sum of square, and between sum of square in R Studio for my own implementation of k-means.
I've successfully written the function for within sum of square, but I'm having difficulty with total sum of square (and thus bss). The result I get is significantly larger than what R's own kmeans function computes. I'm confused because I am following exactly what formulas provide. Here is my data:
A =
36 3
73 3
30 3
49 3
47 11
47 11
0 7
46 5
16 3
52 4
0 8
21 3
0 4
57 6
31 5
0 6
40 3
31 5
38 4
0 5
59 4
61 6
48 7
29 2
0 4
19 4
19 3
48 9
48 4
21 5
where each column is a feature. This is the function I've created thus far for tss:
tot_sumoSq <- function(data){
avg = mean( as.matrix(data) )
r = matrix(avg, nrow(data), ncol(data))
tot_sumoSq = sum( (data - r)^2 )
}
I receive the result 24342.4, but R gives 13244.8. Am I completely missing something?
The latter value is calculated using the column means. If you use this for calculating the means, you'll get the same answer.
avg = colMeans(data)
r = matrix(avg, nrow(data), ncol(data), byrow=T)
[1] 13244.8
May be there are something wrong in your program. You subtract a matrix from a data frame. Use the following -
tot_sumoSq <- function(data){
data = as.matrix(data)
x = sum((data - mean(data))^2)
return(x)
}
From my side it gives the correct answer.
I found a solution to my issue by combining solutions provided by the first two commentators. I see what my previous mistake was and would like to clear any confusion for future scientists.
tot_sumoSq <- function(data){
avg = colMeans(data)
r = matrix(avg, nrow(data), ncol(data), byrow = T)
data = as.matrix(data)
return( sum( (data - r)^2 ) )
}
Each column is the entire sample for different features, so when we calculate the mean for each column, it is the mean of means for the entire sample for one feature. My conceptual mistake earlier was to combine both features to calculate an overall mean.

r join two lists and sum their values

I have two lists: x, y
> x
carlo monte simulation model quantum
31 31 9 6 6
> y
model system temperature quantum simulation problem
15 15 15 13 13 12
What function should I use to obtain:
simulation model quantum
22 21 19
I tried to merge them like in example but it gives me an error:
merge(x,y,by=intersect(names(x),names(y))) produces:
Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns
There's no argument in that function what to do with values. What would be the best function to use?
intersect(names(x),names(y)) will give the names of resulting list, but how to summarize values together??
You can use Map in base R to return a list.
Map("+", x[intersect(names(x),names(y))], y[intersect(names(x),names(y))])
$simulation
[1] 22
$model
[1] 21
$quantum
[1] 19
or mapply to return a named vector which may be more useful.
mapply("+", x[intersect(names(x),names(y))], y[intersect(names(x),names(y))])
simulation model quantum
22 21 19
Using [intersect(names(x), names(y))] will not only be subset the contents of x and y to those with intersecting names, but will also properly sort the elements for the operation.
data
x <- list(carlo=1, monte=2, simulation=9, model=6, quantum=6)
y <-list(model=15, system=8, temperature=10, quantum=13, simulation=13, problem="no")
simple names matching does the trick :
# subset those from x which have names in y also
x1 = x[names(x)[names(x) %in% names(y)]]
# x1
# simulation model quantum
# 9 6 6
# similarily do it for y. note the order of names might be different from that in x
y1 = y[names(y)%in%names(x1)]
# y1
# model quantum simulation
# 15 13 13
# now order the names in both and then add.
x1[order(names(x1))]+y1[order(names(y1))]
# model quantum simulation
# 21 19 22
Base function merge() should do this with no issue so long as your fields make sense, but you need to include merge(..., all=TRUE), as in:
y <- data.frame(rbind(c(15,15,15,13,13,12)))
names(y) <- c("model","system","temperature","quantum","simulation","problem")
x <- data.frame(rbind(c(31,31,9,6,6)))
names(x) <- c("carlo","monte","simulation","model","quantum")
merge(x, y, by = c("simulation","model","quantum"), all = TRUE)
results in:
simulation model quantum carlo monte system temperature problem
1 9 6 6 31 31 NA NA NA
2 13 15 13 NA NA 15 15 12
Here you actually have data frames of length 1, not lists.

Get 2D table (6x6) for dataframe containing two continuous variables, by binning

I am trying to partition observations in a data frame into 36 groups, based on two continuous variables. More specifically, I am trying to cut each of the two variables into six groups, and then group the observations in one of the 36 different possible groups.
My attempt is below, which works. But is there a faster way to do this that avoids the double for loops?
Also, this isn't necessary, but how could I visualize the total number of observations in each group in a 6 by 6 grid? I know table() would produce a list of the 36 possible groups and their totals, but not in grid format.
set.seed(123)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
data <- data.frame(x1,x2)
labs1 <- levels(cut(x1, 6))
ints1 <- cbind(lower = as.numeric(sub("\\((.+),.*", "\\1", labs1)),
upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labs1)))
labs2 <- levels(cut(x2, 6))
ints2 <- cbind(lower = as.numeric(sub("\\((.+),.*", "\\1", labs2)),
upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", labs2)))
tmp <- expand.grid(labs1, labs2)
groups <- cbind(lower1 = as.numeric(sub("\\((.+),.*", "\\1", tmp[,1])),
upper1 = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", tmp[,1])),
lower2 = as.numeric(sub("\\((.+),.*", "\\1", tmp[,2])),
upper2 = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", tmp[,2])))
for (i in 1:1000){
for (j in 1:36){
if (x1[i] >= groups[j,1] & x1[i] <= groups[j,2] &
x2[i] >= groups[j,3] & x2[i] <= groups[j,4]){
data$group[i] <- j
}
}
}
You can use a mix of apply() that will iterate thru your data.frame and which() that will iterate thru your groups array:
data$group <- apply(data, 1, FUN=function(dataRow)
which(
dataRow[1] >= groups[,1] &
dataRow[1] <= groups[,2] &
dataRow[2] >= groups[,3] &
dataRow[2] <= groups[,4]))
You're overthinking things. Getting your 6x6 tables is a one-liner with table(). (Directly use the helpful factor variable created by cut(..., 6), don't just throw away the factor then manually reapply its levels and bin your variables) :
with(data, table(cut(x1, 6), cut(x2, 6)))
(-3.05,-1.97] (-1.97,-0.902] (-0.902,0.171] (0.171,1.24] (1.24,2.32] (2.32,3.4]
(-2.82,-1.8] 2 10 11 7 3 0
(-1.8,-0.793] 1 26 67 49 19 3
(-0.793,0.216] 12 57 140 146 31 3
(0.216,1.22] 11 49 109 95 36 6
(1.22,2.23] 0 10 31 34 15 0
(2.23,3.25] 0 3 5 6 2 1
# and to get the wide lines, you may need...
options('width'=199)
# or if you want more compact labels to keep it all narrow, use `cut(..., dig.lab)`
with(data, table(cut(x1, 6, dig.lab=2), cut(x2, 6, dig.lab=2)))
(-3.1,-2] (-2,-0.9] (-0.9,0.17] (0.17,1.2] (1.2,2.3] (2.3,3.4]
(-2.8,-1.8] 2 10 11 7 3 0
(-1.8,-0.79] 1 26 67 49 19 3
(-0.79,0.22] 12 57 140 146 31 3
(0.22,1.2] 11 49 109 95 36 6
(1.2,2.2] 0 10 31 34 15 0
(2.2,3.2] 0 3 5 6 2 1
Admittedly the doc for both table() and cut() do not say so directly, and could use a 2D example like this. => Doc/Enhance-bug

Avoid using a loop to get sum of rows in R, where I want to start and stop the sum on different columns for each row

I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).

Resources