cut that returns guaranteed number of bins - r

I'd like to do a cut with a guaranteed number of levels returned. So i'd like to take any vector of cumulative percentages and get a cut into deciles. I've tried using cut and it works well in most situations, but in cases where there are deciles that have a large percentages it fails to return the desired number of unique cuts, which is 10. Any ideas on how to ensure that the number of cuts is guaranteed to be 10?
In the included example there is no occurrance of decile 7.
> (x <- c(0.04,0.1,0.22,0.24,0.26,0.3,0.35,0.52,0.62,0.66,0.68,0.69,0.76,0.82,1.41,6.19,9.05,18.34,19.85,20.5,20.96,31.85,34.33,36.05,36.32,43.56,44.19,53.33,58.03,72.46,73.4,77.71,78.81,79.88,84.31,90.07,92.69,99.14,99.95))
[1] 0.04 0.10 0.22 0.24 0.26 0.30 0.35 0.52 0.62 0.66 0.68 0.69 0.76 0.82 1.41 6.19 9.05 18.34 19.85 20.50 20.96 31.85 34.33
[24] 36.05 36.32 43.56 44.19 53.33 58.03 72.46 73.40 77.71 78.81 79.88 84.31 90.07 92.69 99.14 99.95
> (cut(x,seq(0,max(x),max(x)/10),labels=FALSE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (as.integer(cut2(x,seq(0,max(x),max(x)/10))))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (findInterval(x,seq(0,max(x),max(x)/10),rightmost.closed=TRUE,all.inside=TRUE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
I would like to get 10 approximately equally sized intervals, sized in such a way that I am assured of getting 10. cut et al gives 9 bins with this example, i want 10. So I'm looking for an algorithm that would recognize that the break between [58.03,72.46],73.4 is large. Instead of assigning to bins 6,8,8 it would assign these cases to bins 6,7,8.

xx <- cut(x, breaks=quantile(x, (1:10)/10, na.rm=TRUE) )
table(xx)
#------------------------
xx
(0.256,0.58] (0.58,0.718] (0.718,6.76] (6.76,20.5]
4 4 4 4
(20.5,35.7] (35.7,49.7] (49.7,75.1] (75.1,85.5]
3 4 4 4
(85.5,100]
4

numBins = 10
cut(x, breaks = seq(from = min(x), to = max(x), length.out = numBins+1))
Output:
...
...
...
10 Levels: (0.04,10] (10,20] (20,30] (30,40] (40,50] (50,60] ... (90,100]
This will make 10 bins that are approximately equally spaced. Note, that by changing the numBins variable, you may obtain any number of bins that are approximately equally spaced.

Not sure I understand what you need, but if you drop the labels=FALSE and use table to make a frequency table of your data, you will get the number of categories desired:
> table(cut(x, breaks=seq(0, 100, 10)))
(0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
17 2 2 4 2 2 0 5 1 4
Notice that there are is no data in the 7th category, (60,70].

What is the problem you are trying to solve? If you don't want quantiles, then your cutpoints are pretty much arbitrary, so you could just as easily create ten bins by sampling without replacement from your original dataset. I realize that's an absurd method, but I want to make a point: you may be way off track but we can't tell because you haven't explained what you intend to do with your bins. Why, for example, is it so bad that one bin has no content?

Related

Creating a percentage column based on the sums of a column grouped by a different column? [duplicate]

This question already has answers here:
Summarizing by subgroup percentage in R
(2 answers)
Closed 9 months ago.
I am wrangling with a huge dataset and my R skills are very new. I am really trying to understand the terminology and processes but finding it a struggle as the R-documentation often makes no sense to me. So apologies if this is a dumb question.
I have data for plant species at different sites with different percentages of ground-cover. I want to create a new column PROP-COVER which gives the proportion of each species' cover as a percentage of the total cover of all species in a particular site. This is slightly different to calculating percentage cover by site area as it is disregards bare ground with no vegetation. This is an easy calculation with just one site, but I have over a hundred sites and need to perform the calculation on species ground-cover grouped by site. The desired column output is PROP-COVER.
SPECIES SITE COVER PROP-COVER(%)
1 1 10 7.7
2 1 20 15.4
3 1 10 7.7
4 1 20 15.4
5 1 30 23.1
6 1 40 30.8
2 2 20 22.2
3 2 50
5 2 10
6 2 10
1 3 5
2 3 25
3 3 40
5 3 10
I have looked at for loops and repeat but I can't see where the arguments should go. Every attempt I make returns a NULL.
Below is an example of something I tried which I am sure is totally wide of the mark, but I just can't work out where to begin with or know if it is even possible.
a<- for (i in data1$COVER) {
sum(data1$COVER[data1$SITE=="i"],na.rm = TRUE)
}
a
NULL
I have a major brain-blockage when it comes to how 'for' loops etc work, no amount of reading about it seems to help, but perhaps what I am trying to do isn't possible? :(
Many thanks for looking.
In Base R:
merge(df, prop.table(xtabs(COVER~SPECIES+SITE, df), 2)*100)
SPECIES SITE COVER Freq
1 1 1 10 7.692308
2 1 3 5 6.250000
3 2 1 20 15.384615
4 2 2 20 22.222222
5 2 3 25 31.250000
6 3 1 10 7.692308
7 3 2 50 55.555556
8 3 3 40 50.000000
9 4 1 20 15.384615
10 5 1 30 23.076923
11 5 2 10 11.111111
12 5 3 10 12.500000
13 6 1 40 30.769231
14 6 2 10 11.111111
In tidyverse you can do:
df %>%
group_by(SITE) %>%
mutate(n = proportions(COVER) * 100)
# A tibble: 14 x 4
# Groups: SITE [3]
SPECIES SITE COVER n
<int> <int> <int> <dbl>
1 1 1 10 7.69
2 2 1 20 15.4
3 3 1 10 7.69
4 4 1 20 15.4
5 5 1 30 23.1
6 6 1 40 30.8
7 2 2 20 22.2
8 3 2 50 55.6
9 5 2 10 11.1
10 6 2 10 11.1
11 1 3 5 6.25
12 2 3 25 31.2
13 3 3 40 50
14 5 3 10 12.5
The code could also be written as n = COVER/sum(COVER) or even n = prop.table(COVER)

Simulate unbalanced clustered data

I want to simulate some unbalanced clustered data. The number of clusters is 20 and the average number of observations is 30. However, I would like to create an unbalanced clustered data per cluster where there are 10% more observations than specified (i.e., 33 rather than 30). I then want to randomly exclude an appropriate number of observations (i.e., 60) to arrive at the specified average number of observations per cluster (i.e., 30). The probability of excluding an observation within each cluster was not uniform (i.e., some clusters had no cases removed and others had more excluded). Therefore in the end I still have 600 observations in total. Anyone knows how to realize that in R? Here is a smaller example dataset. The number of observation per cluster doesn't follow the condition specified above though, I just used this to convey my idea.
> y <- rnorm(20)
> x <- rnorm(20)
> z <- rep(1:5, 4)
> w <- rep(1:4, each=5)
> df <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
> df
id cluster x y
1 1 1 0.30003855 0.65325768
2 2 1 -1.00563626 -0.12270866
3 3 1 0.01925927 -0.41367651
4 4 1 -1.07742065 -2.64314895
5 5 1 0.71270333 -0.09294102
6 1 2 1.08477509 0.43028470
7 2 2 -2.22498770 0.53539884
8 3 2 1.23569346 -0.55527835
9 4 2 -1.24104450 1.77950291
10 5 2 0.45476927 0.28642442
11 1 3 0.65990264 0.12631586
12 2 3 -0.19988983 1.27226678
13 3 3 -0.64511396 -0.71846622
14 4 3 0.16532102 -0.45033862
15 5 3 0.43881870 2.39745248
16 1 4 0.88330282 0.01112919
17 2 4 -2.05233698 1.63356842
18 3 4 -1.63637927 -1.43850664
19 4 4 1.43040234 -0.19051680
20 5 4 1.04662885 0.37842390
After randomly adding and deleting some data, the unbalanced data become like this:
id cluster x y
1 1 1 0.895 -0.659
2 2 1 -0.160 -0.366
3 1 2 -0.528 -0.294
4 2 2 -0.919 0.362
5 3 2 -0.901 -0.467
6 1 3 0.275 0.134
7 2 3 0.423 0.534
8 3 3 0.929 -0.953
9 4 3 1.67 0.668
10 5 3 0.286 0.0872
11 1 4 -0.373 -0.109
12 2 4 0.289 0.299
13 3 4 -1.43 -0.677
14 4 4 -0.884 1.70
15 5 4 1.12 0.386
16 1 5 -0.723 0.247
17 2 5 0.463 -2.59
18 3 5 0.234 0.893
19 4 5 -0.313 -1.96
20 5 5 0.848 -0.0613
EDIT
This part of the problem solved (credit goes to jay.sf). Next, I want to repeat this process 1000 times and run regression on each generated dataset. However, I don't want to run regression on the whole dataset but rather on some selected clusters with the clusters being selected randomly (can use this function: df[unlist(cluster[sample.int(k, k, replace = TRUE)], use.names = TRUE), ]. In the end, I would like to get confidence intervals from those 1000 regressions. How to proceed?
As per Ben Bolker's request, I am posting my solution but see jay.sf for a more generalizable answer.
#First create an oversampled dataset:
y <- rnorm(24)
x <- rnorm(24)
z <- rep(1:6, 4)
w <- rep(1:4, each=6)
df <- data.frame(id=z,cluster=w,x=x,y=y)
#Then just slice_sample to arrive at the sample size as desired
df %>% slice_sample(n = 20) %>%
arrange(cluster)
#Or just use base R
a <- df[sample(nrow(df), 20), ]
df2 <- a[order(a$cluster), ]
Let ncl be the desired number of clusters. We may generate a sampling space S which is a sequence of tolerance tol around mean observations per cluster mnobs. From that we draw repeatetly a random sample of size 1 to obtain a list of clusters CL. If the sum of cluster lengths meets ncl*mnobs we break the loop, add random data to the clusters and rbind the result.
FUN <- function(ncl=20, mnobs=30, tol=.1) {
S <- do.call(seq.int, as.list(mnobs*(1 + tol*c(-1, 1))))
repeat({
CL <- lapply(1:ncl, function(x) rep(x, sample(S, 1, replace=T)))
if (sum(lengths(CL)) == ncl*mnobs) break
})
L <- lapply(seq.int(CL), function(i) {
id <- seq.int(CL[[i]])
cbind(id, cluster=i,
matrix(rnorm(max(id)*2),,2, dimnames=list(NULL, c("x", "y"))))
})
do.call(rbind.data.frame, L)
}
Usage
set.seed(42)
res <- FUN() ## using defined `arg` defaults
dim(res)
# [1] 600 4
(res.tab <- table(res$cluster))
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 29 29 31 31 30 32 31 30 32 28 28 27 28 31 32 33 31 30 27 30
table(res.tab)
# 27 28 29 30 31 32 33
# 2 3 2 4 5 3 1
sapply(c("mean", "sd"), function(x) do.call(x, list(res.tab)))
# mean sd
# 30.000000 1.747178
Displayable example
set.seed(42)
FUN(4, 5, tol=.3) ## tol needs to be adjusted for smaller samples
# id cluster x y
# 1 1 1 1.51152200 -0.0627141
# 2 2 1 -0.09465904 1.3048697
# 3 3 1 2.01842371 2.2866454
# 4 1 2 -1.38886070 -2.4404669
# 5 2 2 -0.27878877 1.3201133
# 6 3 2 -0.13332134 -0.3066386
# 7 4 2 0.63595040 -1.7813084
# 8 5 2 -0.28425292 -0.1719174
# 9 6 2 -2.65645542 1.2146747
# 10 1 3 1.89519346 -0.6399949
# 11 2 3 -0.43046913 0.4554501
# 12 3 3 -0.25726938 0.7048373
# 13 4 3 -1.76316309 1.0351035
# 14 5 3 0.46009735 -0.6089264
# 15 1 4 0.50495512 0.2059986
# 16 2 4 -1.71700868 -0.3610573
# 17 3 4 -0.78445901 0.7581632
# 18 4 4 -0.85090759 -0.7267048
# 19 5 4 -2.41420765 -1.3682810
# 20 6 4 0.03612261 0.4328180

Replacing conditional values with previous values in r

I have some data on organism survival as a function of time. The data is constructed using the averages of many replicates for each time point, which can yield a forward time step with an increase in survival. Occasionally, this results in a survivorship greater than 1, which is impossible. How can I conditionally change values greater than 1 to the value preceeding it in the same column?
Here's what the data looks like:
>df
Generation Treatment time lx
1 0 1 0 1
2 0 1 2 1
3 0 1 4 0.970
4 0 1 6 0.952
5 0 1 8 0.924
6 0 1 10 0.913
7 0 1 12 0.895
8 0 1 14 0.729
9 0 2 0 1
10 0 2 2 1
I've tried mutating the column of interest as such, which still yields values above 1:
df1 <- df %>%
group_by(Generation, Treatment) %>%
mutate(lx_diag = as.numeric(lx/lag(lx, default = first(lx)))) %>% #calculate running survival
mutate(lx_diag = if_else(lx_diag > 1.000000, lag(lx_diag), lx_diag)) #substitute values >1 with previous value
>df1
Generation Treatment time lx lx_diag
1 12 1 0 1 1
2 12 1 2 1 1
3 12 1 4 1 1
4 12 1 6 0.996 0.996
5 12 1 8 0.988 0.992
6 12 1 10 0.956 0.968
7 12 1 12 0.884 0.925
8 12 1 14 0.72 0.814
9 12 1 15 0.729 1.01
10 12 1 19 0.76 1.04
I expect the results to look something like:
>df1
Generation Treatment time lx lx_diag
1 12 1 0 1 1
2 12 1 2 1 1
3 12 1 4 1 1
4 12 1 6 0.996 0.996
5 12 1 8 0.988 0.992
6 12 1 10 0.956 0.968
7 12 1 12 0.884 0.925
8 12 1 14 0.72 0.814
9 12 1 15 0.729 0.814
10 12 1 19 0.76 0.814
I know you can conditionally change the values to a specific value (i.e. ifelse with no else), but I haven't found any solutions that can conditionally change a value in a column to the value in the previous row. Any help is appreciated.
EDIT: I realized that mutate and if_else are quite efficient when it comes to converting values. Instead of replacing values in sequence from the first to last, as I would have expected, the commands replace all values at the same time. So in a series of values >1, you will have some left behind. Thus, if you just run the command:
SurvTot1$lx_diag <- if_else(SurvTot1$lx_diag > 1, lag(SurvTot1$lx_diag), SurvTot1$lx_diag)
over again, you can rid of the values >1. Not the most elegant solution, but it works.
This looks like a very ugly solution to me, but I couldn't think of anything else:
df = data.frame(
"Generation" = rep(12,10),
"Treatent" = rep(1,10),
"Time" = c(seq(0,14,by=2),15,19),
"lx_diag" = c(1,1,1,0.996,0.992,0.968,0.925,0.814,1.04,1.04)
)
update_lag = function(x){
k <<- k+1
x
}
k=1
df %>%
mutate(
lx_diag2 = ifelse(lx_diag <=1,update_lag(lx_diag),lag(lx_diag,n=k))
)
Using the data from #Fino, here is my vectorized solution using base R
vals.to.replace <- which(df$lx_diag > 1)
vals.to.substitute <- sapply(vals.to.replace, function(x) tail( df$lx_diag[which(df$lx_diag[1:x] <= 1)], 1) )
df$lx_diag[vals.to.replace] = vals.to.substitute
df
Generation Treatent Time lx_diag
1 12 1 0 1.000
2 12 1 2 1.000
3 12 1 4 1.000
4 12 1 6 0.996
5 12 1 8 0.992
6 12 1 10 0.968
7 12 1 12 0.925
8 12 1 14 0.814
9 12 1 15 0.814
10 12 1 19 0.814

Compute values relative to specific factor in R data frame

I have a data frame in R of the following form:
BC solvopt istrng tSolv EPB
1 10 1 0 0.10 -78.1450
2 10 1 1 0.15 -78.7174
3 10 1 10 0.14 -78.7175
4 10 1 100 0.12 -78.7184
5 10 1 1000 0.09 -78.7232
6 10 1 2 0.15 -78.7175
7 10 1 20 0.14 -78.7176
8 10 1 200 0.12 -78.7192
30 10 2 0 0.10 -78.1450
31 10 2 1 0.11 -78.7174
32 10 2 10 0.11 -78.7175
33 10 2 100 0.10 -78.7184
34 10 2 1000 0.13 -78.7232
35 10 2 2 0.11 -78.7174
36 10 2 20 0.10 -78.7176
37 10 2 200 0.10 -78.7192
59 10 3 0 0.16 -78.1450
60 10 3 1 0.23 -78.7174
61 10 3 10 0.21 -78.7175
62 10 3 100 0.19 -78.7184
63 10 3 1000 0.17 -78.7232
64 10 3 2 0.22 -78.7175
65 10 3 20 0.21 -78.7176
66 10 3 200 0.18 -78.7192
88 10 4 0 0.44 -78.1450
89 10 4 1 14.48 -78.7162
90 10 4 10 12.27 -78.7175
91 10 4 100 1.23 -78.7184
92 10 4 1000 0.44 -78.7232
93 10 4 2 14.52 -78.7172
94 10 4 20 6.16 -78.7176
95 10 4 200 0.62 -78.7192
I want to add a column to this frame which shows the relative error in the EPB for each value of BC and istrng relative to solvopt=3.
For example, to compute the relative difference in EPB at each row I would subtract the EPB value of the corresponding row with the same value of BC and istrng but with solvopt=3.
Is there an easy way to do this short of splitting this into multiple data frames (for each solvopt) and then remunging it together?
The end goal is to generate plots of relative error vs istrng for each value of BC using qplot.
If you merge the subset where solvopt==3 against the main data on both BC and istrong, and subtract the difference, you should get the result you want, e.g.:
newdat <- merge(dat,dat[dat$solvopt==3,c("BC","istrng","EPB")], by=c("BC","istrng"))
newdat$diff <- with(newdat, EPB.x - EPB.y)
...or do it all in one fell swoop using match and interaction:
dat$diff <- dat$EPB - dat[dat$solvopt==3,"EPB"][match(
with(dat, interaction(BC,istrng) ),
with(dat[dat$solvopt==3,], interaction(BC,istrng) )
)]
A similar option with data.table
library(data.table)
res <- setkey(setDT(dat), BC,istrng)[dat[solvopt==3, c(1,3,5),
with=FALSE]][, diff:= EPB- i.EPB][]

R ignores 0 values

I am trying to create some descriptive statistics and histograms out of ordered variables (range 0 to 10). I used the following commands:
class(data$var1)
describe(as.numeric(data$var1))
But R starts from 1 and counts the "refusal" values as a further numeric value.
How can I let R start from 0 and ignore the "refusal" values?
Thank you.
Edit: I was able to let R ignore "refusal" value using the following command:
is.na (data$var1[data$var1=="Refusal"]) <- TRUE
But when I search for possible solution about the 0 values I am only finding suggestion on how to ignore/remove 0 values...
Edit2: This is a sample of my data,
[1] 5 8 8 8 Refusal 10 8 Refusal 7
[10] 7 8 7 8 8 8 8 8 8
[19] 8 0 9 Refusal 6 10 7 7 9
as you can see the range is from 0 to 10 but using the R library "psych" and the command "describe" the output range is always 1 to 11 and this invalidates the whole statistics.
> class(data$var1)
[1] "factor"
> describe(as.numeric(data$var1), na.rm=TRUE)
vars n mean sd median trimmed mad min max range skew kurtosis se
1 1 1115 8.38 1.94 9 8.57 1.48 1 11 10 -1.06 1.42 0.06
Sorry for the ongoing editing but I am new of stackoverflow.com
Have a look at how factors work, with ?factor, or looking at the example question here. In essence, each level is given a number starting at 1, hence ending at 11 if you have 11 unique values. Conversion of a factor to numeric returns these codes, rather than the underlying numbers they relate to. To do this, first convert to character, then to numeric. See the difference between these code snippets:
#create data
set.seed(0)
a <- factor(sample(c(0:10,"refusal"),50,T)) #Some dummy data
class(a)
# [1] "factor"
snippet 1 - how you're doing it
describe(as.numeric(a),na.rm=TRUE)
#as.numeric(a)
#n missing unique Mean .05 .10 .25 .50 .75 .90 .95
#50 0 11 6.28 2.00 2.00 4.00 6.00 8.75 10.00 11.00
#
#1 2 3 4 5 6 7 8 9 10 11
#Frequency 2 5 5 4 2 8 6 5 3 6 4
#% 4 10 10 8 4 16 12 10 6 12 8
snippet 2 - correct way
describe(as.numeric(as.character(a)),na.rm=TRUE)
#as.numeric(as.character(a))
#n missing unique Mean .05 .10 .25 .50 .75 .90 .95
#46 4 10 5.304 1.0 1.0 3.0 5.0 8.0 9.5 10.0
#
#0 1 2 3 4 5 7 8 9 10
#Frequency 2 5 4 2 8 6 5 3 6 5
#% 4 11 9 4 17 13 11 7 13 11
#Warning message:
# In describe(as.numeric(as.character(a)), na.rm = TRUE) :
# NAs introduced by coercion
Note the difference in range (even if my describe function isn't the same as yours). The warning refers to the "refusals which are converted to NAs as they don't represent a number

Resources