I know that the topic doesn't describe what I want to do, but belive it's hard to explain in just one senetece.
I have a data frame in the form shown below:
ID V_tour
<dbl> <dbl>
1 1206818 0
2 1238530 0
3 1238530 0
4 1241498 0
5 1228139 1
6 1228139 1
7 1334957 0
8 1328103 1
9 1206818 1
10 1334957 0
11 1239023 0
12 1241498 0
13 1152361 1
What I want to do is that for each specific ID (these IDs may be repeated more than 1 time), I want to calculate the ratio of the number of times we have 1 in column "v_tour" to the total number of rows for other IDs. To elaborate more, let's choose ID, 1206818. For this ID, I have to look at other IDs and find the ratio of the number of times that number 1 repeated to the total number. For ID 1206818, 4/11 because for other IDs number 1 has been repeated 4 times in v_tour column and the total number of other rows are 11, so the ratio is 4/11. Please note that since ID 1206818 repeated two times, for each time we have to show 4/11 in front of that ID
So the expected result would be:
ID V_tour ratio
<dbl> <dbl>
1 1206818 0 4/11
2 1238530 0 5/11
3 1238530 0 5/11
4 1241498 0 5/11
5 1228139 1 3/11
6 1228139 1 3/11
7 1334957 0 5/11
8 1328103 1 4/12
9 1206818 1 4/11
10 1334957 0 5/11
11 1239023 0 5/12
12 1241498 0 5/11
13 1152361 1 4/12
We can count total number of 1's in the data and subtract it with number of 1's for that ID, divide this number by (total number of rows - number of rows in the group).
library(dplyr)
total_one <- sum(df$V_tour)
n <- nrow(df)
df %>%
group_by(ID) %>%
mutate(formula = paste(total_one - sum(V_tour), n-n(), sep = '/'),
ratio = (total_one - sum(V_tour))/(n-n()))
# ID V_tour formula ratio
# <int> <int> <chr> <dbl>
# 1 1206818 0 4/11 0.364
# 2 1238530 0 5/11 0.455
# 3 1238530 0 5/11 0.455
# 4 1241498 0 5/11 0.455
# 5 1228139 1 3/11 0.273
# 6 1228139 1 3/11 0.273
# 7 1334957 0 5/11 0.455
# 8 1328103 1 4/12 0.333
# 9 1206818 1 4/11 0.364
#10 1334957 0 5/11 0.455
#11 1239023 0 5/12 0.417
#12 1241498 0 5/11 0.455
#13 1152361 1 4/12 0.333
Added an additional formula column to explain how ratio is calculated, you can remove that column.
I want to simulate some unbalanced clustered data. The number of clusters is 20 and the average number of observations is 30. However, I would like to create an unbalanced clustered data per cluster where there are 10% more observations than specified (i.e., 33 rather than 30). I then want to randomly exclude an appropriate number of observations (i.e., 60) to arrive at the specified average number of observations per cluster (i.e., 30). The probability of excluding an observation within each cluster was not uniform (i.e., some clusters had no cases removed and others had more excluded). Therefore in the end I still have 600 observations in total. Anyone knows how to realize that in R? Here is a smaller example dataset. The number of observation per cluster doesn't follow the condition specified above though, I just used this to convey my idea.
> y <- rnorm(20)
> x <- rnorm(20)
> z <- rep(1:5, 4)
> w <- rep(1:4, each=5)
> df <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset
> df
id cluster x y
1 1 1 0.30003855 0.65325768
2 2 1 -1.00563626 -0.12270866
3 3 1 0.01925927 -0.41367651
4 4 1 -1.07742065 -2.64314895
5 5 1 0.71270333 -0.09294102
6 1 2 1.08477509 0.43028470
7 2 2 -2.22498770 0.53539884
8 3 2 1.23569346 -0.55527835
9 4 2 -1.24104450 1.77950291
10 5 2 0.45476927 0.28642442
11 1 3 0.65990264 0.12631586
12 2 3 -0.19988983 1.27226678
13 3 3 -0.64511396 -0.71846622
14 4 3 0.16532102 -0.45033862
15 5 3 0.43881870 2.39745248
16 1 4 0.88330282 0.01112919
17 2 4 -2.05233698 1.63356842
18 3 4 -1.63637927 -1.43850664
19 4 4 1.43040234 -0.19051680
20 5 4 1.04662885 0.37842390
After randomly adding and deleting some data, the unbalanced data become like this:
id cluster x y
1 1 1 0.895 -0.659
2 2 1 -0.160 -0.366
3 1 2 -0.528 -0.294
4 2 2 -0.919 0.362
5 3 2 -0.901 -0.467
6 1 3 0.275 0.134
7 2 3 0.423 0.534
8 3 3 0.929 -0.953
9 4 3 1.67 0.668
10 5 3 0.286 0.0872
11 1 4 -0.373 -0.109
12 2 4 0.289 0.299
13 3 4 -1.43 -0.677
14 4 4 -0.884 1.70
15 5 4 1.12 0.386
16 1 5 -0.723 0.247
17 2 5 0.463 -2.59
18 3 5 0.234 0.893
19 4 5 -0.313 -1.96
20 5 5 0.848 -0.0613
EDIT
This part of the problem solved (credit goes to jay.sf). Next, I want to repeat this process 1000 times and run regression on each generated dataset. However, I don't want to run regression on the whole dataset but rather on some selected clusters with the clusters being selected randomly (can use this function: df[unlist(cluster[sample.int(k, k, replace = TRUE)], use.names = TRUE), ]. In the end, I would like to get confidence intervals from those 1000 regressions. How to proceed?
As per Ben Bolker's request, I am posting my solution but see jay.sf for a more generalizable answer.
#First create an oversampled dataset:
y <- rnorm(24)
x <- rnorm(24)
z <- rep(1:6, 4)
w <- rep(1:4, each=6)
df <- data.frame(id=z,cluster=w,x=x,y=y)
#Then just slice_sample to arrive at the sample size as desired
df %>% slice_sample(n = 20) %>%
arrange(cluster)
#Or just use base R
a <- df[sample(nrow(df), 20), ]
df2 <- a[order(a$cluster), ]
Let ncl be the desired number of clusters. We may generate a sampling space S which is a sequence of tolerance tol around mean observations per cluster mnobs. From that we draw repeatetly a random sample of size 1 to obtain a list of clusters CL. If the sum of cluster lengths meets ncl*mnobs we break the loop, add random data to the clusters and rbind the result.
FUN <- function(ncl=20, mnobs=30, tol=.1) {
S <- do.call(seq.int, as.list(mnobs*(1 + tol*c(-1, 1))))
repeat({
CL <- lapply(1:ncl, function(x) rep(x, sample(S, 1, replace=T)))
if (sum(lengths(CL)) == ncl*mnobs) break
})
L <- lapply(seq.int(CL), function(i) {
id <- seq.int(CL[[i]])
cbind(id, cluster=i,
matrix(rnorm(max(id)*2),,2, dimnames=list(NULL, c("x", "y"))))
})
do.call(rbind.data.frame, L)
}
Usage
set.seed(42)
res <- FUN() ## using defined `arg` defaults
dim(res)
# [1] 600 4
(res.tab <- table(res$cluster))
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 29 29 31 31 30 32 31 30 32 28 28 27 28 31 32 33 31 30 27 30
table(res.tab)
# 27 28 29 30 31 32 33
# 2 3 2 4 5 3 1
sapply(c("mean", "sd"), function(x) do.call(x, list(res.tab)))
# mean sd
# 30.000000 1.747178
Displayable example
set.seed(42)
FUN(4, 5, tol=.3) ## tol needs to be adjusted for smaller samples
# id cluster x y
# 1 1 1 1.51152200 -0.0627141
# 2 2 1 -0.09465904 1.3048697
# 3 3 1 2.01842371 2.2866454
# 4 1 2 -1.38886070 -2.4404669
# 5 2 2 -0.27878877 1.3201133
# 6 3 2 -0.13332134 -0.3066386
# 7 4 2 0.63595040 -1.7813084
# 8 5 2 -0.28425292 -0.1719174
# 9 6 2 -2.65645542 1.2146747
# 10 1 3 1.89519346 -0.6399949
# 11 2 3 -0.43046913 0.4554501
# 12 3 3 -0.25726938 0.7048373
# 13 4 3 -1.76316309 1.0351035
# 14 5 3 0.46009735 -0.6089264
# 15 1 4 0.50495512 0.2059986
# 16 2 4 -1.71700868 -0.3610573
# 17 3 4 -0.78445901 0.7581632
# 18 4 4 -0.85090759 -0.7267048
# 19 5 4 -2.41420765 -1.3682810
# 20 6 4 0.03612261 0.4328180
What is this operation called and how do I achieve this? (I can't find an example.)
Given
temp1
Var1 Freq
1 (0,0.78] 0
2 (0.78,0.99] 0
3 (0.99,1.07] 0
4 (1.07,1.201] 1
5 (1.201,1.211] 0
6 (1.211,1.77] 2
How do I split the intervals in Var1 into two vectors for start and end?
Like this
df2
start end Freq
1 0.000 0.780 0
2 0.780 0.990 0
3 0.990 1.070 0
4 1.070 1.201 1
5 1.201 1.211 0
6 1.211 1.770 2
This is an XY problem. You shouldn't need to have that format to fix in the first place.
E.g.:
x <- 1:10
brks <- c(0,5,10)
data.frame(table(cut(x,brks)))
# Var1 Freq
#1 (0,5] 5
#2 (5,10] 5
data.frame(start=head(brks,-1), end=tail(brks,-1), Freq=tabulate(cut(x,brks)))
# start end Freq
#1 0 5 5
#2 5 10 5
I would like to check if there is a correlation between "birds" & "wolfs" in different lags.Getting the correlation value is easy but how can I address the lag issue ( I need to check the correleation value for 1:4 lags )? The output that I look for is a data table that contains the lag value and the related correlation value.
df <- read.table(text = " day birds wolfs
0 2 21
1 8 4
2 2 5
3 2 4
4 3 6
5 1 12
6 7 10
7 1 9
8 2 12 header = TRUE)
Output(not real results):
Lag CorValue
0 0.9
1 0.8
2 0.7
3 0.9
If you do this :
corLag<-ccf(df$birds,df$wolfs,lag.max=max(df$day))
it will return this :
Autocorrelations of series ‘X’, by lag
-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8
-0.028 0.123 -0.045 -0.019 0.145 -0.176 -0.082 -0.126 -0.296 0.757 -0.134 -0.180 0.070 -0.272 0.549 -0.170 -0.117
the first row is the lag, the second is the correlation value. you can check that cor(df$birds,df$wolfs)is indeed equal to -0.296
I'd like to do a cut with a guaranteed number of levels returned. So i'd like to take any vector of cumulative percentages and get a cut into deciles. I've tried using cut and it works well in most situations, but in cases where there are deciles that have a large percentages it fails to return the desired number of unique cuts, which is 10. Any ideas on how to ensure that the number of cuts is guaranteed to be 10?
In the included example there is no occurrance of decile 7.
> (x <- c(0.04,0.1,0.22,0.24,0.26,0.3,0.35,0.52,0.62,0.66,0.68,0.69,0.76,0.82,1.41,6.19,9.05,18.34,19.85,20.5,20.96,31.85,34.33,36.05,36.32,43.56,44.19,53.33,58.03,72.46,73.4,77.71,78.81,79.88,84.31,90.07,92.69,99.14,99.95))
[1] 0.04 0.10 0.22 0.24 0.26 0.30 0.35 0.52 0.62 0.66 0.68 0.69 0.76 0.82 1.41 6.19 9.05 18.34 19.85 20.50 20.96 31.85 34.33
[24] 36.05 36.32 43.56 44.19 53.33 58.03 72.46 73.40 77.71 78.81 79.88 84.31 90.07 92.69 99.14 99.95
> (cut(x,seq(0,max(x),max(x)/10),labels=FALSE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (as.integer(cut2(x,seq(0,max(x),max(x)/10))))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
> (findInterval(x,seq(0,max(x),max(x)/10),rightmost.closed=TRUE,all.inside=TRUE))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 4 4 4 5 5 6 6 8 8 8 8 8 9 10 10 10 10
I would like to get 10 approximately equally sized intervals, sized in such a way that I am assured of getting 10. cut et al gives 9 bins with this example, i want 10. So I'm looking for an algorithm that would recognize that the break between [58.03,72.46],73.4 is large. Instead of assigning to bins 6,8,8 it would assign these cases to bins 6,7,8.
xx <- cut(x, breaks=quantile(x, (1:10)/10, na.rm=TRUE) )
table(xx)
#------------------------
xx
(0.256,0.58] (0.58,0.718] (0.718,6.76] (6.76,20.5]
4 4 4 4
(20.5,35.7] (35.7,49.7] (49.7,75.1] (75.1,85.5]
3 4 4 4
(85.5,100]
4
numBins = 10
cut(x, breaks = seq(from = min(x), to = max(x), length.out = numBins+1))
Output:
...
...
...
10 Levels: (0.04,10] (10,20] (20,30] (30,40] (40,50] (50,60] ... (90,100]
This will make 10 bins that are approximately equally spaced. Note, that by changing the numBins variable, you may obtain any number of bins that are approximately equally spaced.
Not sure I understand what you need, but if you drop the labels=FALSE and use table to make a frequency table of your data, you will get the number of categories desired:
> table(cut(x, breaks=seq(0, 100, 10)))
(0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
17 2 2 4 2 2 0 5 1 4
Notice that there are is no data in the 7th category, (60,70].
What is the problem you are trying to solve? If you don't want quantiles, then your cutpoints are pretty much arbitrary, so you could just as easily create ten bins by sampling without replacement from your original dataset. I realize that's an absurd method, but I want to make a point: you may be way off track but we can't tell because you haven't explained what you intend to do with your bins. Why, for example, is it so bad that one bin has no content?