Specifying directly factor levels and sizes - r

How would you create a factor with levels and corresponding sizes directly specified?
e.g. [0, 5) 6
[5, 7) 20
[7, 13) 4
Edit: This question is related to grouped frequency distributions. Sometimes (say in textbooks), you don't get access to original data but you're just given the count of the occurrences of values within each class. Later on, you'd want to compute cumulative count/frequency, you'd like to tell what count such or such class has and so on. So you just need to be able to enter the class table and hence my question.
Second edit:
Typical textbook example (it's already a summary, the original data set is not available):
[20, 30) 221890
[30, 35) 171050
[35, 40) 121400
[40, 45) 101050
[45, 60) 71620
# ... possibly many more but let's stop here.
Then typical questions are: what is the tally for the [30, 35) class? What is the cumlative count at 45? Plot the corresponding histogram, and so on and so forth.
So #thelatemail 1st comment provided a workable answer but I was worried about the resulting factor 'size'. That's why I asked for other alternative solutions. #agstudy answer also works along the same lines but with the extra burden of recreating a (temporary, agreed) whole new data set. Still it's an interesting answer by itself. I was in particular interested in the way #agstudy computed the temporary data set.
All in all, these solutions work but I would like some optimized approach if at all possible.
Theoretically, 'factor's would be the needed output but 'factor's seem way too big to store that summary table.

For example using cut you can do this:
cut(rep(c(1,6,11),c(6,20,4)),c(0,5,7,13))
You can check using table
table(cut(rep(c(1,6,11),c(6,20,4)),c(0,5,7,13)))
(0,5] (5,7] (7,13]
6 20 4
EDIT to create data from intervals you can do this also :
cut(rep((c(0,5,7,13) +1)[-1],c(6,20,4)),c(0,5,7,13))
EDIT even after clarification is still not clear for me what do you have as inputs specially the structure of your inputs data. Here a straight method:
text='[20, 30) 221890
[30, 35) 171050
[35, 40) 121400
[40, 45) 101050
[45, 60) 71620'
dd <- do.call(rbind,strsplit(readLines(textConnection(text)),') '))
vv <- as.numeric(dd[,2])
names(vv) <- paste0(dd[,1],')')
vv
[20, 30) [30, 35) [35, 40) [40, 45) [45, 60)
221890 171050 121400 101050 71620

Related

Cutting Stock Optimization:Finding all Possible combinations in R

I am looking at the cutting stock problem as described here. Now the starting point of the problem is where they say for the given possible cuts namely 14,31,36,45 a plank of length 100 can be cut into 37 possible patterns. One pattern can be 1,0,1,1, while another can be 1,1,0,1 or 0,0,0,2 etc. Is there an existing algorithm which can be used in R that will list down all possible combinations for a given over all size and individual cuts in this case 37
Here's a brute force approach. Create a vector that has the "max" for each cut. Then create a grid of possibilities. Then do matrix multiplication on the grid against the cuts to get the total "length" of the combination -- anything less than or equal to (lteq) 100 is "legit". Noting there are 38 combinations because one case is 0,0,0,0 which you probably want to throw out.
cuts <- c(14, 31, 36, 45)
# Get the max number of each length of cut
max_of_each <- floor(100 / cuts)
possibilities <- lapply(max_of_each, function(i) seq(0, i))
grid_possibilities <- expand.grid(possibilities)
idx_lteq_100 <- as.matrix(grid_possibilities) %*% cuts <= 100
grid_possibilities[idx_lteq_100, ]
nrow(grid_possibilities[idx_lteq_100, ])
# [1] 38

Randomly pairing elements of a vector in R to count unique arrangements

Background:
On this combinatorics question, the issue is how to determine the sample space: the ways 8 different soccer teams can be paired up for the next round of competition. Two different answers have been advanced for that part of the problem: 28 (see comments OP) and 105 (see edit within OP and answer).
I'd like to do this manually to try to hone down on the mistake in whichever answer is incorrect.
What I have tried:
teams = 1:8
names(teams) = c("RM", "BCN", "SEV", "JUV", "ROM", "MC", "LIV", "BYN")
split(sample(teams), rep(1:(length(teams)/2), each=2))
Unfortunately, the output is a list, and I wanted a vector to be able to run something like:
unique(...,MARGIN=2)
Is there a way of doing this in an elegant manner?
After a now erased answer (thank you), I would go with
a <- replicate(1e5, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2))))
to simulate 100,000 random samples, and later run
unique(a, MARGIN = 2).
But how can I account for the fact that the order of the 4 pairings of opponents doesn't matter, and that LIV-BYN and BYN-LIV, for example, is the same pairing (field advantage notwithstanding)?
> u = ncol(unique(replicate(1e6, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2)))), MARGIN = 2))
> u / (factorial(4) * 2^4)
[1] 105
The idea of unlist is from #Song Zhengyi, and if his answer is un-deleted, I'll accept it. The complete answer is in the lines above.
u needs to be divided by 4! because
BCN-RM, BYN-SEV, JUV-ROM, LIV-MC
is exactly the same as
LIV-MC, BCN-RM, BYN-SEV, JUV-ROM
or
BCN-RM, LIV-MC, BYN-SEV, JUV-ROM
etc.
The term 2^4 is to avoid over-counting since for every possible unique draw, each one of the pairings can be flipped without loss (discarding field advantage): BCN-RM is the same as RM-BCN, and there are 4 pairs in each draw.
If field advantage is a consideration (real life)...
> u/factorial(4)
[1] 1680
we end up with 1,680 possible draws.

Set my own fixed X-axis value in a grid chart? Including symbols like "<" and ">" (QlikView)

So Im creating this grid-chart and I really want to have the following values in my X-Axis:
"<10"
"<20"
">20"
I want my graph to look something like the following graph, in the link below:
Graph example
The nodes X values does not have the lesser than (<) or bigger than(>) symbols, they are just numbers spanning from 1-30 with no extra characters. Chosing only that field as the x-axis doesnt do it, ofc. I only want those three specified values, containing the symbols (< and >), in the X-axis.
I feel like this should be a simple thing to solve, but I've tried for a while now without any succes...
Sorry about the poor example, hopefully you understand what i'm saying
Any ideas?
Thanks in advance.
Have a look at the Class function.
The class function assigns the first parameter to a class interval. The result is a dual value with a<=x
You could create a grid chart and use PplWatched and Rating as dimensions with expression count(id) using the following testdata:
Data:
load
id, class(PplWatched, 10) as PplWatched, Rating
;
load * inline [
id, PplWatched, Rating
1, 14, 4
2, 2, 2
3, 19, 5
4, 30, 4
5, 9, 1
6, 45, 5
];

Finding the first significant figure of difference between two very similar values

I'm trying to reproduce the computations that led to a data set data.ref. I'd like to test how well my current implementation does by comparing the reference data to my computed results, data.my. Since each column of the data should have comparable magnitudes within the column, but not necessarily between columns, I've been looking at
(data.ref - data.my) / data.ref
to put errors on a comparable scale. However, since the data is ultimately going to be rounded off, what I'd really like to do is just run a quick and dirty check of how many significant figures worth of agreement the data has. That is, since I expect data.ref and data.my to be quite close to each other, I'd like the answer the question: what is the first significant figure at which each pair of corresponding entries differs?
Is there an R function that does this?
ceiling(log10(abs(data.ref, data.my))) seems to do the trick.
Example:
> data.my <- c(20, 30, 32, 32.01, 32.012)
> data.ref <- rep(32, length(data.my))
> ceiling(log10(abs(data.my - data.ref)))
[1] 2 1 -Inf -2 -1

Cut function in R - exclusive or am I double counting?

Based off of a previous question I asked, which #Andrie answered, I have a question about the usage of the cut function and labels.
I'd like get summary statistics based on the range of number of times a user logs in.
Here is my data:
# Get random numbers
NumLogin <- round(runif(100,1,50))
# Set the login range
LoginRange <- cut(NumLogin,
c(0,1,3,5,10,15,20,Inf),
labels=c('1','2','3-5','6-10','11-15','16-20','20+')
)
Now I have my LoginRange, but I'm unsure how the cut function actually works. I want to find users who have logged in 1 time, 2 times, 3-5 times, etc, while only including the user if they are in that range. Is the cut function including 3 twice (In the 2 bucket and the 3-5 bucket)? If I look in my example, I can see a user who logged in 3 times, but they are cut as '2'. I've looked at the documentation and every R book I own, but no luck. What am I doing wrong?
Also - As a usage question - should I attach the LoginRange to my data frame? If so, what's the best way to do so?
DF <- data.frame(NumLogin, LoginRange)
?
Thanks
The intervals defined by the cut() function are (by default) closed on the right. To see what that means, try this:
cut(1:2, breaks=c(0,1,2))
# [1] (0,1] (1,2]
As you can see, the integer 1 gets included in the range (0,1], not in the range (1,2]. It doesn't get double-counted, and for any input value falling outside of the bins you define, cut() will return a value of NA.
When dealing with integer-valued data, I tend to set break points between the integers, just to avoid tripping myself up. In fact, doing this with your data (as shown below), reveals that the 2nd and 3rd bins were actually incorrectly named, which illustrates the point quite nicely!
LoginRange <- cut(NumLogin,
c(0.5, 1.5, 3.5, 5.5, 10.5, 15.5, 20.5, Inf),
# c(0,1,3,5,10,15,20,Inf) + 0.5,
labels=c('1','2-3','4-5','6-10','11-15','16-20','20+')
)

Resources