Cut function in R - exclusive or am I double counting? - r

Based off of a previous question I asked, which #Andrie answered, I have a question about the usage of the cut function and labels.
I'd like get summary statistics based on the range of number of times a user logs in.
Here is my data:
# Get random numbers
NumLogin <- round(runif(100,1,50))
# Set the login range
LoginRange <- cut(NumLogin,
c(0,1,3,5,10,15,20,Inf),
labels=c('1','2','3-5','6-10','11-15','16-20','20+')
)
Now I have my LoginRange, but I'm unsure how the cut function actually works. I want to find users who have logged in 1 time, 2 times, 3-5 times, etc, while only including the user if they are in that range. Is the cut function including 3 twice (In the 2 bucket and the 3-5 bucket)? If I look in my example, I can see a user who logged in 3 times, but they are cut as '2'. I've looked at the documentation and every R book I own, but no luck. What am I doing wrong?
Also - As a usage question - should I attach the LoginRange to my data frame? If so, what's the best way to do so?
DF <- data.frame(NumLogin, LoginRange)
?
Thanks

The intervals defined by the cut() function are (by default) closed on the right. To see what that means, try this:
cut(1:2, breaks=c(0,1,2))
# [1] (0,1] (1,2]
As you can see, the integer 1 gets included in the range (0,1], not in the range (1,2]. It doesn't get double-counted, and for any input value falling outside of the bins you define, cut() will return a value of NA.
When dealing with integer-valued data, I tend to set break points between the integers, just to avoid tripping myself up. In fact, doing this with your data (as shown below), reveals that the 2nd and 3rd bins were actually incorrectly named, which illustrates the point quite nicely!
LoginRange <- cut(NumLogin,
c(0.5, 1.5, 3.5, 5.5, 10.5, 15.5, 20.5, Inf),
# c(0,1,3,5,10,15,20,Inf) + 0.5,
labels=c('1','2-3','4-5','6-10','11-15','16-20','20+')
)

Related

how to use the function ```num_to_schoice()```?

I would like to built a simple probability exercise such that the solution is just a one decimal number between zero and one (different from zero and one). I would like to use the function num_to_schoice, but if I write:
num_to_schoice(0.3,digits=1,range=c(0.1,0.9))
I get the error message:
NULL
Warning message:
In num_to_schoice(0.3, digits = 1, range = c(0.1, 0.9)) :
specified 'range' is too small for 'delta'
Could someone please explain how the function num_to_schoice should be properly used?
Let me add a couple of points to existing answer by #Edward (+1):
If you generate a solution from the sequence 0.1, 0.2, ..., 0.9 and want four from the remaining eight numbers as distractors, I would recommend not using num_to_schoice(). Only if moving to a correct solution in 0.10, 0.11, 0.12, ..., 0.9, say, I would use num_to_schoice().
Without num_to_schoice() for one digit
You can set up an answerlist with all nine numbers from the sequence, sorting the correct solution into the first position, and then using the exshuffle meta-information tag to do the actual sampling.
For example, in the data-generation you need something like this:
sol <- 0.3
ans <- c(sol, setdiff(1:9/10, sol))
ans <- paste0("$", ans, "$")
In the question you can then include
answerlist(ans, markup = "markdown")
## Answerlist
## ----------
## * $0.3$
## * $0.1$
## * $0.2$
## * $0.4$
## * $0.5$
## * $0.6$
## * $0.7$
## * $0.8$
## * $0.9$
Finally, the meta-information needs:
exsolution: 100000000
exshuffle: 5
This will then use the correct solution and four of the eight false answers - all in shuffled order. (Note that the above uses .Rmd syntax, for .Rnw this needs to be adapted accordingly.)
With num_to_schoice() for two digits
For the scenario with one digit using num_to_schoice() which tries to do too many things, but for more than one digit it might be useful. Specifically, num_to_schoice() assures that the rank of the correct solution is non-iformative, i.e., the correct solution could be the smallest, second-smallest, ..., largest number in the displayed sequence with equal probability. Specifically, this may be important if the distribution of the correct solution is not uniform across the possible range. This is the reason why the following code sometimes fails:
num_to_schoice(0.3, digits = 1, delta = 0.1, range = c(0.1, 0.9))
Internally, this first decides how many of the four wrong answers should be to the left of the correct solution 0.3. Clearly, there is room for at most two wrong answers to the left, which may result in a warning and a NULL result` if exceeded. Moving to two digits can resolve this, e.g.:
num_to_schoice(0.31, range = c(0.01, 0.99),
digits = 2, delta = 0.03, method = "delta")
Remarks:
Personally, I would only do this if the correct solution can potentially also have two digits. Otherwise students might pick up this pattern.
You need to assure that to the left and to the right of the correct solution there is at least 4 * delta so that there is enough room for the wrong answers.
Using delta = 0.01 would certainly be possible but if you want larger deltas then delta = 0.03 or delta = 0.07 are also often useful choices. This is because sampling from an equidistant grid with such a delta is typically not noticable for most students. In contrast, deltas like 0.05, 0.1, 0.02, etc. are typically picked up quickly.
Because your range is (0, 1), you have to specify a smaller delta than the default (1). The function calculates 5 wrong answers, so each has to be within the range you give AND far enough away from the other answers by an amount equal to delta. You should also use the "delta" method, since the package authors give the following advice:
Two methods can be used to generate the wrong solutions: Either simply
runif or otherwise a full equi-distant grid for the range with step
size delta is set up from which a discrete uniform sample is drawn.
The former is preferred if the range is large enough while the latter
performs better if the range is small (as compared to delta).
So you can try the following:
num_to_schoice(0.3, digits=1, range=c(0.1, 0.9), delta=0.05, method="delta")
#$solutions
#[1] FALSE FALSE FALSE TRUE FALSE
#$questions
#[1] "$0.6$" "$0.5$" "$0.3$" "$0.4$" "$0.8$"
Note that this function incorporates randomness, so you may need to try a few times before a valid solution appears. Just keep ignoring the errors.
Edit:
I did try this a few times and every now and then I got a warning about the specified range being too small, with a NULL result returned. Other times the function didn't do anything and I had to abort. The help page also has this tidbit:
Exercise templates using num_to_schoice should be thoroughly tested in
order to avoid problems with too small ranges or almost identical
correct and wrong answers! This can potentially cause problems,
infinite loops, etc.
Inspection of the num_to_schoice function revealed that there is a while loop near the end which may get stuck in the aforementioned "infinite loop". To cut a long story short, it looks like you need to increase the digits to at least 2, otherwise there's a chance that this loop will never end. I hope it's ok to have 2 digits in the answers.
num_to_schoice(0.3, digits=2, range=c(0.1, 0.9), delta=0.01, method="delta")
$solutions
[1] FALSE FALSE FALSE TRUE FALSE
$questions
[1] "$0.23$" "$0.42$" "$0.22$" "$0.30$" "$0.54$"
I tried this 10,000 times and it always returned a non-null result.
res <- NULL
for(i in 1:10000){
res[[i]] <- num_to_schoice(0.3, digits=2, range=c(0.1, 0.9), delta=0.01, method="delta")
}
sum(sapply(res, function(x) any(is.null(x))))
# [1] 0
Hope that works now.

Finding the first significant figure of difference between two very similar values

I'm trying to reproduce the computations that led to a data set data.ref. I'd like to test how well my current implementation does by comparing the reference data to my computed results, data.my. Since each column of the data should have comparable magnitudes within the column, but not necessarily between columns, I've been looking at
(data.ref - data.my) / data.ref
to put errors on a comparable scale. However, since the data is ultimately going to be rounded off, what I'd really like to do is just run a quick and dirty check of how many significant figures worth of agreement the data has. That is, since I expect data.ref and data.my to be quite close to each other, I'd like the answer the question: what is the first significant figure at which each pair of corresponding entries differs?
Is there an R function that does this?
ceiling(log10(abs(data.ref, data.my))) seems to do the trick.
Example:
> data.my <- c(20, 30, 32, 32.01, 32.012)
> data.ref <- rep(32, length(data.my))
> ceiling(log10(abs(data.my - data.ref)))
[1] 2 1 -Inf -2 -1

How to compute for the mean and sd

I need help on 4b please
‘Warpbreaks’ is a built-in dataset in R. Load it using the function data(warpbreaks). It consists of the number of warp breaks per loom, where a loom corresponds to a fixed length of yarn. It has three variables namely, breaks, wool, and tension.
b. For the ‘AM.warpbreaks’ dataset, compute for the mean and the standard deviation of the breaks variable for those observations with breaks value not exceeding 30.
data(warpbreaks)
warpbreaks <- data.frame(warpbreaks)
AM.warpbreaks <- subset(warpbreaks, wool=="A" & tension=="M")
mean(AM.warpbreaks<=30)
sd(AM.warpbreaks<=30)
This is what I understood this problem and typed the code as in the last two lines. However, I wasn't able to run the last two lines while the first 3 lines ran successfully. Can anybody tell me what is the error here?
Thanks! :)
Another way to go about it:
This way you aren't generating a bunch of datasets and then working on remembering which is which. This is more a personal thing though.
data(warpbreaks)
mean(AM.warpbreaks[which(AM.warpbreaks$breaks<=30),"breaks"])
sd(AM.warpbreaks[which(AM.warpbreaks$breaks<=30),"breaks"])
There are two problems with your code. The first is that you are comparing to 30, but you're looking at the entire data frame, rather than just the "breaks" column.
AM.warpbreaks$breaks <= 30
is an expression that refers to the breaks being less than thirty.
But mean(AM.warpbreaks$breaks <= 30) will not give the answer you want either, because R will evaluate the inner expression as a vector of boolean TRUE/FALSE values indicating whether that break is less than 30.
Generally, you just want to take another subset for an analysis like this.
AM.lt.30 <- subset(AM.warpbreaks, breaks <= 30)
mean(AM.lt.30$breaks)
sd(AM.lt.30$breaks)

How to manage factors with mixed data types

I'm afraid this question has two sub parts. My project is to determine which insurance carrier has the lowest cost based on CPT Codes. Since there are so many CPT Codes I wanted to group them using cut like this:
uCPTCode<- unique(data$CPTCode)
uCPTCode <- cut(uCPTCode,
breaks = c(-Inf, "01999", "69979", "79999", "89398", "99091", "99499", Inf),
labels = c("NA","Anesthesia", "Surgery", "Radiology", "Pathology&Laboratory", "Medicine","Evaluation&Management", "Temp"),
right = FALSE)
Not sure unique is required or wise, but seemed to make sense to me. The issue is that some codes have leading zeros and terminating letters like this
2608 Levels: 0014F 0159T 0164T 0191T 0195T 0232T 0319T 0326T 0513F 0517F 0518F
So question 1 is what is the process to convert these ranges into integers corresponding to the labels I have in the cut function so I can graph the grouped results the x axis?
Question 2 is that I expected the ranges to be continuous, but they are not. How to I manage what happens around code 99000 through 99216 where previous groups (Medicine, Anesthesiology and Evaluation and Management) get combined? Here is a link to the CPT grouper file https://www.dropbox.com/s/wm55n17pufoacww/CPTGrouper.xlsx?dl=0
Here is a smattering of results to see where I am going with it
https://www.dropbox.com/s/h6sdnvm9yew6jdg/SampleStudyResults.xlsx?dl=0
Thanks very much for your time and attention

Calculate percentage over time on very large data frames

I'm new to R and my problem is I know what I need to do, just not how to do it in R. I have an very large data frame from a web services load test, ~20M observations. I has the following variables:
epochtime, uri, cache (hit or miss)
I'm thinking I need to do a coule of things. I need to subset my data frame for the top 50 distinct URIs then for each observation in each subset calculate the % cache hit at that point in time. The end goal is a plot of cache hit/miss % over time by URI
I have read, and am still reading various posts here on this topic but R is pretty new and I have a deadline. I'd appreciate any help I can get
EDIT:
I can't provide exact data but it looks like this, its at least 20M observations I'm retrieving from a Mongo database. Time is epoch and we're recording many thousands per second so time has a lot of dupes, thats expected. There could be more than 50 uri, I only care about the top 50. The end result would be a line plot over time of % TCP_HIT to the total occurrences by URI. Hope thats clearer
time uri action
1355683900 /some/uri TCP_HIT
1355683900 /some/other/uri TCP_HIT
1355683905 /some/other/uri TCP_MISS
1355683906 /some/uri TCP_MISS
You are looking for the aggregate function.
Call your data frame u:
> u
time uri action
1 1355683900 /some/uri TCP_HIT
2 1355683900 /some/other/uri TCP_HIT
3 1355683905 /some/other/uri TCP_MISS
4 1355683906 /some/uri TCP_MISS
Here is the ratio of hits for a subset (using the order of factor levels, TCP_HIT=1, TCP_MISS=2 as alphabetical order is used by default), with ten-second intervals:
ratio <- function(u) aggregate(u$action ~ u$time %/% 10,
FUN=function(x) sum((2-as.numeric(x))/length(x)))
Now use lapply to get the final result:
lapply(seq_along(levels(u$uri)),
function(l) list(uri=levels(u$uri)[l],
hits=ratio(u[as.numeric(u$uri) == l,])))
[[1]]
[[1]]$uri
[1] "/some/other/uri"
[[1]]$hits
u$time%/%10 u$action
1 135568390 0.5
[[2]]
[[2]]$uri
[1] "/some/uri"
[[2]]$hits
u$time%/%10 u$action
1 135568390 0.5
Or otherwise filter the data frame by URI before computing the ratio.
#MatthewLundberg's code is the right idea. Specifically, you want something that utilizes the split-apply-combine strategy.
Given the size of your data, though, I'd take a look at the data.table package.
You can see why visually here--data.table is just faster.
Thought it would be useful to share my solution to the plotting part of them problem.
My R "noobness" my shine here but this is what I came up with. It makes a basic line plot. Its plotting the actual value, I haven't done any conversions.
for ( i in 1:length(h)) {
name <- unlist(h[[i]][1])
dftemp <- as.data.frame(do.call(rbind,h[[i]][2]))
names(dftemp) <- c("time", "cache")
plot(dftemp$time,dftemp$cache, type="o")
title(main=name)
}

Resources