Modelling GLM in R with discrete non-binary response - r

I'm new to R and I would like to model the following in GLM:
A memory retention experiment where I ask each participant a similar question every X days. The question has either a correct or wrong answer, the participant must keep trying until the answer is correct. I want to find the probability of him answering the question correctly in 1 try given the past data of number of tries and the time offset between questions.
I'm following this tutorial to model it:
http://www.theanalysisfactor.com/r-tutorial-glm1/
Here's an example of a part of my table
0 3 -
1 1 2
0 2 1
0 5 4
1 1 2
The first value is a binary value whether he 'passes' or 'fails'. Answering in 1 attempt is pass and more than that is fail.
The second value is the number of attempts. We can see that if it is 1 then the first value is also 1, else, the first value is 0.
The third value is the number of days between the question and the previous question.
Right now I'm modelling it as
first ~ second + third
I was thinking if there is a better way to do it, since the first is directly related to the second value. Something like only using the second and third value, and finding P(second = 1). And eventually I would also like to find P(second = 2) in the future.
Thanks for your help :)

Related

Finding the percentage of a specific value in the column of a data set

I have a dataset called college, and one of the columns is 'accepted'. There are two values for this column - 1 (which means student was accepted) and 0 (which means student was not accepted). I was to find the accepted student percentage.
I did this...
table(college$accepted)
which gave me the frequency of 1 and 0. (1 = 44,224 and 0 = 75,166). I then manually added those two values together (119,390) and divided the 44,224/119,390. This is fine and gets me the value I was looking for. But I would really like to know how I could do this with R code, since I'm sure there is a way to do it that I just haven't thought of.
Thanks!
Perhaps you can use prop.table like below
prop.table(table(college$accepted))["1"]
If it's a simple 0/1 column then you only need take the column mean.
mean_accepted <- mean(df$accepted)
you could first sum the column, and the count the total number in the column
sum(college$accepted)/length(college$accepted)
To make the code more explicit and describe your intent better, I suggest using a condition to identify the cases that meet your criteria for inclusion. For example:
college$accepted == 1
Then take the average of the logical vector to compute the proportion (between 0 and 1), multiply by 100 to make it a percentage.
100 * mean(college$accepted == 1, na.rm = TRUE)

Return matching names instead of binary variables in R

I'm new here and diving into R, and I'm encountering a problem while trying to solve a knapsack problem.
For optimization purposes I wrote a dynamic program in R, however, now that I am at the point of returning the items, which I succeeded in, I only get the binary numbers saying whether the item has been selected or not (1 = yes). Like this:
Select
[1] 1 0 0 1
However, now I would like the Select function to return the names of values instead of these binary values. Underneath I created an example of what my problem looks like.
This would be the data and a related data frame.
items <- c("Glasses","gloves","shoes")
grams <- c(4,2,3)
value <- c(100,20,50)
data <- data.frame(items,grams,value)
Now, I created various functions, with the final one clarifying whether a product has been selected by 1 (yes) or 0 (no). Like above. However, I would really like for it to return the related name of the item. Is there a manner to go around this by linking back to the dataframe created?
So that it would say instead of (in case all products are selected)
Select
[1] 1 1 1
Select
[1] Glasses gloves shoes
I believe I would have to create a new function. But as I mentioned, is there a good way to refer back to the data frame to take related values from another column in the data frame in case of a 1 (yes)?
I really hope my question is more clear now and someone can direct me in the right direction.
Best, Berber
Lets say your binary vector is
idx <- [1, 0, 1, 0, 1]
just use,
items[as.logical(idx)]
will give you the name for selected items, and
items[!as.logical(idx)]
will give you name for unselected items

for loop in R using if & print [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
Maybe I'm thinking too hard on this but I need to create a for loop & if statement to find the highest value in my data set. We also have to write a print statement that prints it out & the day. There's 93 rows & 4 columns in the initial matrix. Column 4 has the needed data. The days are in column 1.
I don't know programming at all. So far this is what I got:
I created a vector out of the column with the data:
only.data <- c(data[,4])
Here's my feeble attempt at a for & if statement:
for (counter in 1:93) {
if (only.data >= data[,4])
print (only.data)
}
How do I get it to spit out the highest value using this method? It prints the max value 93 times and that's not what I want. Do I need to create the only.data vector or can I use the original matrix? I also need to print out the corresponding date next to the highest value.
ps - I know I can use the max function which is much quicker but that's not the assignment.
It seems like you are cheating, thus I won't post a full solution here, but only point you in the right direction
data[,4] is already a vector and there is no reason whatsoever to use c() on it. There is also no reason to save it in a new object only.data, although it potentially can make your loop faster as it won't need to index in each loop.
The idea of a loop is that you will use an index in it (although you don't have to, but there is no real reason not to). Thus, you are specifying the index in for(). Although you specified an index (counter), you haven't used it, thus your loop prints only.data regardless of anything you are doing.
All your if doing is to check if only.data >= only.data in every iteration (which is obviously unnecessary)
To calculate the maximum in a loop is not such an obvious thing, as you comparing a single value in each iteration, thus you''ll need some strategy. For example, you could create a dummy variable which will be compared in each iteration against only.data[counter] to check if it's bigger, and then be replaced in case it's not
To illustrate my last point, consider a toy example
set.seed(1)
only.data <- sample(10,10)
only.data
#[1] 3 4 5 7 2 8 9 6 10 1
You can see that the maximum value is in the 9th position, now we will assign the first value of this vector to a dummy variable and will try to use a for loop in order to find the maximum
dummy <- only.data[1]
dummy
## [1] 3
for (counter in only.data) {
if (counter > dummy) dummy <- counter
}
dummy
## [1] 10

How to calculate the expected cost?

I am not good at probability and I know it's not a coding problem directly. But I wish you would help me with this. While I was solving a computation problem I found this difficulty:
Problem definition:
The Little Elephant from the Zoo of Lviv is going to the Birthday
Party of the Big Hippo tomorrow. Now he wants to prepare a gift for
the Big Hippo. He has N balloons, numbered from 1 to N. The i-th
balloon has the color Ci and it costs Pi dollars. The gift for the Big
Hippo will be any subset (chosen randomly, possibly empty) of the
balloons such that the number of different colors in that subset is at
least M. Help Little Elephant to find the expected cost of the gift.
Input
The first line of the input contains a single integer T - the number
of test cases. T test cases follow. The first line of each test case
contains a pair of integers N and M. The next N lines contain N pairs
of integers Ci and Pi, one pair per line.
Output
In T lines print T real numbers - the answers for the corresponding test cases. Your answer will considered correct if it has at most 10^-6 absolute or relative error.
Example
Input:
2
2 2
1 4
2 7
2 1
1 4
2 7
Output:
11.000000000
7.333333333
So, Here I don't understand why the expected cost of the gift for the second case is 7.333333333, because the expected cost equals Summation[xP(x)] and according to this formula it should be 33/2?
Yes, it is a codechef question. But, I am not asking for the solution or the algorithm( because if I take the algo from other than it would not increase my coding potentiality). I just don't understand their example. And hence, I am not being able to start thinking about the algo.
Please help. Thanks in advance!
There are three possible choices, 1, 2, 1+2, with costs 4, 7 and 11. Each is equally likely, so the expected cost is (4 + 7 + 11) / 3 = 22 / 3 = 7.33333.

Running sum on a column conditional on value

I have a vector of binary variables which state whether a product is on promotion in the period. I'm trying to work out how to calculate the duration of each promotion and the duration between promotions.
promo.flag = c(1,1,0,1,0,0,1,1,1,0,1,1,0))
So in other words: if promo.flag is same as previous period then running.total + 1, else running.total is reset to 1
I've tried playing with apply functions and cumsum but can't manage to get the conditional reset of running total working :-(
The output I need is:
promo.flag = c(1,1,0,1,0,0,1,1,1,0,1,1,0)
rolling.sum = c(1,2,1,1,1,2,1,2,3,1,1,2,0)
Can anybody shed any light on how to achieve this in R?
It sounds like you need run length encoding (via the rle command in base R).
unlist(sapply(rle(promo.flag)$lengths,seq))
Gives you a vector 1 2 1 1 1 2 1 2 3 1 1 2 1. Not sure what you're going for with the zero at the end, but I assume it's a terminal condition and easy to change after the fact.
This works because rle() returns a list of two, one of which is named lengths and contains a compact sequence of how many times each is repeated. Then seq when fed a single integer gives you a sequence from 1 to that number. Then apply repeatedly calls seq with the single numbers in rle()$lengths, generating a list of the mini sequences. unlist then turns that list into a vector.

Resources