I'm trying to assess the performance of a simple prediction model using R, by discretizing the prediction results by binning them into defined intervals and then compare them with the corresponding actual values(binned).
I have two vectors actual and predicted as shown:
> actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
> predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)
I need to perform binning here. First off, the values of 'actual' are factorized/discretized into different levels, say:
0-5: Level 1 ** 6-10: Level 2 ** ... ** 41-45: Level 9
Now, I've to bin the values of 'predicted' also into the above mentioned buckets.
I tried to achieve this using the cut() function in R:
binCount <- 5
binActual <- cut(actual,labels=1:binCount,breaks=binCount)
binPred <- cut(predicted,labels=1:binCount,breaks=binCount)
However, if you see the second element in predicted (98.01) is labelled as 5, but it doesn't actually fall in the desired interval.
I feel that using a different binCount for predicted will not help.Can anyone please suggest a solution for this ?
I'm not 100% sure of what you want to do.
However from what I understand you want to return for each element of each vector the class it would be in. Given a set of class that takes into account any value from any of the two vectors actual and predicted.
If it is what you want to do, then your script (as you say) creates classes for values between 0 and 45. With this cut you class your first vector.
Then you create a new set of classes for your vector predicted.
The classification is not the same anymore.
Assuming that I understood what you want to do, I'd rather write :
actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)
temporary = c(actual, predicted)
maxi <- max(temporary)
mini <- min(temporary)
binCount <- 5
s <- seq(maxi, mini, length.out = binCount)
s = sort(s)
binActual <- cut(actual,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
binPred <- cut(predicted,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
It gives :
> binActual
[1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4
> binPred
[1] 1 4 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4
I'm not sure it is what you're looking for, so let me know, I might be able to help you.
Best wishes.
Is this what you want?
intervals <- cbind(seq(0, 40, length = 9), seq(5, 45, length = 9))
cutFixed <- function(x, intervals) {
sapply(x, function(x) ifelse(x < min(intervals) | x >= max(intervals), NA, which(x >= intervals[,1] & x < intervals[,2])))
}
This gives the following result
> cutFixed(actual, intervals)
[1] 1 1 1 1 9 1 1 2 1 1 1 1 1 1 2 1 1 1 4 1
> cutFixed(predicted, intervals)
[1] 1 NA 1 1 7 1 1 1 1 1 1 3 1 2 1 1 1 2 1
Related
I am trying to combine several binary variables into one categorical variable. I have ten categorial variables, each describing tasks of a job.
Data looks something like this:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
# etc.
My goal is to combine them into one variable, where the value 1 (=Yes) of each binary variable will be a seperate level of the categorical variable.
To illustrate what I imagine (wrong code obviously):
If Personal_Help = 1 -> Jobcontent = 1
If PR = 1 -> Jobcontent = 2
If Fundraising = 1 -> Jobcontent = 3
etc.
Thank you very much in advance!
Edit:
Thanks for your Answers and apologies for my late answer. I think more context from my side is needed. The goal of combining the binary variables into a categorical variable is to print them into one graphic (using ggplot). The graphic should display how many respondants report the above mentioned tasks as part of their work.
if you're interested only in the first occurrence of 1 among your variables:
df <- data.frame(t(data.frame(Personal_Help, PR,Fundraising)))
result <- sapply(df, function(x) which(x==1)[1])
X1 X2 X3 X4 X5 X6
1 1 2 1 2 1
Of course, this will depend on what you want to do when multiple values are 1 as asked in comments.
Since there are three different variables, and each variable can take either of 2 values, there are 2^3 = 8 possible unique combinations of the three variables, each of which should have a unique number associated.
One way to do this is to imagine each column as being a digit in a three digit binary number. If we subtract 1 from each column, we get a 1 for "no" and a 0 for "yes". This means that our eight possible unique values, and the binary numbers associated with each would be:
binary decimal
0 0 0 = 0
0 0 1 = 1
0 1 0 = 2
0 1 1 = 3
1 0 0 = 4
1 0 1 = 5
1 1 0 = 6
1 1 1 = 7
This system will work for any number of columns, and can be achieved as follows:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
df <- data.frame(Personal_Help, PR, Fundraising)
New_var <- 0
for(i in seq_along(df)) New_var <- New_var + (2^(i - 1)) * (df[[i]] - 1)
df$New_var <- New_var
The end result would then be:
df
#> Personal_Help PR Fundraising New_var
#> 1 1 2 1 2
#> 2 1 1 2 4
#> 3 2 1 1 1
#> 4 1 2 2 6
#> 5 2 1 2 5
#> 6 1 2 1 2
In your actual data, there will be 1024 possible combinations of tasks, so this will generate numbers for New_var between 0 and 1023. Because of how it is generated, you can actually use this single number to reverse engineer the entire row as long as you know the original column order.
As #ulfelder commented, you need to clarify how you want to handle cases where more than one column is 1.
Assuming you want to use the first column equal to 1, you can use which.min(), applied by row:
data <- data.frame(Personal_Help, PR, Fundraising)
data$Jobcontent <- apply(data, MARGIN = 1, which.min)
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 1
2 1 1 2 1
3 2 1 1 2
4 1 2 2 1
5 2 1 2 2
6 1 2 1 1
If you’d like Jobcontent to include the name of each job, you can index into names(data):
data$Jobcontent <- names(data)[apply(data, MARGIN = 1, which.min)]
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 Personal_Help
2 1 1 2 Personal_Help
3 2 1 1 PR
4 1 2 2 Personal_Help
5 2 1 2 PR
6 1 2 1 Personal_Help
max.col may help here:
Jobcontent <- max.col(-data.frame(Personal_Help, PR, Fundraising), "first")
Jobcontent
#> [1] 1 1 2 1 2 1
In R, How do we use a vector instead of element in the lag function. i.e for Lag(x,k=2); instead of 2 I want to use a vector because I want to lag each row by a different value. So one row could have a lag of 3, while 1 could be 0 etc.
Example:
a #lags d
1 0 1
2 1 1
4 2 1
3 0 3
1 1 3
i think you may need to write your own function for this task. i wrote one that i think will be what you need, or perhaps point you in the right direction:
x1 <- c(75,98,65,45,78,94,123,54) #a fake data set for us to lag
y1 <- c(2,3,1,4,1,2,3,5) #vector of values to lag by
#the function below takes the data, x1, and lags it by y1
dynlag <- function(x,y) {
a1 <- x[length(x)-y]
return(a1)
}
#test out the function
dynlag(x1,y1)
hope this helps. :)
Here is a solution with index calculus:
D <- read.table(header=TRUE, text=
'a lags d
1 0 1
2 1 1
4 2 1
3 0 3
1 1 3')
i <- seq(length(D$a))
erg <- D$a[i - D$lags]
all.equal(erg, D$d)
I want to calculate the choice switching probability by group first(user in below code). Then I will average the group level probability and get a total probability. I have tens of thousands of groups so I need the code to be fast. My code is a for loop , which takes more than 10 minutes to run. I did the same code/logic excel, it takes less than a few seconds.
The switching for choice m to n for a particular user is defined as the share of observations whose choice are n at period t and m at period t-1
My original code is tagging the first and last purchase by for loop first. Then use another for loop to get the switching matrix. I am only able to create the switching matrix by the whole data not by group. Even so, it is still very slow. Adding user would make it even slower.
t<-c(1,2,1,1,2,3,4,5)
user<-c('A','A','B' ,'C','C','C','C','C')
choice<-c(1,1,2,1,2,1,3,3)
dt<-data.frame(t,user,choice)
t user choice
1 A 1
2 A 1
1 B 2
1 C 1
2 C 2
3 C 1
4 C 3
5 C 3
# **step one** create a second choice column for later construction of the switching matrix
#Label first purchase and last purchase is zero
for (i in 1:nrow(dt))
{ ifelse (dt$user[i+1]==dt$user[i],dt$newcol[i+1]<-0,dt$newcol[i+1]<-1) }
# **step two** create stitching matrix
# switching.m is a empty matrix with the size of total chocie:3x3 here
length(unique(dt$user))
total.choice<-3
switching.m<-matrix(0,nrow=total.choice,ncol=total.choice)
for (i in 1:total.choice)
{
for(j in 1:total.choice)
{
if(length(nrow(switching.m[switching.m[,1]==i& switching.m[,2]==j,])!=0))
{switching.m[i,j]=nrow(dt[dt[,1]==i&dt[,2]==j,])}
else {switching.m[i,j]<0}
}
}
The desire output for a particular user/group is like this. The output should have the same matrix size even if the user does not make a particular choice at all
# take user C
#output for switching matrix
second choice
first 1 2 3
1 0 1 1
2 1 0 0
3 0 0 1
#output for switching probability
second choice
first 1 2 3
1 0 0.5 0.5
2 1 0 0
3 0 0 1
We could use table and prop.table after after splitting by 'user'
lst <- lapply(split(dt, dt$user), function(x)
table(factor(x$choice, levels= 1:3), factor(c(x$choice[-1], NA), levels=1:3)))
As mentioned by #nicola, it is more compact to split the 'choice' column by 'user'
lst <- lapply(split(dt$choice, dt$user), function(x)
table(factor(x, levels = 1:3), factor(c(x[-1], NA), levels = 1:3)))
lst$C
# 1 2 3
#1 0 1 1
#2 1 0 0
#3 0 0 1
prb <- lapply(lst, prop.table, 1)
prb$C
# 1 2 3
# 1 0.0 0.5 0.5
# 2 1.0 0.0 0.0
# 3 0.0 0.0 1.0
I'm doing a failure analysis, for which I like to try some different scenarios and some random trials. So far I've done this with the mosaic package and its working out great.
In one specific scenario I want to generate a vector of (semi)random numbers with from different distributions. No problem so far.
Now I want to have defined number of negative numbers in this vector.
For example I want to have between 0-5 negative numbers in the vector of 25 numbers.
I thought I could use something like rbinom(n=25,prob=5/25,size=1) to get 5 random ones first but of course 5/25, 25 times can be more than 5 ones. This seems a dead end.
I could get it done with some for loops, but probably something easier exists.
I've tried all sorts of sample,seq, shuffle combinations but I cannot get it to work so far.
does anyone have any ideas or suggestions?
If you have a vector x where all elements are >= 0, let's say drawn from Poisson:
x = rpois(25, lambda=3)
You can make a random 5 of the negative by doing
x * sample(rep(c(1, -1), c(length(x) - 5, 5)))
This works because
rep(c(1, -1), c(length(x) - 5, 5))
will be
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 -1 -1 -1 -1
and sample(rep(c(1, -1), c(length(x) - 5, 5))) simply shuffles them up randomly:
sample(rep(c(1, -1), c(length(x) - 5, 5)))
# [1] 1 1 -1 1 1 1 1 1 1 1 1 -1 1 1 1 -1 -1 1 1 1 -1 1 1 1 1
I can suggest a very straightforward solution, guaranteeing 5 negative values and working for any continuous distribution. The idea is just to sort the vector and substract the 6th biggest to each value:
x <- rnorm(25)
res <- sort(x, T)[6] - x
#### [1] 0.4956991 1.5799885 2.4207497 1.1639569 0.2161187 0.2443917 -0.4942884 -0.2627706 1.5188197
#### [10] 0.0000000 1.6081025 1.4922573 1.4828059 0.3320079 0.3552913 -0.6435770 -0.3106201 1.5074491
#### [19] 0.6042724 0.3707655 -0.2624150 1.1671077 2.4679686 1.0024573 0.2453597
sum(res<0)
#### [1] 5
It also works for discrete distributions but only if there are no ties..
How do I apply a function that can "see" the preceding result when operating by rows?
This comes up a lot, but my current problem requires a running total by student that resets if the total doesn't get to 5.
Example Data:
> df
row Student Absent Consecutive.Absences
1 A 0 0
2 A 1 1
3 A 1 2
4 A 0 0 <- resets to zero if under 5
5 A 0 0
6 A 1 1
7 A 1 2
8 A 1 3
9 B 1 1 <- starts over for new factor (Student)
10 B 1 2
11 B 0 0
12 B 1 1
13 B 1 2
14 B 1 3
15 B 1 4
16 B 0 0
17 B 1 1
18 B 1 2
19 B 1 3
20 B 1 4
21 B 1 5
22 B 0 5 <- gets locked at 5
23 B 0 5
24 B 1 6
25 B 1 7
I've tried doing this with a huge matrix of shifted vectors.
I've tried doing this with the apply family of functions and half of them do nothing, the other half hit 16GB of RAM and crash my computer.
I've tried straight looping and it takes 4+ hours (it's a big data set)
What bothers me is how easy this is in Excel. Usually R runs circles around Excel both in speed and writability, which leads me to believe I'm missing something elementary here.
Forgetting even the more challenging ("lock at 5") feature of this, I can't even get a cumsum that resets. There is no combination of factors I can think of to group for ave like this:
Consecutive.Absences = ave(Absent, ..., cumsum)
Obviously, grouping on Student will just give the Total Cumulative Absences -- it "remembers" the kid's absence over the gaps because of the split and recombine in ave.
So as I said, the core of what I don't know how to do in R is this:
How do I apply a function that can "see" the preceding result when operating by rows?
In Excel it would be easy:
C3 = IF($A3=$A2,$B3+$C2,$B3)*$B3
This excel function is displayed without the 5-absence lock for easy readability.
Once I figure out how to apply a function that looks at previous results of the same function in R, I'll be able to figure out the rest.
Thank you in advance for your help--this will be very useful in a lot of my applications!
Genuinely,
Sam
UPDATE:
Thank you everyone for the ideas on how to identify if a student has 5 consecutive absences!
However, that's easy enough to do in the database at the STUDENTS table. What I need to know is the number of consecutive absences by student in the attendance record itself for things like, "Do we count this particular attendance record when calculating other summary statistics?"
If you're looking to apply a function to every element in a vector while making use the previous element's value, you might want to check out "Reduce", with the accumulate parameter set to True
Here's an example:
##define your function that takes two parameters
##these are the 'previous' and the 'current' elements
runSum <- function(sum, x){
res = 0
if (x == 1){
res = sum + 1
}
else if (x == 0 & sum < 5){
res = 0
}
else{
res = sum
}
res
}
#lets look at the absent values from subject B
x = c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1)
Reduce(x=x, f=runSum, accumulate=T)
# [1] 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
It's fairly easy to identify the students with one or more runs of 5:
tapply(dfrm$Absent, dfrm$Student, function(x) rle(x)$value[rle(x)$length >=5] )
$A
integer(0)
$B
[1] 1
Look for any values of "1" in the result:
tapply(dfrm$Absent, dfrm$Student, function(x) 1 %in% rle(x)$value[rle(x)$length >=5] )
A B
FALSE TRUE
I also struggled through to a Reduce solution (but am second in priority to #kithpradhan):
ave(dfrm$Absent, dfrm$Student,
FUN= function(XX)
Reduce(function(x,y) if( x[1] >= 5){ y+x[1]
} else{ x[1]*y+y } , #Resets to 0 if y=0
XX, accumulate=TRUE)
)
#[1] 0 1 2 0 0 1 2 3 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
For the record, you can also create your own Reduce-derivative which receives f and x, and applies f(x) on its output until x == f(x) or maxiter is reached:
ireduce = function(f, x, maxiter = 50){
i = 1
while(!identical(f(x), x) & i <= maxiter) {x = f(x); i = i+1}; x
}