I want to calculate the choice switching probability by group first(user in below code). Then I will average the group level probability and get a total probability. I have tens of thousands of groups so I need the code to be fast. My code is a for loop , which takes more than 10 minutes to run. I did the same code/logic excel, it takes less than a few seconds.
The switching for choice m to n for a particular user is defined as the share of observations whose choice are n at period t and m at period t-1
My original code is tagging the first and last purchase by for loop first. Then use another for loop to get the switching matrix. I am only able to create the switching matrix by the whole data not by group. Even so, it is still very slow. Adding user would make it even slower.
t<-c(1,2,1,1,2,3,4,5)
user<-c('A','A','B' ,'C','C','C','C','C')
choice<-c(1,1,2,1,2,1,3,3)
dt<-data.frame(t,user,choice)
t user choice
1 A 1
2 A 1
1 B 2
1 C 1
2 C 2
3 C 1
4 C 3
5 C 3
# **step one** create a second choice column for later construction of the switching matrix
#Label first purchase and last purchase is zero
for (i in 1:nrow(dt))
{ ifelse (dt$user[i+1]==dt$user[i],dt$newcol[i+1]<-0,dt$newcol[i+1]<-1) }
# **step two** create stitching matrix
# switching.m is a empty matrix with the size of total chocie:3x3 here
length(unique(dt$user))
total.choice<-3
switching.m<-matrix(0,nrow=total.choice,ncol=total.choice)
for (i in 1:total.choice)
{
for(j in 1:total.choice)
{
if(length(nrow(switching.m[switching.m[,1]==i& switching.m[,2]==j,])!=0))
{switching.m[i,j]=nrow(dt[dt[,1]==i&dt[,2]==j,])}
else {switching.m[i,j]<0}
}
}
The desire output for a particular user/group is like this. The output should have the same matrix size even if the user does not make a particular choice at all
# take user C
#output for switching matrix
second choice
first 1 2 3
1 0 1 1
2 1 0 0
3 0 0 1
#output for switching probability
second choice
first 1 2 3
1 0 0.5 0.5
2 1 0 0
3 0 0 1
We could use table and prop.table after after splitting by 'user'
lst <- lapply(split(dt, dt$user), function(x)
table(factor(x$choice, levels= 1:3), factor(c(x$choice[-1], NA), levels=1:3)))
As mentioned by #nicola, it is more compact to split the 'choice' column by 'user'
lst <- lapply(split(dt$choice, dt$user), function(x)
table(factor(x, levels = 1:3), factor(c(x[-1], NA), levels = 1:3)))
lst$C
# 1 2 3
#1 0 1 1
#2 1 0 0
#3 0 0 1
prb <- lapply(lst, prop.table, 1)
prb$C
# 1 2 3
# 1 0.0 0.5 0.5
# 2 1.0 0.0 0.0
# 3 0.0 0.0 1.0
Related
I am trying to combine several binary variables into one categorical variable. I have ten categorial variables, each describing tasks of a job.
Data looks something like this:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
# etc.
My goal is to combine them into one variable, where the value 1 (=Yes) of each binary variable will be a seperate level of the categorical variable.
To illustrate what I imagine (wrong code obviously):
If Personal_Help = 1 -> Jobcontent = 1
If PR = 1 -> Jobcontent = 2
If Fundraising = 1 -> Jobcontent = 3
etc.
Thank you very much in advance!
Edit:
Thanks for your Answers and apologies for my late answer. I think more context from my side is needed. The goal of combining the binary variables into a categorical variable is to print them into one graphic (using ggplot). The graphic should display how many respondants report the above mentioned tasks as part of their work.
if you're interested only in the first occurrence of 1 among your variables:
df <- data.frame(t(data.frame(Personal_Help, PR,Fundraising)))
result <- sapply(df, function(x) which(x==1)[1])
X1 X2 X3 X4 X5 X6
1 1 2 1 2 1
Of course, this will depend on what you want to do when multiple values are 1 as asked in comments.
Since there are three different variables, and each variable can take either of 2 values, there are 2^3 = 8 possible unique combinations of the three variables, each of which should have a unique number associated.
One way to do this is to imagine each column as being a digit in a three digit binary number. If we subtract 1 from each column, we get a 1 for "no" and a 0 for "yes". This means that our eight possible unique values, and the binary numbers associated with each would be:
binary decimal
0 0 0 = 0
0 0 1 = 1
0 1 0 = 2
0 1 1 = 3
1 0 0 = 4
1 0 1 = 5
1 1 0 = 6
1 1 1 = 7
This system will work for any number of columns, and can be achieved as follows:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
df <- data.frame(Personal_Help, PR, Fundraising)
New_var <- 0
for(i in seq_along(df)) New_var <- New_var + (2^(i - 1)) * (df[[i]] - 1)
df$New_var <- New_var
The end result would then be:
df
#> Personal_Help PR Fundraising New_var
#> 1 1 2 1 2
#> 2 1 1 2 4
#> 3 2 1 1 1
#> 4 1 2 2 6
#> 5 2 1 2 5
#> 6 1 2 1 2
In your actual data, there will be 1024 possible combinations of tasks, so this will generate numbers for New_var between 0 and 1023. Because of how it is generated, you can actually use this single number to reverse engineer the entire row as long as you know the original column order.
As #ulfelder commented, you need to clarify how you want to handle cases where more than one column is 1.
Assuming you want to use the first column equal to 1, you can use which.min(), applied by row:
data <- data.frame(Personal_Help, PR, Fundraising)
data$Jobcontent <- apply(data, MARGIN = 1, which.min)
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 1
2 1 1 2 1
3 2 1 1 2
4 1 2 2 1
5 2 1 2 2
6 1 2 1 1
If you’d like Jobcontent to include the name of each job, you can index into names(data):
data$Jobcontent <- names(data)[apply(data, MARGIN = 1, which.min)]
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 Personal_Help
2 1 1 2 Personal_Help
3 2 1 1 PR
4 1 2 2 Personal_Help
5 2 1 2 PR
6 1 2 1 Personal_Help
max.col may help here:
Jobcontent <- max.col(-data.frame(Personal_Help, PR, Fundraising), "first")
Jobcontent
#> [1] 1 1 2 1 2 1
I have a dataframe, which contains 100.000 rows. It looks like this:
Value
1
2
-1
-2
0
3
4
-1
3
I want to create an extra column (column B). Which consist of 0 and 1's.
It is basically 0, but when there are 5 data points in a row positive OR negative, then it should give a 1. But, only if they are in a row (e.g.: when the row is positive, and there is a negative number.. the count shall start again).
Value B
1 0
2 0
1 0
2 0
2 1
3 1
4 1
-1 0
3 0
I tried different loops, but It didn't work. I also tried to convert the whole DF to a list (and loop over the list). Unfortunately with no end.
Here's an approach that uses the rollmean function from the zoo package.
set.seed(1000)
df = data.frame(Value = sample(-9:9,1000,replace=T))
sign = sign(df$Value)
library(zoo)
rolling = rollmean(sign,k=5,fill=0,align="right")
df$B = as.numeric(abs(rolling) == 1)
I generated 1000 values with positive and negative sets.
Extract the sign of the values - this will be -1 for negative, 1 for positive and 0 for 0
Calculate the right aligned rolling mean of 5 values (it will average x[1:5], x[2:6], ...). This will be 1 or -1 if all the values in a row are positive or negative (respectively)
Take the absolute value and store the comparison against 1. This is a logical vector that turns into 0s and 1s based on your conditions.
Note - there's no need for loops. This can all be vectorised (once we have the rolling mean calculated).
This will work. Not the most efficient way to do it but the logic is pretty transparent -- just check if there's only one unique sign (i.e. +, -, or 0) for each sequence of five adjacent rows:
dat <- data.frame(Value=c(1,2,1,2,2,3,4,-1,3))
dat$new_col <- NA
dat$new_col[1:4] <- 0
for (x in 5:nrow(dat)){
if (length(unique(sign(dat$Value[(x-4):x])))==1){
dat$new_col[x] <- 1
} else {
dat$new_col[x] <- 0
}
}
Use the cumsum(...diff(...) <condition>) idiom to create a grouping variable, and ave to calculate the indices within each group.
d$B2 <- ave(d$Value, cumsum(c(0, diff(sign(d$Value)) != 0)), FUN = function(x){
as.integer(seq_along(x) > 4)})
# Value B B2
# 1 1 0 0
# 2 2 0 0
# 3 1 0 0
# 4 2 0 0
# 5 2 1 1
# 6 3 1 1
# 7 4 1 1
# 8 -1 0 0
# 9 3 0 0
I'm trying to assess the performance of a simple prediction model using R, by discretizing the prediction results by binning them into defined intervals and then compare them with the corresponding actual values(binned).
I have two vectors actual and predicted as shown:
> actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
> predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)
I need to perform binning here. First off, the values of 'actual' are factorized/discretized into different levels, say:
0-5: Level 1 ** 6-10: Level 2 ** ... ** 41-45: Level 9
Now, I've to bin the values of 'predicted' also into the above mentioned buckets.
I tried to achieve this using the cut() function in R:
binCount <- 5
binActual <- cut(actual,labels=1:binCount,breaks=binCount)
binPred <- cut(predicted,labels=1:binCount,breaks=binCount)
However, if you see the second element in predicted (98.01) is labelled as 5, but it doesn't actually fall in the desired interval.
I feel that using a different binCount for predicted will not help.Can anyone please suggest a solution for this ?
I'm not 100% sure of what you want to do.
However from what I understand you want to return for each element of each vector the class it would be in. Given a set of class that takes into account any value from any of the two vectors actual and predicted.
If it is what you want to do, then your script (as you say) creates classes for values between 0 and 45. With this cut you class your first vector.
Then you create a new set of classes for your vector predicted.
The classification is not the same anymore.
Assuming that I understood what you want to do, I'd rather write :
actual <- c(0,2,0,0,41,1,3,5,2,0,0,0,0,0,6,1,0,0,15,1)
predicted <- c(3.38,98.01,3.08,4.89,31.46,3.88,4.75,4.64,3.11,3.15,3.42,10.42,3.18,5.73,4.20,3.34,3.95,5.94,3.99)
temporary = c(actual, predicted)
maxi <- max(temporary)
mini <- min(temporary)
binCount <- 5
s <- seq(maxi, mini, length.out = binCount)
s = sort(s)
binActual <- cut(actual,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
binPred <- cut(predicted,breaks=s, include.lowest = T, labels = 1:(length(s)-1))
It gives :
> binActual
[1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4
> binPred
[1] 1 4 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Levels: 1 2 3 4
I'm not sure it is what you're looking for, so let me know, I might be able to help you.
Best wishes.
Is this what you want?
intervals <- cbind(seq(0, 40, length = 9), seq(5, 45, length = 9))
cutFixed <- function(x, intervals) {
sapply(x, function(x) ifelse(x < min(intervals) | x >= max(intervals), NA, which(x >= intervals[,1] & x < intervals[,2])))
}
This gives the following result
> cutFixed(actual, intervals)
[1] 1 1 1 1 9 1 1 2 1 1 1 1 1 1 2 1 1 1 4 1
> cutFixed(predicted, intervals)
[1] 1 NA 1 1 7 1 1 1 1 1 1 3 1 2 1 1 1 2 1
How do I apply a function that can "see" the preceding result when operating by rows?
This comes up a lot, but my current problem requires a running total by student that resets if the total doesn't get to 5.
Example Data:
> df
row Student Absent Consecutive.Absences
1 A 0 0
2 A 1 1
3 A 1 2
4 A 0 0 <- resets to zero if under 5
5 A 0 0
6 A 1 1
7 A 1 2
8 A 1 3
9 B 1 1 <- starts over for new factor (Student)
10 B 1 2
11 B 0 0
12 B 1 1
13 B 1 2
14 B 1 3
15 B 1 4
16 B 0 0
17 B 1 1
18 B 1 2
19 B 1 3
20 B 1 4
21 B 1 5
22 B 0 5 <- gets locked at 5
23 B 0 5
24 B 1 6
25 B 1 7
I've tried doing this with a huge matrix of shifted vectors.
I've tried doing this with the apply family of functions and half of them do nothing, the other half hit 16GB of RAM and crash my computer.
I've tried straight looping and it takes 4+ hours (it's a big data set)
What bothers me is how easy this is in Excel. Usually R runs circles around Excel both in speed and writability, which leads me to believe I'm missing something elementary here.
Forgetting even the more challenging ("lock at 5") feature of this, I can't even get a cumsum that resets. There is no combination of factors I can think of to group for ave like this:
Consecutive.Absences = ave(Absent, ..., cumsum)
Obviously, grouping on Student will just give the Total Cumulative Absences -- it "remembers" the kid's absence over the gaps because of the split and recombine in ave.
So as I said, the core of what I don't know how to do in R is this:
How do I apply a function that can "see" the preceding result when operating by rows?
In Excel it would be easy:
C3 = IF($A3=$A2,$B3+$C2,$B3)*$B3
This excel function is displayed without the 5-absence lock for easy readability.
Once I figure out how to apply a function that looks at previous results of the same function in R, I'll be able to figure out the rest.
Thank you in advance for your help--this will be very useful in a lot of my applications!
Genuinely,
Sam
UPDATE:
Thank you everyone for the ideas on how to identify if a student has 5 consecutive absences!
However, that's easy enough to do in the database at the STUDENTS table. What I need to know is the number of consecutive absences by student in the attendance record itself for things like, "Do we count this particular attendance record when calculating other summary statistics?"
If you're looking to apply a function to every element in a vector while making use the previous element's value, you might want to check out "Reduce", with the accumulate parameter set to True
Here's an example:
##define your function that takes two parameters
##these are the 'previous' and the 'current' elements
runSum <- function(sum, x){
res = 0
if (x == 1){
res = sum + 1
}
else if (x == 0 & sum < 5){
res = 0
}
else{
res = sum
}
res
}
#lets look at the absent values from subject B
x = c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1)
Reduce(x=x, f=runSum, accumulate=T)
# [1] 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
It's fairly easy to identify the students with one or more runs of 5:
tapply(dfrm$Absent, dfrm$Student, function(x) rle(x)$value[rle(x)$length >=5] )
$A
integer(0)
$B
[1] 1
Look for any values of "1" in the result:
tapply(dfrm$Absent, dfrm$Student, function(x) 1 %in% rle(x)$value[rle(x)$length >=5] )
A B
FALSE TRUE
I also struggled through to a Reduce solution (but am second in priority to #kithpradhan):
ave(dfrm$Absent, dfrm$Student,
FUN= function(XX)
Reduce(function(x,y) if( x[1] >= 5){ y+x[1]
} else{ x[1]*y+y } , #Resets to 0 if y=0
XX, accumulate=TRUE)
)
#[1] 0 1 2 0 0 1 2 3 1 2 0 1 2 3 4 0 1 2 3 4 5 5 5 6 7
For the record, you can also create your own Reduce-derivative which receives f and x, and applies f(x) on its output until x == f(x) or maxiter is reached:
ireduce = function(f, x, maxiter = 50){
i = 1
while(!identical(f(x), x) & i <= maxiter) {x = f(x); i = i+1}; x
}
I have quite big data frame (few millions of records).
I need to filter it due to following rule:
- For each product delete all records which are before the fifth record after the first record with x>0.
So, We are interested only in two columns - ID and x. Data frame is sorted by ID.
It is fairly easy to do it using loops, but loops doesn't perform well on such big data frame.
How to do it in 'vector style'?
Example:
BEFORE FILTERING
ID x
1 0
1 0
1 5 # First record with x>0
1 0
1 3
1 4
1 0
1 9
1 0 # Delete all earlier records of that product
1 0
1 6
2 0
2 1 # First record with x>0
2 0
2 4
2 5
2 8
2 0 # Delete all earlier records of that product
2 1
2 3
After filtering:
ID x
1 9
1 0
1 0
1 6
2 0
2 1
2 3
For these split, apply, combine problems - I like using plyr. There are alternatives if speed becomes an issue, but for most things - plyr is easy to understand and use. I wrote a function that implements the logic you described above and then fed that to ddply() to operate on each chunk of the data based on ID.
fun <- function(x, column, threshold, numplus){
whichcol <- which(x[column] > threshold)[1]
rows <- seq(from = (whichcol + numplus), to = nrow(x))
return(x[rows,])
}
And then feed this to ddply()
require(plyr)
ddply(dat, "ID", fun, column = "x", threshold = 0, numplus = 5)
#-----
ID x
1 1 9
2 1 0
3 1 0
4 1 6
5 2 0
6 2 1
7 2 3