My dataframe consists of three columns:
customerID: shows an unique ID for each customer.
buy: "1" if the customer bought a certain product and "0" if they didn't.
loy: Indicates the grade of "loyalty" of the customer. The value changes depending if the customer has previously bought (or not) the product.
The same customer can buy the product multiple times.
The construction of the variable "loy" goes in the following way for each observation [i]:
0.9buy[i-1] + 0.1loy[i-1]
This is the piece of code I have until now:
dta <- data.frame(
customerID = c("10","10","10","11","11","11","12","12","12","13","13", "13", "14",
"14", "14", "15", "15", "15", "16", "16", "16"),
buy = c(1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1,
0, 0, 1)
)
dta$loy <- 0
n=nrow(dta)
for(i in 2:n){
dta$loy[i] <- dta$buy[i-1]*0.9 + 0.1*dta$loy[i-1]
}
(The first value of "loy" for each customer is supposed to be fixed to "0")
I need to construct the variable "loy" for each customer. That means, the loop has to reset every time it passes to a new customerID, but I don't know how to do it.
With the code I have until now, the loop keeps going for all of the observations.
It's worth to mention that this dataframe was created from scratch just for the ease of this question. I have another set of data with over 2000 customerID's, on which I plan to apply the solution of this simplified problem.
You can run a loop inside another loop. This solution depends on your data being sorted by customerID and time.
ids <- unique(dta$customerID)
for(i in ids){
# location of ith customer's records
index <- which(dta$customerID == i)
# calculate running loyalty score
for (j in index[-1]) {
dta[j, "loy"] <- dta[
j - 1, "buy"]*0.9 + 0.1*dta[
j - 1, "loy"]
}
}
dta
# customerID buy loy
# 1 10 1 0.00
# 2 10 1 0.90
# 3 10 1 0.99
# 4 11 1 0.00
# 5 11 0 0.90
# 6 11 0 0.09
# ...
You could use dplyr:
library(dplyr)
dta%>%
group_by(customerID) %>%
mutate(tmp = 0.9 * lag(buy, default = 0),
loy = tmp + 0.1 * lag(tmp, default = 0)) %>%
select(-tmp) %>%
ungroup()
which returns
# A tibble: 21 x 3
customerID buy loy
<chr> <dbl> <dbl>
1 10 1 0
2 10 1 0.9
3 10 1 0.99
4 11 1 0
5 11 0 0.9
6 11 0 0.09
7 12 0 0
8 12 1 0
9 12 1 0.9
10 13 1 0
# ... with 11 more rows
Related
I am reproducing some Stata code on R and I would like to perform a multinomial logistic regression with the mlogit function, from the package of the same name (I know that there is a multinom function in nnet but I don't want to use this one).
My problem is that, to use mlogit, I need my data to be formatted using mlogit.data and I can't figure out how to format it properly. Comparing my data to the data used in the examples in the documentation and in this question, I realize that it is not in the same form.
Indeed, the data I use is like:
df <- data.frame(ID = seq(1, 10),
type = c(2, 3, 4, 2, 1, 1, 4, 1, 3, 2),
age = c(28, 31, 12, 1, 49, 80, 36, 53, 22, 10),
dum1 = c(1, 0, 0, 0, 0, 1, 0, 1, 1, 0),
dum2 = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
ID type age dum1 dum2
1 1 2 28 1 1
2 2 3 31 0 0
3 3 4 12 0 1
4 4 2 1 0 1
5 5 1 49 0 0
6 6 1 80 1 0
7 7 4 36 0 1
8 8 1 53 1 0
9 9 3 22 1 1
10 10 2 10 0 0
whereas the data they use is like:
key altkey A B C D
1 201005131 1 2.6 118.17 117 0
2 201005131 2 1.4 117.11 115 0
3 201005131 3 1.1 117.38 122 1
4 201005131 4 24.6 NA 122 0
5 201005131 5 48.6 91.90 122 0
6 201005131 6 59.8 NA 122 0
7 201005132 1 20.2 118.23 113 0
8 201005132 2 2.5 123.67 120 1
9 201005132 3 7.4 116.30 120 0
10 201005132 4 2.8 118.86 120 0
11 201005132 5 6.9 124.72 120 0
12 201005132 6 2.5 123.81 120 0
As you can see, in their case, there is a column altkey that details every category for each key and there is also a column D showing which alternative is chosen by the person.
However, I only have one column (type) which shows the choice of the individual but does not show the other alternatives or the value of the other variables for each of these alternatives. When I try to apply mlogit, I have:
library(mlogit)
mlogit(type ~ age + dum1 + dum2, df)
Error in data.frame(lapply(index, function(x) x[drop = TRUE]), row.names = rownames(mydata)) :
row names supplied are of the wrong length
Therefore, how can I format my data so that it corresponds to the type of data mlogit requires?
Edit: following the advices of #edsandorf, I modified my dataframe and mlogit.data works but now all the other explanatory variables have the same value for each alternative. Should I set these variables at 0 in the rows where the chosen alternative is 0 or FALSE ? (in fact, can somebody show me the procedure from where I am to the results of the mlogit because I don't get where I'm wrong for the estimation?)
The data I show here (df) is not my true data. However, it is exactly the same form: a column with the choice of the alternative (type), columns with dummies and age, etc.
Here's the procedure I've made so far (I did not set the alternatives to 0):
# create a dataframe with all alternatives for each ID
qqch <- data.frame(ID = rep(df$ID, each = 4),
choice = rep(1:4, 10))
# merge both dataframes
df2 <- dplyr::left_join(qqch, df, by = "ID")
# change the values in stype by 1 or 0
for (i in 1:length(df2$ID)){
df2[i, "type"] <- ifelse(df2[i, "type"] == df2[i, "choice"], 1, 0)
}
# format for mlogit
df3 <- mlogit.data(df2, choice = "type", shape = "long", alt.var = "choice")
head(df3)
ID choice type age dum1 dum2
1.1 1 1 FALSE 28 1 1
1.2 1 2 TRUE 28 1 1
1.3 1 3 FALSE 28 1 1
1.4 1 4 FALSE 28 1 1
2.1 2 1 FALSE 31 0 0
2.2 2 2 FALSE 31 0 0
If I do :
mlogit(type ~ age + dum1 + dum2, df3)
I have the error:
Error in solve.default(H, g[!fixed]) : system is computationally singular: reciprocal condition number
Your data doesn't lend itself well to be estimated using an MNL model unless we make more assumptions. In general, since all your variables are individual specific and does not vary across alternatives (types), the model cannot be identified. All of your individual specific characteristics will drop out unless we treat them as alternative specific. By the sounds of it, each professional program carries meaning in an of itself. In that case, we could estimate the MNL model using constants only, where the constant captures everything about the program that makes an individual choose it.
library(mlogit)
df <- data.frame(ID = seq(1, 10),
type = c(2, 3, 4, 2, 1, 1, 4, 1, 3, 2),
age = c(28, 31, 12, 1, 49, 80, 36, 53, 22, 10),
dum1 = c(1, 0, 0, 0, 0, 1, 0, 1, 1, 0),
dum2 = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
Now, just to be on the safe side, I create dummy variables for each of the programs. type_1 refers to program 1, type_2 to program 2 etc.
qqch <- data.frame(ID = rep(df$ID, each = 4),
choice = rep(1:4, 10))
# merge both dataframes
df2 <- dplyr::left_join(qqch, df, by = "ID")
# change the values in stype by 1 or 0
for (i in 1:length(df2$ID)){
df2[i, "type"] <- ifelse(df2[i, "type"] == df2[i, "choice"], 1, 0)
}
# Add alternative specific variables (here only constants)
df2$type_1 <- ifelse(df2$choice == 1, 1, 0)
df2$type_2 <- ifelse(df2$choice == 2, 1, 0)
df2$type_3 <- ifelse(df2$choice == 3, 1, 0)
df2$type_4 <- ifelse(df2$choice == 4, 1, 0)
# format for mlogit
df3 <- mlogit.data(df2, choice = "type", shape = "long", alt.var = "choice")
head(df3)
Now we can run the model. I include the dummies for each of the alternatives keeping alternative 4 as my reference level. Only J-1 constants are identified, where J is the number of alternatives. In the second half of the formula (after the pipe operator), I make sure that I remove all alternative specific constants that the model would have created and I add your individual specific variables, treating them as alternative specific. Note that this only makes sense if your alternatives (programs) carry meaning and are not generic.
model <- mlogit(type ~ type_1 + type_2 + type_3 | -1 + age + dum1 + dum2,
reflevel = 4, data = df3)
summary(model)
I want to calculate a moving sum of one column (populated with ones and zeroes), but only when the value in a corresponding column (time) is within a (moving) range of values.
My data looks like this:
values <- c(1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0)
seconds <- c(0.0, 1.0, 2.5, 3.0, 5.5, 6.0, 6.5, 7.0, 8.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.5, 16.0, 17.0, 18.0, 19.0, 20.0)
data <- data.frame(values, seconds)
Say I want to sum every 5 seconds worth of data in the 'values' column.
Then my first 5-second sum (seconds >=0 & seconds <= 5) would be:
1 (because it corresponds to a 'seconds', 0.0, within the interval of interest)
+
0 (corresponds to 1.0 in 'seconds')
+
0 (2.5)
+
0 (3.0)
= 1
STOPs here because the next value (1) corresponds to 5.5 seconds, outside of the interval.
The next 5-second interval (seconds >= 1 & seconds <= 6) would equal:
0 + 0 + 0 + 1 + 1 = 2
3rd interval:
(seconds >= 2.5 & seconds <= 7.5) = 0 + 0 + 1 + 1 + 0 + 1 = 3
and so on.
I'm an R noob, so this is the method I'm using to calculate it (and it is super slow, so I know there must be a better way):
for(i in 1:20){movsum[i] <- sum(subset(data, seconds >= (seconds[i] - 5.0) & seconds <= seconds[i])$values)}
Thanks for your help. Let me know if there's anything I should clarify.
Here's a possible data.table::foverlaps solution. The idea here is to create 5 seconds interval look up table and then lookup within data which values fall in each interval.
Choose an interval
int <- 5 ## 5 seconds
The load the package, add additional (identical) column to data in order to set boundaries, create a new data set which will have the desired boundaries per row, run foverlaps, key data in order to enable the binary join, find the corresponding values in data$values and sum them per each interval, something like the following seem to work
library(data.table)
setkey(setDT(data)[, seconds2 := seconds], seconds, seconds2)
lookup <- data[, .(seconds, seconds2 = seconds + int)]
res <- foverlaps(lookup, data, which = TRUE)[, values := data$values[yid]]
res[, .(SumValues = sum(values)), by = .(SecInterval = xid)]
# SecInterval SumValues
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 3
# 5: 5 3
# 6: 6 2
# 7: 7 1
# 8: 8 2
# 9: 9 1
# 10: 10 2
# 11: 11 3
# 12: 12 3
# 13: 13 2
# 14: 14 2
# 15: 15 1
# 16: 16 0
# 17: 17 0
# 18: 18 0
# 19: 19 0
# 20: 20 0
You may try some functions from the zoo package:
library(zoo)
# convert your data to a zoo time series
z <- read.zoo(data, index = "seconds")
# create an empty, regular time series,
# which contains the full time range, in steps of 0.5 sec
z0 <- zoo(, seq(from = start(z), to = end(z), by = 0.5))
# 'expand' the irregular, original data to a regular series, by merging it with z0
z2 <- merge(z, z0)
# apply the desired function (sum) to a rolling window of width 11
# (number of observations in each window)
# move the time frame in steps of 2 (by = 2) which correspond to 1 sec
# use partial = TRUE, to allow the window to pass outside the time range
rollapply(z2, width = 11, by = 2, FUN = sum, na.rm = TRUE,
align = "left", partial = TRUE)
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# 1 2 3 3 3 3 2 2 1 2 2 3 3 2 2 1 0 0 0 0 0
I have a dataframe d like this:
ID Value1 Value2 Value3
1 20 25 0
2 2 0 0
3 15 32 16
4 0 0 0
What I would like to do is calculate the variance for each person (ID), based only on non-zero values, and to return NA where this is not possible.
So for instance, in this example the variance for ID 1 would be var(20, 25),
for ID 2 it would return NA because you can't calculate a variance on just one entry, for ID 3 the var would be var(15, 32, 16) and for ID 4 it would again return NULL because it has no numbers at all to calculate variance on.
How would I go about this? I currently have the following (incomplete) code, but this might not be the best way to go about it:
len=nrow(d)
variances = numeric(len)
for (i in 1:len){
#get all nonzero values in ith row of data into a vector nonzerodat here
currentvar = var(nonzerodat)
Variances[i]=currentvar
}
Note this is a toy example, but the dataset I'm actually working with has over 40 different columns of values to calculate variance on, so something that easily scales would be great.
Data <- data.frame(ID = 1:4, Value1=c(20,2,15,0), Value2=c(25,0,32,0), Value3=c(0,0,16,0))
var_nonzero <- function(x) var(x[!x == 0])
apply(Data[, -1], 1, var_nonzero)
[1] 12.5 NA 91.0 NA
This seems overwrought, but it works, and it gives you back an object with the ids attached to the statistics:
library(reshape2)
library(dplyr)
variances <- df %>%
melt(., id.var = "id") %>%
group_by(id) %>%
summarise(variance = var(value[value!=0]))
Here's the toy data I used to test it:
df <- data.frame(id = seq(4), X1 = c(3, 0, 1, 7), X2 = c(10, 5, 0, 0), X3 = c(4, 6, 0, 0))
> df
id X1 X2 X3
1 1 3 10 4
2 2 0 5 6
3 3 1 0 0
4 4 7 0 0
And here's the result:
id variance
1 1 14.33333
2 2 0.50000
3 3 NA
4 4 NA
I would like to add one new row for each of the subjects in my dataframe, which looks something like this:
Subject = c("1","5","10")
time = c("2", "2.25", "2.5")
value = c("3", "17", "9")
DF <- data.frame(Subject, time, value)
Subject time value
1 1 2 3
2 5 2.25 17
3 10 2.5 9
I want to add a new row for each subject with a time = 0 and value = 0, giving this:
Subject = c("1","1","5","5","10","10")
time = c("0","2","0", "2.25","0", "2.5")
value = c("0","3","0", "17","0", "9")
DF2 <- data.frame(Subject, time, value)
Subject time value
1 1 0 0
2 1 2 3
3 5 0 0
4 5 2.25 17
5 10 0 0
6 10 2.5 9
I have a lot of subjects with a lot of gaps in their subject numbers, and want do this for all of them in a reasonable way. Any suggestions?
Thank you in advance.
Sincerily,
ykl
I would just rbind in the new values (not sure why you specified all your values as character values, here I changed them to numeric)
DF <- data.frame(
Subject = c(1,5,10),
time = c(2, 2.25, 2.5),
value = c(3, 17, 9)
)
DF2 <- rbind(
DF,
data.frame(Subject = unique(DF$Subject), time="0", value="0")
)
this puts them at the bottom, but you could re-sort of you like
DF2[order(DF2$subject, DF2$time), ]
You can also use interleave from the "gdata" package:
library(gdata)
interleave(DF, data.frame(Subject = 0, time = 0, value = 0))
# Subject time value
# 1 1 2.00 3
# 11 0 0.00 0
# 2 5 2.25 17
# 1.1 0 0.00 0
# 3 10 2.50 9
# 1.2 0 0.00 0
This uses #MrFlick's sample data.
DF <- data.frame(
Subject = c(1,5,10),
time = c(2, 2.25, 2.5),
value = c(3, 17, 9)
)
I have a following sample code to make one data frame containing information for more than 1 ID.
I want to sort them by defined categories.
In which I want to see the percentage change at specific (given time for e.h here t=10) with respect to
its baseline value and return the value of that found category in output.
I have explained detailed step of my calculation below.
a=c(100,105,126,130,150,100,90,76,51,40)
t=c(0,5,10,20,30)
t=rep(t,2)
ID=c(1,1,1,1,1,2,2,2,2,2)
data=data.frame(ID,t,a)
My desired Calculation
1)for all ID at t=0 "a" value is baseline
2) Computation
e.g At Given t=10 (Have to provide) take corresponding a value
%Change(answer) = (taken a value - baseline/baseline)
3) Compare the answer in the following define CATEGORIES..
#category
1-If answer>0.25
2-If -0.30<answer<0.25
3-If -1.0<answer< -0.30
4-If answer== -1.0
4)Return the value of category
Desired Output
ID My_Answer
1 1
2 3
Can anybody help me in this.I do understand the flow of my computation but not awre of efficient way of doing it as i have so many ID in that data frame.
Thank you
It's easier to do math with columns than with rows. So the first step is to move baseline numbers into their own columns, then use cut to define these groups:
library(dplyr)
library(tidyr)
foo <- data %>%
filter(t == 0) %>%
left_join(data %>%
filter(t != 0),
by = "ID") %>%
mutate(percentchange = (a.y - a.x) / a.x,
My_Answer = cut(percentchange, breaks = c(-1, -0.3, 0.25, Inf),
right = FALSE, include.lowest = TRUE, labels = c("g3","g2","g1")),
My_Answer = as.character(My_Answer),
My_Answer = ifelse(percentchange == -1, "g4", My_Answer)) %>%
select(ID, t = t.y, My_Answer)
foo
ID t.x a.x t.y a.y percentchange My_Answer
1 1 0 100 5 105 0.05 g2
2 1 0 100 10 126 0.26 g1
3 1 0 100 20 130 0.30 g1
4 1 0 100 30 150 0.50 g1
5 2 0 100 5 90 -0.10 g2
6 2 0 100 10 76 -0.24 g2
7 2 0 100 20 51 -0.49 g3
8 2 0 100 30 40 -0.60 g3
You can see that this lets us calculate My_Answer for all values at once. if you want to find out the values for t == 10, you can just pull out those rows:
foo %>%
filter(t == 10)
ID t My_Answer
1 1 10 g1
2 2 10 g2