Possible combinations of -1 0 & +1 - r

Surveys offered to customers to provide feedback on the consultant that they dealt with. Customer gives a score of 1 to 10.
Scores are grouped as:
1-6
7-8
9-10
If score <= 6 then this = detractor which constitutes a negative one: -1
7-8 = 0 (neutral)
Scores of 9 and 10 = +1.
So although there are 10 possible responses by the customer (1:10) this ends up being either -1, 0 or +1. So only 3 possible outcomes: -1,0,+1.
Say 10 surveys are done it may looks like this:
-1,+1,0,0,+1,+1,-1,+1,0,0 = +2
Then this +2 is divided by survey counts and multiplied by 100. So 2/10 = 0.2 *
100 = 20.
Now to the tricky part.
I want to figure out the possible sum totals (like the +2 calc'd 5 lines up). Best explained with a simple example:
1 Survey = either a -1, 0 or +1 so 3 possible outcomes.
2 surveys possibilities
-1,-1 equals -2
-1,0 equals -1
-1,+1 equals 0
0,0 equals 0
0,1 equals +1
+1,+1 equals 2
So although there are 6 possible combinations this is not what I am after. The order and individual totals are not what I am after it's just the possible sum of n surveys. In this case there's only one -2, -1, +1 and +2. That's 4. But there are 2 combinations that sum to zero so this is only 1 possible outcome from a sum perspective. So in this case given 2 surveys there is only 5 possible summed amounts.
I can do these simple one's by hand but what if there were 20 surveys? That'd be silly obviously. I would like to know if there is a way that in R I can use an algorithm to solve this for as many surveys that I need to examine this for.

As written in the comments above:
Obviously, the sum can range from -n to +n where n is the number of surveys, a total of 2 n + 1 possible sums. – Gassa 11 mins ago
This is a way you can reach this conclusion:
Start to create the full logic table of the outcome
+1 +1 2
+1 0 1
+1 -1 0
0 +1 1
0 0 0
0 -1 -1
-1 +1 0
-1 0 -1
-1 -1 -2
So with 2 surveys you have to 2^3 possible outcomes, with a few duplicates boiling down to the unique set 2,1,0,-1,-2
combine this with the next survey
+1 +2 3
+1 +1 2
+1 0 1
[..]
-1 0 -1
-1 -1 -2
-1 -2 -3
For every new survey you add you will end up with
+1 +(N-1) N
+1 +(N-2) N-1
[..]
-1 -(N-2) -(N-1)
-1 -(N-1) -N
The conclusion would be the max and min outcome will be +-N where N is the number of surveys given.
Maybe this article on wikipedia about 3-way logic is what you are looking for:
http://en.wikipedia.org/wiki/Three-valued_logic

Related

Regression with before and after

I have a dataset with four variables (df)
household
group
income
post
1
0
20'000
0
1
0
22'000
1
2
1
10'000
0
2
1
20'000
1
3
0
20'000
0
3
0
21'000
1
4
1
9'000
0
4
1
16'000
1
5
1
8'000
0
5
1
18'000
1
6
0
22'000
0
6
0
26'000
1
7
1
12'000
0
7
1
24'000
1
8
0
24'000
0
8
0
27'000
1
Group is a binary variable and is 1, when household got support from state. and post variable is also binary and is 1, when it is after some household got support from state.
Now I would like to run a before vs after regression that estimates the group effect by comparing post-period and before period for the supported group. I would like to put the dependent variable in logs, to have the effect in percentage, so the impact of state support on income.
I used that code, but I don't know if it is right to get the answer?
library("fixest")
feols(log(income) ~ group + post,data=df) %>% etable()
Is there another way?
If you are looking for the classic 2x2 design your code was almost correct. Change '+' with '*'. This tell us that the supported group increased the income with 7 250 more than the group which not received support.
comparing = feols(income ~ group * post,data)
comparing_log = feols(log(income) ~ group * post,data)
etable(comparing,comparing_log)
PS: The interpretation of the coefficient as percentage change is a good approximation for small numbers. The correct formula for % change is: exp(beta)-1. In this case it is exp(0.5829)-1 = 0.7912.
So the change here is 79,12%.

Calculate the length of an interval if data are equal to zero

I have a dataframe with time points and the corresponding measure of the activity in different subjects. Each time point it's a 5 minutes interval.
time Subject1 Subject2
06:03:00 6,682129 8,127075
06:08:00 3,612061 20,58838
06:13:00 0 0
06:18:00 0,9030762 0
06:23:00 0 0
06:28:00 0 0
06:33:00 0 0
06:38:00 0 7,404663
06:43:00 0 11,55835
...
I would like to calculate the length of each interval that contains zero activity, as the example below:
Subject 1 Subject 2
Interval_1 1 5
Interval_2 5
I have the impression that I should solve this using loops and conditions, but as I am not so experienced with loops I do not know where to start. Do you have any idea to solve this? Any help is really appreciated!
You can use rle() to find runs of consecutive values and the length of the runs. We need to filter the results to only runs where the value is 0:
result = lapply(df[-1], \(x) with(rle(x), lengths[values == 0]))
result
# $Subject1
# [1] 1 5
#
# $Subject2
# [1] 5
As different subjects can have different numbers of 0-runs, the results make more sense in a list than a rectangular data frame.

Cavs vs. Warriors - probability of Cavs winning the series includes combinations like "0,1,0,0,0,1,1" - but the series is over after game 5

There is a problem in DataCamp about computing the probability of winning an NBA series. Cavs and the Warriors are playing a seven game championship series. The first to win four games wins the series. They each have a 50-50 chance of winning each game. If the Cavs lose the first game, what is the probability that they win the series?
Here is how DataCamp computed the probability using Monte Carlo simulation:
B <- 10000
set.seed(1)
results<-replicate(B,{x<-sample(0:1,6,replace=T) # 0 when game is lost and 1 when won.
sum(x)>=4})
mean(results)
Here is a different way they computed the probability using simple code:
# Assign a variable 'n' as the number of remaining games.
n<-6
# Assign a variable `outcomes` as a vector of possible game outcomes: 0 indicates a loss and 1 a win for the Cavs.
outcomes<-c(0,1)
# Assign a variable `l` to a list of all possible outcomes in all remaining games. Use the `rep` function on `list(outcomes)` to create list of length `n`.
l<-rep(list(outcomes),n)
# Create a data frame named 'possibilities' that contains all combinations of possible outcomes for the remaining games.
possibilities<-expand.grid(l) # My comment: note how this produces 64 combinations.
# Create a vector named 'results' that indicates whether each row in the data frame 'possibilities' contains enough wins for the Cavs to win the series.
rowSums(possibilities)
results<-rowSums(possibilities)>=4
# Calculate the proportion of 'results' in which the Cavs win the series.
mean(results)
Question/Problem:
They both produce approximately the same probability of winning the series ~ 0.34. However, there seems to be a flaw in the the concept and the code design. For example, the code (sampling six times) allows for combinations such as the following:
G2 G3 G4 G5 G6 G7 rowSums
0 0 0 0 0 0 0 # Series over after G4 (Cavs lose). No need for game G5-G7.
0 0 0 0 1 0 1 # Series over after G4 (Cavs lose). Double counting!
0 0 0 0 0 1 1 # Double counting!
...
1 1 1 1 0 0 4 # No need for game G6 and G7.
1 1 1 1 0 1 5 # Double counting! This is the same as 1,1,1,1,0,0.
0 1 1 1 1 1 5 # No need for game G7.
1 1 1 1 1 1 6 # Series over after G5 (Cavs win). Double counting!
> rowSums(possibilities)
[1] 0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 1 2 2 3 2 3 3 4 2 3 3 4 3 4 4 5 2 3 3 4 3 4 4 5 3 4 4 5 4 5 5 6
As you can see, these are never possible. After winning the first four of the remaining six games, no more games should be played. Similarly, after losing the first three games of the remaining six games, no more games should be played. So these combinations shouldn't be included in the computation of the probability of winning the series. There is double counting for some of the combinations.
Here is what I did to omit some of the combinations that are not possible in real life.
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities<-possibilities %>% mutate(rowsums=rowSums(possibilities)) %>% filter(rowsums<=4)
But then I am not able to omit the other unnecessary combinations. For example, I want to remove two of these three: (a) 1,0,0,0,0,0 (b) 1,0,0,0,0,1 (c) 1,0,0,0,1,1. This is because no more games will be played after losing three times in a row. And they are basically double counting.
There are too many conditions for me to be able to filter them individually. There has to be a more efficient and intuitive way to do this. Can someone provide me with some hints on how to solve this whole mess?
Here is a way:
library(dplyr)
outcomes<-c(0,1)
l<-rep(list(outcomes),6)
possibilities<-expand.grid(l)
possibilities %>%
mutate(rowsums=rowSums(cur_data()),
anti_sum = rowSums(!cur_data())) %>%
filter(rowsums<=4, anti_sum <= 3)
We use the fact that r can coerce into a logical where 0 will be false. See sum(!0) as a short example.

How to calculate similarity of numbers (in list)

I am looking for a method for calculating similarity score for list of numbers. Ideally the method should give result in fixed range. For example from 0 to 1 where 0 is not similar at all and 1 means all numbers are identical.
For clarity let me provide a few examples:
0 1 2 3 4 5 6 7 8 9 10 => the similarity should be 0 or close to zero as all numbers are different
1 1 1 1 1 1 1 => 1
10 9 11 10.5 => close to 1
1 1 1 1 1 1 1 1 1 1 100 => score should be still pretty high as only the last value is different
I have tried to calculate the similarity based on normalization and average, but that gives me really bad results when there is one 'bad number'.
Thank you.
Similarity tests are always incredibly subjective, and the right one to use depends heavily on what you're trying to use it for. We already have three typical measures of central tendency (mean, median, mode). It's hard to say what test will work for you because there are different ways of measuring that will do what you're asking, but have wildly different measures for other lists (like [1]*7 + [100] * 7). Here's one solution:
import statistics as stats
def tester(ell):
mode_measure = 1 - len(set(ell))/len(ell)
avg_measure = 1 - stats.stdev(ell)/stats.mean(ell)
return max(avg_measure, mode_measure)

Reverse cumsum with breaks with non-sequential numbers

Looking to fill a matrix with a reverse cumsum. There are multiple breaks that must be maintained.
I have provided a sample matrix for what I want to accomplish. The first column is the data, the second column is what I want. You will see that column 2 is updated to reflect the number of items that are left. When there are 0's the previous number must be carried through.
update <- matrix(c(rep(0,4),rep(1,2),2,rep(0,2),1,3,
rep(10,4), 9,8,6, rep(6,2), 5, 2),ncol=2)
I have tried multiple ways to create a sequence, loop using numerous packages (i.e. zoo). What is difficult is that the numbers in column 1 can be between 0,1,..,X but less than column 2.
Any help or tips would be appreciated
EDIT: Column 2 starts with a given value which can represent any starting value (i.e. inventory at the beginning of a month). Column 1 would then represent "purchases" made which; thus, column 2 should reflect the total number of remaining items available.
The following will report the purchase and inventory balance as described:
starting_inventory <- 100
df <- data.frame(purchases=c(rep(0,4),rep(1,2),2,rep(0,2),1,3))
df$cum_purchases <- cumsum(df$purchases)
df$remaining_inventory <- starting_inventory - df$cum_purchases
Result:
purchases cum_purchases remaining_inventory
1 0 0 100
2 0 0 100
3 0 0 100
4 0 0 100
5 1 1 99
6 1 2 98
7 2 4 96
8 0 4 96
9 0 4 96
10 1 5 95
11 3 8 92

Resources