Transform a dataset to summarize table in R - r

I am learning data mining about market basket analysis and would like to transform the rawdata to a summarize table for further calculation of support and confidence.
Below is an example that about 4 transactions that indicate the customer has purchased corresponding item.
Example is like following:
Afterwards would like to have all possible item sets. For above example, total possibility is 24 item sets.

It sounds like you're looking for the crossprod function:
M <- data.frame(ID = 1:4, A = c(1, 0, 1, 0),
B = c(1, 1, 0, 0), C = c(0, 1, 1, 0),
D = c(0, 0, 1, 1))
crossprod(as.matrix(M[-1]))
# A B C D
# A 2 1 1 1
# B 1 2 1 0
# C 1 1 2 1
# D 1 0 1 2

Related

Creating a table with proportional values

I have got a data set that looks like this:
COMPANY DATABREACH CYBERBACKGROUND
A 1 2
B 0 2
C 0 1
D 0 2
E 1 1
F 1 2
G 0 2
H 0 2
I 0 2
J 0 2
No I want to create the following: 40% of the cases that the column DATABREACH has the value of 1, I want the value CYBERBACKGROUND to take the value of 2. I figure there must be some function to do this, but I cannot find it.
ind <- which(df$DATABREACH == 1)
ind <- ind[rbinom(length(ind), 1, prob = 0.4) > 0]
df$CYBERBACKGROUND[ind] <- 2
The above is a bit more efficient in that it only pulls randomness for as many as strictly required. If you aren't concerned (11000 doesn't seem too high), you can reduce that to
df$CYBERBACKGROUND <-
ifelse(df$DATABREACH == 1 & rbinom(nrow(df), 1, prob = 0.4) > 0,
2, df$CYBERBACKGROUND)
We may use
library(dplyr)
df1 <- df1 %>%
mutate(CYBERBACKGROUND = replace(CYBERBACKGROUND,
sample(which(DATABREACH == 0), sum(ceiling(sum(DATABREACH) * 0.4))), 2))

R: how to format my data for multinomial logit?

I am reproducing some Stata code on R and I would like to perform a multinomial logistic regression with the mlogit function, from the package of the same name (I know that there is a multinom function in nnet but I don't want to use this one).
My problem is that, to use mlogit, I need my data to be formatted using mlogit.data and I can't figure out how to format it properly. Comparing my data to the data used in the examples in the documentation and in this question, I realize that it is not in the same form.
Indeed, the data I use is like:
df <- data.frame(ID = seq(1, 10),
type = c(2, 3, 4, 2, 1, 1, 4, 1, 3, 2),
age = c(28, 31, 12, 1, 49, 80, 36, 53, 22, 10),
dum1 = c(1, 0, 0, 0, 0, 1, 0, 1, 1, 0),
dum2 = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
ID type age dum1 dum2
1 1 2 28 1 1
2 2 3 31 0 0
3 3 4 12 0 1
4 4 2 1 0 1
5 5 1 49 0 0
6 6 1 80 1 0
7 7 4 36 0 1
8 8 1 53 1 0
9 9 3 22 1 1
10 10 2 10 0 0
whereas the data they use is like:
key altkey A B C D
1 201005131 1 2.6 118.17 117 0
2 201005131 2 1.4 117.11 115 0
3 201005131 3 1.1 117.38 122 1
4 201005131 4 24.6 NA 122 0
5 201005131 5 48.6 91.90 122 0
6 201005131 6 59.8 NA 122 0
7 201005132 1 20.2 118.23 113 0
8 201005132 2 2.5 123.67 120 1
9 201005132 3 7.4 116.30 120 0
10 201005132 4 2.8 118.86 120 0
11 201005132 5 6.9 124.72 120 0
12 201005132 6 2.5 123.81 120 0
As you can see, in their case, there is a column altkey that details every category for each key and there is also a column D showing which alternative is chosen by the person.
However, I only have one column (type) which shows the choice of the individual but does not show the other alternatives or the value of the other variables for each of these alternatives. When I try to apply mlogit, I have:
library(mlogit)
mlogit(type ~ age + dum1 + dum2, df)
Error in data.frame(lapply(index, function(x) x[drop = TRUE]), row.names = rownames(mydata)) :
row names supplied are of the wrong length
Therefore, how can I format my data so that it corresponds to the type of data mlogit requires?
Edit: following the advices of #edsandorf, I modified my dataframe and mlogit.data works but now all the other explanatory variables have the same value for each alternative. Should I set these variables at 0 in the rows where the chosen alternative is 0 or FALSE ? (in fact, can somebody show me the procedure from where I am to the results of the mlogit because I don't get where I'm wrong for the estimation?)
The data I show here (df) is not my true data. However, it is exactly the same form: a column with the choice of the alternative (type), columns with dummies and age, etc.
Here's the procedure I've made so far (I did not set the alternatives to 0):
# create a dataframe with all alternatives for each ID
qqch <- data.frame(ID = rep(df$ID, each = 4),
choice = rep(1:4, 10))
# merge both dataframes
df2 <- dplyr::left_join(qqch, df, by = "ID")
# change the values in stype by 1 or 0
for (i in 1:length(df2$ID)){
df2[i, "type"] <- ifelse(df2[i, "type"] == df2[i, "choice"], 1, 0)
}
# format for mlogit
df3 <- mlogit.data(df2, choice = "type", shape = "long", alt.var = "choice")
head(df3)
ID choice type age dum1 dum2
1.1 1 1 FALSE 28 1 1
1.2 1 2 TRUE 28 1 1
1.3 1 3 FALSE 28 1 1
1.4 1 4 FALSE 28 1 1
2.1 2 1 FALSE 31 0 0
2.2 2 2 FALSE 31 0 0
If I do :
mlogit(type ~ age + dum1 + dum2, df3)
I have the error:
Error in solve.default(H, g[!fixed]) : system is computationally singular: reciprocal condition number
Your data doesn't lend itself well to be estimated using an MNL model unless we make more assumptions. In general, since all your variables are individual specific and does not vary across alternatives (types), the model cannot be identified. All of your individual specific characteristics will drop out unless we treat them as alternative specific. By the sounds of it, each professional program carries meaning in an of itself. In that case, we could estimate the MNL model using constants only, where the constant captures everything about the program that makes an individual choose it.
library(mlogit)
df <- data.frame(ID = seq(1, 10),
type = c(2, 3, 4, 2, 1, 1, 4, 1, 3, 2),
age = c(28, 31, 12, 1, 49, 80, 36, 53, 22, 10),
dum1 = c(1, 0, 0, 0, 0, 1, 0, 1, 1, 0),
dum2 = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
Now, just to be on the safe side, I create dummy variables for each of the programs. type_1 refers to program 1, type_2 to program 2 etc.
qqch <- data.frame(ID = rep(df$ID, each = 4),
choice = rep(1:4, 10))
# merge both dataframes
df2 <- dplyr::left_join(qqch, df, by = "ID")
# change the values in stype by 1 or 0
for (i in 1:length(df2$ID)){
df2[i, "type"] <- ifelse(df2[i, "type"] == df2[i, "choice"], 1, 0)
}
# Add alternative specific variables (here only constants)
df2$type_1 <- ifelse(df2$choice == 1, 1, 0)
df2$type_2 <- ifelse(df2$choice == 2, 1, 0)
df2$type_3 <- ifelse(df2$choice == 3, 1, 0)
df2$type_4 <- ifelse(df2$choice == 4, 1, 0)
# format for mlogit
df3 <- mlogit.data(df2, choice = "type", shape = "long", alt.var = "choice")
head(df3)
Now we can run the model. I include the dummies for each of the alternatives keeping alternative 4 as my reference level. Only J-1 constants are identified, where J is the number of alternatives. In the second half of the formula (after the pipe operator), I make sure that I remove all alternative specific constants that the model would have created and I add your individual specific variables, treating them as alternative specific. Note that this only makes sense if your alternatives (programs) carry meaning and are not generic.
model <- mlogit(type ~ type_1 + type_2 + type_3 | -1 + age + dum1 + dum2,
reflevel = 4, data = df3)
summary(model)

Create a vector of counts

I wanted to create a vector of counts if possible.
For example: I have a vector
x <- c(3, 0, 2, 0, 0)
How can I create a frequency vector for all integers between 0 and 3? Ideally I wanted to get a vector like this:
> 3 0 1 1
which gives me the counts of 0, 1, 2, and 3 respectively.
Much appreciated!
You can do
table(factor(x, levels=0:3))
Simply using table(x) is not enough.
Or with tabulate which is faster
tabulate(factor(x, levels = min(x):max(x)))
You can do this using rle (I made this in minutes, so sorry if it's not optimized enough).
x = c(3, 0, 2, 0, 0)
r = rle(x)
f = function(x) sum(r$lengths[r$values == x])
s = sapply(FUN = f, X = as.list(0:3))
data.frame(x = 0:3, freq = s)
#> data.frame(x = 0:3, freq = s)
# x freq
#1 0 3
#2 1 0
#3 2 1
#4 3 1
You can just use table():
a <- table(x)
a
x
#0 2 3
#3 1 1
Then you can subset it:
a[names(a)==0]
#0
#3
Or convert it into a data.frame if you're more comfortable working with that:
u<-as.data.frame(table(x))
u
# x Freq
#1 0 3
#2 2 1
#3 3 1
Edit 1:
For levels:
a<- as.data.frame(table(factor(x, levels=0:3)))

Imputing labels based on a comparison of columns

I don't think this question has been asked on this board before. I have two columns of 1s and 0s in a dataframe. Let's call these columns X and Y, respectively. In a comparison of X and Y for any row, one of four combinations is obviously possible:
A: 1, 0
B: 0, 1
C: 1, 1
D: 0, 0
Imagine the dataframe has m columns total, but we're interested only in X and Y. I'd like to write a function that compares only X and Y and then characterizes the particular combination with the corresponding labels A, B, C, or D in a new column (let's call it Z).
So say the data looks like:
X Y
1 1
0 1
0 0
1 1
The function will ouput:
X Y Z
1 1 C
0 1 B
0 0 D
1 1 C
I imagine this would be trivial but I'm an R newbie. Thanks for any guidance!
We create a key/value combination unique dataset and then merge with the input dataset based on 'X' and 'Y' columns
merge(df1, KeyDat, by = c("X", "Y"), all.x=TRUE)
# X Y Z
#1 0 0 D
#2 0 1 B
#3 1 1 C
#4 1 1 C
Or to get the output in the same order, use left_join
library(dplyr)
left_join(df1, keyDat)
#Joining by: c("X", "Y")
# X Y Z
#1 1 1 C
#2 0 1 B
#3 0 0 D
#4 1 1 C
data
keyDat <- data.frame(X= c(1, 0, 1, 0), Y = c(0, 1, 1,
0), Z = c("A", "B", "C", "D"), stringsAsFactors=FALSE)
df1 <- data.frame(X= c(1, 0, 0, 1), Y=c(1, 1, 0, 1))

How to visualize change in binary/categorical data over time?

>dput(data)
structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3), Dx = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1), Month = c(0,
6, 12, 18, 24, 0, 6, 12, 18, 24, 0, 6, 12, 18, 24), score = c(0,
0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0)), .Names = c("ID",
"Dx", "Month", "score"), row.names = c(NA, -15L), class = "data.frame")
>data
ID Dx Month score
1 1 1 0 0
2 1 1 6 0
3 1 1 12 0
4 1 1 18 1
5 1 1 24 1
6 2 1 0 1
7 2 1 6 1
8 2 2 12 1
9 2 2 18 0
10 2 2 24 1
11 3 1 0 0
12 3 1 6 0
13 3 1 12 0
14 3 1 18 0
15 3 1 24 0
Suppose I have the above data.frame. I have 3 patients (ID = 1, 2 or 3). Dx is the diagnosis (Dx = 1 is normal, = 2 is diseased). There is a month variable. And last but not least, is a test score variable. The participants' test score is binary, and it can change from 0 or 1 or revert back from 1 to 0. I am having trouble coming up with a way to visualize this data. I would like an informative graph that looks at:
The trend of the participants' test scores over time.
How that trend compares to the participants' diagnosis over time
In my real dataset I have over 800 participants, so I do not want to construct 800 separate graphs ... I think the test score variable being binary really has me stumped. Any help would be appreciated.
With ggplot2 you can make faceted plots with subplots for each patient (see my solution for dealing with the large number of plots below). An example visualization:
library(ggplot2)
ggplot(data, aes(x=Month, y=score, color=factor(Dx))) +
geom_point(size=5) +
scale_x_continuous(breaks=c(0,6,12,18,24)) +
scale_color_discrete("Diagnosis",labels=c("normal","diseased")) +
facet_grid(.~ID) +
theme_bw()
which gives:
Including 800 patients in one plot might be a bit too much as already mentioned in the comments of the question. There are several solutions to this problem:
Aggregate the data.
Create patient subgroups and make a plot for each subgroup.
Filter out all the patients who have never been ill.
With regard to the last suggestion, you can do that with the following code (which I adapted from an answer to one of my own questions):
deleteable <- with(data, ave(Dx, ID, FUN=function(x) all(x==1)))
data2 <- data[deleteable==0,]
You can use this as well for creating a new variable identifying patient who have been ill:
data$neverill <- with(data, ave(Dx, ID, FUN=function(x) all(x==1)))
Then you can for example aggregate the data with the several grouping variables (e.g. Month, neverill).
Note: A lot of the following data manipulation needs to be done for part 2. Part 1 is less complex, and you can see it fit in below.
Uses
library(data.table)
library(ggplot2)
library(reshape2)
To Compare
First, change the Dx from 1 to 2 to 0 to 1 (Assuming that a 0 in score corresponds to a 1 in Dx)
data$Dx <- data$Dx - 1
Now, create a matrix that returns a 1 for a 1 diagnosis with a 0 test, and a -1 for a 1 test with a 0 diagnosis.
compare <- matrix(c(0,1,-1,0),ncol = 2,dimnames = list(c(0,1),c(0,1)))
> compare
0 1
0 0 -1
1 1 0
Now, lets score every event. This simply looks up the matrix above for every entry in your matrix:
data$calc <- diag(compare[as.character(data$Dx),as.character(data$score)])
*Note: This can be sped up for large matrices using matching, but it is a quick fix for smaller sets like yours
To allow us to use data.table aggregation:
data <- data.table(data)
Now we need to create our variables:
tograph <- melt(data[, list(ScoreTrend = sum(score)/.N,
Type = sum(calc)/length(calc[calc != 0]),
Measure = sum(abs(calc))),
by = Month],
id.vars = c("Month"))
ScoreTrend: This calculates the proportion of positive scores in each
month. Shows the trend of scores over time
Type: Shows the proportion of -1 vs 1 over time. If this returns -1,
all events were score = 1, diag = 0. If it returns 1, all events were
diag = 1, score = 0. A zero would mean a balance between the two
Measure: The raw number of incorrect events.
We melt this data frame along month so that we can create a facet graph.
If there are no incorrect events, we will get a NaN for Type. To set this to 0:
tograph[value == NaN, value := 0]
Finally, we can plot
ggplot(tograph, aes(x = Month, y = value)) + geom_line() + facet_wrap(~variable, ncol = 1)
We can now see, in one plot:
The number of positive scores by month
The proportion of under vs. over diagnosis
The number of incorrect diagnoses.

Resources