How to visualize change in binary/categorical data over time? - r

>dput(data)
structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3), Dx = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1), Month = c(0,
6, 12, 18, 24, 0, 6, 12, 18, 24, 0, 6, 12, 18, 24), score = c(0,
0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0)), .Names = c("ID",
"Dx", "Month", "score"), row.names = c(NA, -15L), class = "data.frame")
>data
ID Dx Month score
1 1 1 0 0
2 1 1 6 0
3 1 1 12 0
4 1 1 18 1
5 1 1 24 1
6 2 1 0 1
7 2 1 6 1
8 2 2 12 1
9 2 2 18 0
10 2 2 24 1
11 3 1 0 0
12 3 1 6 0
13 3 1 12 0
14 3 1 18 0
15 3 1 24 0
Suppose I have the above data.frame. I have 3 patients (ID = 1, 2 or 3). Dx is the diagnosis (Dx = 1 is normal, = 2 is diseased). There is a month variable. And last but not least, is a test score variable. The participants' test score is binary, and it can change from 0 or 1 or revert back from 1 to 0. I am having trouble coming up with a way to visualize this data. I would like an informative graph that looks at:
The trend of the participants' test scores over time.
How that trend compares to the participants' diagnosis over time
In my real dataset I have over 800 participants, so I do not want to construct 800 separate graphs ... I think the test score variable being binary really has me stumped. Any help would be appreciated.

With ggplot2 you can make faceted plots with subplots for each patient (see my solution for dealing with the large number of plots below). An example visualization:
library(ggplot2)
ggplot(data, aes(x=Month, y=score, color=factor(Dx))) +
geom_point(size=5) +
scale_x_continuous(breaks=c(0,6,12,18,24)) +
scale_color_discrete("Diagnosis",labels=c("normal","diseased")) +
facet_grid(.~ID) +
theme_bw()
which gives:
Including 800 patients in one plot might be a bit too much as already mentioned in the comments of the question. There are several solutions to this problem:
Aggregate the data.
Create patient subgroups and make a plot for each subgroup.
Filter out all the patients who have never been ill.
With regard to the last suggestion, you can do that with the following code (which I adapted from an answer to one of my own questions):
deleteable <- with(data, ave(Dx, ID, FUN=function(x) all(x==1)))
data2 <- data[deleteable==0,]
You can use this as well for creating a new variable identifying patient who have been ill:
data$neverill <- with(data, ave(Dx, ID, FUN=function(x) all(x==1)))
Then you can for example aggregate the data with the several grouping variables (e.g. Month, neverill).

Note: A lot of the following data manipulation needs to be done for part 2. Part 1 is less complex, and you can see it fit in below.
Uses
library(data.table)
library(ggplot2)
library(reshape2)
To Compare
First, change the Dx from 1 to 2 to 0 to 1 (Assuming that a 0 in score corresponds to a 1 in Dx)
data$Dx <- data$Dx - 1
Now, create a matrix that returns a 1 for a 1 diagnosis with a 0 test, and a -1 for a 1 test with a 0 diagnosis.
compare <- matrix(c(0,1,-1,0),ncol = 2,dimnames = list(c(0,1),c(0,1)))
> compare
0 1
0 0 -1
1 1 0
Now, lets score every event. This simply looks up the matrix above for every entry in your matrix:
data$calc <- diag(compare[as.character(data$Dx),as.character(data$score)])
*Note: This can be sped up for large matrices using matching, but it is a quick fix for smaller sets like yours
To allow us to use data.table aggregation:
data <- data.table(data)
Now we need to create our variables:
tograph <- melt(data[, list(ScoreTrend = sum(score)/.N,
Type = sum(calc)/length(calc[calc != 0]),
Measure = sum(abs(calc))),
by = Month],
id.vars = c("Month"))
ScoreTrend: This calculates the proportion of positive scores in each
month. Shows the trend of scores over time
Type: Shows the proportion of -1 vs 1 over time. If this returns -1,
all events were score = 1, diag = 0. If it returns 1, all events were
diag = 1, score = 0. A zero would mean a balance between the two
Measure: The raw number of incorrect events.
We melt this data frame along month so that we can create a facet graph.
If there are no incorrect events, we will get a NaN for Type. To set this to 0:
tograph[value == NaN, value := 0]
Finally, we can plot
ggplot(tograph, aes(x = Month, y = value)) + geom_line() + facet_wrap(~variable, ncol = 1)
We can now see, in one plot:
The number of positive scores by month
The proportion of under vs. over diagnosis
The number of incorrect diagnoses.

Related

Identify first occurence of vector where 12 of 15 values are 1

I have a vector like so:
test = c(NA, 1, 1, 1, NA, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, NA, 1, 1, 1, 1, NA, NA, 1)
and within this vector I want to identify the first time that 12 of 15 values is equal to one.
I have started by using rle to count the consecutive values:
#get counts of sequences
count = rle(test)
and then getting a sequence based on this:
#make a sequence of the counts
new <- sequence(count$lengths)
I will then turn any values in new to 0 where a test value is equal to NA:
#when the value was na make the count 0
new[is.na(test)] <- 0
and lastly I will change all other values to 1:
#make all other counts 1
new[new !=0] <- 1
which will return:
[1] 0 1 1 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 1
this is where I am stuck, now I know the index location where the first time 12 out of 15 values is 1 is idx = 6, but I am stuck on how to retrieve it with an algorithm.
We can use zoo::rollsum to find the first point where the next 15 values sum to 12 or more.
which(zoo::rollsum(test, 15, align = "left", fill = 0) >= 12)[1]
#> [1] 6
Created on 2020-07-17 by the reprex package (v0.3.0)
We could use rollapply to do the sum of logical vector, and get the index of first match where the sum is 12 with match
library(zoo)
match(12, rollapply(test, width = 15, FUN = function(x) sum(x== 1, na.rm = TRUE)))
#[1] 6

R: how to format my data for multinomial logit?

I am reproducing some Stata code on R and I would like to perform a multinomial logistic regression with the mlogit function, from the package of the same name (I know that there is a multinom function in nnet but I don't want to use this one).
My problem is that, to use mlogit, I need my data to be formatted using mlogit.data and I can't figure out how to format it properly. Comparing my data to the data used in the examples in the documentation and in this question, I realize that it is not in the same form.
Indeed, the data I use is like:
df <- data.frame(ID = seq(1, 10),
type = c(2, 3, 4, 2, 1, 1, 4, 1, 3, 2),
age = c(28, 31, 12, 1, 49, 80, 36, 53, 22, 10),
dum1 = c(1, 0, 0, 0, 0, 1, 0, 1, 1, 0),
dum2 = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
ID type age dum1 dum2
1 1 2 28 1 1
2 2 3 31 0 0
3 3 4 12 0 1
4 4 2 1 0 1
5 5 1 49 0 0
6 6 1 80 1 0
7 7 4 36 0 1
8 8 1 53 1 0
9 9 3 22 1 1
10 10 2 10 0 0
whereas the data they use is like:
key altkey A B C D
1 201005131 1 2.6 118.17 117 0
2 201005131 2 1.4 117.11 115 0
3 201005131 3 1.1 117.38 122 1
4 201005131 4 24.6 NA 122 0
5 201005131 5 48.6 91.90 122 0
6 201005131 6 59.8 NA 122 0
7 201005132 1 20.2 118.23 113 0
8 201005132 2 2.5 123.67 120 1
9 201005132 3 7.4 116.30 120 0
10 201005132 4 2.8 118.86 120 0
11 201005132 5 6.9 124.72 120 0
12 201005132 6 2.5 123.81 120 0
As you can see, in their case, there is a column altkey that details every category for each key and there is also a column D showing which alternative is chosen by the person.
However, I only have one column (type) which shows the choice of the individual but does not show the other alternatives or the value of the other variables for each of these alternatives. When I try to apply mlogit, I have:
library(mlogit)
mlogit(type ~ age + dum1 + dum2, df)
Error in data.frame(lapply(index, function(x) x[drop = TRUE]), row.names = rownames(mydata)) :
row names supplied are of the wrong length
Therefore, how can I format my data so that it corresponds to the type of data mlogit requires?
Edit: following the advices of #edsandorf, I modified my dataframe and mlogit.data works but now all the other explanatory variables have the same value for each alternative. Should I set these variables at 0 in the rows where the chosen alternative is 0 or FALSE ? (in fact, can somebody show me the procedure from where I am to the results of the mlogit because I don't get where I'm wrong for the estimation?)
The data I show here (df) is not my true data. However, it is exactly the same form: a column with the choice of the alternative (type), columns with dummies and age, etc.
Here's the procedure I've made so far (I did not set the alternatives to 0):
# create a dataframe with all alternatives for each ID
qqch <- data.frame(ID = rep(df$ID, each = 4),
choice = rep(1:4, 10))
# merge both dataframes
df2 <- dplyr::left_join(qqch, df, by = "ID")
# change the values in stype by 1 or 0
for (i in 1:length(df2$ID)){
df2[i, "type"] <- ifelse(df2[i, "type"] == df2[i, "choice"], 1, 0)
}
# format for mlogit
df3 <- mlogit.data(df2, choice = "type", shape = "long", alt.var = "choice")
head(df3)
ID choice type age dum1 dum2
1.1 1 1 FALSE 28 1 1
1.2 1 2 TRUE 28 1 1
1.3 1 3 FALSE 28 1 1
1.4 1 4 FALSE 28 1 1
2.1 2 1 FALSE 31 0 0
2.2 2 2 FALSE 31 0 0
If I do :
mlogit(type ~ age + dum1 + dum2, df3)
I have the error:
Error in solve.default(H, g[!fixed]) : system is computationally singular: reciprocal condition number
Your data doesn't lend itself well to be estimated using an MNL model unless we make more assumptions. In general, since all your variables are individual specific and does not vary across alternatives (types), the model cannot be identified. All of your individual specific characteristics will drop out unless we treat them as alternative specific. By the sounds of it, each professional program carries meaning in an of itself. In that case, we could estimate the MNL model using constants only, where the constant captures everything about the program that makes an individual choose it.
library(mlogit)
df <- data.frame(ID = seq(1, 10),
type = c(2, 3, 4, 2, 1, 1, 4, 1, 3, 2),
age = c(28, 31, 12, 1, 49, 80, 36, 53, 22, 10),
dum1 = c(1, 0, 0, 0, 0, 1, 0, 1, 1, 0),
dum2 = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))
Now, just to be on the safe side, I create dummy variables for each of the programs. type_1 refers to program 1, type_2 to program 2 etc.
qqch <- data.frame(ID = rep(df$ID, each = 4),
choice = rep(1:4, 10))
# merge both dataframes
df2 <- dplyr::left_join(qqch, df, by = "ID")
# change the values in stype by 1 or 0
for (i in 1:length(df2$ID)){
df2[i, "type"] <- ifelse(df2[i, "type"] == df2[i, "choice"], 1, 0)
}
# Add alternative specific variables (here only constants)
df2$type_1 <- ifelse(df2$choice == 1, 1, 0)
df2$type_2 <- ifelse(df2$choice == 2, 1, 0)
df2$type_3 <- ifelse(df2$choice == 3, 1, 0)
df2$type_4 <- ifelse(df2$choice == 4, 1, 0)
# format for mlogit
df3 <- mlogit.data(df2, choice = "type", shape = "long", alt.var = "choice")
head(df3)
Now we can run the model. I include the dummies for each of the alternatives keeping alternative 4 as my reference level. Only J-1 constants are identified, where J is the number of alternatives. In the second half of the formula (after the pipe operator), I make sure that I remove all alternative specific constants that the model would have created and I add your individual specific variables, treating them as alternative specific. Note that this only makes sense if your alternatives (programs) carry meaning and are not generic.
model <- mlogit(type ~ type_1 + type_2 + type_3 | -1 + age + dum1 + dum2,
reflevel = 4, data = df3)
summary(model)

exchange two columns and remove the duplicates in a data frame using R

Here is an example to explain what I want to do. I have a data frame like:
X Y
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
I want to change it to another format:
X1 Y1 X2 Y2
1 1 1 1
1 2 2 1
1 3 3 1
......
For two rows in the first table, say X=1, Y=2 and X=2, Y=1. They just exchange each other's values. So I want to put such rows in on row, as shown in the second table, and then remove the duplicates. So, the 'thin and long' table is turned to 'short and fat'. I know how to do it using two for loops. But in R, such operation takes for ever. So, can anyone help me with a quick way?
Here is a smallest example:
The original table is:
X Y
1 2
2 1
The transferred table that I want is like:
X1 Y1 X2 Y2
1 2 2 1
So, the rows in the first table that just exchanges values are integrated into one row in the second table and the extra row in the first table is removed.
Maybe the code below in base R can work
dfout <- `names<-`(cbind(r <- subset(df,df$Y>=df$X),rev(r)),
c("X1","Y1","X2","Y2"))
such that
> dfout
X1 Y1 X2 Y2
1 1 1 1 1
2 1 2 2 1
3 1 3 3 1
5 2 2 2 2
6 2 3 3 2
9 3 3 3 3
DATA
df <- structure(list(X = c(1, 1, 1, 2, 2, 2, 3, 3, 3), Y = c(1, 2,
3, 1, 2, 3, 1, 2, 3)), class = "data.frame", row.names = c(NA,
-9L))
library(tidyverse)
df <- tibble(x1 = 1, 1, 1, 2, 2, 2, 3, 3, 3,
y1 = 1, 2, 3, 1, 2, 3, 1, 2, 3)
df <- df %>% mutate(x2 = y1, y2 = x1) %>% distinct()
I think this does the trick.

collapse/aggregate some parts of an adjacency matrix simultaneously on rows and columns

I have a matrix, which represents mobility between various jobs:
jobnames <- c("job 1","job 2","job 3","job 4","job 5","job 6","job 7")
jobdat <- matrix(c(
5, 5, 5, 0, 0, 5, 5,
5, 5, 2, 5, 5, 1, 5,
1, 5, 5, 5, 0, 0, 1,
1, 0, 5, 5, 8, 0, 1,
0, 5, 0, 0, 5, 5, 1,
0, 0, 5, 5, 0, 5, 5,
0, 1, 0, 0, 5, 1, 5
),
nrow = 7, ncol = 7, byrow = TRUE,
dimnames = list(jobnames,jobnames
))
This is treated as a directed, weighted adjacency matrix in a social network analysis. The direction of the network is from rows to columns: So mobility is defined as going from a job-row to a job-column. The diagonal is relevant, since it is possible to change to the same job in another firm.
I need to collapse this matrix according to a prefigured list
containing the index of the jobs that should be combined:
group.list <- list(grp1=c(1,2) ,grp2 =c(3,4))
Now, since it is an adjacency matrix, it's a bit different than the other ' answers about how to collapse a matrix that I've ' found here and elsewhere. The collapse has to be simultanious on both the rows and the columns. And some jobs isn't grouped at all. So the result in this example should be like this:
group.jobnames <- c("job 1 and 2","job 3 and 4","job 5","job 6","job 7")
group.jobdat <- matrix(c(
20,12,5,6,10,
7,17,8,0,2,
5,0,5,5,1,
0,10,0,5,5,
1,0,5,1,5
),
nrow = 5, ncol = 5, byrow = TRUE,
dimnames = list(group.jobnames,group.jobnames
))
This example groups the two first jobs and then the next two, but in my actual data it could be any combination of (indexes of) jobs, and any number of jobs in each group. So job [1,7] could be one group, and job [2,3,6] could be another group, while job 4 or 5 wasn't grouped. Or any other combination.
Thank you for your time,
I believe there are some typos in the intended output, and the group.list definition. If I am correct in my interpretation, here is a solution.
Here is a new group.list to conform with the names of the desired output. In this version, group 2 is mapped to 1 and group 4 is mapped to 3, which conforms with the text in group.jobs.
group.list <- list(grp1=c(1, 3), grp2=c(2, 4))
Given this list, construct a grouping vector
# initial grouping
groups <- seq_len(ncol(jobdat))
# map elements of second list item to values of first list item
groups[match(group.list[["grp2"]], groups)] <- group.list[["grp1"]]
groups
[1] 1 1 3 3 5 6 7
So, now groups 1 and 2 are the same as well as 3 and 4. Now, we use rowsum and a couple of transposes to calculate the output.
myMat <- t(rowsum(t(rowsum(jobdat, groups)), groups))
# add the group names
dimnames(myMat) <- list(group.jobnames,group.jobnames)
myMat
job 1 and 2 job 3 and 4 job 5 job 6 job 7
job 1 and 2 20 12 5 6 10
job 3 and 4 7 20 8 0 2
job 5 5 0 5 5 1
job 6 0 10 0 5 5
job 7 1 0 5 1 5
In response to the OP's comments below, the grouping was intended to be within list elements, rather than corresponding positions between list elements as I had originally interpreted. To accomplish this form a grouping, a repeated feeding of replace to Reduce will accomplish the task.
With group.list as in the question,
group.list <- list(grp1=c(1, 2), grp2=c(3, 4))
groups <- Reduce(function(x, y) replace(x, x[x %in% y], min(y)),
c(list(groups), unname(group.list)))
groups
[1] 1 1 3 3 5 6 7
Here, replace takes the original grouping, finds the elements in the grouping that are in one of the vectors in group.list, and replaces these with the minimum value of that vector. The Reduce function repeatedly applies this operation on the original group variable, except modifying it in each iteration.
With this result, we use the above transposes and rowsum to get
myMat
job 1 and 2 job 3 and 4 job 5 job 6 job 7
job 1 and 2 20 12 5 6 10
job 3 and 4 7 20 8 0 2
job 5 5 0 5 5 1
job 6 0 10 0 5 5
job 7 1 0 5 1 5

Transform a dataset to summarize table in R

I am learning data mining about market basket analysis and would like to transform the rawdata to a summarize table for further calculation of support and confidence.
Below is an example that about 4 transactions that indicate the customer has purchased corresponding item.
Example is like following:
Afterwards would like to have all possible item sets. For above example, total possibility is 24 item sets.
It sounds like you're looking for the crossprod function:
M <- data.frame(ID = 1:4, A = c(1, 0, 1, 0),
B = c(1, 1, 0, 0), C = c(0, 1, 1, 0),
D = c(0, 0, 1, 1))
crossprod(as.matrix(M[-1]))
# A B C D
# A 2 1 1 1
# B 1 2 1 0
# C 1 1 2 1
# D 1 0 1 2

Resources