Migration(or conversion) path map in R [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I need to make migration(or conversion) path map in R.
There is the example of my data.frame
ID order state
1 1 a
1 2 b
1 3 b
2 1 b
2 2 b
2 3 c
3 1 b
3 2 c
4 1 a
4 2 b
5 1 c
In this data.frame ID1 have moved to a -> b -> b according to the order.
In the same perspective, ID2 have moved to b -> b -> c, ID3 have moved to b-> c, ID4 have moved to a->b. And ID5 did not move.
In the aggregate level, we can make migration (or conversion) path map like below.
In this map, the arrows have frequency information of path. And the circles have frequency information of states.
How can I make this path map in R? Is there any packages for this?

Here's a possibility using the diagram package. Most of the work here is just reshaping the data into a nice format. There may be more efficient ways, but this at least seems to work. First, your data. I also want to make sure that we treat the order column as a factor rather than a numeric value.
#sample input data
dd<-structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L,
5L), order = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 1L),
state = structure(c(1L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 1L, 2L, 3L),
.Label = c("a", "b", "c"), class = "factor")),
.Names = c("ID", "order", "state"),
class = "data.frame", row.names = c(NA, -11L))
dd$order<-factor(dd$order)
Now we begin the transformation. We need to create an adjaceny matrix between all the state/order positions
ss <- interaction(dd$state, dd$order)
Embed <- function(x) if(length(x)>1) embed(x,2) else numeric(0)
adj <- do.call(rbind, lapply(split(as.numeric(ss), dd$ID), Embed))
tf <- function(x) factor(levels(ss)[x], levels=levels(ss))
tt <- table(tf(adj[,1]), tf(adj[,2]))
Then we re-name the rows of the matrix (because that's what is used as labels on the plot)
rownames(tt) <- paste(levels(dd$state), table(dd$state, dd$order), sep="/")
And now we focus on the layout. We assign positions to each circle, then plot the diagram with the transitions, and finally add the text at the top.
xpos<-cbind(rep(1:nlevels(dd$order), each=nlevels(dd$state)),
rev(rep(1:nlevels(dd$state), nlevels(dd$order))))
xpos<-(xpos-1)/2*.7+.15
plotmat(tt, pos=xpos)
text(paste("order", levels(dd$order)), x=unique(xpos[,1]), y=1, xpd=NA)
The final result is
I tried to make it as robust as possible to different numbers of states/orders but I didn't fully test it. So be sure to double check the results with your real data.

Related

group_by edit distance between rows over multiple columns

I have the following data frame.
Input:
class id q1 q2 q3 q4
Ali 12 1 2 3 3
Tom 16 1 2 4 2
Tom 18 1 2 3 4
Ali 24 2 2 4 3
Ali 35 2 2 4 3
Tom 36 1 2 4 2
class indicates the teacher's name,
id indicates the student user ID, and,
q1, q2, q3 and q4 indicate marks on different test questions
Requirement:
I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated.
For this, I want to calculate absolute distance or difference, grouped by class name, across multiple columns, i.e., all the test questions q1, q2, q3 and q4. And I want to store this information in a couple of new columns as below:
difference:
For a given class name, it contains the pairwise distance or difference with all other students' id. For a given class name, it stores the information as (id1, id2 = difference)
cheating:
This column lists any id's based on the previously created new column where the difference was zero (or some threshold value). This will be a flag to alert the teacher that their student might have cheated.
class id q1 q2 q3 q4 difference cheating
Ali 12 1 2 3 3 (12,24 = 2), (12,35 = 2) NA
Tom 16 1 2 4 2 (16,18 = 3), (16,36 = 0) 36
Tom 18 1 2 3 4 (16,18 = 3), (18,36 = 3) NA
Ali 24 2 2 4 3 (12,24 = 2), (24,35 = 0) 35
Ali 35 2 2 4 3 (12,35 = 2), (24,35 = 0) 24
Tom 36 1 2 4 2 (16,36 = 0), (18,36 = 3) 16
Is it possible to achieve this using dplyr?
Related posts:
I have tried to look for related solutions but none of them address the exact problem that I am facing e.g.,
This post calculates the difference between all pairs of rows. It does not incorporate the group_by situation plus the solution is extremely slow: R - Calculate the differences in the column values between rows/ observations (all combinations)
This one compares only two columns using stringdist(). I want my solution over multiple columns and with a group_by() condition: Creating new field that shows stringdist between two columns in R?
The following post compares the initial values in a column with their preceding values: R Calculating difference between values in a column
This one compares values in one column to all other columns. I would want this but done row wise and through group_by(): R Calculate the difference between values from one to all the other columns
dput()
For your convenience, I am sharing data dput():
structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
Any help would be greatly appreciated!
You could try to clustering the data, using hclust() for example. Once the relative distances are calculated and mapped, the cut the tree at the threshold of expected cheating.
This example I am using the standard dist() function to calculate differences, the stringdist function may be better or maybe another option is out there to try.
df<- structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
#apply the standard distance function
scores <- hclust(dist(df[ , 3:6]))
plot(scores)
#divide into groups based on level of matching too closely
groups <- cutree(scores, h=0.1)
#summary table
summarytable <- data.frame(class= df$class, id =df$id, groupings =groups)
#select groups with more than 2 people in them
suspectgroups <- table(groups)[table(groups) >=2]
potential_cheaters <- summarytable %>% filter(groupings %in% names(suspectgroups)) %>% arrange(groupings)
potential_cheaters
This works for this test case, but for larger datasets the height in the cutree() function may need to be adjusted. Also consider splitting the initial dataset by class to eliminate the chance of matching people between classes (depending on the situation of course).

Converting a data frame to a motified list

although there are alot of questions concering this topic; I can not seem to find the correct question answer. Therefore I am directing this question to you guys.
The context:
I've got a data set with alot of rows (+150K) with 32 corresponding columns. The second column is a document number. The document number is not a unique ID. So the date contains rows with mutiple rows with the same document number. I like to create a list of the document numbers. This list of document numbers contains another list with the corresponding rows with the same document numbers.
For example:
Here is an example of the data (I included a dput output of the example below).
Document Number Col.A Col.B
A random_56681 random_24984
A random_78738 random_23098
A random_48640 random_32375
B random_96243 random_96927
B random_72045 random_52583
C random_19367 random_20441
C random_96778 random_22161
C random_48038 random_95644
C random_62999 random_44561
Now here is what I am looking for. I need a list that contains the 3 documents (A, B, C). Each of these list needs to contain another list containing the corresponding rows. For example, the main list (lets say my_list) should have 3 lists A, B and C; each of the lists should contain respectively 3, 2 and 4 lists.
Hope I was clear enough in asking the question (if not please let me know).
Here you can find the example data:
structure(list(Document_Number = structure(c(1L, 1L, 1L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
Col.A = structure(c(4L, 7L, 3L, 8L, 6L, 1L, 9L, 2L, 5L), .Label = c("random_19367",
"random_48038", "random_48640", "random_56681", "random_62999",
"random_72045", "random_78738", "random_96243", "random_96778"
), class = "factor"), Col.B = structure(c(4L, 3L, 5L, 9L,
7L, 1L, 2L, 8L, 6L), .Label = c("random_20441", "random_22161",
"random_23098", "random_24984", "random_32375", "random_44561",
"random_52583", "random_95644", "random_96927"), class = "factor")), class = "data.frame", row.names = c(NA,
-9L))
You can use split like:
split(x, x$Document_Number)
#$A
# Document_Number Col.A Col.B
#1 A random_56681 random_24984
#2 A random_78738 random_23098
#3 A random_48640 random_32375
#
#$B
# Document_Number Col.A Col.B
#4 B random_96243 random_96927
#5 B random_72045 random_52583
#
#$C
# Document_Number Col.A Col.B
#6 C random_19367 random_20441
#7 C random_96778 random_22161
#8 C random_48038 random_95644
#9 C random_62999 random_44561
An option is group_split
library(dplyr)
df1 %>%
group_split(Document_Number)

Overwrite levels of factor columns in one dataframe using another

I have 2 data frames with multiple factor columns. One is the base data frame and the other is the final data frame. I want to update the levels of the base data frame using the final data frame.
Consider this example:
base <- data.frame(product=c("Business Call", "Business Transactional",
"Monthly Non-Compounding and Standard Non-Compounding",
"OCR based Call", "Offsale Call", "Offsale Savings",
"Offsale Transactional", "Out of Scope","Personal Call"))
base$product <- as.factor(base$product)
final <- data.frame(product=c("Business Call", "Business Transactional",
"Monthly Standard Non-Compounding", "OCR based Call",
"Offsale Call", "Offsale Savings","Offsale Transactional",
"Out of Scope","Personal Call", "You Money"))
final$product <- as.factor(final$product)
What I would now want is for the final data base to have the same levels as base and remove the levels which do not exist at all like "You Money". Whereas "Monthly Standard Non-Compounding" to be fuzzy matched
Eg:
levels(base$var1) <- "a" "b" "c"
levels(final$var1) <- "Aa" "Bb" "Cc"
Is there a way to overwrite the levels in base data using the final data using some kind of fuzzy match?
Like I want the final levels for both data to be the same. i.e.
levels(base$var1) <- "Aa" "Bb" "Cc"
levels(final$var1) <- "Aa" "Bb" "Cc"
We could build our own fuzzyMatcher.
First, we'll need kinda vectorized agrep function,
agrepv <- function(x, y) all(as.logical(sapply(x, agrep, y)))
on which we build our fuzzyMatcher.
fuzzyMatcher <- function(from, to) {
mc <- mapply(function(y)
which(mapply(function(x) agrepv(y, x), Map(levels, to))),
Map(levels, from))
return(Map(function(x, y) `levels<-`(x, y), base,
Map(levels, from)[mc]))
}
final labels applied on base labels (note, that I've shifted columns to make it a little more sophisticated):
base[] <- fuzzyMatcher(final1, base1)
# X1 X2
# 1 Aa Xx
# 2 Aa Xx
# 3 Aa Yy
# 4 Aa Yy
# 5 Bb Yy
# 6 Bb Zz
# 7 Bb Zz
# 8 Aa Xx
# 9 Cc Xx
# 10 Cc Zz
Update
Based on the new provided data above it'll make sense to use another vectorized agrepv2(), which, used with outer(), enables us to apply agrep on all combinations of the levels of both vectors. Hereafter colSums that equal zero give us non-matching levels and which.max the matching levels of the target data frame final. We can use these two resulting vectors on the one hand to delete unused rows of final, on the other hand to subset the desired levels of the base data frame in order to rebuild the factor column.
# add to mimic other columns in data frame
base$x <- seq(nrow(base))
final$x <- seq(nrow(final))
# some abbrevations for convenience
p1 <- levels(base$product)
p2 <- levels(final$product)
# agrep
AGREPV2 <- Vectorize(function(x, y, ...) agrep(p2[x], p1[y])) # new vectorized agrep
out <- t(outer(seq(p2), seq(p1), agrepv2, max.distance=0.9)) # apply `agrepv2`
del.col <- grep(0, colSums(apply(out, 2, lengths))) # find negative matches
lvl <- unlist(apply(out, 2, which.max)) # find positive matches
lvl <- as.character(p2[lvl]) # get the labels
# delete "non-existing" rows and re-generate factor with new labels
transform(final[-del.col, ], product=factor(product, labels=lvl))
# product x
# 1 Business Call 1
# 2 Business Transactional 2
# 4 OCR based Call 4
# 5 Offsale Call 5
# 6 Offsale Savings 6
# 7 Offsale Transactional 7
# 8 Out of Scope 8
# 9 Personal Call 9
Data
base1 <- structure(list(X1 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L,
3L, 3L), .Label = c("a", "b", "c"), class = "factor"), X2 = structure(c(1L,
1L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 3L), .Label = c("x", "y", "z"
), class = "factor")), row.names = c(NA, -10L), class = "data.frame")
final1 <- structure(list(X1 = structure(c(1L, 3L, 1L, 1L, 2L, 3L, 2L, 1L,
2L, 2L, 3L, 3L, 2L, 2L, 2L), .Label = c("Xx", "Yy", "Zz"), class = "factor"),
X2 = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L), .Label = c("Aa", "Bb", "Cc"), class = "factor")), row.names = c(NA,
-15L), class = "data.frame")

How to calculate the amount of different 'combinations of classes' of a variable depending on a other variable?

CustomerID MarkrtungChannel OrderID
1 A 1
2 B 2
3 A 3
4 B 4
5 C 5
1 C 6
1 A 7
2 C 8
3 B 9
3 B 10
Hi, I want to know which combinations of marketing channels are used by how many customers .
How can I calculate this with R?
E.g. The combination of Marketing channels A and C is used by 1 customer (ID 1)
the combination of Marketing channels C and B is also used by 1 customer (ID 2)
And so on...
and here's a tidyverse way.
library(tidyverse)
data.df%>%
group_by(CustomerID)%>%
summarize(combo=paste0(sort(unique(MarkrtungChannel)),collapse=""))%>%
ungroup()%>%
group_by(combo)%>%
summarize(n.users=n())
counting the number of people using each combo at the end.
You can do it multiple ways. Here is data.table way:
# Here is your data
df<-structure(list(CustomerID = c(1L, 2L, 3L, 4L, 5L, 1L, 1L, 2L,
3L, 3L), MarkrtungChannel = structure(c(1L, 2L, 1L, 2L, 3L, 3L,
1L, 3L, 2L, 2L), .Label = c("A", "B", "C"), class = "factor"),
OrderID = 1:10), .Names = c("CustomerID", "MarkrtungChannel",
"OrderID"), class = "data.frame", row.names = c(NA, -10L))
df[]<-lapply(df[],as.character)
# Here is the combination field
library(data.table)
setDT(df)
df[,Combo:=.(list(unique(MarkrtungChannel))), by=CustomerID]
# Or (to get the combination counts)
df[,list(combo=(list(unique(MarkrtungChannel)))), by=CustomerID][,uniqueN(CustomerID),by=combo]

How do I, in R, transform a dataset which has a response per row into a matrix where columns are answers? [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
Let's start with:
df <- structure(list(user.id = c(2L, 3L, 1L, 3L, 1L, 4L), questions = structure(c(3L,
3L, 3L, 1L, 3L, 2L), .Label = c("Do you own an xbox?", "How many game consoles do you own?",
"which game did you buy recently?"), class = "factor"), answers = structure(c(2L,
5L, 3L, 6L, 4L, 1L), .Label = c("3", "DOOM", "Fallout 3", "Ghost Recon",
"Mario", "yes"), class = "factor")), .Names = c("user.id", "questions",
"answers"), row.names = c(NA, -6L), class = "data.frame")
This gives us the data.frame
> df
user.id questions answers
1 2 which game did you buy recently? DOOM
2 3 which game did you buy recently? Mario
3 1 which game did you buy recently? Fallout 3
4 3 Do you own an xbox? yes
5 1 which game did you buy recently? Ghost Recon
6 4 How many game consoles do you own? 3
I'd like to transform this to a data.frame or equivalent where:
> matrixed
user.id q_1 q_2 q_3
1 1 Ghost Recon
2 2 DOOM
3 3 yes Mario
4 4 3
Right now I'm using this primitive piece of code:
questions <- sort(unique(df$questions))
user.id <- unique(sort(df$user.id))
matrixed <- data.frame(user.id)
sapply(1:length(questions), function(i) matrixed[, paste0("q_", i)] <<- rep("", length(user.id)))
sapply(1:nrow(df), function(j) matrixed[df[j, ]$user.id, paste0("q_", which(df[j, ]$questions == questions))] <<- as.character(df[j, ]$answers))
Are there more elegant ways to do this -- perhaps libraries that help handle this type of data?
The standard tidyverse solution to this:
library(dplyr)
library(tidyr)
df %>% tidyr::spread(questions, answers)
Doesn't quite work in your case because user 1 answered which game did you buy recently? twice according to your data.
If you wanted to take the first answer for each user.id x question combination, you could use
df %>%
distinct(user.id, questions, .keep_all = TRUE) %>%
tidyr::spread(questions, answers)
Careful: The result of this will depend on how your data is sorted.

Resources