I have the following data frame.
Input:
class id q1 q2 q3 q4
Ali 12 1 2 3 3
Tom 16 1 2 4 2
Tom 18 1 2 3 4
Ali 24 2 2 4 3
Ali 35 2 2 4 3
Tom 36 1 2 4 2
class indicates the teacher's name,
id indicates the student user ID, and,
q1, q2, q3 and q4 indicate marks on different test questions
Requirement:
I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated.
For this, I want to calculate absolute distance or difference, grouped by class name, across multiple columns, i.e., all the test questions q1, q2, q3 and q4. And I want to store this information in a couple of new columns as below:
difference:
For a given class name, it contains the pairwise distance or difference with all other students' id. For a given class name, it stores the information as (id1, id2 = difference)
cheating:
This column lists any id's based on the previously created new column where the difference was zero (or some threshold value). This will be a flag to alert the teacher that their student might have cheated.
class id q1 q2 q3 q4 difference cheating
Ali 12 1 2 3 3 (12,24 = 2), (12,35 = 2) NA
Tom 16 1 2 4 2 (16,18 = 3), (16,36 = 0) 36
Tom 18 1 2 3 4 (16,18 = 3), (18,36 = 3) NA
Ali 24 2 2 4 3 (12,24 = 2), (24,35 = 0) 35
Ali 35 2 2 4 3 (12,35 = 2), (24,35 = 0) 24
Tom 36 1 2 4 2 (16,36 = 0), (18,36 = 3) 16
Is it possible to achieve this using dplyr?
Related posts:
I have tried to look for related solutions but none of them address the exact problem that I am facing e.g.,
This post calculates the difference between all pairs of rows. It does not incorporate the group_by situation plus the solution is extremely slow: R - Calculate the differences in the column values between rows/ observations (all combinations)
This one compares only two columns using stringdist(). I want my solution over multiple columns and with a group_by() condition: Creating new field that shows stringdist between two columns in R?
The following post compares the initial values in a column with their preceding values: R Calculating difference between values in a column
This one compares values in one column to all other columns. I would want this but done row wise and through group_by(): R Calculate the difference between values from one to all the other columns
dput()
For your convenience, I am sharing data dput():
structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
Any help would be greatly appreciated!
You could try to clustering the data, using hclust() for example. Once the relative distances are calculated and mapped, the cut the tree at the threshold of expected cheating.
This example I am using the standard dist() function to calculate differences, the stringdist function may be better or maybe another option is out there to try.
df<- structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
#apply the standard distance function
scores <- hclust(dist(df[ , 3:6]))
plot(scores)
#divide into groups based on level of matching too closely
groups <- cutree(scores, h=0.1)
#summary table
summarytable <- data.frame(class= df$class, id =df$id, groupings =groups)
#select groups with more than 2 people in them
suspectgroups <- table(groups)[table(groups) >=2]
potential_cheaters <- summarytable %>% filter(groupings %in% names(suspectgroups)) %>% arrange(groupings)
potential_cheaters
This works for this test case, but for larger datasets the height in the cutree() function may need to be adjusted. Also consider splitting the initial dataset by class to eliminate the chance of matching people between classes (depending on the situation of course).
Related
CustomerID MarkrtungChannel OrderID
1 A 1
2 B 2
3 A 3
4 B 4
5 C 5
1 C 6
1 A 7
2 C 8
3 B 9
3 B 10
Hi, I want to know which combinations of marketing channels are used by how many customers .
How can I calculate this with R?
E.g. The combination of Marketing channels A and C is used by 1 customer (ID 1)
the combination of Marketing channels C and B is also used by 1 customer (ID 2)
And so on...
and here's a tidyverse way.
library(tidyverse)
data.df%>%
group_by(CustomerID)%>%
summarize(combo=paste0(sort(unique(MarkrtungChannel)),collapse=""))%>%
ungroup()%>%
group_by(combo)%>%
summarize(n.users=n())
counting the number of people using each combo at the end.
You can do it multiple ways. Here is data.table way:
# Here is your data
df<-structure(list(CustomerID = c(1L, 2L, 3L, 4L, 5L, 1L, 1L, 2L,
3L, 3L), MarkrtungChannel = structure(c(1L, 2L, 1L, 2L, 3L, 3L,
1L, 3L, 2L, 2L), .Label = c("A", "B", "C"), class = "factor"),
OrderID = 1:10), .Names = c("CustomerID", "MarkrtungChannel",
"OrderID"), class = "data.frame", row.names = c(NA, -10L))
df[]<-lapply(df[],as.character)
# Here is the combination field
library(data.table)
setDT(df)
df[,Combo:=.(list(unique(MarkrtungChannel))), by=CustomerID]
# Or (to get the combination counts)
df[,list(combo=(list(unique(MarkrtungChannel)))), by=CustomerID][,uniqueN(CustomerID),by=combo]
I'm creating a shiny application that will have a checkboxGroupInput, where each box checked will add another line to a frequency plot. I'm trying to wrap my head around reshape2 and ggplot2 to understand how to make this possible.
data:
head(testSet)
date store_id product_id count
1 2015-08-15 3 1 8
2 2015-08-15 3 3 1
3 2015-08-17 3 1 7
4 2015-08-17 3 2 3
5 2015-08-17 3 3 1
6 2015-08-18 3 3 2
class level information:
dput(droplevels(head(testSet, 10)))
structure(list(date = structure(c(16662, 16662, 16664,
16664, 16664, 16665, 16665, 16665, 16666, 16666), class = "Date"),
store_id = c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), product_id = c(1L,
3L, 1L, 2L, 3L, 3L, 1L, 2L, 1L, 2L), count = c(8L, 1L, 7L,
3L, 1L, 2L, 18L, 1L, 0L, 2L)), .Names = c("date", "store_id",
"product_id", "count"), row.names = c(NA, 10L), class = "data.frame")
The graph should have an x-axis that corresponds to date, and a y-axis that corresponds to count. I would like to have a checkbox group input where for each box representing a product checked, a line corresponding to product_id will be plotted on the graph. The data is already filtered to store_id.
My first thought was to write a for loop inside of the plot to render a new geom_line() per each returned value of the input$productId vector. -- however after some research it seems that's the wrong way to go about things.
Currently I'm trying to melt() the data down to something useful, and then aes(...group=product_id), but getting errors on whatever I try.
Attempting to melt the data:
meltSet <- melt(testSet, id.vars="product_id", value.name="count", variable.name="date")
head of meltSet
head(meltSet)
product_id date count
1 1 date 16662
2 3 date 16662
3 1 date 16664
4 2 date 16664
5 3 date 16664
6 3 date 16665
tail of meltSet
tail(meltSet)
product_id date count
76 9 count 5
77 1 count 19
78 2 count 1
79 3 count 39
80 8 count 1
81 9 count 4
Plotting:
ggplot(data=meltSet, aes(x=date, y=count, group = product_id, colour = product_id)) + geom_line()
So my axis and values are all wonky, and not what I'm expecting from setting the plot.
If I'm understanding it correctly you don't need any melting, you just need to aggregate your data, summing up count by date and product_id. you can use data.table for this purpose:
testSet = data.table(testSet)
aggrSet = testSet[, .(count=sum(count)), by=.(date, product_id)]
You can do your ggplot stuff on aggrSet. It has three columns now: date, product_id, count.
When you melt like you did you merge two variables with different types into date: date(Date) and store_id(int).
This question already has answers here:
Finding maximum value of one column (by group) and inserting value into another data frame in R
(3 answers)
Closed 7 years ago.
I have this data frame which consists of two vectors and it runs into million of rows. I used loop but it takes a day to compare the value.
Can some one suggest any apply functions??
Names Sales
A 1
A 2
A 3
B 1
B 5
B 6
.
.
what I want is unique list of names along with the maximum element in sales against that particular name. like A has 3 rows and highest sales is 3.
Output should be in data frame.
Names Sales
A 3
B 6
You can try with aggregate()
aggregate(V2 ~ ., df1 , max)
# V1 V2
#1 A 3
#2 B 6
data
df1 <- structure(list(V1 = structure(c(1L, 1L, 1L, 2L, 2L, 2L),
.Label = c("A", "B"), class = "factor"), V2 = c(1L, 2L, 3L, 1L, 5L, 6L)),
.Names = c("V1","V2"), class = "data.frame", row.names = c(NA, -6L))
I have a long list of people receiving drugs coded in the variable ATC. I want to find out how many people have used 4 specific drugs. For example, I want to count how many people have used this particular pattern of drugs "C07ABC" & "C09XYZ" &"C08123" &"C03ZOO". Some people may have used some agents (eg C07 or C08) more than once, thats ok, I just want to count how many unique people had the regimen I'm interested in. I don't care how many times they had the unique drugs. However, because I have various patterns that I want to look up - I would like to use the grepl function. To explain this further, my first attempt at this problem tried a sum command:
sum(df[grepl('^C07.*?'|'^C09.*?'|'^C08.*?|C03.*?', as.character(df$atc)),])
However this doesn't work because I think the sum command needs a boolean function. ALso, I think the | sign isn't correct here either (I want an &) but I'm just showing the code so that you know what I'm after. Maybe an ave function is what I need - but am unsure of how I would code this?
Thanks in advance.
df
names fruit dates atc
4 john kiwi 2010-07-01 C07ABC
7 john apple 2010-09-01 C09XYZ
9 john banana 2010-11-01 C08123
13 john orange 2010-12-01 C03ZOO
14 john apple 2011-01-01 C07ABC
2 mary orange 2010-05-01 C09123
5 mary apple 2010-07-01 C03QRT
8 mary orange 2010-07-01 C09ZOO
10 mary apple 2010-09-01 C03123
12 mary apple 2010-11-01 C09123
1 tom apple 2010-02-01 C03897
3 tom banana 2010-03-01 C02CAMN
6 tom apple 2010-06-01 C07123
11 tom kiwi 2010-08-01 C02DA12
You might consider avoiding the use of regular expressions, and instead derive some set of meaningful columns from column atc. For combinations, you probably want a 2-way table of person and drug, and then compute on the matrix to count combinations.
For example:
tab <- xtabs(~ names + atc, df)
combo <- c("C07ABC", "C09XYZ", "C08123", "C03ZOO")
haveCombo <- rowSums(tab[,combo] > 0) == length(combo)
sum(haveCombo)
The last two lines could easily be turned into a function for each combination.
EDIT: This approach can be applied to other, derived columns, so if you're interested in the prefix then,
df$agent <- substring(df$atc, 1, 3)
tab <- xtabs(~ names + agent, df)
combo <- c("C07", "C09", "C08", "C03")
and proceed as before.
In addition to not needing to deliver entire dataframe lines to sum you also had extra quote marks in that pattern:
> sum( grepl('^C07.*|^C09.*|^C08.*|C03.*', df$atc) )
[1] 12
I think this is easier to read:
> sum( grepl('^(C07|C09|C08|C03).*', df$atc) )
[1] 12
But now I read that you want all of thos used and to do the calculation within a patient id. That might have requiree using & as the connector but I decide to try a different route and use unique and then count then number of unique matches while doing it within an aggregate operation.
> aggregate(atc ~ names, data=df,
function(drgs) length(unique(grep('^(C07|C09|C08|C03)', drgs))))
names atc
1 john 5
2 mary 5
3 tom 2
Although that's the number of matching items but not the number of unique items, because I forgot to put value=TRUE in the grep call (and also need to use substr to avoid separately counting congeners with different trailing ATC codes):
> aggregate(atc ~ names, data=df, function(drgs) length(unique(grep('^C0[7983]', substr(drgs,1,3), value=TRUE))))
names atc
1 john 4
2 mary 2
3 tom 2
This would be somewhat similar to #MichaelLawrence's matrix/table approach, but I think it would scale better since the "tables" being created would be much smaller:
combo <- c("C07", "C09", "C08", "C03")
tapply(df$atc, df$names, function(drgs) sum(combo %in% substr(drgs,1,3)) )
#------
john mary tom
4 2 2
you can try this
drugs <- c("C07ABC","C09XYZ", "C08123", "C03ZOO")
table(unique(df[df$atc %in% drugs, c("names", "atc")])$names)
# john mary tom
# 4 0 0
names(which(table(unique(df[df$atc %in% drugs, c("names", "atc")])$names) > 3))
# [1] "john"
Data
df <- structure(list(names = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("john", "mary", "tom"
), class = "factor"), fruit = structure(c(3L, 1L, 2L, 4L, 1L,
4L, 1L, 4L, 1L, 1L, 1L, 2L, 1L, 3L), .Label = c("apple", "banana",
"kiwi", "orange"), class = "factor"), dates = structure(c(5L,
7L, 8L, 9L, 10L, 3L, 5L, 5L, 7L, 8L, 1L, 2L, 4L, 6L), .Label = c("2010-02-01",
"2010-03-01", "2010-05-01", "2010-06-01", "2010-07-01", "2010-08-01",
"2010-09-01", "2010-11-01", "2010-12-01", "2011-01-01"), class = "factor"),
atc = structure(c(8L, 11L, 9L, 6L, 8L, 10L, 5L, 12L, 3L,
10L, 4L, 1L, 7L, 2L), .Label = c("C02CAMN", "C02DA12", "C03123",
"C03897", "C03QRT", "C03ZOO", "C07123", "C07ABC", "C08123",
"C09123", "C09XYZ", "C09ZOO"), class = "factor")), .Names = c("names",
"fruit", "dates", "atc"), class = "data.frame", row.names = c("4",
"7", "9", "13", "14", "2", "5", "8", "10", "12", "1", "3", "6",
"11"))
This is just a continuation of #Michael Lawrence's answer. I changed the drugs to what #user2363642 wanted, and I also substringed the atc column to only use the three first characters, which again, I believe is what #user2363642 wanted. Also, for the rowSums, I first changed all non-zero quantities to 1, to ensure we don't double count drugs.
drugs <- c("C07", "C09", "C08", "C03")
df$atc.abbr <- substring(df$atc, 1, 3)
xt <- xtabs(~ names + atc.abbr, df)
xt[xt>0] <- 1
rowSums(xt[,drugs]) >= length(drugs)
Output:
john mary tom
TRUE FALSE FALSE
I am working with migration data, and I want to produce three summary tables from a very large dataset (>4 million). An example of which is detailed below:
migration <- structure(list(area.old = structure(c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("leeds",
"london", "plymouth"), class = "factor"), area.new = structure(c(7L,
13L, 3L, 2L, 4L, 7L, 6L, 7L, 6L, 13L, 5L, 8L, 7L, 11L, 12L, 9L,
1L, 10L, 11L), .Label = c("bath", "bristol", "cambridge", "glasgow",
"harrogate", "leeds", "london", "manchester", "newcastle", "oxford",
"plymouth", "poole", "york"), class = "factor"), persons = c(6L,
3L, 2L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 3L, 4L, 1L, 1L, 2L, 3L, 4L,
9L, 4L)), .Names = c("area.old", "area.new", "persons"), class = "data.frame", row.names = c(NA,
-19L))
Summary table 1: 'area.within'
The first table I wish to create is called 'area.within'. This will detail only areas where people have moved within the same area (i.e. it will count the total number of persons where 'london' is noted down in 'area.old' and 'area.new'). There will probably be multiple occurrences of this within the data table. It will then do this for all of the different areas, so the summary would be:
area.within persons
1 london 13
2 leeds 5
3 plymouth 5
Using the data table package, I have as far as:
setDT(migration)[as.character(area.old)==as.character(area.new)]
... but this doesn't get rid of duplicates...
Summary table 2: 'moved.from'
The second table will summarise areas which have experienced people moving out (i.e. those unique values in 'area.old'). It will identify areas for which column 1 and 2 are different and add together all the people that are detailed (i.e. excluding those who have moved between areas - in summary table 1). The resulting table should be:
moved.from persons
1 london 24
2 leeds 17
3 plymouth 19
Summary table 3: 'moved.to'
The third table summarises which areas have experienced people moving to (i.e. those unique values in 'area.new'). It will identify all the unique areas for which column 1 and 2 are different and add together all the people that are detailed (i.e. excluding those who have moved between areas - in summary table 1). The resulting table should be:
moved.to persons
1 london 5
2 york 3
3 cambridge 2
4 bristol 5
5 glasgow 6
6 leeds 8
7 york 6
8 harrogate 3
9 manchester 4
10 plymouth 0
11 poole 2
12 newcastle 3
13 bath 4
14 oxford 9
Most importantly, a sum of all the persons detailed in tables 2 and 3 should be the same. And then this value, combined with the persons total for table 1 should equal the sum of the all the persons in the original table.
If anyone could help me sort out how to structure my code using the data table package to produce my tables, I should be most grateful.
Using data.table is a good choice i think.
setDT(migration) #This has to be done only once
1.
To avoid duplicates just sum them up by city as follows
migration[as.character(area.old)==as.character(area.new),
.(persons = sum(persons)),
by=.(area.within = area.new)]
2.
This is very similar to the 1. one but uses != in the i-Argument
migration[as.character(area.old)!=as.character(area.new),
.(persons = sum(persons)),
by=.(moved.from = area.old)]
3.
Same as 2.
migration[as.character(area.old)!=as.character(area.new),
.(persons = sum(persons)),
by=.(moved.to = area.new)]
Alternative
As 2. and 3. are very similar you can also do:
moved <- migration[as.character(area.old)!=as.character(area.new)]
#2
moved[,.(persons = sum(persons)), by=.(moved.from = area.old)]
#3
moved[,.(persons = sum(persons)), by=.(moved.to = area.new)]
Thus only once the selection of the right rows has to be done.