Producing smmary tables for very large datasets

Producing smmary tables for very large datasets - r

I am working with migration data, and I want to produce three summary tables from a very large dataset (>4 million). An example of which is detailed below:
migration <- structure(list(area.old = structure(c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("leeds",
"london", "plymouth"), class = "factor"), area.new = structure(c(7L,
13L, 3L, 2L, 4L, 7L, 6L, 7L, 6L, 13L, 5L, 8L, 7L, 11L, 12L, 9L,
1L, 10L, 11L), .Label = c("bath", "bristol", "cambridge", "glasgow",
"harrogate", "leeds", "london", "manchester", "newcastle", "oxford",
"plymouth", "poole", "york"), class = "factor"), persons = c(6L,
3L, 2L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 3L, 4L, 1L, 1L, 2L, 3L, 4L,
9L, 4L)), .Names = c("area.old", "area.new", "persons"), class = "data.frame", row.names = c(NA,
-19L))
Summary table 1: 'area.within'
The first table I wish to create is called 'area.within'. This will detail only areas where people have moved within the same area (i.e. it will count the total number of persons where 'london' is noted down in 'area.old' and 'area.new'). There will probably be multiple occurrences of this within the data table. It will then do this for all of the different areas, so the summary would be:
area.within persons
1 london 13
2 leeds 5
3 plymouth 5
Using the data table package, I have as far as:
setDT(migration)[as.character(area.old)==as.character(area.new)]
... but this doesn't get rid of duplicates...
Summary table 2: 'moved.from'
The second table will summarise areas which have experienced people moving out (i.e. those unique values in 'area.old'). It will identify areas for which column 1 and 2 are different and add together all the people that are detailed (i.e. excluding those who have moved between areas - in summary table 1). The resulting table should be:
moved.from persons
1 london 24
2 leeds 17
3 plymouth 19
Summary table 3: 'moved.to'
The third table summarises which areas have experienced people moving to (i.e. those unique values in 'area.new'). It will identify all the unique areas for which column 1 and 2 are different and add together all the people that are detailed (i.e. excluding those who have moved between areas - in summary table 1). The resulting table should be:
moved.to persons
1 london 5
2 york 3
3 cambridge 2
4 bristol 5
5 glasgow 6
6 leeds 8
7 york 6
8 harrogate 3
9 manchester 4
10 plymouth 0
11 poole 2
12 newcastle 3
13 bath 4
14 oxford 9
Most importantly, a sum of all the persons detailed in tables 2 and 3 should be the same. And then this value, combined with the persons total for table 1 should equal the sum of the all the persons in the original table.
If anyone could help me sort out how to structure my code using the data table package to produce my tables, I should be most grateful.

Using data.table is a good choice i think.
setDT(migration) #This has to be done only once
1.
To avoid duplicates just sum them up by city as follows
migration[as.character(area.old)==as.character(area.new),
.(persons = sum(persons)),
by=.(area.within = area.new)]
2.
This is very similar to the 1. one but uses != in the i-Argument
migration[as.character(area.old)!=as.character(area.new),
.(persons = sum(persons)),
by=.(moved.from = area.old)]
3.
Same as 2.
migration[as.character(area.old)!=as.character(area.new),
.(persons = sum(persons)),
by=.(moved.to = area.new)]
Alternative
As 2. and 3. are very similar you can also do:
moved <- migration[as.character(area.old)!=as.character(area.new)]
#2
moved[,.(persons = sum(persons)), by=.(moved.from = area.old)]
#3
moved[,.(persons = sum(persons)), by=.(moved.to = area.new)]
Thus only once the selection of the right rows has to be done.

Related

group_by edit distance between rows over multiple columns

I have the following data frame.
Input:
class id q1 q2 q3 q4
Ali 12 1 2 3 3
Tom 16 1 2 4 2
Tom 18 1 2 3 4
Ali 24 2 2 4 3
Ali 35 2 2 4 3
Tom 36 1 2 4 2
class indicates the teacher's name,
id indicates the student user ID, and,
q1, q2, q3 and q4 indicate marks on different test questions
Requirement:
I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated.
For this, I want to calculate absolute distance or difference, grouped by class name, across multiple columns, i.e., all the test questions q1, q2, q3 and q4. And I want to store this information in a couple of new columns as below:
difference:
For a given class name, it contains the pairwise distance or difference with all other students' id. For a given class name, it stores the information as (id1, id2 = difference)
cheating:
This column lists any id's based on the previously created new column where the difference was zero (or some threshold value). This will be a flag to alert the teacher that their student might have cheated.
class id q1 q2 q3 q4 difference cheating
Ali 12 1 2 3 3 (12,24 = 2), (12,35 = 2) NA
Tom 16 1 2 4 2 (16,18 = 3), (16,36 = 0) 36
Tom 18 1 2 3 4 (16,18 = 3), (18,36 = 3) NA
Ali 24 2 2 4 3 (12,24 = 2), (24,35 = 0) 35
Ali 35 2 2 4 3 (12,35 = 2), (24,35 = 0) 24
Tom 36 1 2 4 2 (16,36 = 0), (18,36 = 3) 16
Is it possible to achieve this using dplyr?
Related posts:
I have tried to look for related solutions but none of them address the exact problem that I am facing e.g.,
This post calculates the difference between all pairs of rows. It does not incorporate the group_by situation plus the solution is extremely slow: R - Calculate the differences in the column values between rows/ observations (all combinations)
This one compares only two columns using stringdist(). I want my solution over multiple columns and with a group_by() condition: Creating new field that shows stringdist between two columns in R?
The following post compares the initial values in a column with their preceding values: R Calculating difference between values in a column
This one compares values in one column to all other columns. I would want this but done row wise and through group_by(): R Calculate the difference between values from one to all the other columns
dput()
For your convenience, I am sharing data dput():
structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
Any help would be greatly appreciated!

You could try to clustering the data, using hclust() for example. Once the relative distances are calculated and mapped, the cut the tree at the threshold of expected cheating.
This example I am using the standard dist() function to calculate differences, the stringdist function may be better or maybe another option is out there to try.
df<- structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
#apply the standard distance function
scores <- hclust(dist(df[ , 3:6]))
plot(scores)
#divide into groups based on level of matching too closely
groups <- cutree(scores, h=0.1)
#summary table
summarytable <- data.frame(class= df$class, id =df$id, groupings =groups)
#select groups with more than 2 people in them
suspectgroups <- table(groups)[table(groups) >=2]
potential_cheaters <- summarytable %>% filter(groupings %in% names(suspectgroups)) %>% arrange(groupings)
potential_cheaters
This works for this test case, but for larger datasets the height in the cutree() function may need to be adjusted. Also consider splitting the initial dataset by class to eliminate the chance of matching people between classes (depending on the situation of course).

making a presence/absence timeline in r for multiple y objects

This is my first time using SO and I am an R newbie; sorry if this is a little basic or unclear (or if the question has already been answered... I'm struggling with coding and need pretty specific answers to understand)
I would like to produce an image similar to this one:
Except I would like it to be oriented horizontally on a timeline, and with two vertical lines drawn from the x-axis.
I can set the data up simply, and there are only two variables - date and Tag.
Tag Date
1 1 1/1/2014
2 3 1/1/2014
3 1 1/3/2014
4 2 1/3/2014
5 3 1/3/2014
6 5 1/3/2014
7 2 1/4/2015
8 3 1/4/2015
9 4 1/4/2015
10 6 1/4/2015
11 1 1/5/2014
12 2 1/5/2014
13 4 1/5/2014
14 6 1/5/2014
15 1 1/6/2014
16 2 1/6/2014
17 3 1/6/2014
18 4 1/6/2014
19 6 1/6/2014
20 2 1/7/2014
21 4 1/7/2014
22 1 1/8/2014
23 2 1/8/2014
24 6 1/8/2014
Here is a drawn image of what I would like to accomplish:
To recap - I want to take this data, which shows the dates of detection of animals at a certain location and plot it on a timeline with two vertical lines on two dates. If an animal (say, tag 2) was detected on consecutive days, I would like to connect those dates with a line, and if the detection happened without detection on consecutive days, a simple dot will suffice. I imagine the y-axis is stacked with each individual Tag, and the x-axis is a date scale - for each date, if A tag ID was detected, then its corresponding x,y coordinate will be marked; if a tag was not detected on a certain date; the corresponding x,y coordinate will remain blank.
Here's a follow-up question:
I want to add a shaded background to some of the dates. I figured that I can use this using geom_rect, but i keep getting the following error:
Error: Invalid input: date_trans works with objects of class Date only
using the code you wrote, this is what I have added to receive the error:
geom_rect(aes(xmin=16075, xmax=16078, ymin=-Inf, ymax=Inf), fill="red", alpha=0.25)
this code will plot, but is not transparent, and so becomes fairly useless:
geom_rect(xmin=16075, xmax=16078, ymin=-Inf, ymax=Inf)

You first need to change your date format into Date. Then you need to figure out if dates are consecutive. And finally you need to plot them. Below is a possible solution using the packages dplyr and ggplot2.
# needed packages
require(ggplot2)
require(dplyr)
# input your data (changed to 2014 dates)
dat <- structure(list(Tag = c(1L, 3L, 1L, 2L, 3L, 5L, 2L, 3L, 4L, 6L, 1L, 2L, 4L, 6L, 1L, 2L, 3L, 4L, 6L, 2L, 4L, 1L, 2L, 6L), Date = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 7L, 7L, 7L), .Label = c("1/1/2014", "1/3/2014", "1/4/2014", "1/5/2014", "1/6/2014", "1/7/2014", "1/8/2014"), class = "factor")), .Names = c("Tag", "Date"), class = "data.frame", row.names = c(NA, -24L))
# change date to Date format
dat[, "Date"] <- as.Date(dat[, "Date"], format='%m/%d/%Y')
# adding consecutive tag for first day of cons. measurements
dat <- dat %>% group_by(Tag) %>% mutate(consecutive=c(diff(Date), 2)==1)
# plotting command
ggplot(dat, aes(Date, Tag)) + geom_point() + theme_bw() +
geom_line(aes(alpha=consecutive, group=Tag)) +
scale_alpha_manual(values=c(0, 1), breaks=c(FALSE, TRUE), guide='none')

R: iterate over columns and plot

My data looks like this:
Group Feature_A Feature_B Feature_C Feature_D
1 1 0 3 2 4
2 1 5 2 2 8
3 1 9 8 6 5
4 2 5 7 8 8
5 2 2 6 8 1
6 2 3 8 6 4
7 3 1 5 3 5
8 3 1 4 3 4
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), Feature_A = c(0L,
5L, 9L, 5L, 2L, 3L, 1L, 1L), Feature_B = c(3L, 2L, 8L, 7L, 6L,
8L, 5L, 4L), Feature_C = c(2L, 2L, 6L, 8L, 8L, 6L, 3L, 3L), Feature_D = c(4L,
8L, 5L, 8L, 1L, 4L, 5L, 4L)), .Names = c("Group", "Feature_A",
"Feature_B", "Feature_C", "Feature_D"), class = "data.frame", row.names = c(NA,
-8L))
For every Feature I want to generate a plot (e.g., boxplot) that would higlight difference between Groups.
# Get unique Feature and Group
Features<-unique(colnames(df[,-1]))
Group<-unique(colnames(df$Group))
But how can I do the rest?
Pseudo-code might look like this:
Select Feature from Data
Split Data according Group
Boxplot
for (i in 1:levels(df$Features)){
for (o in 1:length(Group)){
}}
How can I achieve this? Hope someone can help me.

I would put py data in the long format. Then Using ggplot2 you can do some nice things.
library(reshape2)
library(ggplot2)
library(gridExtra)
## long format using Group as id
dat.m <- melt(dat,id='Group')
## bar plot
p1 <- ggplot(dat.m) +
geom_bar(aes(x=Group,y=value,fill=variable),stat='identity')
## box plot
p2 <- ggplot(dat.m) +
geom_boxplot(aes(x=factor(Group),y=value,fill=variable))
## aggregate the 2 plots
grid.arrange(p1,p2)

This is easy to do. I do this all the time
The code below will generate the charts using ggplot and save them as ch_Feature_A ....
you can wrap the answer in a pdf statement to send them to pdf as well
library(ggplot2)
df$Group <- as.factor(df$Group)
for (i in 2:dim(df)[2]) {
ch <- ggplot(df,aes_string(x="Group",y=names(df)[i],fill="Group"))+geom_boxplot()
assign(paste0("ch_",names(df)[i]),ch)
}
or even simpler, if you do not want separate charts
library(reshape2)
df1 <- melt(df)
ggplot(df1,aes(x=Group,y=value,fill=Group))+geom_boxplot()+facet_grid(.~variable)

Checking row format of csv

I am trying to import some data (below) and checking to see if I have the appropriate number of rows for later analysis.
repexample <- structure(list(QueueName = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c(" Overall", "CCM4.usci_retention_eng", "usci_helpdesk"
), class = "factor"), X8Tile = structure(c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), .Label = c(" Average", "1", "2", "3", "4", "5", "6", "7",
"8"), class = "factor"), Actual = c(508.1821504, 334.6994838,
404.9048759, 469.4068667, 489.2800416, 516.5744106, 551.7966176,
601.5103783, 720.9810622, 262.4622533, 250.2777778, 264.8281938,
272.2807882, 535.2466968, 278.25, 409.9285714, 511.6635101, 553,
641, 676.1111111, 778.5517241, 886.3666667), Calls = c(54948L,
6896L, 8831L, 7825L, 5768L, 7943L, 5796L, 8698L, 3191L, 1220L,
360L, 454L, 406L, 248L, 11L, 9L, 94L, 1L, 65L, 9L, 29L, 30L),
Pop = c(41L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 3L, 1L, 1L,
1L, 11L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), .Names = c("QueueName",
"X8Tile", "Actual", "Calls", "Pop"), class = "data.frame", row.names = c(NA,
-22L))
The data gives 5 columns and is one example of some data that I would typically import (via a .csv file). As you can see there are three unique values in the column "QueueName". For each unique value in "QueueName" I want to check that it has 9 rows, or the corresponding values in the column "X8Tile" ( Average, 1, 2, 3, 4, 5, 6, 7, 8). As an example the "QueueName" Overall has all of the necessary rows, but usci_helpdesk does not.
So my first priority is to at least identify if one of the unique values in "QueueName" does not have all of the necessary rows.
My second priority would be to remove all of the rows corresponding to a unique "QueueName" that does not meet the requirements.

Both these priorities are easily addressed using the Split-Apply-Combine paradigm, implemented in the plyr package.
Priority 1: Identify values of QueueName which don't have enough rows
require(plyr)
# Make a short table of the number of rows for each unique value of QueueName
rowSummary <- ddply(repexample, .(QueueName), summarise, numRows=length(QueueName))
print(rowSummary)
If you have lots of unique values of QueueName, you'll want to identify the values which are not equal to 9:
rowSummary[rowSummary$numRows !=9, ]
Priority 2: Eliminate rows for which QueueNamedoes not have enough rows
repexample2 <- ddply(repexample, .(QueueName), transform, numRows=length(QueueName))
repexampleEdit <- repexample2[repexample2$numRows ==9, ]
print(repxampleEdit)
(I don't quite understand the meaning of 'check that it has 9 rows, or the corresponding values in the column "X8Tile"). You could edit the repexampleEdit line based on your needs.

This is an approach that makes some assumptions about how your data are ordered. It can be modified (or your data can be reordered) if the assumption doesn't fit:
## Paste together the values from your "X8tile" column
## If all is in order, you should have "Average12345678"
## If anything is missing, you won't....
myMatch <- names(
which(with(repexample, tapply(X8Tile, QueueName, FUN=function(x)
gsub("^\\s+|\\s+$", "", paste(x, collapse = ""))))
== "Average12345678"))
## Use that to subset...
repexample[repexample$QueueName %in% myMatch, ]
# QueueName X8Tile Actual Calls Pop
# 1 Overall Average 508.1822 54948 41
# 2 Overall 1 334.6995 6896 6
# 3 Overall 2 404.9049 8831 5
# 4 Overall 3 469.4069 7825 5
# 5 Overall 4 489.2800 5768 5
# 6 Overall 5 516.5744 7943 5
# 7 Overall 6 551.7966 5796 5
# 8 Overall 7 601.5104 8698 5
# 9 Overall 8 720.9811 3191 5
# 14 CCM4.usci_retention_eng Average 535.2467 248 11
# 15 CCM4.usci_retention_eng 1 278.2500 11 2
# 16 CCM4.usci_retention_eng 2 409.9286 9 2
# 17 CCM4.usci_retention_eng 3 511.6635 94 2
# 18 CCM4.usci_retention_eng 4 553.0000 1 1
# 19 CCM4.usci_retention_eng 5 641.0000 65 1
# 20 CCM4.usci_retention_eng 6 676.1111 9 1
# 21 CCM4.usci_retention_eng 7 778.5517 29 1
# 22 CCM4.usci_retention_eng 8 886.3667 30 1
Similar approaches can be taken with aggregate+merge and similar tools.

R - Select rows for random sample of column values?

How can I select all of the rows for a random sample of column values?
I have a dataframe that looks like this:
tag weight
R007 10
R007 11
R007 9
J102 11
J102 9
J102 13
J102 10
M942 3
M054 9
M054 12
V671 12
V671 13
V671 9
V671 12
Z990 10
Z990 11
That you can replicate using...
weights_df <- structure(list(tag = structure(c(4L, 4L, 4L, 1L, 1L, 1L, 1L,
3L, 2L, 2L, 5L, 5L, 5L, 5L, 6L, 6L), .Label = c("J102", "M054",
"M942", "R007", "V671", "Z990"), class = "factor"), value = c(10L,
11L, 9L, 11L, 9L, 13L, 10L, 3L, 9L, 12L, 12L, 14L, 5L, 12L, 11L,
15L)), .Names = c("tag", "value"), class = "data.frame", row.names = c(NA,
-16L))
I need to create a dataframe containing all of the rows from the above dataframe for two randomly sampled tags. Let's say tags R007and M942 get selected at random, my new dataframe needs to look like this:
tag weight
R007 10
R007 11
R007 9
M942 3
How do I do this?
I know I can create a list of two random tags like this:
library(plyr)
tags <- ddply(weights_df, .(tag), summarise, count = length(tag))
set.seed(5464)
tag_sample <- tags[sample(nrow(tags),2),]
tag_sample
Resulting in...
tag count
4 R007 3
3 M942 1
But I just don't know how to use that to subset my original dataframe.

is this what you want?
subset(weights_df, tag%in%sample(levels(tag),2))

If your data.frame is named dfrm, then this will select 100 random tags
dfrm[ sample(NROW(dfrm), 100), "tag" ] # possibly with repeats
If, on the other hand, you want a dataframe with the same columns (possibly with repeats):
samp <- dfrm[ sample(NROW(dfrm), 100), ] # leave the col name entry blank to get all
A third possibility... you want 100 distinct tags at random, but not with the probability at all weighted to the frequency:
samp.tags <- unique(dfrm$tag)[ sample(length(unique(dfrm$tag)), 100]
Edit: With to revised question; one of these:
subset(dfrm, tag %in% c("R007", "M942") )
Or:
dfrm[dfrm$tag %in% c("R007", "M942"), ]
Or:
dfrm[grep("R007|M942", dfrm$tag), ]