Related
I am working with migration data, and I want to produce three summary tables from a very large dataset (>4 million). An example of which is detailed below:
migration <- structure(list(area.old = structure(c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("leeds",
"london", "plymouth"), class = "factor"), area.new = structure(c(7L,
13L, 3L, 2L, 4L, 7L, 6L, 7L, 6L, 13L, 5L, 8L, 7L, 11L, 12L, 9L,
1L, 10L, 11L), .Label = c("bath", "bristol", "cambridge", "glasgow",
"harrogate", "leeds", "london", "manchester", "newcastle", "oxford",
"plymouth", "poole", "york"), class = "factor"), persons = c(6L,
3L, 2L, 5L, 6L, 7L, 8L, 4L, 5L, 6L, 3L, 4L, 1L, 1L, 2L, 3L, 4L,
9L, 4L)), .Names = c("area.old", "area.new", "persons"), class = "data.frame", row.names = c(NA,
-19L))
Summary table 1: 'area.within'
The first table I wish to create is called 'area.within'. This will detail only areas where people have moved within the same area (i.e. it will count the total number of persons where 'london' is noted down in 'area.old' and 'area.new'). There will probably be multiple occurrences of this within the data table. It will then do this for all of the different areas, so the summary would be:
area.within persons
1 london 13
2 leeds 5
3 plymouth 5
Using the data table package, I have as far as:
setDT(migration)[as.character(area.old)==as.character(area.new)]
... but this doesn't get rid of duplicates...
Summary table 2: 'moved.from'
The second table will summarise areas which have experienced people moving out (i.e. those unique values in 'area.old'). It will identify areas for which column 1 and 2 are different and add together all the people that are detailed (i.e. excluding those who have moved between areas - in summary table 1). The resulting table should be:
moved.from persons
1 london 24
2 leeds 17
3 plymouth 19
Summary table 3: 'moved.to'
The third table summarises which areas have experienced people moving to (i.e. those unique values in 'area.new'). It will identify all the unique areas for which column 1 and 2 are different and add together all the people that are detailed (i.e. excluding those who have moved between areas - in summary table 1). The resulting table should be:
moved.to persons
1 london 5
2 york 3
3 cambridge 2
4 bristol 5
5 glasgow 6
6 leeds 8
7 york 6
8 harrogate 3
9 manchester 4
10 plymouth 0
11 poole 2
12 newcastle 3
13 bath 4
14 oxford 9
Most importantly, a sum of all the persons detailed in tables 2 and 3 should be the same. And then this value, combined with the persons total for table 1 should equal the sum of the all the persons in the original table.
If anyone could help me sort out how to structure my code using the data table package to produce my tables, I should be most grateful.
Using data.table is a good choice i think.
setDT(migration) #This has to be done only once
1.
To avoid duplicates just sum them up by city as follows
migration[as.character(area.old)==as.character(area.new),
.(persons = sum(persons)),
by=.(area.within = area.new)]
2.
This is very similar to the 1. one but uses != in the i-Argument
migration[as.character(area.old)!=as.character(area.new),
.(persons = sum(persons)),
by=.(moved.from = area.old)]
3.
Same as 2.
migration[as.character(area.old)!=as.character(area.new),
.(persons = sum(persons)),
by=.(moved.to = area.new)]
Alternative
As 2. and 3. are very similar you can also do:
moved <- migration[as.character(area.old)!=as.character(area.new)]
#2
moved[,.(persons = sum(persons)), by=.(moved.from = area.old)]
#3
moved[,.(persons = sum(persons)), by=.(moved.to = area.new)]
Thus only once the selection of the right rows has to be done.
My data looks like this:
Group Feature_A Feature_B Feature_C Feature_D
1 1 0 3 2 4
2 1 5 2 2 8
3 1 9 8 6 5
4 2 5 7 8 8
5 2 2 6 8 1
6 2 3 8 6 4
7 3 1 5 3 5
8 3 1 4 3 4
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), Feature_A = c(0L,
5L, 9L, 5L, 2L, 3L, 1L, 1L), Feature_B = c(3L, 2L, 8L, 7L, 6L,
8L, 5L, 4L), Feature_C = c(2L, 2L, 6L, 8L, 8L, 6L, 3L, 3L), Feature_D = c(4L,
8L, 5L, 8L, 1L, 4L, 5L, 4L)), .Names = c("Group", "Feature_A",
"Feature_B", "Feature_C", "Feature_D"), class = "data.frame", row.names = c(NA,
-8L))
For every Feature I want to generate a plot (e.g., boxplot) that would higlight difference between Groups.
# Get unique Feature and Group
Features<-unique(colnames(df[,-1]))
Group<-unique(colnames(df$Group))
But how can I do the rest?
Pseudo-code might look like this:
Select Feature from Data
Split Data according Group
Boxplot
for (i in 1:levels(df$Features)){
for (o in 1:length(Group)){
}}
How can I achieve this? Hope someone can help me.
I would put py data in the long format. Then Using ggplot2 you can do some nice things.
library(reshape2)
library(ggplot2)
library(gridExtra)
## long format using Group as id
dat.m <- melt(dat,id='Group')
## bar plot
p1 <- ggplot(dat.m) +
geom_bar(aes(x=Group,y=value,fill=variable),stat='identity')
## box plot
p2 <- ggplot(dat.m) +
geom_boxplot(aes(x=factor(Group),y=value,fill=variable))
## aggregate the 2 plots
grid.arrange(p1,p2)
This is easy to do. I do this all the time
The code below will generate the charts using ggplot and save them as ch_Feature_A ....
you can wrap the answer in a pdf statement to send them to pdf as well
library(ggplot2)
df$Group <- as.factor(df$Group)
for (i in 2:dim(df)[2]) {
ch <- ggplot(df,aes_string(x="Group",y=names(df)[i],fill="Group"))+geom_boxplot()
assign(paste0("ch_",names(df)[i]),ch)
}
or even simpler, if you do not want separate charts
library(reshape2)
df1 <- melt(df)
ggplot(df1,aes(x=Group,y=value,fill=Group))+geom_boxplot()+facet_grid(.~variable)
I am working with a matrix set_onco of 206 rows x 196 cols and I have a vector, genes_100 (it's a matrix but I take only the first col), with 101 names.
here's a snippet of how they look
> set_onco[1:10,1:10]
V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
GLI1_UP.V1_DN COPZ1 C10orf46 C20orf118 TMEM181 CCNL2 YIPF1 GTDC1 OPN3 RSAD2 SLC22A1
GLI1_UP.V1_UP IGFBP6 HLA-DQB1 CCND2 PTH1R TXNDC12 M6PR PPT2 STAU1 IGJ TMOD3
E2F1_UP.V1_DN TGFB1I1 CXCL5 POU5F1 SAMD10 KLF2 STAT6 ENTPD6 VCAN HMGCS1 ANXA8
E2F1_UP.V1_UP RRP1B HES1 ADCY6 CHAF1B VPS37B GRSF1 TLX2 SSX2IP DNA2 CMA1
EGFR_UP.V1_DN NPY1R PDZK1 GFRA1 GREB1 MSMB DLC1 MYB SLC6A14 IFI44 IFI44L
EGFR_UP.V1_UP FGG GBP1 TNFRSF11B FGB GJA1 DUSP6 S100A9 ADM ITGB6 DUSP4
ERB2_UP.V1_DN NPY1R PDZK1 ANXA3 GREB1 HSPB8 DLC1 NRIP1 FHL2 EGR3 IFI44
FAM18B1
ERB2_UP.V1_UP CYP1A1 CEACAM5 FAM129A TNFRSF11B DUSP4 CYP1B1 UPK2 DAB2 CEACAM6 KIAA1199
GCNP_SHH_UP_EARLY.V1_DN SRRM2 KIAA1217 DEFA1 DLK1 PITX2 CCL2 UPK3B SEZ6 TAF15 EMP1
genes_100[1:10,1]
[1] AL591845.1 B3GALT6 RAP1GAP HSPG2 BX293535.1 RP1-159A19.1 IFI6 FAM76A FAM176B CSF3R
101 Levels: 5_8S_rRNA AC018470.1 AC091179.2 AC103702.3 AC138972.1 ACVR1B AL049829.5 AL137797.2 AL139260.2 AL450326.2 AL591845.1 AL607122.2 B3GALT6 BX293535.1 ... ZNF678
what I want to do is to parse through the matrix and count the frequency at which each row contains the names in genes_100
to do that I created 3 for loops: the first one moves down one row at the time, the second one moves into the row and the third one loops over the list genes_100 checking for matches.
at the end I save in a matrix how many times genes_100 matched with the terms in each row, saving also the row names from the matrix (so that I know which one is which)
the code works and gives me the correct output...but it's just really slow!!
a snippet of the output is:
head(result_matrix_100)
freq_100
[1,] "GLI1_UP.V1_DN" "0"
[2,] "GLI1_UP.V1_UP" "0"
[3,] "E2F1_UP.V1_DN" "0"
[4,] "E2F1_UP.V1_UP" "0"
[5,] "EGFR_UP.V1_DN" "0"
[6,] "EGFR_UP.V1_UP" "0"
I used system.time() and I get:
user system elapsed
525.38 0.06 530.34
which is way too slow since I have even bigger matrices to parse, and in some cases I have to repeat this 10k times!!!
the code is:
result_matrix_100 <- matrix(nrow=0, ncol=2)
for (q in seq(1,nrow(set_onco),1)) {
for (j in seq(1, length(set_onco[q,]),1)) {
for (x in seq(1,101,1)) {
if (as.character(genes_100[x,1]) == as.character(set_onco[q,j])) {
freq_100 <- freq_100+1
}
}
}
result_matrix_100 <- rbind(result_matrix_100, cbind(row.names(set_onco)[q], freq_100))
}
what would you suggest?
thanks in advance :)
#joran's will possibly be faster although it may not be "factor-safe". Your set_onco values are probably encoded as factor variables (because your genes_100 object clearly is.) This will be safer:
set_onco[] <- lapply(set_onco, as.character)
# that converts a data.frame with factor columns to character valued
# at that point #joran's solution could be used safely
freq100 <- apply(set_onco, 1, function(x) sum(x %in% genes_100) )
# that does a row-by-row count of the number of matches to genes_100
freq100
GLI1_UP.V1_DN GLI1_UP.V1_UP E2F1_UP.V1_DN
0 0 0
E2F1_UP.V1_UP EGFR_UP.V1_DN EGFR_UP.V1_UP
0 0 0
ERB2_UP.V1_DN ERB2_UP.V1_UP GCNP_SHH_UP_EARLY.V1_DN
0 0 0
The size of your dataset (206 rows x 196 cols) is quite small so this will be virtually immediate. These dput statements and output can be used to construct what I think your objects look like internally:
dput(set_onco)
structure(list(V2 = structure(c(1L, 4L, 8L, 6L, 5L, 3L, 5L, 2L,
7L), .Label = c("COPZ1", "CYP1A1", "FGG", "IGFBP6", "NPY1R",
"RRP1B", "SRRM2", "TGFB1I1"), class = "factor"), V3 = structure(c(1L,
6L, 3L, 5L, 8L, 4L, 8L, 2L, 7L), .Label = c("C10orf46", "CEACAM5",
"CXCL5", "GBP1", "HES1", "HLA-DQB1", "KIAA1217", "PDZK1"), class = "factor"),
V4 = structure(c(3L, 4L, 8L, 1L, 7L, 9L, 2L, 6L, 5L), .Label = c("ADCY6",
"ANXA3", "C20orf118", "CCND2", "DEFA1", "FAM129A", "GFRA1",
"POU5F1", "TNFRSF11B"), class = "factor"), V5 = structure(c(7L,
5L, 6L, 1L, 4L, 3L, 4L, 8L, 2L), .Label = c("CHAF1B", "DLK1",
"FGB", "GREB1", "PTH1R", "SAMD10", "TMEM181", "TNFRSF11B"
), class = "factor"), V6 = structure(c(1L, 8L, 5L, 9L, 6L,
3L, 4L, 2L, 7L), .Label = c("CCNL2", "DUSP4", "GJA1", "HSPB8",
"KLF2", "MSMB", "PITX2", "TXNDC12", "VPS37B"), class = "factor"),
V7 = structure(c(8L, 6L, 7L, 5L, 3L, 4L, 3L, 2L, 1L), .Label = c("CCL2",
"CYP1B1", "DLC1", "DUSP6", "GRSF1", "M6PR", "STAT6", "YIPF1"
), class = "factor"), V8 = structure(c(2L, 5L, 1L, 7L, 3L,
6L, 4L, 8L, 9L), .Label = c("ENTPD6", "GTDC1", "MYB", "NRIP1",
"PPT2", "S100A9", "TLX2", "UPK2", "UPK3B"), class = "factor"),
V9 = structure(c(4L, 8L, 9L, 7L, 6L, 1L, 3L, 2L, 5L), .Label = c("ADM",
"DAB2", "FHL2", "OPN3", "SEZ6", "SLC6A14", "SSX2IP", "STAU1",
"VCAN"), class = "factor"), V10 = structure(c(8L, 6L, 4L,
2L, 5L, 7L, 3L, 1L, 9L), .Label = c("CEACAM6", "DNA2", "EGR3",
"HMGCS1", "IFI44", "IGJ", "ITGB6", "RSAD2", "TAF15"), class = "factor"),
V11 = structure(c(8L, 9L, 1L, 2L, 6L, 3L, 5L, 7L, 4L), .Label = c("ANXA8",
"CMA1", "DUSP4", "EMP1", "IFI44", "IFI44L", "KIAA1199", "SLC22A1",
"TMOD3"), class = "factor")), .Names = c("V2", "V3", "V4",
"V5", "V6", "V7", "V8", "V9", "V10", "V11"), class = "data.frame", row.names = c("GLI1_UP.V1_DN",
"GLI1_UP.V1_UP", "E2F1_UP.V1_DN", "E2F1_UP.V1_UP", "EGFR_UP.V1_DN",
"EGFR_UP.V1_UP", "ERB2_UP.V1_DN", "ERB2_UP.V1_UP", "GCNP_SHH_UP_EARLY.V1_DN"
))
dput(factor(genes_100) )
structure(c(1L, 2L, 9L, 7L, 3L, 10L, 8L, 6L, 5L, 4L), .Label = c("AL591845.1",
"B3GALT6", "BX293535.1", "CSF3R", "FAM176B", "FAM76A", "HSPG2",
"IFI6", "RAP1GAP", "RP1-159A19.1"), class = "factor")
Something like this will probably be quite fast:
#Sample data
m <- matrix(sample(letters,206*196,replace = TRUE),206,196)
genes_100 <- letters[1:5]
m1 <- matrix(m %in% genes_100,206,196)
rowSums(m1)
I am trying to import some data (below) and checking to see if I have the appropriate number of rows for later analysis.
repexample <- structure(list(QueueName = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c(" Overall", "CCM4.usci_retention_eng", "usci_helpdesk"
), class = "factor"), X8Tile = structure(c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), .Label = c(" Average", "1", "2", "3", "4", "5", "6", "7",
"8"), class = "factor"), Actual = c(508.1821504, 334.6994838,
404.9048759, 469.4068667, 489.2800416, 516.5744106, 551.7966176,
601.5103783, 720.9810622, 262.4622533, 250.2777778, 264.8281938,
272.2807882, 535.2466968, 278.25, 409.9285714, 511.6635101, 553,
641, 676.1111111, 778.5517241, 886.3666667), Calls = c(54948L,
6896L, 8831L, 7825L, 5768L, 7943L, 5796L, 8698L, 3191L, 1220L,
360L, 454L, 406L, 248L, 11L, 9L, 94L, 1L, 65L, 9L, 29L, 30L),
Pop = c(41L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 3L, 1L, 1L,
1L, 11L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), .Names = c("QueueName",
"X8Tile", "Actual", "Calls", "Pop"), class = "data.frame", row.names = c(NA,
-22L))
The data gives 5 columns and is one example of some data that I would typically import (via a .csv file). As you can see there are three unique values in the column "QueueName". For each unique value in "QueueName" I want to check that it has 9 rows, or the corresponding values in the column "X8Tile" ( Average, 1, 2, 3, 4, 5, 6, 7, 8). As an example the "QueueName" Overall has all of the necessary rows, but usci_helpdesk does not.
So my first priority is to at least identify if one of the unique values in "QueueName" does not have all of the necessary rows.
My second priority would be to remove all of the rows corresponding to a unique "QueueName" that does not meet the requirements.
Both these priorities are easily addressed using the Split-Apply-Combine paradigm, implemented in the plyr package.
Priority 1: Identify values of QueueName which don't have enough rows
require(plyr)
# Make a short table of the number of rows for each unique value of QueueName
rowSummary <- ddply(repexample, .(QueueName), summarise, numRows=length(QueueName))
print(rowSummary)
If you have lots of unique values of QueueName, you'll want to identify the values which are not equal to 9:
rowSummary[rowSummary$numRows !=9, ]
Priority 2: Eliminate rows for which QueueNamedoes not have enough rows
repexample2 <- ddply(repexample, .(QueueName), transform, numRows=length(QueueName))
repexampleEdit <- repexample2[repexample2$numRows ==9, ]
print(repxampleEdit)
(I don't quite understand the meaning of 'check that it has 9 rows, or the corresponding values in the column "X8Tile"). You could edit the repexampleEdit line based on your needs.
This is an approach that makes some assumptions about how your data are ordered. It can be modified (or your data can be reordered) if the assumption doesn't fit:
## Paste together the values from your "X8tile" column
## If all is in order, you should have "Average12345678"
## If anything is missing, you won't....
myMatch <- names(
which(with(repexample, tapply(X8Tile, QueueName, FUN=function(x)
gsub("^\\s+|\\s+$", "", paste(x, collapse = ""))))
== "Average12345678"))
## Use that to subset...
repexample[repexample$QueueName %in% myMatch, ]
# QueueName X8Tile Actual Calls Pop
# 1 Overall Average 508.1822 54948 41
# 2 Overall 1 334.6995 6896 6
# 3 Overall 2 404.9049 8831 5
# 4 Overall 3 469.4069 7825 5
# 5 Overall 4 489.2800 5768 5
# 6 Overall 5 516.5744 7943 5
# 7 Overall 6 551.7966 5796 5
# 8 Overall 7 601.5104 8698 5
# 9 Overall 8 720.9811 3191 5
# 14 CCM4.usci_retention_eng Average 535.2467 248 11
# 15 CCM4.usci_retention_eng 1 278.2500 11 2
# 16 CCM4.usci_retention_eng 2 409.9286 9 2
# 17 CCM4.usci_retention_eng 3 511.6635 94 2
# 18 CCM4.usci_retention_eng 4 553.0000 1 1
# 19 CCM4.usci_retention_eng 5 641.0000 65 1
# 20 CCM4.usci_retention_eng 6 676.1111 9 1
# 21 CCM4.usci_retention_eng 7 778.5517 29 1
# 22 CCM4.usci_retention_eng 8 886.3667 30 1
Similar approaches can be taken with aggregate+merge and similar tools.
How can I select all of the rows for a random sample of column values?
I have a dataframe that looks like this:
tag weight
R007 10
R007 11
R007 9
J102 11
J102 9
J102 13
J102 10
M942 3
M054 9
M054 12
V671 12
V671 13
V671 9
V671 12
Z990 10
Z990 11
That you can replicate using...
weights_df <- structure(list(tag = structure(c(4L, 4L, 4L, 1L, 1L, 1L, 1L,
3L, 2L, 2L, 5L, 5L, 5L, 5L, 6L, 6L), .Label = c("J102", "M054",
"M942", "R007", "V671", "Z990"), class = "factor"), value = c(10L,
11L, 9L, 11L, 9L, 13L, 10L, 3L, 9L, 12L, 12L, 14L, 5L, 12L, 11L,
15L)), .Names = c("tag", "value"), class = "data.frame", row.names = c(NA,
-16L))
I need to create a dataframe containing all of the rows from the above dataframe for two randomly sampled tags. Let's say tags R007and M942 get selected at random, my new dataframe needs to look like this:
tag weight
R007 10
R007 11
R007 9
M942 3
How do I do this?
I know I can create a list of two random tags like this:
library(plyr)
tags <- ddply(weights_df, .(tag), summarise, count = length(tag))
set.seed(5464)
tag_sample <- tags[sample(nrow(tags),2),]
tag_sample
Resulting in...
tag count
4 R007 3
3 M942 1
But I just don't know how to use that to subset my original dataframe.
is this what you want?
subset(weights_df, tag%in%sample(levels(tag),2))
If your data.frame is named dfrm, then this will select 100 random tags
dfrm[ sample(NROW(dfrm), 100), "tag" ] # possibly with repeats
If, on the other hand, you want a dataframe with the same columns (possibly with repeats):
samp <- dfrm[ sample(NROW(dfrm), 100), ] # leave the col name entry blank to get all
A third possibility... you want 100 distinct tags at random, but not with the probability at all weighted to the frequency:
samp.tags <- unique(dfrm$tag)[ sample(length(unique(dfrm$tag)), 100]
Edit: With to revised question; one of these:
subset(dfrm, tag %in% c("R007", "M942") )
Or:
dfrm[dfrm$tag %in% c("R007", "M942"), ]
Or:
dfrm[grep("R007|M942", dfrm$tag), ]