Can dplyr perform chained summarise operations on a data.frame?
My data.frame has the structure:
data_df = tbl_df(data)
data_df %.%
group_by(col_1) %.%
summarise(number_of= length(col_2)) %.%
summarise(sum_of = sum(col_3))
This causes RStudio to encounter a fatal error - R Session Aborted message
Usually with plyr I would include these summarise functions without problems.
UPDATE
Data are here.
Code is:
library(dplyr)
orth <- read.csv('orth0106.csv')
orth_df = tbl_df(orth)
orth_df %.%
group_by(Hospital) %.%
summarise(Procs = length(Procedure)) %.%
summarise(SSIs = sum(SSI))
I can reproduce the error on Windows 7 machine running RStudio 0.97.551
It may be because you're calling summarise and chaining onto something that's not there. You can summarise with 2 different columns as I've done here.
url <- "https://raw.github.com/johnmarquess/some.data/master/orth0106.csv"
library(dplyr)
orth <- read.csv(url)
orth_df <- tbl_df(orth)
orth_df %.%
group_by(Hospital) %.%
summarise(Procs = length(Procedure), SSIs = sum(SSI))
## Source: local data frame [18 x 3]
##
## Hospital Procs SSIs
## 1 A 865 80
## 2 B 1069 38
## 3 C 796 24
## 4 D 891 35
## 5 E 997 39
## 6 F 550 30
## 7 G 2598 128
## 8 H 373 27
## 9 I 1079 70
## 10 J 714 30
## 11 K 477 30
## 12 L 227 2
## 13 M 125 6
## 14 N 589 38
## 15 O 292 3
## 16 P 149 9
## 17 Q 1984 52
## 18 R 351 13
In any event this seems like either an RStudio or a dplyr bug. I'd open up an issue with Hadley as he probably cares either way. https://github.com/hadley/dplyr/issues
EDIT This (your first call) also cause rgui (windows) and the terminal to crash as well on:
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
This indicates a dplyr problem Hadley and Romain will want to know about.
To get my first point we run:
orth_df %.%
group_by(Hospital) %.%
summarise(Procs = length(Procedure))
Source: local data frame [18 x 2]
Hospital Procs
1 A 865
2 B 1069
3 C 796
4 D 891
5 E 997
6 F 550
7 G 2598
8 H 373
9 I 1079
10 J 714
11 K 477
12 L 227
13 M 125
14 N 589
15 O 292
16 P 149
17 Q 1984
18 R 351
Where is %.% summarise(SSIs = sum(SSI)) supposed to find SSI?
So the chaining you think is happening fails. TO my understanding %.% isn't exactly like how ggplot2 works but similar. In ggplot2 once you pass the data in the initial mapping you can access it later on. Here %.% seems to modify grab the left chunk and operate on it like this:
So you're grabbing:
Hospital Procs
1 A 865
2 B 1069
3 C 796
.
.
.
17 Q 1984
18 R 351
when you use %.% summarise(SSIs = sum(SSI)) and there is no SSI to be gotten. So the analogy that comes to mind is serial vs. parallel wiring Christmas lights. %.% = serial ggplot() + = parallel. This is a nonprogrammer's understanding of things and the R gurus may come and tell me I'm stupid but for now that's the best theory you've got.
Related
What is the most efficient way to determine the maximum positive difference between the value (X) for each row and the subsequent values of the same variable (X) within group (Y) in data.table in R.
Example:
set.seed(1)
dt <- data.table(X = sample(100:200, 500455, replace = TRUE),
Y = unlist(sapply(10:1000, function(x) rep(x, x))))
Here's my solution which I consider ineffective and slow:
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
head(dt, 21)
X Y max_diff
1: 126 10 69
2: 137 10 58
3: 157 10 38
4: 191 10 4
5: 120 10 75
6: 190 10 5
7: 195 10 0
8: 166 10 0
9: 163 10 0
10: 106 10 0
11: 120 11 80
12: 117 11 83
13: 169 11 31
14: 138 11 62
15: 177 11 23
16: 150 11 50
17: 172 11 28
18: 200 11 0
19: 138 11 56
20: 178 11 16
21: 194 11 0
If you can advise the efficient (faster) solution?
Here's a dplyr solution that is about 20x faster and gets the same results. I presume the data.table equivalent would be yet faster. (EDIT: see bottom - it is!)
The speedup comes from reducing how many comparisons need to be performed. The largest difference will always be found against the largest remaining number in the group, so it's faster to identify that number first and do only the one subtraction per row.
First, the original solution takes about 4 sec on my machine:
tictoc::tic("OP data.table")
dt[, max_diff := vapply(1:.N, function(x) max(X[x:.N] - X[x]), numeric(1)), by = Y]
tictoc::toc()
# OP data.table: 4.594 sec elapsed
But in only 0.2 sec we can take that data.table, convert to a data frame, add the orig_row row number, group by Y, reverse sort by orig_row, take the difference between X and the cumulative max of X, ungroup, and rearrange in original order:
library(dplyr)
tictoc::tic("dplyr")
dt2 <- dt %>%
as_data_frame() %>%
mutate(orig_row = row_number()) %>%
group_by(Y) %>%
arrange(-orig_row) %>%
mutate(max_diff2 = cummax(X) - X) %>%
ungroup() %>%
arrange(orig_row)
tictoc::toc()
# dplyr: 0.166 sec elapsed
all.equal(dt2$max_diff, dt2$max_diff2)
#[1] TRUE
EDIT: as #david-arenburg suggests in the comments, this can be done lightning-fast in data.table with an elegant line:
dt[.N:1, max_diff2 := cummax(X) - X, by = Y]
On my computer, that's about 2-4x faster than the dplyr solution above.
I want to divide a df into x roughly equal groups, sequentially.
I was basically doing it like this:
df_1 <- df[1:10,]
df_2 <- df[11:21,]
df_3..
Is there a simpler way to do this, using split or slice? The important thing is, I want to maintain the order of the df, not sample from it.
Imagine I had 7000 observations, and I wanted 19 roughly equal groups.
Best!
I don't know if it counts for roughly equal, but you can do this:
nobs <- 7000
ngroups <- 17
df <- data.frame(x = sample(nobs))
set.seed(1)
df$grp <- sort(sample(1:ngroups,nobs,T)) # added the sort so the order of your df is maintained
table(df$grp)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# 436 407 410 369 417 411 440 401 431 411 356 398 390 414 443 418 448
then split(df,df$grp)
I'm currently performing a multiple sequence alignment using the 'msa' package from Bioconductor. I'm using this to calculate the consensus sequence (msaConsensusSequence) and conservation score (msaConservationScore). This gives me outputs that are values ...
e.g.
ConsensusSequence:
i.llE etc (str = chr)
(lower case = 20%+ conservation, uppercase = 80%+ conservation, . = <20% conservation)
ConservationScore:
221 -296 579 71 423 etc (str = named num)
I would like to convert these into a table where the first row contains columns where each is a different letter in the consensus sequence and the second row is the corresponding conservation score.
e.g.
i . l l E
221 -296 579 71 423
Could people please advise on the best way to go about this?
Thanks
Natalie
For what you have said in the comments you can get a data frame like this:
data(BLOSUM62)
alignment <- msa(mySequences)
conservation <- msaConservationScore(alignment, BLOSUM62)
# Now create the data fram
df <- data.frame(consensus = names(conservation), conservation = conservation)
head(df)
consensus conservation
1 T 141
2 E 160
3 E 165
4 E 325
5 ? 179
6 ? 71
7 T 216
8 W 891
9 ? 38
10 T 405
11 L 204
If you prefer to transpose it you can:
df <- t(df)
colnames(df) <- 1:ncol(df)
i have such a difficult question (at least to me) that i spend 2 hours just writing it. Complete impossible to program it by my self. I try to be very clear and i´m sorry if i didn´t. I´m doing this in a very rustic way in excel, but i really need to program this.
i have a data.frame like this
id_pix id_lote clase f1 f2
45 4 Sg 2460 2401
46 4 Sg 2620 2422
47 4 Sg 2904 2627
48 5 M 2134 2044
49 5 M 2180 2104
50 5 M 2127 2069
83 11 S 2124 2062
84 11 S 2189 2336
85 11 S 2235 2162
86 11 S 2162 2153
87 11 S 2108 2124
with 17451 "id_pixel"(rows), 2080 "id_lote" and 9 "clase"
this is the "id_lote" count per "clase" (v1 is the id_lote count)
clase v1
1: S 1099
2: P 213
3: Sg 114
4: M 302
5: Alg 27
6: Az 77
7: Po 228
8: Cit 13
9: Ma 7
i need to split the "id_lote" randomly within the "clase". I mean i have 1099 "id_lote" for the "S" "clase" that are 9339 "id_pixel" (rows) and i want to randomly select 50 % of "id_lote" that are x "id_pixel"(rows). And do this for every "clase" considering that the size (number of "id_lote") of every "clase" are different. I also would like to be able to change the size of the selection (50 %, 30 %, etc). And i also want to keep the not selected set of "id_lote". I hope some one can help me with this!
here is the reproducible example
this is the data with 2 clase (S and Az), with 6 id_lote and 13 id_pixel
id_pix id_lote clase f1 f2
1 1 S 2909 2381
2 1 S 2515 2663
3 1 S 2628 3249
30 2 S 3021 2985
31 2 S 3020 2596
71 9 S 4725 4404
72 9 S 4759 4943
75 11 S 2728 2225
218 21 Az 4830 3007
219 21 Az 4574 2761
220 21 Az 5441 3092
1155 126 Az 7209 2449
1156 126 Az 7035 2932
and one result could be:
id_pix id_lote clase f1 f2
1 1 S 2909 2381
2 1 S 2515 2663
3 1 S 2628 3249
75 11 S 2728 2225
1155 126 Az 7209 2449
1156 126 Az 7035 2932
were 50% of id_lote were randomly selected in clase "S" (2 of 4 id_lote) but all the id_pixel in selected id_lote were keeped. The same for clase "Az", one id_lote was randomly selected (1 of 2 in this case) and all the id_pixel in selected id_lote were keeped.
what colemand77 proposed helped a lot. I think dplyr package is usefull for this but i think that if i do
df %>%
group_by(clase, id_lote) %>%
sample_frac(.3, replace = FALSE)
i get the 30 % of the data of each clase but not grouped by id_lote like i need! I mean 30 % of the rows (id_pixel) were selected instead of id_lote.
i hope this example help to understand what i want to do and make it usefull for everybody. I´m sorry if i wasn´t clear enough the first time.
Thanks a lot!
First glimpse I'd say the dplyr package is your friend here.
df %>%
group_by(clase, id_lote) %>%
sample_frac(.3, replace = FALSE)
so you first use group_by() and include the grouping levels you want to sample from, then you use sample_frac to sample the fraction of the results you want for each group.
As near as I can tell this is what you are asking for. If not, please consider re-stating your question to include either a reproducible example or clarify. Cheers.
to "keep" the not-selected members, I would add a column of unique ids, and use an anti-join anti_join()(also from the dplyr package) to find the id's that are not in common between the two data.frames (the results of the sampling and the original).
## Update ##
I'm understanding better now, I believe. Think about this as a two step process...
1) you want to select x% (50 in example) of the id_lote from each clase and return those id_lote #s (i'm assuming that a given id_lote does not exist for multiple clase?)
2) you want to see all of the id_pixels that correspond to each id_lote, all in one data.frame
I've broken this down into multiple steps for illustration, not because it is the fastest / prettiest.
raw data: (couldn't read your data into R.)
df<-data.frame(id_pix = c(1:200),
id_lote = sample(1:20,200, replace = TRUE),
clase = sample(letters[seq_along(1:10)], 200, replace = TRUE),
f1 = sample(1000:2000,200, replace = TRUE),
f2 = sample(2000:3000,200, replace = TRUE))
1) figure out which id_lote correspond to which clase - for this we use the dplyr summarise function and store it in a variable
summary<-df %>%
ungroup() %>%
group_by(clase, id_lote) %>%
summarise()
returns:
Source: local data frame [125 x 2]
Groups: clase
clase id_lote
1 a 1
2 a 2
3 a 4
4 a 5
5 a 6
6 a 7
7 a 8
8 a 9
9 a 11
10 a 12
.. ... ...
then we sample to get the 30% of the id_lote for each clase..
sampled_summary <- summary %>%
group_by(clase) %>%
sample_frac(.3,replace = FALSE)
so the result of this is a data table with two columns, (clase and id_lote) with 30% of the id_lotes shown for each clase.
2) ok so now we have the id_lotes randomly selected from each class but not the id_pix that are associated with that class. To accomplish this we do a join to get the corresponding full data set including the id_pix, etc.
result <- sampled_summary %>%
left_join(df)
The above makes a copy of the data set a bunch, so if you have a substantial data set you could just do it all at one go:
result <- df %>%
ungroup() %>%
group_by(clase, id_lote) %>%
summarise() %>%
group_by(clase) %>%
sample_frac(.5,replace = FALSE) %>%
left_join(df)
if this doesn't get you what you want, let me know and we'll take another crack at it.
I have written a function that will compare the similarity of IP addresses, and will let the user select the level of detail in the octet. for example, in the address 255.255.255.0 and 255.255.255.1, a user could specify that they only want to compare the first, first and second, first second third etc. octets.
the function is below:
did.change.ip=function(vec, detail){
counter=2
result.vec=FALSE
r.list=strsplit(vec, '.', fixed=TRUE)
for(i in vec){
if(counter>length(vec)){
break
}
first=as.numeric(r.list[[counter-1]][1:detail])
second=as.numeric(r.list[[counter]][1:detail])
if(sum(first==second)==detail){
result.vec=append(result.vec,FALSE)
}
else{
result.vec=append(result.vec,TRUE)
}
counter=counter+1
}
return(result.vec)
}
and it's really slow once the data starts getting larger. for a dataset of 500,000 rows, the system.time() results are:
user system elapsed
208.36 0.59 209.84
are there any R power users who have insight on how to write this more efficiently? I know lapply() is the preferred method for looping over vectors/dataframes, but I'm stumped as to how to access the previous element in a vector for this purpose. I've tried to sketch something out quickly, but It returns a syntax error:
test=function(vec, detail){
rlist=strsplit(vec, '.', fixed=TRUE)
r.value=vapply(rlist, function(x,detail) ifelse(x[1:detail]==x[1:detail] TRUE, FALSE))
}
I've created some sample data for testing purposes below:
stack.data=structure(list(V1 = c("247.116.209.66", "195.121.47.105", "182.136.49.12",
"237.123.100.50", "120.30.174.18", "29.85.72.70", "18.186.76.177",
"33.248.142.26", "109.97.92.50", "217.138.155.145", "20.203.156.2",
"71.1.51.190", "31.225.208.60", "55.25.129.73", "211.204.249.244",
"198.137.15.53", "234.106.102.196", "244.3.87.9", "205.242.10.22",
"243.61.212.19", "32.165.79.86", "190.207.159.147", "157.153.136.100",
"36.151.152.15", "2.254.210.246", "3.42.1.208", "30.11.229.18",
"72.187.36.103", "98.114.189.34", "67.93.180.224")), .Names = "V1", class = "data.frame", row.names = c(NA,
-30L))
Here's another solution just using base R.
did.change.ip <- function(vec, detail=4){
ipv <- scan(text=paste(vec, collapse="\n"),
what=c(replicate(detail, integer()), replicate(4-detail,NULL)),
sep=".", quiet=TRUE)
c(FALSE, rowSums(vapply(ipv[!sapply(ipv, is.null)],
diff, integer(length(vec)-1))!=0)>0)
}
Here we use scan() to break up the ip address into numbers. Then we we look down each octet for differences using diff. It seems this is faster than the original proposal, but slightly slower than #josilber's stringr solution (using microbenchmark with 3,000 ip addresses)
Unit: milliseconds
expr min lq median uq max neval
orig 35.251886 35.716921 36.019354 36.700550 90.159992 100
scan 2.062189 2.116391 2.170110 2.236658 3.563771 100
strngr 2.027232 2.075018 2.136114 2.200096 3.535227 100
The simplest way I can think of to do this is to build a transformed vector that only includes the parts of the IP you want. Then it's a one-liner to check if each element is equal to the one before it:
library(stringr)
did.change.josilber <- function(vec, detail) {
s <- str_extract(vec, paste0("^(\\d+\\.){", detail, "}"))
return(s != c(s[1], s[1:(length(s)-1)]))
}
This seems reasonably efficient for 500,000 rows:
set.seed(144)
big.vec <- sample(stack.data[,1], 500000, replace=T)
system.time(did.change.josilber(big.vec, 3))
# user system elapsed
# 0.527 0.030 0.554
The biggest issue with your code is that you call append each iteration, which requires reallocation of your vector 500,000 times. You can read more about this in the second circle of the R inferno.
Not sure if all you want is counts, but this is potentially a solution:
library(dplyr)
library(tidyr)
# split ip addresses into "octets"
octets <- stack.data %>%
separate(V1,c("first","second","third","fourth"))
# how many shared both their first and second octets?
octets %>%
group_by(first,second) %>%
summarize(n = n())
first second n
1 109 97 1
2 120 30 1
3 157 153 1
4 18 186 1
5 182 136 1
6 190 207 1
7 195 121 1
8 198 137 1
9 2 254 1
10 20 203 1
11 205 242 1
12 211 204 1
13 217 138 1
14 234 106 1
15 237 123 1
16 243 61 1
17 244 3 1
18 247 116 1
19 29 85 1
20 3 42 1
21 30 11 1
22 31 225 1
23 32 165 1
24 33 248 1
25 36 151 1
26 55 25 1
27 67 93 1
28 71 1 1
29 72 187 1
30 98 114 1