iteration, ddply, and counts

iteration, ddply, and counts - r

I've generated the following data frame:
random <- data.frame(replicate(10, sample(1:12, 564, replace=TRUE)))
It contains 10 columns and 564 rows.
Each number in the column pertains to a day (1-12).
To this data frame I appended a new column containing the words "Green", "Pink" and "Red" in no particular order and filling all 564 rows using existing data I have:
random <- fruit$color
Here's what I'd like to do:
For each column 1-10, create the following table of counts:
Day Green Pink Red
1 # # #
2 # # #
3 # # #
4 # # #
... # # #
12 # # #
So, I should be able to know from this table the number of Day 1 Greens from column 1, for example. It is important that color counts from different columns are distinguishable from one another.
And there's a fun twist!
Counts for Day 9 and 10 need to be added for each Color, so each table should look like:
Day Green Pink Red
1 # # #
2 # # #
3 # # #
4 # # #
5 # # #
6 # # #
7 # # #
8 # # #
9 - 10 # # #
11 # # #
12 # # #
So far, I've tried to work with ddply and cast to do this and loop over each column, but I'm not familiar with loops. Here's the bologna I've got so far:
for(i in names(random)) {
random_counts <- ddply(random, c('color', i), function(x) c(count=nrow(x)))
random_counts <- cast(random_counts, i ~ color, mean, value='count')
random_counts
}
Help with this would be much much appreciated!
Thanks

Here is a base alternative:
# slightly smaller toy data
random <- data.frame(replicate(2, sample(1:5, 20, replace = TRUE)))
color <- sample(c("Green", "Pink", "Red"), nrow(random), replace = TRUE)
# use cut to put e.g. 3 and 4 in the same interval
random[] <- lapply(random, function(x) cut(x, breaks = c(0, 1, 2, 4, 5)))
# count
lapply(random, function(x) table(x, color))
# $X1
# color
# x Green Pink Red
# (0,1] 2 0 1
# (1,2] 3 0 2
# (2,4] 3 4 2
# (4,5] 1 1 1
#
# $X2
# color
# x Green Pink Red
# (0,1] 3 0 1
# (1,2] 4 0 1
# (2,4] 1 3 1
# (4,5] 1 2 3

Related

R cut function - where do breaks begin and end? [duplicate]

I was looking for an clear explanation of the 'labels are constructed using "(a,b]" interval notation' - as described in the cut help file, which seemed to lack an explanation.

So I tested cut on some simple examples as follows:
df <- data.frame(c(1,2,3,4,5,6,7,99))
names(df) <- 'x'
df$cut <- cut(df[ ,1], breaks = c(2,4,6,8), right = TRUE)
df
x cut
# 1 <NA>
# 2 <NA>
# 3 (2,4]
# 4 (2,4]
# 5 (4,6]
# 6 (4,6]
# 7 (6,8]
# 99 <NA>
So the '(' means x>break on the left and '[' means <= (next) break on the right and if a value is lower than the lowest break it is flagged as NA, similarly if a value exceed the highest break it is also flagged as NA.
Next testing the option include.lowest = TRUE
df$cut <- cut(df[ ,1], breaks = c(2,4,6,8), right = TRUE, include.lowest = TRUE)
df
x cut
# 1 <NA>
# 2 [2,4]
# 3 [2,4]
# 4 [2,4]
# 5 (4,6]
# 6 (4,6]
# 7 (6,8]
So here for the first bin between the first two breaks, the '[' on left means >=(first break) and the ']' means <=(second) break. Subsequent breaks are treated as above.
Next the NA values can be addressed by using -Inf and/or +Inf in the breaks as follows:
df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = TRUE, include.lowest = TRUE)
df
x cut
# 1 [-Inf,2]
# 2 [-Inf,2]
# 3 (2,4]
# 4 (2,4]
# 5 (4,6]
# 6 (4,6]
# 7 (6,8]
# 99 (8, Inf]
Setting the right = FALSE option swaps around the sense of the thresholds as per the example below:
df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = FALSE)
df
# x cut
# 1 [-Inf,2)
# 2 [2,4)
# 3 [2,4)
# 4 [4,6)
# 5 [4,6)
# 6 [6,8)
# 7 [6,8)
# 99 [8, Inf)
Finally the labels option allows custom names for the thresholds should you so wish ...
lbls <- c('x<=2','2<x<=4','4<x<=6','6<x<=8','x>8')
df$cut <- cut(df[ ,1], breaks = c(-Inf,2,4,6,8,+Inf), right = TRUE, include.lowest = TRUE, labels = lbls)
df
x cut
# 1 x<=2
# 2 x<=2
# 3 2<x<=4
# 4 2<x<=4
# 5 4<x<=6
# 6 4<x<=6
# 7 6<x<=8
# 99 x>8

How to increase performance when randomly selecting clusters and adding observations?

In a clustered dataset, I want to randomly pick some clusters and then add some simulated observations to the selected clusters. Then I want to create a dataset that combines the simulated and original observations from the selected clusters with all the original observations from the unselected clusters. I would also like to repeat this process many times and thus create many (maybe 1000) new datasets. I managed to do this using for loop but would like to know if there is any more efficient and concise way to accomplish this. Here is an example dataset:
## simulate some data
y <- rnorm(20)
x <- rnorm(20)
z <- rep(1:5, 4)
w <- rep(1:4, each=5)
dd <- data.frame(id=z, cluster=w, x=x, y=y)
# id cluster x y
# 1 1 1 0.30003855 0.65325768
# 2 2 1 -1.00563626 -0.12270866
# 3 3 1 0.01925927 -0.41367651
# 4 4 1 -1.07742065 -2.64314895
# 5 5 1 0.71270333 -0.09294102
# 6 1 2 1.08477509 0.43028470
# 7 2 2 -2.22498770 0.53539884
# 8 3 2 1.23569346 -0.55527835
# 9 4 2 -1.24104450 1.77950291
# 10 5 2 0.45476927 0.28642442
# 11 1 3 0.65990264 0.12631586
# 12 2 3 -0.19988983 1.27226678
# 13 3 3 -0.64511396 -0.71846622
# 14 4 3 0.16532102 -0.45033862
# 15 5 3 0.43881870 2.39745248
# 16 1 4 0.88330282 0.01112919
# 17 2 4 -2.05233698 1.63356842
# 18 3 4 -1.63637927 -1.43850664
# 19 4 4 1.43040234 -0.19051680
# 20 5 4 1.04662885 0.37842390
cl <- split(dd, dd$cluster) ## split the data based on clusters
k <- length(dd$id)
l <- length(cl)
`%notin%` <- Negate(`%in%`) ## define "not in" to exclude unselected clusters so
## as to retain their original observations
A clsamp function in the following code is then created which includes two for loops. The first for loop is to exclude the unselected clusters and the second for loop is to simulate new observations and append them to the selected clusters. Note that I randomly sample 2 clusters (10% of the total number of observations), without replacement
clsamp <- function(cl, k) {
a <- sample(cl, size=0.1*k, replace=FALSE)
jud <- (names(cl) %notin% names(a))
need <- names(cl)[jud]
T3 <- NULL
for (k in need) {
T3 <- rbind(T3, cl[[k]])
}
subt <- NULL
s <- a
for (j in 1:2) {
y <- rnorm(2)
x <- rnorm(2)
d <- cbind(id=nrow(a[[j]]) + c(1:length(x)),
cluster=unique(a[[j]]$cluster), x, y)
s[[j]] <- rbind(a[[j]], d)
subt <- rbind(subt, s[[j]])
}
T <- rbind(T3, subt)
return(T)
}
Finally, this creates a list of 5 datasets each of which combines the simulated and original observations from the selected clusters with all the original observations from the unselected clusters
Q <- vector(mode="list", length=5)
for (i in 1:length(Q)) {
Q[[i]] <- clsamp(cl, 20)
}
Anyone knows a shorter way to do this? Maybe use the replicate function? Thanks.

This generates a sizeX2 matrix of random values and cbinds sampled cluster names and consecutive ids to it. It directly starts with dd and also works when you convert dd to a matrix mm, which might be slightly faster. Output is a data frame, though. Instead of your k I use f to directly calculate the number of rows that should be added to the two selected clusters. In case the size gets zero, the original data frame is returned.
clsamp2 <- function(m, f=.1) {
size <- round(nrow(m)*f)
if (size == 0) as.data.frame(m)
else {
ids <- unique(m[,1])
cls <- unique(m[,2])
rd <- matrix(rnorm(size * 4), ncol=2, dimnames=list(NULL, c("x", "y")))
out <- rbind.data.frame(m, cbind(id=rep(max(ids) + 1:size, each=2),
cluster=sample(cls, 2), rd))
`rownames<-`(out[order(out$cluster, out$id), ], NULL)
}
}
Result
set.seed(42) ## same seed also used for creating `dd`
clsamp2(dd, .1)
## or
mm <- as.matrix(dd)
clsamp2(mm, .1)
# id cluster x y
# 1 1 1 -0.30663859 1.37095845
# 2 2 1 -1.78130843 -0.56469817
# 3 3 1 -0.17191736 0.36312841
# 4 4 1 1.21467470 0.63286260
# 5 5 1 1.89519346 0.40426832
# 6 1 2 -0.43046913 -0.10612452
# 7 2 2 -0.25726938 1.51152200
# 8 3 2 -1.76316309 -0.09465904
# 9 4 2 0.46009735 2.01842371
# 10 5 2 -0.63999488 -0.06271410
# 11 6 2 1.37095845 0.40426832
# 12 7 2 0.36312841 1.51152200
# 13 1 3 0.45545012 1.30486965
# 14 2 3 0.70483734 2.28664539
# 15 3 3 1.03510352 -1.38886070
# 16 4 3 -0.60892638 -0.27878877
# 17 5 3 0.50495512 -0.13332134
# 18 1 4 -1.71700868 0.63595040
# 19 2 4 -0.78445901 -0.28425292
# 20 3 4 -0.85090759 -2.65645542
# 21 4 4 -2.41420765 -2.44046693
# 22 5 4 0.03612261 1.32011335
# 23 6 4 -0.56469817 -0.10612452
# 24 7 4 0.63286260 -0.09465904
To create the list of five samples, you may use replicate.
replicate(5, clsamp2(dd, .1), simplify=FALSE)
Running time is negligible.
system.time(replicate(1000, clsamp2(dd, .1), simplify=FALSE))
# user system elapsed
# 0.44 0.03 0.44

apply similar variable to multiple dataset in r

I have 6 data named from dat1 to dat6, I want to add variable region and label them in a similar way, like this:
dat1$region <- paste("NE-1")
dat2$region <- paste("NE-2")
dat3$region <- paste("NE-3")
dat4$region <- paste("NE-4")
dat5$region <- paste("NE-5")
How can I write this code in a more concise way? using apply or for-loop?
Thanks!!

One option is to use get and assign functions in a for-loop.
Sample data:
dat1 <- data.frame(id=1:4, region = letters[1:4])
dat2 <- data.frame(id=5:8, region = letters[5:8])
dat3 <- data.frame(id=9:12, region = letters[9:12])
dat4 <- data.frame(id=13:16, region = letters[13:16])
dat5 <- data.frame(id=17:20, region = letters[17:20])
dat1
# id region
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
Apply for-loop:
for(i in 1:5){
name = paste("dat",i,sep="")
temp <- get(name)
temp$region = paste("NE",i,sep = "-")
assign(name, temp)
}
Verify results:
dat1
# id region
# 1 1 NE-1
# 2 2 NE-1
# 3 3 NE-1
# 4 4 NE-1
dat5
# id region
# 1 17 NE-5
# 2 18 NE-5
# 3 19 NE-5
# 4 20 NE-5

Keep all dataframes in a list then use lapply:
# example dataframes
dat1 <- cars[1:2, ]
dat2 <- cars[3:4, ]
dat3 <- cars[5:6, ]
myList <- list(dat1, dat2, dat3)
# myList
# [[1]]
# speed dist
# 1 4 2
# 2 4 10
#
# [[2]]
# speed dist
# 3 7 4
# 4 7 22
#
# [[3]]
# speed dist
# 5 8 16
# 6 9 10
Then it is easier to do repetitive operations. Loop through the list, add region column:
res <- lapply(seq_along(myList), function(i){
x <- myList[[ i ]]
x$region <- paste0("NE-", i)
x
})
res
# [[1]]
# speed dist region
# 1 4 2 NE-1
# 2 4 10 NE-1
#
# [[2]]
# speed dist region
# 3 7 4 NE-2
# 4 7 22 NE-2
#
# [[3]]
# speed dist region
# 5 8 16 NE-3
# 6 9 10 NE-3

How about this (assuming all your items start with dat and end with a unique identifier string):
dat_names <- ls()[grepl("^dat", ls())]
dat_ID <- sapply(dat_names, function(d) gsub("dat", "", d))
for(d in 1:length(dat_names)) {
dat_names[[d]]$region <- paste("NE-", dat_ID[d], sep="")
}

R data frame manipulation

Suppose I have a data frame that looks like this.
# start end motif
# 2 6 a
# 10 15 b
# 30 35 c
How would I create a data frame that fills in the remaining start and end locations like so up to a certain number Max_end:
Max_end <- 33
# start end motif
# 0 2 na # <- 0-2 are filled in because it is not in the original data frame
# 2 6 a # <- 2-6 are in the original
# 6 10 na # <- 6-10 is not
# 10 15 b # <- 10-15 is
# 15 30 na # and so on
# 30 33 c
And further, calculates the distance between the start and end locations and creates a one column data frame.
# Length motif
# 2 na
# 4 a
# 4 na
# 5 b
# 15 na
# 3 c
Currently this is how i am doing it: It is very inefficient
library(data.table)
library(stringi)
f <- fread('ABC.txt',header=F,skip=1)$V1
f <- paste(f, collapse = "")
motifs = c('GATC', 'CTGCAG', 'ACCACC', 'CC(A|T)GG', 'CCAC.{8}TGA(C|T)')
v <- na.omit(data.frame(do.call(rbind, lapply(stri_locate_all_regex(f, motifs), unlist))))
v <- v[order(v[,1]),]
v2difference <- "blah"
for(i in 2:nrow(v)){
if(v[i,1] > v[i-1,2]+2){v2difference[i] <- v[i,1]-v[i-1,2]-2}
}
v2difference[1] <- v[1,1]
v2 <- data.frame(Order=seq(1, 2*nrow(v), 2),Lengths=matrix(v2difference, ncol = 1),Motifs="na")
v1 <- data.frame(Order=seq(2, 2*nrow(v), 2),Lengths=(v$end-v$start+1),Motifs=na.omit(unlist(stri_extract_all_regex(f,motifs))))
V <- data.frame(Track=1,rbind(v1,v2))
V <- V[order(V$Order),]
B <- V[,!(names(V) %in% "Order")]

Max_end <- 33
breaks <- c(0, t(as.matrix(dat[,1:2])), Max_end) # get endpoints
breaks <- breaks[breaks <= Max_end]
merge(dat, data.frame(start=breaks[-length(breaks)], end=breaks[-1]), all=T)
# start end motif
# 1 0 2 <NA>
# 2 2 6 a
# 3 6 10 <NA>
# 4 10 15 b
# 5 15 30 <NA>
# 6 30 33 <NA>
# 7 30 35 c
To specify a start and endpoint, you could do
Max_end <- 33
Max_start <- 10
breaks <- unique(c(Max_start, t(as.matrix(dat[,1:2])), Max_end))
breaks <- breaks[breaks <= Max_end & breaks >= Max_start]
merge(dat, data.frame(start=breaks[-length(breaks)], end=breaks[-1]), all.y=T)
# start end motif
# 1 10 15 b
# 2 15 30 <NA>
# 3 30 33 <NA>
Note: this doesn't include "c" in the shortened final interval, you would need to decide if that values gets included or not when the interval changes.

How to select/find coordinates within a distance from a list (X/Y) using R

I have a data frame with list of X/Y locations (>2000 rows). What I want is to select or find all the rows/locations based on a max distance. For example, from the data frame select all the locations that are between 1-100 km from each other. Any suggestions on how to do this?

You need to somehow determine the distance between each pair of rows.
The simplest way is with a corresponding distance matrix
# Assuming Thresh is your threshold
thresh <- 10
# create some sample data
set.seed(123)
DT <- data.table(X=sample(-10:10, 5, TRUE), Y=sample(-10:10, 5, TRUE))
# create the disance matrix
distTable <- matrix(apply(createTable(DT), 1, distance), nrow=nrow(DT))
# remove the lower.triangle since we have symmetry (we don't want duplicates)
distTable[lower.tri(distTable)] <- NA
# Show which rows are above the threshold
pairedRows <- which(distTable >= thresh, arr.ind=TRUE)
colnames(pairedRows) <- c("RowA", "RowB") # clean up the names
Starting with:
> DT
X Y
1: -4 -10
2: 6 1
3: -2 8
4: 8 1
5: 9 -1
We get:
> pairedRows
RowA RowB
[1,] 1 2
[2,] 1 3
[3,] 2 3
[4,] 1 4
[5,] 3 4
[6,] 1 5
[7,] 3 5
These are the two functions used for creating the distance matrix
# pair-up all of the rows
createTable <- function(DT)
expand.grid(apply(DT, 1, list), apply(DT, 1, list))
# simple cartesian/pythagorean distance
distance <- function(CoordPair)
sqrt(sum((CoordPair[[2]][[1]] - CoordPair[[1]][[1]])^2, na.rm=FALSE))

I'm not entirely clear from your question, but assuming you mean you want to take each row of coordinates and find all the other rows whose coordinates fall within a certain distance:
# Create data set for example
set.seed(42)
x <- sample(-100:100, 10)
set.seed(456)
y <- sample(-100:100, 10)
coords <- data.frame(
"x" = x,
"y" = y)
# Loop through all rows
lapply(1:nrow(coords), function(i) {
dis <- sqrt(
(coords[i,"x"] - coords[, "x"])^2 + # insert your preferred
(coords[i,"y"] - coords[, "y"])^2 # distance calculation here
)
names(dis) <- 1:nrow(coords) # replace this part with an index or
# row names if you have them
dis[dis > 0 & dis <= 100] # change numbers to preferred threshold
})
[[1]]
2 6 7 9 10
25.31798 95.01579 40.01250 30.87070 73.75636
[[2]]
1 6 7 9 10
25.317978 89.022469 51.107729 9.486833 60.539243
[[3]]
5 6 8
70.71068 91.78780 94.86833
[[4]]
5 10
40.16217 99.32774
[[5]]
3 4 6 10
70.71068 40.16217 93.40771 82.49242
[[6]]
1 2 3 5 7 8 9 10
95.01579 89.02247 91.78780 93.40771 64.53681 75.66373 97.08244 34.92850
[[7]]
1 2 6 9 10
40.01250 51.10773 64.53681 60.41523 57.55867
[[8]]
3 6
94.86833 75.66373
[[9]]
1 2 6 7 10
30.870698 9.486833 97.082439 60.415230 67.119297
[[10]]
1 2 4 5 6 7 9
73.75636 60.53924 99.32774 82.49242 34.92850 57.55867 67.11930

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

iteration, ddply, and counts - r

Related

R cut function - where do breaks begin and end? [duplicate]

How to increase performance when randomly selecting clusters and adding observations?

apply similar variable to multiple dataset in r

R data frame manipulation

How to select/find coordinates within a distance from a list (X/Y) using R

Categories

Resources