I have a large dataset (DF), a subset of which looks like this:
Site Event HardwareID Species Day1 Day2 Day3 Day4 Day5 Day6
1 1 16_11 x 0 0 0 0 0 0
1 1 29_11 y 0 0 6 2 0 1
1 1 36_11 d 0 0 0 0 0 1
1 1 41_11 y 0 0 2 4 1 1
1 1 41_11 x 0 0 0 0 0 1
1 1 58_11 a 0 0 1 0 0 0
1 1 62_11 y 0 0 0 1 0 0
1 1 62_11 z 0 0 0 0 0 0
1 1 62_11 x 0 0 0 0 0 1
2 1 40_AR b 0 0 0 0 0 0
2 1 12_11 z 0 0 1 0 0 0
I'd like to examine the minimum number of HardwareIDs to produce the most Species over the shortest amount of time, by calculating species accumulation curves (which intrinsically incorporates the Days columns) for each HardwareID, at each different site, and boostrapping the HardwareID selection part (so, look at accumulation curves using two HardwareIDs, then 3, then 4 etc, at each site).
I have written a function to create species accumulation curves (using specaccum) for a subset of these, such as:
Sites<-subset(DF,DF$Site==1)
samples<-function (x) {
specurve_sample<-(ddply(Sites[,4:length(colnames(Sites))],"Species",numcolwise(sum)))
specurve_sample<-specurve_sample[-1,]
n<-specurve_sample$Species
n<-drop.levels(n,reorder=FALSE)
specurve_sample<-specurve_sample[,-1]
specurve_sample <-t(specurve_sample)
colnames(specurve_sample)<-n
specurve_sample<-as.data.frame(specurve_sample)
sample_k<-specaccum(specurve_sample)
out<-rbind(sample_k$richness,sample_k$sd)
outnames<-c("Richness","SD")
st<-rep(Sites$Site[1],2)
out<-as.data.frame(cbind(outnames,st,out))
colnames(out)<-c("label","site","Days")
out
}
The function works fine if I subset my data before hand, but the boostrapping part does not work. I know I need to create a function (x,j) but cannot figure out where to place the j in my function. Here is the rest of my code. Many thanks for any assistance. James
all_data<-c()
for (i in 1:length(unique(DF$Site))) {
Sites<-subset(DF,DF$Site==i)
boots<-boot(Sites,samples, strata=Sites$HardwareID,R=1000)
all_data<-rbind(all_data,boots)
all_data
}
One straightforward way to do this is to create a function of x and j (as you have started to do), and have the first line of that function identify the relevant bootstrap subset from the whole collection, bootsub <- x[j, ]. Then, you can refer to this subset, bootsub throughout the rest of the function, and you need not refer to j again.
In your case, you don't want your function to refer back to your original data frame, Site. So, every where that you have Site in your function, change it to bootsub. For example:
samples <- function(x, j) {
bootsub <- x[j, ]
specurve_sample <- (ddply(bootsub[, 4:length(colnames(bootsub))], "Species", numcolwise(sum)))
specurve_sample <- specurve_sample[-1, ]
n <- specurve_sample$Species
n <- drop.levels(n, reorder=FALSE)
specurve_sample <- specurve_sample[, -1]
specurve_sample <- t(specurve_sample)
colnames(specurve_sample) <- n
specurve_sample <- as.data.frame(specurve_sample)
sample_k <- specaccum(specurve_sample)
out <- rbind(sample_k$richness, sample_k$sd)
outnames <- c("Richness", "SD")
st <- rep(bootsub$Site[1], 2)
out <- as.data.frame(cbind(outnames, st, out))
colnames(out) <- c("label", "site", "Days")
out
}
...
A follow up to the first two comments below. It's a little hard to troubleshoot without data, but this is my best guess. It may be that you have an issue with your subset() function, because you use i as an index of unique sites in the for() loop, but then refer to i as the value of the site in the call to subset(). Also, it is likely more efficient to run one call to do.call() after the for() loop, rather than multiple calls to rbind() inside the loop. Give this untested code a try.
# vector of unique sites
usite <- unique(DF$Site)
# empty list in which to put the bootstrap results
alldatlist <- vector("list", length(usite))
# loop through every site separately, save the bootstrap replicates ($t)
for(i in 1:length(usite)) {
Sites <- subset(DF, DF$Site==usite[i])
alldatlist[[i]] <- boot(Sites, samples, strata=Sites$HardwareID, R=1000)$t
}
# combine the list of results into a single matrix
all_data <- do.call(rbind, alldatlist)
Related
I already asked a related question here, which was probably not sufficiently specified and thus the answers did not solve my problem. Will try to do it better this time.
I have created dataframe df:
df <- data.frame(names1=c('mouse','dog','cat','cat','mouse','cat','cat','dog','cat','mouse'), names2=c('cat','dog','dog','mouse','cat','cat','mouse','mouse','mouse','mouse'), values=c(11,5,41,25,101,78,12,41,6,77))
There are purposely no "birds" neither in the names1 nor names2 column.
and the following names vector:
dims <- c('dog','mouse','bird','cat')
and an initially empty matrix:
my_matrix <- matrix(data=0,nrow = length(unique(dims)),ncol = length(unique(dims))) rownames(my_matrix) <- c('dog','mouse','cat', 'bird') colnames(my_matrix) <- c('dog','mouse','cat','bird'))
So the empty matrix look like this:
> dog mouse bird cat
> dog 0 0 0 0
> mouse 0 0 0 0
> bird 0 0 0 0
> cat 0 0 0 0
**The Goal: **
The goal is to populate the empty matrix with the information coming from dataframe df
Count Matrix
In the first matrix, I want to count the occurrences where the vector elements match. So the first matrix should look exactly like this one (based on df$names1 and df$names2)
> dog mouse bird cat
> dog 1 1 0 0
> mouse 0 1 0 2
> bird 0 0 0 0
> cat 1 3 0 1
Sum Matrix
In the second matrix, I want to show the sums provided by column df$values and based on the vector match element given by df$names1 and df$names2. So the matrix should look exactly like this one
> dog mouse bird cat
> dog 5 41 0 0
> mouse 0 77 0 112
> bird 0 0 0 0
> cat 0 137 0 25
Restrictions:
there is one restriction regarding the shape of the matrix -
The order of rows and columns is given by the dims vector. The purpose is that the matrix should show that there are no birds in df$names1 and df$names2
My approach:
I tried to place the count and sum into each element of the empty matrix by using a for loop which looks like this:
for(i in 1:nrow(my_matrix)){
for(j in 1:ncol(my_matrix)){
my_matrix[i,j] <- sum(df$values & df$names1[i] == df$names2[j] & df$names1[j] == df$names2[i])
}
}
and which provides me with this undesired result
dog mouse bird cat
dog 0 0 0 10
mouse 0 10 0 0
bird 0 0 0 0
cat 10 0 0 0
My intuition tells me that the approach with the for loop is ok but I am not sure how exactly to address each matrix element in the loop and how to define the restrictions within the sum function to get the counts (goal matrix1) and sums (goal matrix2).
Your help will be much appreciated and I am very open to other solutions considering the above-mentioned restrictions (still knowing how to loop through a matrix and assign values to each position would be cool).
To continue your for loop approach -
The code you posted only loops through the first 4 elements of the dataframe (e.g. the maximum of nrow() or ncol() of the matrix), and it does not match the names of the matrix columns and rows to the names1 and names2 vectors in the dataframe.
I'm not sure if the following is exactly what you want, but it does loop through all of the dataframe, relate dataframe and matrix names and count/take sums, so you can modify it as you need.
my_matrix_count <- my_matrix
for(i in 1:nrow(my_matrix)){
for(j in 1:ncol(my_matrix)){
my_matrix_count[i,j] <-
length(df$values[df$names1==row.names(my_matrix)[i]
& df$names2==colnames(my_matrix)[j] ]>0)
my_matrix[i,j] <-
sum(df$values[df$names1==row.names(my_matrix)[i]
& df$names2==colnames(my_matrix)[j] ])
}
}
It's good to know looping and indexing, in my opinion, but yes, it can be done in many better ways, such as the answer by jblood94.
For another version, you can also try to write:
xtabs(values ~names1 + names2, data=df)
Regards, Lars
Using match and data.table grouping operations:
library(data.table)
n <- length(dims)
# initialize matrices
m_count <- m_sum <- matrix(0L, n, n, 0, list(dims, dims))
dt <- setDT(df)[
, .(
# the matrix index for the name1-name2 combination
idx = (match(names2, dims) - 1L)*n + match(names1, dims),
count = .N, # value for the count matrix
sum = sum(values) # value for the sum matrix
),
c("names1", "names2") # group by
]
# update matrices
m_count[dt$idx] <- dt$count
m_sum[dt$idx] <- dt$sum
m_count
#> dog mouse bird cat
#> dog 1 1 0 0
#> mouse 0 1 0 2
#> bird 0 0 0 0
#> cat 1 3 0 1
m_sum
#> dog mouse bird cat
#> dog 5 41 0 0
#> mouse 0 77 0 112
#> bird 0 0 0 0
#> cat 41 43 0 78
Data:
df <- data.frame(
names1=c('mouse','dog','cat','cat','mouse','cat','cat','dog','cat','mouse'),
names2=c('cat','dog','dog','mouse','cat','cat','mouse','mouse','mouse','mouse'),
values=c(11,5,41,25,101,78,12,41,6,77)
)
dims <- unique(c('dog','mouse','bird','cat'))
I'm working on a project that is looking at regrowth of trees after a deforestation event. To simplify the data set for this question, I have a matrix (converted from data frame), which has 10 columns corresponding to years 2001-2010.
-1 indicates a change point in the data, when a previously forested plot was deforested. 1 indicated when a previously deforested region became forested. 0's indicate no change in state.
I found this link which I think does what I need to do, except in python/c++. Since I did the rest of my analyses in R, I want to stick with it.
So I was trying to translate some of the code to R, but I've been having problems.
This is my sample data set. One of my alternative thoughts is that if I could identify the index of (-1) and then the index of 1, then I could subtract these two indices to get the difference (and then subtract 1 to account for factoring in the first index in the subtraction)
# Example data
head(tcc_change)
id 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
1 1 0 0 0 0 0 -1 0 0 1 0
2 2 0 0 0 -1 0 0 1 0 0 0
3 3 0 0 0 -1 0 0 0 1 0 0
4 4 0 -1 0 0 0 0 1 0 0 0
5 5 0 0 0 1 0 0 -1 1 0 0
# Indexing attempt
tcc_change$loss_init <- apply(tcc_change, 1, function(x) match(-1, x[1:10], nomatch = 99))
tcc_change$gain <- apply(tcc_change, 1, function(x) match(1, x[1:10], nomatch=99))
This method has a lot of problems though. What if there's a 1 before a (-1), for example. I'd like to figure out a better way to do this analysis, similar to the logical structure in the link above, but I don't know how to do this in R.
Ideally I'd like to identify points where there was deforestation (-1) and then regrowth (1) and then count the zeroes in between. The number of zeroes in between would be posted to a new column. This would give me a better idea of how long it takes for a plot to become forested after a deforestation event. If there are no zeroes in between (like row 5), I would want the code to output '0'.
Sorry my function may only handle simple case. Hope that helps.
First your code has some issues that when you search index, you include the id column as well (in x[1:10]). if you want to exclude that, can use x[-1] to exclude the first column, but the index will count from 2nd ones.
tcc_change$loss_init <- apply(tcc_change, 1, function(x) match(-1, x[1:10], nomatch = 99))
tcc_change$gain <- apply(tcc_change, 1, function(x) match(1, x[1:10], nomatch=99))
I adjusted your approach and first to get the -1 index, then use match again to search index of 1 starting from the index of -1; then once I found that, can just minus 1 to get the number of intervals:
get_interval = function(x){
init = match(-1, x[-1])
interval = match(1, x[-(1:(init+1))]) - 1
return(interval)
}
> apply(tcc_change, 1, get_interval)
[1] 2 2 3 4 0
Hope that helps.
I am hoping to create all possible permutations of a vector containing two different values, in which I control the proportion of each of the values.
For example, if I have a vector of length three and I want all possible combinations containing a single 1, my desired output is a list looking like this:
list.1 <- list(c(1,0,0), c(0,1,0), c(0,0,1))
In contrast, if I want all possible combinations containing three 1s, my desired output is a list looking like this:
list.3 <- list(c(1,1,1))
To put it another way, the pattern of the 1 and 0 values matter, but all 1s should be treated as identical to all other 1s.
Based on searching here and elsewhere, I've tried several approaches:
expand.grid(0:1, 0:1, 0:1) # this includes all possible combinations of 1, 2, or 3 ones
permn(c(0,1,1)) # this does not treat the ones as identical (e.g. it produces (0,1,1) twice)
unique(permn(c(0,1,1))) # this does the job!
So, using the function permn from the package combinat seems promising. However, where I scale this up to my actual problem (a vector of length 20, with 50% 1s and 50% 0s, I run into problems:
unique(permn(c(rep(1,10), rep(0, 10))))
# returns the error:
Error in vector("list", gamma(n + 1)) :
vector size specified is too large
My understanding is that this is happening because, in the call to permn, it makes a list containing all possible permutations, even though many of them are identical, and this list is too large for R to handle.
Does anyone have a suggestion for how to work around this?
Sorry if this has been answered previously - there are many, many SO questions containing similar language but different problems and I have not bene able to find a solution which meets my needs!
It should not be a dealbreaker that expand.grid includes all permutations. Just add a subset after:
combinations <- function(size, choose) {
d <- do.call("expand.grid", rep(list(0:1), size))
d[rowSums(d) == choose,]
}
combinations(size=10, choose=3)
# Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10
# 8 1 1 1 0 0 0 0 0 0 0
# 12 1 1 0 1 0 0 0 0 0 0
# 14 1 0 1 1 0 0 0 0 0 0
# 15 0 1 1 1 0 0 0 0 0 0
# 20 1 1 0 0 1 0 0 0 0 0
# 22 1 0 1 0 1 0 0 0 0 0
...
The problem is indeed that you are initially computing all factorial(20) (~10^18) permutations, which will not fit in your memory.
What you are looking for is an efficient way to compute multiset permutations. The multicool package can do this:
library(multicool)
res <- allPerm(initMC(c(rep(0,10),rep(1,10) )))
This computation takes about two minutes on my laptop, but is definitely feasible.
I'm working on code to construct an option pricing matrix. What I have at the moment is the values along the diagonal part of the matrix. Currently I'm working in a matrix with 4 rows and 4 columns. What I'm attempting to do is to use the values in the diagonal part of the matrix to give values in the lower triangle of the matrix. So for my matrix Omat, Omat[1,1]+Omat[2,2] will give a value for [2,1], Omat[2,2]+Omat[3,3] will give a value for [3,2]. Then using these created values, Omat[2,1]+Omat[3,2] will give a value for [3,1].
My attempt:
Omat = diag(2, 4, 4)
Omat[j+i,j] <- Omat[i-1,j]+Omat[i,j+1]
Any ideas on how one could go about this?
What I currently have, a 4 row by 4 col matrix:
Omat
# 2 0 0 0
# 0 2 0 0
# 0 0 2 0
# 0 0 0 2
What I've been attempting to create, a 4 row by 4 col matrix:
0 0 0 0
4 0 0 0
8 4 0 0
16 8 4 0
You could try calculating successive diagonals underneath the main diagonal. Code could look like:
Omat = diag(2,4)
for(i in 1:(nrow(Omat)-1)) {
for( j in (i+1):nrow(Omat)) {
Omat[j,j-i] <- Omat[j,j-i+1] + Omat[j-1,j-i]
}
}
diag(Omat) <- 0
Am I probably missing something, but why not do this:
for (i in 2:dim){
for (j in 1:(i-1)){
Omat[i,j] <- Omat[i-1,j] + Omat[i,j+1]
}
}
diag(Omat) <- 0
,David.
I need some help with data manipulation in R. I have a long code which does this as a series of steps, but I am looking for a shorter way to do it.
Here is a data frame which has two columns - the first one is an ID and the other has pipe delimited data in it as shown below:
ID DATA
1 a
2 a|b
3 b|c
4 d|e
I need to convert this to this form:
ID a b c d e
1 1 0 0 0 0
2 1 1 0 0 0
3 0 1 1 0 0
4 0 0 0 1 1
I am hoping there is a simpler way to do this than the lengthy code I have.
Thanks in advance for your help.
This works on the supplied data. First read in your data:
pipdat <- read.table(stdin(),header=TRUE,stringsAsFactors=FALSE)
ID DATA
1 a
2 a|b
3 b|c
4 d|e
# leave a blank line at the end so it stops reading
Now here goes:
nr <- dim(pipdat)[1]
chrs <- strsplit(pipdat[,2],"[|]")
af <- unique(unlist(chrs))
whichlet <- function(a,fac) as.numeric(fac %in% a)
matrix(unlist(lapply(chrs,whichlet,af)),
byrow=TRUE,nr=nr,dimnames=list(ID=1:nr,af))
(That can be done in fewer lines, but it's handy to see what some of those steps do)
It produces:
ID a b c d e
1 1 0 0 0 0
2 1 1 0 0 0
3 0 1 1 0 0
4 0 0 0 1 1
I guessed from your post that you wanted ID as row names; if you need it to be a column of data that last line needs to be different.
I'd have used sapply instead of lapply, but you end up with the transpose of the desired matrix. That works if you replace the last line with:
res <- t(sapply(chrs,whichlet,af))
dimnames(res) <- list(ID=1:nr,af)
res
but it might be slower.
---
If you don't follow the line
matrix(unlist(lapply(chrs,whichlet,af)),
byrow=TRUE,nr=nr,dimnames=list(ID=1:nr,af))
just break it up from the innermost function outward:
lres <- lapply(chrs,whichlet,af)
vres <- unlist(lres)
matrix(vres,byrow=TRUE,nr=nr,dimnames=list(ID=1:nr,af))
---
If you need ID as a column of data instead of row names, one way to do it is:
lres <- lapply(chrs,whichlet,af)
vres <- unlist(lres)
cbind(ID=1:nr,matrix(vres,byrow=TRUE,nr=nr,dimnames=list(1:nr,af)))
or you could do
res <- t(sapply(chrs,whichlet,af))
dimnames(res) <- list(1:nr,af)
cbind(ID=1:nr,res)