I am attempting to loop a command based upon a list (fish_species). And while I’ve found plenty of examples, I haven’t found one that also includes changing the column name as part of the loop. I have figured out how to get the desired result for an individual species (lines 10-13), but in the actual dataset I have ~500 species, and I’d prefer not to repeat this command 500+ times. Is there a way to substitute the values from a list where it says variable?
Fishdata$variable <- ifelse(fishdata$Species== “variable”,fishdata$Number,0)
I know how to do this is ArcGIS, but I am trying to expand my horizons and learn R. This is also my first post, so please excuse any screw ups.
Thank you for any help you can provide.
fishdata <-c()
fishdata$Site <-c(1,1,1,2,2,2)
fishdata$Species <- c("one_fish", "two_fish", "two_fish", "red_fish", "blue_fish", "blue_fish")
fishdata$Number <- c(1,1,1,1,1,1)
fishdata$one_fish <-0
fishdata$two_fish <-0
fishdata$red_fish <-0
fishdata$blue_fish <-0
fish_list <- c("one_fish","two_fish", "red_fish", "blue_fish")
fishdata$one_fish <- ifelse(fishdata$Species=="one_fish",fishdata$Number,0)
fishdata$two_fish <- ifelse(fishdata$Species=="two_fish",fishdata$Number,0)
fishdata$red_fish <- ifelse(fishdata$Species=="red_fish",fishdata$Number,0)
fishdata$blue_fish <- ifelse(fishdata$Species=="blue_fish",fishdata$Number,0)
You can use sapply to iterate over species,
sapply(fishdata$Species, function(i)ifelse(fishdata$Species== i, fishdata$Number,0))
# one_fish two_fish two_fish red_fish blue_fish blue_fish
#[1,] 1 0 0 0 0 0
#[2,] 0 1 1 0 0 0
#[3,] 0 1 1 0 0 0
#[4,] 0 0 0 1 0 0
#[5,] 0 0 0 0 1 1
#[6,] 0 0 0 0 1 1
$ is just an alternative to the [] operator:
a$x
a["x"]
So you can do:
fishdata[species] <- ifelse(fishdata$Species == species, fishdata$Number, 0)
for (species in fish_species) {
fishdata[species] <- ifelse(fishdata$Species == species, fishdata$Number, 0)
}
Related
I'm working on populating a binary matrix based on values from a different table. I can create the matrix but am struggling with the looping needed to populate it. I think this is a pretty simple issue so I hope I can get some easy help.
Here's an example of my data:
start <- c(291, 291, 291, 702, 630, 768)
sequence <- c("chr9:103869456:103870456", "chr5:30823103:30824103", "chr11:49801703:49802703", "chr4:133865601:133866601", "chr12:55738034:55739034", "chr8:96569493:96570493")
motif <- c("ARI5B", "ARI5B", "ARI5B", "ATOH1", "EGR1", "EGR1")
df <- data.frame(start, sequence, motif)
I have created a character vector for each unique motif+start values like so:
x <- sprintf("%s_%d", df$motif, df$start)
x <- unique(x)
Next I create a binary matrix with the sequences as rows and the values from x as columns:
binmat <- matrix(0, nrow = length(df$sequence), ncol = length(x))
rownames(binmat) <- df$sequence
colnames(binmat) <- x
And now I'm stuck. I want to iterate through columns and rows and put a 1 in each position that has a match. For example, the first sequence is "chr9:103869456:103870456" and it has motif "ARI5B" at starting position 291, so it should get a 1 while the rest of the values in that row remain at 0. The output of this example should look like this:
ARI5B_291 ATOH1_702 EGR1_630 EGR1_768
chr9:103869456:103870456 1 0 0 0
chr5:30823103:30824103 1 0 0 0
chr11:49801703:49802703 1 0 0 0
chr4:133865601:133866601 0 1 0 0
chr12:55738034:55739034 0 0 1 0
chr8:96569493:96570493 0 0 0 1
But so far I am unsuccessful. I think I need a double for loop somewhere along these lines:
for (row in binmat){
for (col in binmat){
if (row && col %in% x){
1
} else { 0
}
}
}
But all I get are 0s.
Thanks in advance!
Aren't you just looking for table here? You can get the result as a vectorized one-liner, without loops, by doing:
table(factor(df$sequence, df$sequence), sprintf("%s_%d", df$motif, df$start))
ARI5B_291 ATOH1_702 EGR1_630 EGR1_768
chr9:103869456:103870456 1 0 0 0
chr5:30823103:30824103 1 0 0 0
chr11:49801703:49802703 1 0 0 0
chr4:133865601:133866601 0 1 0 0
chr12:55738034:55739034 0 0 1 0
chr8:96569493:96570493 0 0 0 1
I have created a transition matrix as a 'from cluster' (rows) 'to cluster' (columns) frequency. Think Markov chain.
Assume I have 5 from clusters but only 3 to clusters then I get a 5*3 transition matrix. How do a force it to be a 5*5 transition matrix? Effectively how to I show the all zero columns?
I'm after an elegant solution as this will be applied on a much larger problem involving hundreds of clusters. I am really quite unfamiliar with R Matrix's and to my knowledge I don't know of an elegant way to force number of columns to enter number of rows then impute zero's where no match except for using a for loop which my hunch is that's not the best solution.
Example code:
# example data
cluster_before <- c(1,2,3,4,5)
cluster_after <- c(1,2,4,4,1)
# Table output
table(cluster_before,cluster_after)
# ncol does not = nrows. I want to rectify that
# I want output to look like this:
what_I_want <- matrix(
c(1,0,0,0,0,
0,1,0,0,0,
0,0,0,1,0,
0,0,0,1,0,
1,0,0,0,0),
byrow=TRUE,ncol=5
)
# Possible solution. But for loop can't be best solution?
empty_mat <- matrix(0,ncol=5,nrow=5)
matrix_to_update <- empty_mat
for (i in 1:length(cluster_before)) {
val_before <- cluster_before[i]
val_after <- cluster_after[i]
matrix_to_update[val_before,val_after] <- matrix_to_update[val_before,val_after]+1
}
matrix_to_update
# What's the more elegant solution?
Thanks in advance for your help. It's much appreciated.
Make them factors and then table:
levs <- union(cluster_before, cluster_after)
table(factor(cluster_before,levs), factor(cluster_after,levs))
# 1 2 3 4 5
# 1 1 0 0 0 0
# 2 0 1 0 0 0
# 3 0 0 0 1 0
# 4 0 0 0 1 0
# 5 1 0 0 0 0
Another solution is to use matrix indicies:
what_I_want <- matrix(0,ncol=5,nrow=5)
what_I_want[cbind(cluster_before,cluster_after)] <- 1
print(what_I_want)
## [,1] [,2] [,3] [,4] [,5]
##[1,] 1 0 0 0 0
##[2,] 0 1 0 0 0
##[3,] 0 0 0 1 0
##[4,] 0 0 0 1 0
##[5,] 1 0 0 0 0
The second line sets the elements corresponding to the row (cluster_before) and column (cluster_after) indices to 1.
Hope this helps.
I have a large dataset (DF), a subset of which looks like this:
Site Event HardwareID Species Day1 Day2 Day3 Day4 Day5 Day6
1 1 16_11 x 0 0 0 0 0 0
1 1 29_11 y 0 0 6 2 0 1
1 1 36_11 d 0 0 0 0 0 1
1 1 41_11 y 0 0 2 4 1 1
1 1 41_11 x 0 0 0 0 0 1
1 1 58_11 a 0 0 1 0 0 0
1 1 62_11 y 0 0 0 1 0 0
1 1 62_11 z 0 0 0 0 0 0
1 1 62_11 x 0 0 0 0 0 1
2 1 40_AR b 0 0 0 0 0 0
2 1 12_11 z 0 0 1 0 0 0
I'd like to examine the minimum number of HardwareIDs to produce the most Species over the shortest amount of time, by calculating species accumulation curves (which intrinsically incorporates the Days columns) for each HardwareID, at each different site, and boostrapping the HardwareID selection part (so, look at accumulation curves using two HardwareIDs, then 3, then 4 etc, at each site).
I have written a function to create species accumulation curves (using specaccum) for a subset of these, such as:
Sites<-subset(DF,DF$Site==1)
samples<-function (x) {
specurve_sample<-(ddply(Sites[,4:length(colnames(Sites))],"Species",numcolwise(sum)))
specurve_sample<-specurve_sample[-1,]
n<-specurve_sample$Species
n<-drop.levels(n,reorder=FALSE)
specurve_sample<-specurve_sample[,-1]
specurve_sample <-t(specurve_sample)
colnames(specurve_sample)<-n
specurve_sample<-as.data.frame(specurve_sample)
sample_k<-specaccum(specurve_sample)
out<-rbind(sample_k$richness,sample_k$sd)
outnames<-c("Richness","SD")
st<-rep(Sites$Site[1],2)
out<-as.data.frame(cbind(outnames,st,out))
colnames(out)<-c("label","site","Days")
out
}
The function works fine if I subset my data before hand, but the boostrapping part does not work. I know I need to create a function (x,j) but cannot figure out where to place the j in my function. Here is the rest of my code. Many thanks for any assistance. James
all_data<-c()
for (i in 1:length(unique(DF$Site))) {
Sites<-subset(DF,DF$Site==i)
boots<-boot(Sites,samples, strata=Sites$HardwareID,R=1000)
all_data<-rbind(all_data,boots)
all_data
}
One straightforward way to do this is to create a function of x and j (as you have started to do), and have the first line of that function identify the relevant bootstrap subset from the whole collection, bootsub <- x[j, ]. Then, you can refer to this subset, bootsub throughout the rest of the function, and you need not refer to j again.
In your case, you don't want your function to refer back to your original data frame, Site. So, every where that you have Site in your function, change it to bootsub. For example:
samples <- function(x, j) {
bootsub <- x[j, ]
specurve_sample <- (ddply(bootsub[, 4:length(colnames(bootsub))], "Species", numcolwise(sum)))
specurve_sample <- specurve_sample[-1, ]
n <- specurve_sample$Species
n <- drop.levels(n, reorder=FALSE)
specurve_sample <- specurve_sample[, -1]
specurve_sample <- t(specurve_sample)
colnames(specurve_sample) <- n
specurve_sample <- as.data.frame(specurve_sample)
sample_k <- specaccum(specurve_sample)
out <- rbind(sample_k$richness, sample_k$sd)
outnames <- c("Richness", "SD")
st <- rep(bootsub$Site[1], 2)
out <- as.data.frame(cbind(outnames, st, out))
colnames(out) <- c("label", "site", "Days")
out
}
...
A follow up to the first two comments below. It's a little hard to troubleshoot without data, but this is my best guess. It may be that you have an issue with your subset() function, because you use i as an index of unique sites in the for() loop, but then refer to i as the value of the site in the call to subset(). Also, it is likely more efficient to run one call to do.call() after the for() loop, rather than multiple calls to rbind() inside the loop. Give this untested code a try.
# vector of unique sites
usite <- unique(DF$Site)
# empty list in which to put the bootstrap results
alldatlist <- vector("list", length(usite))
# loop through every site separately, save the bootstrap replicates ($t)
for(i in 1:length(usite)) {
Sites <- subset(DF, DF$Site==usite[i])
alldatlist[[i]] <- boot(Sites, samples, strata=Sites$HardwareID, R=1000)$t
}
# combine the list of results into a single matrix
all_data <- do.call(rbind, alldatlist)
I would like to create a matrix of indicator variables. My initial thought was to use model.matrix, which was also suggested here: Automatically expanding an R factor into a collection of 1/0 indicator variables for every factor level
However, model.matrix does not seem to work if a factor has only one level.
Here is an example data set with three levels to the factor 'region':
dat = read.table(text = "
reg1 reg2 reg3
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
0 1 0
0 1 0
0 1 0
0 0 1
0 0 1
0 0 1
0 0 1
", sep = "", header = TRUE)
# model.matrix works if there are multiple regions:
region <- c(1,1,1,1,1,1,2,2,2,3,3,3,3)
df.region <- as.data.frame(region)
df.region$region <- as.factor(df.region$region)
my.matrix <- as.data.frame(model.matrix(~ -1 + df.region$region, df.region))
my.matrix
# The following for-loop works even if there is only one level to the factor
# (one region):
# region <- c(1,1,1,1,1,1,1,1,1,1,1,1,1)
my.matrix <- matrix(0, nrow=length(region), ncol=length(unique(region)))
for(i in 1:length(region)) {my.matrix[i,region[i]]=1}
my.matrix
The for-loop is effective and seems simple enough. However, I have been struggling to come up with a solution that does not involve loops. I can use the loop above, but have been trying hard to wean myself off of them. Is there a better way?
I would use matrix indexing. From ?"[":
A third form of indexing is via a numeric matrix with the one column for each dimension: each row of the index matrix then selects a single element of the array, and the result is a vector.
Making use of that nice feature:
my.matrix <- matrix(0, nrow=length(region), ncol=length(unique(region)))
my.matrix[cbind(seq_along(region), region)] <- 1
# [,1] [,2] [,3]
# [1,] 1 0 0
# [2,] 1 0 0
# [3,] 1 0 0
# [4,] 1 0 0
# [5,] 1 0 0
# [6,] 1 0 0
# [7,] 0 1 0
# [8,] 0 1 0
# [9,] 0 1 0
# [10,] 0 0 1
# [11,] 0 0 1
# [12,] 0 0 1
# [13,] 0 0 1
I came up with this solution by modifying an answer to a similar question here:
Reshaping a column from a data frame into several columns using R
region <- c(1,1,1,1,1,1,2,2,2,3,3,3,3)
site <- seq(1:length(region))
df <- cbind(site, region)
ind <- xtabs( ~ site + region, df)
ind
region <- c(1,1,1,1,1,1,1,1,1,1,1,1,1)
site <- seq(1:length(region))
df <- cbind(site, region)
ind <- xtabs( ~ site + region, df)
ind
EDIT:
The line below will extract the data frame of indicator variables from ind:
ind.matrix <- as.data.frame.matrix(ind)
I have a nested liste, resulted from a function. Where the top element names are reapeated in the element names further down.
$`1`
$`1`$`1`
[1] 0 0 0 0 0 0 0 1 0
$`1`$`2`
[1] 0 0 0 0 0 0 0 0 0
$`2`
$`2`$`1`
[1] 0 0 0 1 1 0 0 0 0
$`2`$`2`
[1] 0 1 0 0 0 1 0 0 0
Is there a way to use an apply function (or whatever) to extract those vectors where the element and subelement names match. E.g. $1$1 and $2$2. I have a huge list (4000 elements with 4000 subelements) so efficiency is thus a must.
Alternatively - I have figured out a way out of this mess by using ´melt()´, but it's too consuming for the size of my set. But if anyone know how to replicate the effect - giving a dataframe with 3 columns one for elementname, one for subelement name and one for the vector - that will also work.
Regards and thanks :)
This is a way to get a list of the vectors you want:
lapply(names(dat), function(x) dat[[x]][[x]])
In a data frame:
do.call("rbind",
lapply(names(dat),
function(x) data.frame(element = x,
subelement = x,
values = dat[[x]][[x]])
)
)
You can unlist them without recursion to remove the top level list structure, and then use regex-assisted subsetting on the names of this result.
l <- list(`1`=list(`1`=rpois(6,1),`2`=rep(0,6)),`2`=list(`1`=rep(0,6),`2`=rpois(6,1)))
l2 <- unlist(l,recursive=F)
l2[grepl("([0-9]+)[.]\\1",names(l2))]
$`1.1`
[1] 2 0 2 4 1 0
$`2.2`
[1] 0 0 0 2 1 0