I would like to declare a column in a data.frame that is a multidimensional character array (3 characters in each row). I'm driving myself crazy trying to figure this out.
simulations <- 1000
data <- data.frame(nonsing = character(simulations))
for(i in 1:simulations){
data$nonsing[i] = letters[1:3]
}
You need to collapse the 3 characters in one string which can be done with toString.
simulations <- 1000
data <- data.frame(nonsing = character(simulations))
for(i in 1:simulations){
data$nonsing[i] = toString(letters[sample(1:26, 3)])
}
letters[1:3] would always give 'a, b, c' hence I used sample to assign random 3 letters.
You can also use replicate :
data$nonsing <- replicate(simulations, toString(letters[sample(1:26, 3)]))
Related
I would like to clean up my code a bit and start to use more functions for my everyday computations (where I would normally use for loops). I have an example of a for loop that I would like to make into a function. The problem I am having is in how to step through the constraint vectors without a loop. Here's what I mean;
## represents spectral data
set.seed(11)
df <- data.frame(Sample = 1:100, replicate(1000, sample(0:1000, 100, rep = TRUE)))
## feature ranges by column number
frm <- c(438,563,953,963)
to <- c(548,803,1000,993)
nm <- c("WL890", "WL1080", "WL1400", "WL1375")
WL.ps <- list()
for (i in 1:length(frm)){
## finds the minimum value within the range constraints and returns the corresponding column name
WL <- colnames(df[frm[i]:to[i]])[apply(df[frm[i]:to[i]],1,which.min)]
WL.ps[[i]] <- WL
}
new.df <- data.frame(WL.ps)
colnames(new.df) <- nm
The part where I iterate through the 'frm' and 'to' vector values is what I'm having trouble with. How does one go from frm[1] to frm[2].. so-on in a function (apply or otherwise)?
Any advice would be greatly appreciated.
Thank you.
You could write a function which returns column name of minimum value in each row for a particular range of columns. I have used max.col instead of apply(df, 1, which.min) to get minimum value in a row since max.col would be efficient compared to apply.
apply_fun <- function(data, x, y) {
cols <- x:y
names(data[cols])[max.col(-data[cols])]
}
Apply this function using Map :
WL.ps <- Map(apply_fun, frm, to, MoreArgs = list(data = df))
I want to create 10 (at work: 50,000) random data-frames with setting seed for sake of reproducibility. The seed should be different for each data-frame, also its name should increase from df_01, df_02 ... to df_10. With help of #akrun 's answer I coded a loop like this:
# Number of data-frames to be created
n <- 10
# setting a seed vector
x <- 42
# loop
for (i in 1:10) {
set.seed(x)
a <- rnorm(10,.9,.05)
b <- sample(8:200,10,replace=TRUE)
c <- rnorm(10,80,30)
lst <- replicate(i, data.frame(a,b,c), simplify=FALSE)
x <- x+i
}
# name data-frames
names(lst) <- paste0('df', 1:10)
Now I have my data-frames, but it seems I can't get he random generation running. All data are similar. When I replace the lst-line with following code at least the seeded randomization works:
print(data.frame(a,b,c))
A crackajack extra would be a hint for leading zeros in the dfs-names in order to sort them.
Any help appreciated, thx!
You get the same results in all your list elements, because you create your list from scratch in every iteration using replicate and replace the previously created one. If you are using a for loop, you do not need replicate.
For the sake of reproducability I would create a vector of seeds before the loop and then set the seed in each iteration. The leading zeros can be produced using sprintf:
## Number of random data frames to create:
n <- 10
## Sample vector of seeds:
initSeed <- 1234
set.seed(initSeed)
seedVec <- sample.int(n = 1e8, size = n, replace = FALSE)
## loop:
lst <- lapply(1:n, function(i){
set.seed(seedVec[i])
a <- rnorm(10,.9,.05)
b <- sample(8:200,10,replace=TRUE)
c <- rnorm(10,80,30)
data.frame(a,b,c)
})
## Set names with leading zeroes (2 digits). If you want
## three digits, change "%02d" to "%03d" etc.
names(lst) <- paste0('df', sprintf("%02d", 1:10))
Given a list of 16 elements, where each element is a named numeric vector, I want to plot the length of the intersection of names between every 2 elements. That is; the intersection of element 1 with element 2, that of element 3 with element 4, etc.
Although I can do this in a very tedious, low-throughput manner, I'll have to repeat this sort of analysis, so I'd like a more programmatic way of doing it.
As an example, the first 5 entries of the first 2 list elements are:
topGenes[[1]][1:5]
3398 284353 219293 7450 54658
2.856363 2.654106 2.653845 2.635599 2.626518
topGenes[[2]][1:5]
1300 64581 2566 5026 146433
2.932803 2.807381 2.790484 2.739735 2.705030
Here, the first row of numbers are gene IDs & I want to know how many each pair of vectors (a treatment replicate) have in common, among, say, the top 100.
I've tried using lapply() in the following manner:
vectorOfIntersectLengths <- lapply(topGenes, function(x) lapply(topGenes, function(y) length(intersect(names(x)[1:100],names(y)[1:100]))))
This only seems to operate on the first two elements; topGenes[[1]] & topGenes[[2]].
I've also been trying to do this with a for() loop, but I'm unsure how to write this. Something along the lines of this:
lengths <- c()
for(i in 1:length(topGenes)){
lens[i] <- length(intersect(names(topGenes[[i]][1:200]),
names(topGenes[[i+1]][1:200])))
}
This returns a 'subscript out of bounds' error, which I don't really understand.
Thanks a lot for any help!
Is this what you're looking for?
# make some fake data
set.seed(123)
some_list <- lapply(1:16, function(x) {
y <- rexp(100)
names(y) <- sample.int(1000,100)
y
})
# identify all possible pairs
pairs <- t( combn(length(some_list), 2) )
# note: you could also use: pairs <- expand.grid(1:length(some_list),1:length(some_list))
# but in addition to a-to-b, you'd get b-to-a, a-to-a, and b-to-b
# get the intersection of names of a pair of elements with given indices kept for bookkeeping
get_intersection <- function(a,b) {
list(a = a, b = b,
intersection = intersect( names(some_list[[a]]), names(some_list[[b]]) )
)
}
# get intersection for each pair
intersections <- mapply(get_intersection, a = pairs[,1], b = pairs[,2], SIMPLIFY=FALSE)
# print the intersections
for(indx in 1:length(intersections)){
writeLines(paste('Intersection of', intersections[[indx]]$a, 'and',
intersections[[indx]]$b, 'contains:',
paste( sort(intersections[[indx]]$intersection), collapse=', ') ) )
}
I have a set of vectors of length n, say, for example that n=3:
vec1<-c(1,2,3)
vec2<-c(2,2,2)
And a multidimensional array of size n^n:
threeDarray<-array(0,dim=c(3,3,3))
I want to create a loop that goes through my set of vectors and adds 1 to the corresponding index in the array. After analysing the two vectors above the array should be like:
threeDarray[1,2,3]=1
threeDarray[2,2,2]=1
I'm trying to use the multidimensional array to store the number of occurrences of each vector (my vectors are patterns in a time series).
The community is right (and the noob is wrong). Multidimensional arrays are not the way to go about this.
An example of code working with lists:
freqPatterns<-function(timeSeries,dimension){
temp<-character()
for (i in 1:(length(timeSeries)-dimension+1)){
pattern<-paste(as.character(rank(timeSeries[i:(i+dimension-1)])-1),collapse=", ")
#print(pattern)
temp[[length(temp)+1]] <- pattern
}
freqTable=sort(table(temp),decreasing=T)
return(freqTable)
}
Thank you guys!
Like you found out yourself, I wouldn't use a multidimensioanl array neither.
Here is a solution using a dataframe:
n=4 # dimension
ll = lapply(vector("list", n), function(x) x=1:n) # build list of vectors (n * 1:n)
df_occurs = expand.grid(ll, KEEP.OUT.ATTRS=F) # get all combinations
df_occurs$occurences = 0
# for-loop for storing the occurences
for(v in list(vec1, vec2)) {
v_match = apply(df_occurs[,1:n], 1, function(x) all(x==v))
df_occurs$occurences[v_match] = 1
}
Maybe performance is an issue with large n. If it's possible to build a character-key out of your vector, eg.
paste(vec1, collapse="")
the lookup in the dataframe would be easier:
df_occurs = data.frame(
key = apply(expand.grid(ll, KEEP.OUT.ATTRS=F), 1, paste, collapse=""),
occurences = 0
)
for(key in list(vec1, vec2)) {
df_occurs$occurences[df_occurs$key==paste(key, collapse="")] = 1
}
I am trying to split my data set using two parameters, the fraction of missing values and "maf", and store the sub-data sets in a list. Here is what I have done (it's not working). Any help will be appreciated,
Thanks.
library(BLR)
library(missForest)
data(wheat)
X2<- prodNA(X, 0.4) ### creating missing values
dim(X2)
fd<-t(X2)
MAF<-function(geno){ ## markers are in the rows
geno[(geno!=0) & (geno!=1) & (geno!=-1)] <- NA
geno <- as.matrix(geno)
## calc_Freq for alleles
n0 <- apply(geno==0,1,sum,na.rm=T)
n1 <- apply(geno==1,1,sum,na.rm=T)
n2 <- apply(geno==-1,1,sum,na.rm=T)
n <- n0 + n1 + n2
## calculate allele frequencies
p <- ((2*n0)+n1)/(2*n)
q <- 1 - p
maf <- pmin(p, q)
maf}
frac.missing <- apply(fd,1,function(z){length(which(is.na(z)))/length(z)})
maf<-MAF(fd)
lst<-matrix()
for (i in seq(0.2,0.7,by =0.2)){
for (j in seq(0,0.2,by =0.005)){
lst=fd[(maf>j)|(frac.missing < i),]
}}
It sounds like you want the results that the split function provides.
If you have a vector, "frac.missing" and "maf" is defined on the basis of values in "fd" (and has the same length as the number of rows in fd"), then this would provide the split you are looking for:
spl.fd <- split(fd, list(maf, frac.missing) )
If you want to "group" the fd values basesd on of maf(fd) and frac.missing within the bands specified by your for-loop, then the same split-construct may do what your current code is failing to accomplish:
lst <- split( fd, list(cut(maf(fd), breaks = seq(0,0.2,by =0.005) ,
include.lowest=TRUE),
cut(frac.missing, breaks = seq(0.2,0.7,by =0.2),
right=TRUE,include.lowest=TRUE)
)
)
The right argument accomodates the desire to have the splits based on a "<" operator whereas the default operation of cut presumes a ">" comparison against the 'breaks'. The other function that provides similar facility is by.
the below codes give me exactly what i need:
Y<-t(GBS.binary)
nn<-colnames(Y)
fd<-Y
maf<-as.matrix(MAF(Y))
dff<-cbind(frac.missing,maf,Y)
colnames(dff)<-c("fm","maf",nn)
dff<-as.data.frame(dff)
for (i in seq(0.1,0.6,by=0.1)) {
for (j in seq(0,0.2,by=0.005)){
assign(paste("fm_",i,"maf_",j,sep=""),
(subset(dff, maf>j & fm <i))[,-c(1,2)])
} }