I have this set of sequences with 2 variables for a 3rd variable(device). Now i want to break the sequence for each device into sets of 300. dsl is a data frame that contains d being the device id and s being the number of sequences of length 300.
First, I am labelling (column Sid) all the sequences rep(1,300) followed by rep(2,300) and so on till rep(s,300). Whatever remains unlabelled i.e. with initialized labels(=0) needs to be ignored. The actual labelling happens with seqid vector though.
I had to do this as I want to stack the sets of 300 data points and then transpose it. This would form one row of my predata data.frame. For each predata data frame i am doing a k-means to generate 5 clusters that I am storing in final data.
Essentially for every device I will have 5 clusters that I can then pull by referencing the row number in final data (mapped to device id).
#subset processed data by device
for (ds in 1:387){
d <- dsl[ds,1]
s <- dsl[ds,3]
temp.data <- subset(data,data$Device==d)
temp.data$Sid <- 0
temp.data[1:(s*300),4] <- rep(1:300,s)
temp.data <- subset(temp.data,temp.data$Sid!="0")
seqid <- NA
for (j in 1:s){ seqid[(300*(j-1)+1):(300*j)] <- j }
temp.data$Sid <- seqid
predata <- as.data.frame(matrix(numeric(0),s,600))
for(k in 1:s){
temp.data2 <- subset(temp.data[,c(1,2)], temp.data$Sid==k)
predata[k,] <- t(stack(temp.data2)[,1])
}
ob <- kmeans(predata,5,iter.max=10,algorithm="Hartigan-Wong")
finaldata <- rbind(finaldata,(unique(fitted(ob,method="centers"))))
}
Being a noob to R, I ended up with 3 nested loops (the function did work for the outermost loop being one value). This has taken 5h and running. Need a faster way to go about this.
Any help will be appreciated.
Thanks
Ok, I am going to suggest a radical simplification of your code within the loop. However, it is hard to verify that I in fact did assume the right thing without having sample data. So please ensure that my predata in fact equals yours.
First the code:
for (ds in 1:387){
d <- dsl[ds,1]
s <- dsl[ds,3]
temp.data <- subset(data,data$Device==d)
temp.data <- temp.data[1:(s*300),]
predata <- cbind(matrix(temp.data[,1], byrow=T, ncol=300), matrix(temp.data[,2], byrow=T, ncol=300))
ob <- kmeans(predata,5,iter.max=10,algorithm="Hartigan-Wong")
finaldata <- rbind(finaldata,(unique(fitted(ob,method="centers"))))
}
What I understand you are doing: Take the first 300*s elements from your subset(data, data$Devide == d. This might easily be done using the command
temp.data <- temp.data[1:(s*300),]
Afterwards, you collect a matrix that has the first row c(temp.data[1:300, 1], temp.data[1:300, 2]), and so on for all further rows. I do this using the matrix command as above.
I assume that your outer loop could be transformed in a call to tapply or something similar, but therefore, we would need more context.
Related
I have a list of responses to 7 questions from a survey, each their own column, and am trying to find the response within the first 6 that is closest (numerically) to the 7th. Some won't be the exact same, so I want to create a new variable that produces the difference between the closest number in the first 6 and the 7th. The example below would produce 0.
s <- c(1,2,3,4,5,6,3)
s <- t(s)
s <- as.data.frame(s)
s
Any help is deeply appreciated. I apologize for not having attempted code as nothing I have tried has actually gotten close.
How about this?
which.min( abs(s[1, 1:6] - s[1, 7]))
I'm assuming you want it generalized somehow, but you'd need to provide more info for that. Or just run it through a loop :-)
EDIT: added the loop from the comment and changed exactly 2 tiny things.
s <- c(1,2,3,4,5,6,3)
t <- c(1,2,3,4,5,6,7)
p <- c(1,2,3,4,5,6,2)
s <- data.frame(s,t,p)
k <- t(s)
k <- as.data.frame(k)
k$t <- NA ### need to initialize the column
for(i in 1:3){
## need to refer to each line of k when populating the t column
k[i,]$t <- which.min(abs(k[i, 1:6] - k[i, 7])) }
I want to extract from the online database OBIS all the species occurrence records for a group of polygons. The number of polygons is to large to get all of them at the same time so I thought to use a loop to achieve it. The problem I'm facing is due the fact that not all polygons have records so the result is an empty dataframe and the loop stops. I tried to use the control-flow "if" but is not working. Can I get what I need with a loop? Here is a shorter version of the shapefile I'm using.
library(robis)
library(maptools)
library(mregions)
library(plyr)
polygons <- readShapeSpatial("~/smaller.shp")
occurrence_list = list()
for (i in 1:length(polygons)){
wkt_polygons <- mr_as_wkt(polygons[i,])
occur <- occurrence(geometry=wkt_polygons)
if(is.null(occur) next
occur$i<-i
occurrence_list[[i]] <- occur
}
data <- dplyr::bind_rows(occurrence_list)
I'm not sure how the required result should look like, but this might work:
occurrence_list = list()
for (i in 1:length(polygons)){
wkt_polygons <- mr_as_wkt(polygons[i,])
occur <- occurrence(geometry=wkt_polygons)
if(nrow(occur) > 0) {
occur$i<-i
occurrence_list[[length(occurrence_list) + 1]] <- occur
}
}
data <- dplyr::bind_rows(occurrence_list)
I am a noob R programmer. I have written a code that needs to apply a function to a data frame split by factors. The data frame in itself contains about 1 million 324961 observations with 64376 factors in the variable that we use to slice the dataframe.
The code is as follows:
library("readstata13")
# Reading the Stata Data file into R
bod_fb <- read.dta13("BoD_nonmissing_fb.dta")
gen_fuzzy_blau <- function(bod_sample){
# Here we drop the Variables that are not required in creating the Fuzzy-Blau index
bod_sample <- as.data.frame(bod_sample)
bod_sample$tot_occur <- as.numeric(bod_sample$tot_occur)
bod_sample$caste1_occ <- as.numeric(bod_sample$caste1_occ)
bod_sample$caste2_occ <- as.numeric(bod_sample$caste2_occ)
bod_sample$caste3_occ <- as.numeric(bod_sample$caste3_occ)
bod_sample$caste4_occ <- as.numeric(bod_sample$caste4_occ)
# Calculating the Probabilites of a director belonging to a caste
bod_sample$caste1_occ <- (bod_sample$caste1_occ)/(bod_sample$tot_occur)
bod_sample$caste2_occ <- (bod_sample$caste2_occ)/(bod_sample$tot_occur)
bod_sample$caste3_occ <- (bod_sample$caste3_occ)/(bod_sample$tot_occur)
bod_sample$caste4_occ <- (bod_sample$caste4_occ)/(bod_sample$tot_occur)
#Dropping the Total Occurances column, as we do not need it anymore
bod_sample$tot_occur<- NULL
# Here we replace all the blanks with NA
bod_sample <- apply(bod_sample, 2, function(x) gsub("^$|^ $", NA, x))
bod_sample <- as.data.frame(bod_sample)
# Here we push all the NAs in the caste names and caste probabilities to the end of the row
# So if there are only two castes against a name, then they become caste1 and caste2
caste_list<-data.frame(bod_sample$caste1,bod_sample$caste2,bod_sample$caste3,bod_sample$caste4)
caste_list = as.data.frame(t(apply(caste_list,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
caste_list_prob<-data.frame(bod_sample$caste1_occ,bod_sample$caste2_occ,bod_sample$caste3_occ,bod_sample$caste4_occ)
caste_list_prob = as.data.frame(t(apply(caste_list_prob,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
# Here we write two functions: 1. gen_castelist
# 2. gen_caste_prob
# gen_castelist: This function takes the row number (serial number of the direcor)
# and returns the names of all the castes for which he has a non-zero
# probability.
# gen_caste_prob: This function takes the row number (serial number of the director)
# and returns the probability with which he belongs to the caste
#
gen_castelist <- function(x){
y <- caste_list[x,]
y <- as.vector(y[!is.na(y)])
return(y)
}
gen_caste_prob <- function(x){
z <- caste_list_prob[x,]
z <- z[!is.na(z)]
z <- as.numeric(z)
return(z)
}
caste_ls <-list()
caste_prob_ls <- list()
for(i in 1:nrow(bod_sample))
{
caste_ls[[i]]<- gen_castelist(i)
caste_prob_ls[[i]]<- gen_caste_prob(i)
}
gridcaste <- expand.grid(caste_ls)
gridcaste <- data.frame(lapply(gridcaste, as.character), stringsAsFactors=FALSE)
gridcasteprob <- expand.grid(caste_prob_ls)
# Generating the Joint Probability
gridcasteprob$JP <- apply(gridcasteprob,1,prod)
# Generating the Similarity Index
gen_sim_index <- function(x){
x <- t(x)
a <- as.data.frame(table(x))
sim_index <- sum(a$Freq^2)/(sum(a$Freq))^2
return(sim_index)
}
gridcaste$sim_index <- apply(gridcaste,1,gen_sim_index)
# Generating fuzzyblau
gridcaste$fb <- gridcaste$sim_index * gridcasteprob$JP
fuzzy_blau_index <- sum(gridcaste$fb)
remove_list <- c("gridcaste","")
return(fuzzy_blau_index)
}
fuzzy_blau_output <- by(bod_fb,bod_fb$code_year,gen_fuzzy_blau)
# Saving the output as a dataframe with two columns
# Column 1 is the fuzzy blau index
# Column 2 is the code_year
code_year <- names(fuzzy_blau_output)
fuzzy_blau <- as.data.frame(as.vector(unlist(fuzzy_blau_output)))
names(fuzzy_blau) <- c("fuzzy_blau_index")
fuzzy_blau$code_year <- code_year
bod_fb <- merge(bod_fb,fuzzy_blau,by = "code_year")
save.dta13(bod_fb,"bod_fb_example.dta")
If the code is tl;dr, the summary is as follows:
I have a dataframe bod_fb. I need to apply the apply the gen_fuzzy_blau function on this dataframe by slicing the dataframe with factors of bod_fb$code_year.
Since the function is very huge sequential processing is taking more than a day and ends up in running out of memory. The function gen_fuzzy_blau returns a numeric variable fuzzy_blau_index for each code_year of the dataframe. I use by to apply the function on each slice. I wanted to know if there is a way to parallelly implement this code so that multiple instances of the function run at once on different slices of the dataframe. I did not find a by implementation for parallel package and I did not know how to pass the dataframes as iterators while using foreach and doParallel packages.
I have a AMD A8 laptop with 4GB RAM and windows 7 sp1 home basic. I have given 20GB as page file memory (this was after I got the memory error).
Thank you
EDIT 1: #milkmotel I have eliminated the redundancy in the code and removed the for loops, but a huge amount of time is being wasted in gen_sim_index in the function, I am using the proc.time()function to gauge the time that each part of the code is taking.
The function is supposed to the following to a row:
if we have a row (not a vector) say: a a b c the similarity index will be (2/4)^2 + (1/4)^2 + (1/4)^2 ie, summation of (no of occurences of each unique element of each row/total no of elements in the row)^2
I am unable to use the apply function directly on the row because each element in a row because each element in the row has different factors and table() does not output the frequencies properly.
What is an efficient way to code the gen_sim_index function?
You're saving your data 6 times over in 6 different variables. Try not doing that.
and it takes a day because you're running character indexing on a ridiculous amount of data with gsub().
Take your code out of your gen_fuzzy_blau function as it provides no value to wrap it up into one function rather than running it all independently. Then run it all one line at a time. If it takes too long to run, reconsider your method. Your code is incredibly inefficient.
I am having trouble optimising a piece of R code. The following example code should illustrate my optimisation problem:
Some initialisations and a function definition:
a <- c(10,20,30,40,50,60,70,80)
b <- c(“a”,”b”,”c”,”d”,”z”,”g”,”h”,”r”)
c <- c(1,2,3,4,5,6,7,8)
myframe <- data.frame(a,b,c)
values <- vector(length=columns)
solution <- matrix(nrow=nrow(myframe),ncol=columns+3)
myfunction <- function(frame,columns){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
The problematic for-loop looks like this:
columns = 6
for(i in 1:nrow(myframe){
values <- myfunction(as.matrix(myframe[i,]), columns)
values[columns+2] = i
values[columns+3] = myframe[i,3]
#more columns added with simple operations (i.e. sum)
solution <- rbind(solution,values)
#solution is a large matrix from outside the for-loop
}
The problem seems to be the rbind function. I frequently get error messages regarding the size of solution which seems to be to large after a while (more than 50 MB).
I want to replace this loop and the rbind with a list and lapply and/or foreach. I have started with converting myframeto a list.
myframe_list <- lapply(seq_len(nrow(myframe)), function(i) myframe[i,])
I have not really come further than this, although I tried applying this very good introduction to parallel processing.
How do I have to reconstruct the for-loop without having to change myfunction? Obviously I am open to different solutions...
Edit: This problem seems to be straight from the 2nd circle of hell from the R Inferno. Any suggestions?
The reason that using rbind in a loop like this is bad practice, is that in each iteration you enlarge your solution data frame and then copy it to a new object, which is a very slow process and can also lead to memory problems. One way around this is to create a list, whose ith component will store the output of the ith loop iteration. The final step is to call rbind on that list (just once at the end). This will look something like
my.list <- vector("list", nrow(myframe))
for(i in 1:nrow(myframe)){
# Call all necessary commands to create values
my.list[[i]] <- values
}
solution <- rbind(solution, do.call(rbind, my.list))
A bit to long for comment, so I put it here:
If columns is known in advance:
myfunction <- function(frame){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
apply(myframe, 2, myfunction)
If columns is not given via environment, you can use:
apply(myframe, 2, myfunction, columns) with your original myfunction definition.
So, I built a function called sort.song.
My goal with this function is to randomly sample the rows of a data.frame (DATA) and then filter it out (DATA.NEW) to analyse it. I want to do it multiple times (let's say 10 times). By the end, I want that each object (mantel.something) resulted from this function to be saved in my workspace with a name that I can relate to each cycle (mantel.something1, mantel.somenthing2...mantel.something10).
I have the following code, so far:
sort.song<-function(DATA){
require(ade4)
for(i in 1:10){ # Am I using for correctly here?
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist<-dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist<-dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist<-dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist<-dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist<-dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist<-dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist<-dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
mantel.numnotes[i]<<-mantel.rtest(coord.dist,num.notes.dist,nrepet=1000)
mantel.songdur[i]<<-mantel.rtest(coord.dist,songdur.dist,nrepet=1000)
mantel.hfreq[i]<<-mantel.rtest(coord.dist,hfreq.dist,nrepet=1000)
mantel.lfreq[i]<<-mantel.rtest(coord.dist,lfreq.dist,nrepet=1000)
mantel.bwidth[i]<<-mantel.rtest(coord.dist,bwidth.dist,nrepet=1000)
mantel.hfreqlnote[i]<<-mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
}
}
Could someone please help me to do it the right way?
I think I'm not assigning the cycles correctly for each mantel.somenthing object.
Many thanks in advance!
The best way to implement what you are trying to do is through a list. You can even make it take two indices, the first for the iterations, the second for the type of analysis.
mantellist <- as.list(1:10) ## initiate list with some values
for (i in 1:10){
...
mantellist[[i]] <- list(numnotes=mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur=mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq=mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
...)
}
return(mantellist)
In this way you can index your specific analysis for each iteration in an intuitive way:
mantellist[[2]][['hfreq']]
mantellist[[2]]$hfreq ## alternative
EDIT by Mohr:
Just for clarification...
So, according to your suggestion the code should be something like this:
sort.song<-function(DATA){
require(ade4)
mantellist <- as.list(1:10)
for(i in 1:10){
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist<-dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist<-dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist<-dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist<-dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist<-dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist<-dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist<-dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
mantellist[[i]] <- list(numnotes=mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur=mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq=mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
lfreq=mantel.rtest(coord.dist,lfreq.dist,nrepet=1000),
bwidth=mantel.rtest(coord.dist,bwidth.dist,nrepet=1000),
hfreqlnote=mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
)
}
return(mantellist)
}
You can achieve your objective of repeating this exercise 10 (or more times) without using an explicit for-loop. Rather than have the function run the loop, write the sort.song function to run one iteration of the process, then you can use replicate to repeat that process however many times you desire.
It is generally good practice not to create a bunch of named objects in your global environment. Instead, you can hold of the results of each iteration of this process in a single object. replicate will return an array (if possible) otherwise a list (in the example below, a list of lists). So, the list will have 10 elements (one for each iteration) and each element will itself be a list containing named elements corresponding to each result of mantel.rtest.
sort.song<-function(DATA){
DATA.NEW <- DATA[sample(1:nrow(DATA),replace=FALSE),]
DATA.NEW <- DATA.NEW[!duplicated(DATA.NEW$Point),]
coord.dist <- dist(DATA.NEW[,4:5],method="euclidean")
num.notes.dist <- dist(DATA.NEW$Num_Notes,method="euclidean")
songdur.dist <- dist(DATA.NEW$Song_Dur,method="euclidean")
hfreq.dist <- dist(DATA.NEW$High_Freq,method="euclidean")
lfreq.dist <- dist(DATA.NEW$Low_Freq,method="euclidean")
bwidth.dist <- dist(DATA.NEW$Bwidth_Song,method="euclidean")
hfreqlnote.dist <- dist(DATA.NEW$HighFreq_LastNote,method="euclidean")
return(list(
numnotes = mantel.rtest(coord.dist,num.notes.dist,nrepet=1000),
songdur = mantel.rtest(coord.dist,songdur.dist,nrepet=1000),
hfreq = mantel.rtest(coord.dist,hfreq.dist,nrepet=1000),
lfreq = mantel.rtest(coord.dist,lfreq.dist,nrepet=1000),
bwidth = mantel.rtest(coord.dist,bwidth.dist,nrepet=1000),
hfreqlnote = mantel.rtest(coord.dist,hfreqlnote.dist,nrepet=1000)
))
}
require(ade4)
replicate(10, sort.song(DATA))