I'm trying to count the number of missing values for each missing.value of all variables in a SPSS file. I imported the file using the memisc package. Here is my actual code:
library(memisc)
#Takes about 70seconds
escc <- spss.system.file(file.choose(), to.lower=FALSE)
system.time({
esccMiss <- matrix(,length(escc),9)
esccMiss[,1] <- names(escc)
for (i in 1:length(escc)) {
x <- escc[i]
if(length(miss <- missing.values(x)) > 0) {
ifelse(length(miss#range)>0 , vals <- miss#range[1]:(miss#range[1]+3), vals <- miss#filter)
for (j in 1:length(vals)) {
esccMiss[i, 2*j] <- vals[j]
esccMiss[i,2*j+1] <- length(x[x == vals[j]])
}
}
}
})
I'm fairly new to R (explains the C structure of my code) and i realise this is really slow but i have trouble finding the way to do the samething with lapply function in the memisc package.
Forget my other answer, this is much faster:
escc2 <- as.data.set(escc)
system.time(lis <- lapply(escc2,function(x) table(x[which(is.missing(x))])))
Should only take a few seconds now.
Explanation: The original dataset (escc) is of a class that simply does not work in the *apply family since there isn't a method written for it. However, memisc also includes as.data.set, which does work in *apply.
is.missing returns a vector of all the values that are marked as missing.
which finds the indices of those missings and x[] subsets x so you only have those missings.
table puts the values into a table.
Related
I'm using the boot() function from the boot package to bootstrap means from a population. The used function is:
boot_mean <- function(data, i){
ds_m <- data[i]
return(mean(ds_m))
}
Works like charm but now I want to adapt the boot_mean function so that I can get the samples which lead to the mean too. I tried:
library('boot')
boot_mean <- function(data, i){
ds_m <- data[i]
ds_m_mean <- mean(ds_m)
rlist <- list("means" = ds_m_mean, "data" = ds_m)
return(rlist)
}
dummy_data <- rnorm(500)
dummy_boot <- boot(dummy_data, boot_mean, R = 1000)
Which results in an error:
Error in t.star[r, ] <- res[[r]] : incorrect number of subscripts
on matrix
What's wrong here? How can I get the corresponding dataset to the bootstrapped mean?
From the documentation ?boot, describing the statistic argument.
A function which when applied to data returns a vector containing the statistic(s) of interest. ...
The boot() function only wants to deal with functions that output a single vector. Modifying your code to return a list of two elements means it won't work anymore. There's actually a little interesting oddity in R and the boot() function which means the code almost works if you set R=1 in the boot() call, but it's still wrong.
Fortunately for your purpose, the authors have already programmed the useful boot.array() function. It outputs a matrix with R rows and nrow(data) columns, indicating either how many times the jth individual was sampled for the ith bootstrap, or the indices of the sampled individuals. Getting the bootstrapped datasets can easily be found by selecting those individuals from the data. This can take a little while.
dats <- lapply(1:nrow(boot.array(dummy_boot)),
FUN = function(x) dummy_data[boot.array(dummy_boot, indices = TRUE)[x, ]])
If you have multiple columns of data you should add , , drop = FALSE
dats <- lapply(1:nrow(boot.array(dummy_boot)),
FUN = function(x) dummy_data[boot.array(dummy_boot, indices = TRUE)[x, ], , drop = FALSE])
I am trying to create a vector or list of values based on the output of a function performed on individual elements of a column.
library(hpoPlot)
xyz_hpo <- c("HP:0003698", "HP:0007082", "HP:0006956")
getallancs <- function(hpo_col) {
for (i in 1:length(hpo_col)) {
anc <- get.ancestors(hpo.terms, hpo_col[i])
output <- list()
output[[length(anc) + 1]] <- append(output, anc)
}
return(anc)
}
all_ancs <- getallancs(xyz_hpo)
get.ancestors outputs a character vector of variable length depending on each term. How can I loop through hpo_col adding the length of each ancs vector to the output vector?
Welcome to Stack Overflow :) Great job on providing a minimal reproducible example!
As mentioned in the comments, you need to move the output <- list() outside of your for loop, and return it after the loop. At present it is being reset for each iteration of the loop, which is not what you want. I also think you want to return a vector rather than a list, so I have changed the type of output.
Also, in your original question, you say that you want to return the length of each anc vector in the loop, so I have changed the function to output the length of each iteration, rather than the whole vector.
getallancs <- function(hpo_col) {
output <- numeric()
for (i in 1:length(hpo_col)) {
anc <- get.ancestors(hpo.terms, hpo_col[i])
output <- append(output, length(anc))
}
return(output)
}
If you are only doing this for a few cases, such as your example, this approach will be fine, however, this paradigm is typically quite slow in R and it's better to try and vectorise this style of calculation if possible. This is especially important if you are running this for a large number of elements where computation will take more than a few seconds.
For example, one way the function above could be vectorised is like so:
all_ancs <- sapply(xyz_hpo, function(x) length(get.ancestors(hpo.terms, x)))
If in fact you did mean to output the whole vector of anc, not just the lengths, the original function would look like this:
getallancs <- function(hpo_col) {
output <- character()
for (i in 1:length(hpo_col)) {
anc <- get.ancestors(hpo.terms, hpo_col[i])
output <- c(output, anc)
}
return(output)
}
Or a vectorised version could be
all_ancs <- unlist(lapply(xyz_hpo, function(x) get.ancestors(hpo.terms, x)))
Hope that helps. If it solves your problem, please mark this as the answer.
I am writing a function which takes a directory of data, and reads them in, and (if it reaches the threshold of complete cases), calculates the correlation between two variables in the data ("sulfate" and "nitrate"). I want this to run in a for loop to create a numeric vector of the correlation values (one value for each file in the directory).
However, when I run the code, it only returns the last value.
I am quite new to R (so may be making simple mistakes, and have the newest version of R installed). Below is the code:
corr <- function(directory, threshold = 0) {
filenames3 <- list.files(directory, pattern = ".csv", full.names = TRUE)
loop_length <- length(filenames3)
correlation_values <- numeric()
for(i in loop_length) {
read_in_data3 <- read.csv(filenames3[i])
complete_boolean <- complete.cases(read_in_data3)
nobs2 <- sum(complete_boolean)
data_rmNA <- read_in_data3[complete_boolean, ]
if(nobs2 > threshold) {
correlation_values <- c(correlation_values,
cor(data_rmNA[["sulfate"]],
data_rmNA[["nitrate"]]))
}
}
correlation_values
}
corr("C:/Users/Danie/OneDrive/Documents/R/specdata")
I have tried specifying the length of the vector e.g. correlation_values <- numeric(length = loop_length). This returns a vector of the right length, but all the values are 0 excluding the last which runs properly. I have looked at similar questions, but still can't find a solution to my problem.
I assume I'm losing information in the loop somewhere (rewriting over a variable or something).
Thanks in advance for any help.
I think you need to say for(i in 1:loop_length) instead of for(i in loop_length).
R will loop over each element in the provided vector, but right now your vector is length 1 which is why only the last value is returned.
I am a noob R programmer. I have written a code that needs to apply a function to a data frame split by factors. The data frame in itself contains about 1 million 324961 observations with 64376 factors in the variable that we use to slice the dataframe.
The code is as follows:
library("readstata13")
# Reading the Stata Data file into R
bod_fb <- read.dta13("BoD_nonmissing_fb.dta")
gen_fuzzy_blau <- function(bod_sample){
# Here we drop the Variables that are not required in creating the Fuzzy-Blau index
bod_sample <- as.data.frame(bod_sample)
bod_sample$tot_occur <- as.numeric(bod_sample$tot_occur)
bod_sample$caste1_occ <- as.numeric(bod_sample$caste1_occ)
bod_sample$caste2_occ <- as.numeric(bod_sample$caste2_occ)
bod_sample$caste3_occ <- as.numeric(bod_sample$caste3_occ)
bod_sample$caste4_occ <- as.numeric(bod_sample$caste4_occ)
# Calculating the Probabilites of a director belonging to a caste
bod_sample$caste1_occ <- (bod_sample$caste1_occ)/(bod_sample$tot_occur)
bod_sample$caste2_occ <- (bod_sample$caste2_occ)/(bod_sample$tot_occur)
bod_sample$caste3_occ <- (bod_sample$caste3_occ)/(bod_sample$tot_occur)
bod_sample$caste4_occ <- (bod_sample$caste4_occ)/(bod_sample$tot_occur)
#Dropping the Total Occurances column, as we do not need it anymore
bod_sample$tot_occur<- NULL
# Here we replace all the blanks with NA
bod_sample <- apply(bod_sample, 2, function(x) gsub("^$|^ $", NA, x))
bod_sample <- as.data.frame(bod_sample)
# Here we push all the NAs in the caste names and caste probabilities to the end of the row
# So if there are only two castes against a name, then they become caste1 and caste2
caste_list<-data.frame(bod_sample$caste1,bod_sample$caste2,bod_sample$caste3,bod_sample$caste4)
caste_list = as.data.frame(t(apply(caste_list,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
caste_list_prob<-data.frame(bod_sample$caste1_occ,bod_sample$caste2_occ,bod_sample$caste3_occ,bod_sample$caste4_occ)
caste_list_prob = as.data.frame(t(apply(caste_list_prob,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
# Here we write two functions: 1. gen_castelist
# 2. gen_caste_prob
# gen_castelist: This function takes the row number (serial number of the direcor)
# and returns the names of all the castes for which he has a non-zero
# probability.
# gen_caste_prob: This function takes the row number (serial number of the director)
# and returns the probability with which he belongs to the caste
#
gen_castelist <- function(x){
y <- caste_list[x,]
y <- as.vector(y[!is.na(y)])
return(y)
}
gen_caste_prob <- function(x){
z <- caste_list_prob[x,]
z <- z[!is.na(z)]
z <- as.numeric(z)
return(z)
}
caste_ls <-list()
caste_prob_ls <- list()
for(i in 1:nrow(bod_sample))
{
caste_ls[[i]]<- gen_castelist(i)
caste_prob_ls[[i]]<- gen_caste_prob(i)
}
gridcaste <- expand.grid(caste_ls)
gridcaste <- data.frame(lapply(gridcaste, as.character), stringsAsFactors=FALSE)
gridcasteprob <- expand.grid(caste_prob_ls)
# Generating the Joint Probability
gridcasteprob$JP <- apply(gridcasteprob,1,prod)
# Generating the Similarity Index
gen_sim_index <- function(x){
x <- t(x)
a <- as.data.frame(table(x))
sim_index <- sum(a$Freq^2)/(sum(a$Freq))^2
return(sim_index)
}
gridcaste$sim_index <- apply(gridcaste,1,gen_sim_index)
# Generating fuzzyblau
gridcaste$fb <- gridcaste$sim_index * gridcasteprob$JP
fuzzy_blau_index <- sum(gridcaste$fb)
remove_list <- c("gridcaste","")
return(fuzzy_blau_index)
}
fuzzy_blau_output <- by(bod_fb,bod_fb$code_year,gen_fuzzy_blau)
# Saving the output as a dataframe with two columns
# Column 1 is the fuzzy blau index
# Column 2 is the code_year
code_year <- names(fuzzy_blau_output)
fuzzy_blau <- as.data.frame(as.vector(unlist(fuzzy_blau_output)))
names(fuzzy_blau) <- c("fuzzy_blau_index")
fuzzy_blau$code_year <- code_year
bod_fb <- merge(bod_fb,fuzzy_blau,by = "code_year")
save.dta13(bod_fb,"bod_fb_example.dta")
If the code is tl;dr, the summary is as follows:
I have a dataframe bod_fb. I need to apply the apply the gen_fuzzy_blau function on this dataframe by slicing the dataframe with factors of bod_fb$code_year.
Since the function is very huge sequential processing is taking more than a day and ends up in running out of memory. The function gen_fuzzy_blau returns a numeric variable fuzzy_blau_index for each code_year of the dataframe. I use by to apply the function on each slice. I wanted to know if there is a way to parallelly implement this code so that multiple instances of the function run at once on different slices of the dataframe. I did not find a by implementation for parallel package and I did not know how to pass the dataframes as iterators while using foreach and doParallel packages.
I have a AMD A8 laptop with 4GB RAM and windows 7 sp1 home basic. I have given 20GB as page file memory (this was after I got the memory error).
Thank you
EDIT 1: #milkmotel I have eliminated the redundancy in the code and removed the for loops, but a huge amount of time is being wasted in gen_sim_index in the function, I am using the proc.time()function to gauge the time that each part of the code is taking.
The function is supposed to the following to a row:
if we have a row (not a vector) say: a a b c the similarity index will be (2/4)^2 + (1/4)^2 + (1/4)^2 ie, summation of (no of occurences of each unique element of each row/total no of elements in the row)^2
I am unable to use the apply function directly on the row because each element in a row because each element in the row has different factors and table() does not output the frequencies properly.
What is an efficient way to code the gen_sim_index function?
You're saving your data 6 times over in 6 different variables. Try not doing that.
and it takes a day because you're running character indexing on a ridiculous amount of data with gsub().
Take your code out of your gen_fuzzy_blau function as it provides no value to wrap it up into one function rather than running it all independently. Then run it all one line at a time. If it takes too long to run, reconsider your method. Your code is incredibly inefficient.
I am having trouble optimising a piece of R code. The following example code should illustrate my optimisation problem:
Some initialisations and a function definition:
a <- c(10,20,30,40,50,60,70,80)
b <- c(“a”,”b”,”c”,”d”,”z”,”g”,”h”,”r”)
c <- c(1,2,3,4,5,6,7,8)
myframe <- data.frame(a,b,c)
values <- vector(length=columns)
solution <- matrix(nrow=nrow(myframe),ncol=columns+3)
myfunction <- function(frame,columns){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
The problematic for-loop looks like this:
columns = 6
for(i in 1:nrow(myframe){
values <- myfunction(as.matrix(myframe[i,]), columns)
values[columns+2] = i
values[columns+3] = myframe[i,3]
#more columns added with simple operations (i.e. sum)
solution <- rbind(solution,values)
#solution is a large matrix from outside the for-loop
}
The problem seems to be the rbind function. I frequently get error messages regarding the size of solution which seems to be to large after a while (more than 50 MB).
I want to replace this loop and the rbind with a list and lapply and/or foreach. I have started with converting myframeto a list.
myframe_list <- lapply(seq_len(nrow(myframe)), function(i) myframe[i,])
I have not really come further than this, although I tried applying this very good introduction to parallel processing.
How do I have to reconstruct the for-loop without having to change myfunction? Obviously I am open to different solutions...
Edit: This problem seems to be straight from the 2nd circle of hell from the R Inferno. Any suggestions?
The reason that using rbind in a loop like this is bad practice, is that in each iteration you enlarge your solution data frame and then copy it to a new object, which is a very slow process and can also lead to memory problems. One way around this is to create a list, whose ith component will store the output of the ith loop iteration. The final step is to call rbind on that list (just once at the end). This will look something like
my.list <- vector("list", nrow(myframe))
for(i in 1:nrow(myframe)){
# Call all necessary commands to create values
my.list[[i]] <- values
}
solution <- rbind(solution, do.call(rbind, my.list))
A bit to long for comment, so I put it here:
If columns is known in advance:
myfunction <- function(frame){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
apply(myframe, 2, myfunction)
If columns is not given via environment, you can use:
apply(myframe, 2, myfunction, columns) with your original myfunction definition.