Most of the times there are more than one ways to implement a solution for a specific problem. Hence, there are bad solutions and good solutions. I consider robust implementations the ones that include for loops and while statements, lists or any other function and build-in types that makes our life easier.
I am looking forward to see and understand some examples of high-programming in R.
Assume a task like the following.
#IMPORT DATASET
Dataset <- read.table("blablabla\\dataset.txt", header=T, dec=".")
#TRAINING OF MODEL
Modeltrain <- lm(temperature~latitude+sea.distance+altitude, data=Dataset)
#COEFFICIENT VALUES FOR INDEPENDENT VARIABLES
Intercept <- summary(Modeltrain)$coefficients[1]
Latitude <- summary(Modeltrain)$coefficients[2]
Sea.distance <- summary(Modeltrain)$coefficients[3]
Altitude <- summary(Modeltrain)$coefficients[4]
#ASK FOR USER INPUT AND CALCULATE y
i <- 1
while (i == 1){
#LATITUDE (Xlat)
cat("Input latitude value please: ")
Xlat <- readLines(con="stdin", 1)
Xlat <- as.numeric(Xlat)
cat(Xlat, "is the latitude value. \n")
#LONGTITUDE (Xlong)
#CALCULATE DISTANCE FROM SEA (Xdifs)
cat("Input longtitude value please: ")
Xlong <- readLines(con="stdin", 1)
Xlong <- as.numeric(Xlong)
#cat(Xlong, "\n")
Xdifs <- min(4-Xlong, Xlat)
cat(Xdifs, "is the calculated distance from sea value. \n")
#ALTITUDE(Xlat)
cat("Input altitude value please: ")
Xalt <- readLines(con="stdin", 1)
Xalt <- as.numeric(Xalt)
cat(Xalt, "is the altitude value. \n")
y = Intercept + Latitude*Xlat + Sea.distance*Xdifs + Altitude*Xalt
cat(y, "is the predicted temperature value.\n")
}
First of all, i would like to ask how to, instead of blablabla\\dataset.txt, set an absolute path making the script functional in other OS too.
Second question is how do i automate the above process, to include additional X variables as well, without having to add them manually in the script.
I understand the latest question probably means re-writing the whole thing therefore i don't expect an answer. As i said before i am more interested in understanding how it could be done and do it myself.
Thanks.
p.s. please don't ask for a reproducible example i can't provide much else.
For the first question, you may want to look at the file.path command. For the second, I would approach this by defining, outside the while loop, two lists, one to store the prompts (e.g. list(lat="Please enter Latitude")) and another, with identical names, to store the input values. Then another loop inside the while iterates through the names of the first list, produces the relevant prompt, and stores the response in the named slot in the second list.
If your users are happy interacting with R in such a way, then you're lucky. Else, as #Roland suggests, delegate the UI to some other technology.
Related
I'm trying to perform multiple imputation on a dataset in R where I have two variables, one of which needs to be the same or greater than the other one. I have set up the method and the predictive matrix, but I am having trouble understanding how to configure the post-processing. The manual (or main paper - van Buuren and Groothuis-Oudshoorn, 2011) states (section 3.5): "The mice() function has an argument post that takes a vector of strings of R commands. These commands are parsed and evaluated just after the univariate imputation function returns, and thus provide a way to post-process the imputed values." There are a couple of examples, of which the second one seems most useful:
R> post["gen"] <- "imp[[j]][p$data$age[!r[,j]]<5,i] <- levels(boys$gen)[1]"
this suggests to me that I could do:
R> ini <- mice(cbind(boys), max = 0, print = FALSE)
R> post["A"] <- "imp[[j]][p$data$B[!r[,j]]>p$data$A[!r[,j]],i] <- levels(boys$A)[boys$B]"
However, this doesn't work (when I plot A v B, I get random scatter rather than the points being confined to one half of the graph where A >= B).
I have also tried using the ifdo() function, as suggested in another sx post:
post["A"] <- "ifdo(A < B), B"
However, it seems the ifdo() function is not yet implemented. I tried running the code suggested for inspiration but afraid my R programming skills are not that brilliant.
So, in summary, has anyone any advice about how to implement post-processing in mice such that value A >= value B in the final imputed datasets?
Ok, so I've found an answer to my own question - but maybe this isn't the best way to do it.
In FIMD, there is a suggestion to do this kind of thing outside the imputation process, which thus gives:
R> long <- mice::complete(imp, "long", include = TRUE)
R> long$A <- with(long, ifelse(B < A, B, A))
This seems to work, so I'm happy.
I understand that in the following
aa <- sapply(c("BMI","KOL"),function(x) as.formula(paste('Surv(BL_AGE,CVD_AGE,INCIDENT_CVD) ~', paste(colnames(s)[c(21,259,330,380)], collapse='+'))))
I am missing x
but i really don't understand how and where to insert it to be correct.
Thank you for any help.
Making this an answer instead of a comment due to amount of text.
If I understand you correctly, you're trying to iterate over a list of variables, which you want to add (each in turn) to a set of independent variables in a survival model. The issue in the code you gave is that you don't give x a place. There are several approaches to do so.
The first one is very similar to what you're doing, and creates the formulas. I demonstrate this using the 'cancer' dataset:
library(survival)
data(cancer)
myvars <- c("meal.cal","wt.loss")
a1 <- sapply(myvars,function(x){
as.formula(sprintf("Surv(time, status)~age+sex+%s",x))
}
)
#then we can fit our models
lapply(a1,function(x){coxph(formula=x,data=cancer)})
In my opinion, this is a bit convoluted and can be done in one step:
models <- lapply(myvars, function(x){
form <- as.formula(sprintf("Surv(time, status)~age+sex+%s",x))
fit <- coxph(formula=form, data=cancer)
return(fit)
})
Using the code you started with, we can simply add 'x' to the vector of dependent variables. However, this is not very readable code and I'm always a bit nervous about feeding column indices to models. You might be safer using variable names instead.
aa <- sapply(c("BMI","KOL"),function(x) as.formula(paste('Surv(BL_AGE,CVD_AGE,INCIDENT_CVD) ~', paste(c(x,colnames(s)[c(21,259,330,380)]), collapse='+'))))
I am working on a problem in which I have to two data frames data and abbreviations and I would like to replace all the abbreviations present in data to their respective full forms. Till now I was using for-loops in the following manner
abb <- c()
for(i in 1:length(data$text)){
for(j in 1:length(AbbreviationList$Abb)){
abb <- paste("(\\b", AbbreviationList$Abb[j], "\\b)", sep="")
data$text[i] <- gsub(abb, AbbreviationList$Fullform[j], tolower(data$text[i]))
}
}
The abbreviation data frame looks something like the image below and can be generated using the following code
Abbreviation <- c(c("hru", "how are you"),
c("asap", "as soon as possible"),
c("bf", "boyfriend"),
c("ur", "your"),
c("u", "you"),
c("afk", "away from keyboard"))
Abbreviation <- data.frame(matrix(Abbreviation, ncol=2, byrow=T), row.names=NULL)
names(Abbreviation) <- c("abb","Fullform")
And the data is merely a data frame with 1 columns having text strings in each rows which can also be generated using the following code.
data <- data.frame(unlist(c("its good to see you, hru doing?",
"I am near bridge come ASAP",
"Can u tell me the method u used for",
"afk so couldn't respond to ur mails",
"asmof I dont know who is your bf?")))
names(data) <- "text"
Initially, I had data frame with around 1000 observations and abbreviation of around 100. So, I was able to run the analysis. But now the data has increased to almost 50000 and I am facing difficulty in processing it as there are two for-loops which makes the process very slow. Can you suggest some better alternatives to for-loop and explain with an example how to use it in this situation. If this problem can be solved faster via vectorization method then please suggest how to do that as well.
Thanks for the help!
This should be faster, and without side effect.
mapply(function(x,y){
abb <- paste0("(\\b", x, "\\b)")
gsub(abb, y, tolower(data$text))
},abriv$Abb,abriv$Fullform)
gsub is vectorized so no you give it a character vector where matches are sought. Here I give it data$text
I use mapply to avoid the side effect of for.
First of all, clearly there is no need to compile the regular expressions with each iteration of the loop. Also, there is no need to actually loop over data$text: in R, very often you can use a vector where a value could do -- and R will go through all the elements of the vector and return a vector of the same length.
Abbreviation$regex <- sprintf( "(\\b%s\\b)", Abbreviation$abb )
for( j in 1:length( Abbreviation$abb ) ) {
data$text <- gsub( Abbreviation$regex[j],
Abbreviation$Fullform[j], data$text,
ignore.case= T )
}
The above code works with the example data.
I'm using the GillespieSSA package for R, which allows you to call a single instance of a stochastic simulation easily and quickly. But a single realization isn't super-useful, I'd like to be able to look at the variability that arises from chance, etc. Consider a very basic model:
library(GillespieSSA)
x0 <- c(S=499, I=1, R=0)
a <- c("0.001*{S}*{I}","0.1*{I}")
nu <- matrix(c(-1,0,
+1,-1,
0, +1),nrow=3,byrow=T)
out <- ssa(x0, a, nu, tf=100)
That puts out a profoundly complicated list, the interesting bits of which are in out$data.
My question is, I can grab out$data for a single call of the instance, tag it with a variable indicating what call of the function it is, and then append that data to the old data to come out with one big set at the end. So in crude pseudo-R, something like:
nruns <- 10
for (i in 1:nruns){
out <- ssa(x0, a, nu, tf=100)
data <- out$data
run <- rep(i,times=length[data[,2]))
data <- cbind(data,run)
But where data isn't being overwritten at each time step. I feel like I'm close, but having hopped between several languages this week, my R loop fu, weak as it already is, is failing.
I am not sure if I understood you correctly. Do you want to do something like the following?
out <- lapply(X=1:10,FUN=function(x) ssa(x0, a, nu, tf=100)$data)
This would do 10 runs and put the resulting data lists in a list. You could then access,e.g., the data from the second run with out[[2]].
So I have some lidar data that I want to calculate some metrics for (I'll attach a link to the data in a comment).
I also have ground plots that I have extracted the lidar points around, so that I have a couple hundred points per plot (19 plots). Each point has X, Y, Z, height above ground, and the associated plot.
I need to calculate a bunch of metrics on the plot level, so I created plotsgrouped with split(plotpts, plotpts$AssocPlot).
So now I have a data frame with a "page" for each plot, so I can calculate all my metrics by the "plot page". This works just dandy for individual plots, but I want to automate it. (yes, I know there's only 19 plots, but it's the principle of it, darn it! :-P)
So far, I've got a for loop going that calculates the metrics and puts the results in a data frame called Results. I pulled the names of the groups into a list called groups as well.
for(i in 1:length(groups)){
Results$Plot[i] <- groups[i]
Results$Mean[i] <- mean(plotsgrouped$PLT01$Z)
Results$Std.Dev.[i] <- sd(plotsgrouped$PLT01$Z)
Results$Max[i] <- max(plotsgrouped$PLT01$Z)
Results$75%Avg.[i] <- mean(plotsgrouped$PLT01$Z[plotsgrouped$PLT01$Z <= quantile(plotsgrouped$PLT01$Z, .75)])
Results$50%Avg.[i] <- mean(plotsgrouped$PLT01$Z[plotsgrouped$PLT01$Z <= quantile(plotsgrouped$PLT01$Z, .50)])
...
and so on.
The problem arises when I try to do something like:
Results$mean[i] <- mean(paste("plotsgrouped", groups[i],"Z", sep="$")). mean() doesn't recognize the paste as a reference to the vector plotsgrouped$PLT27$Z, and instead fails. I've deduced that it's because it sees the quotes and thinks, "Oh, you're just some text, I can't get the mean of you." or something to that effect.
Btw, groups is a list of the 19 plot names: PLT01-PLT27 (non-consecutive sometimes) and FTWR, so I can't simply put a sequence for the numeric part of the name.
Anyone have an easier way to iterate across my test plots and get arbitrary metrics?
I feel like I have all the right pieces, but just don't know how they go together to give me what I want.
Also, if anyone can come up with a better title for the question, feel free to post it or change it or whatever.
Try with:
for(i in seq_along(groups)) {
Results$Plot[i] <- groups[i] # character names of the groups
tempZ = plotsgrouped[[groups[i]]][["Z"]]
Results$Mean[i] <- mean(tempZ)
Results$Std.Dev.[i] <- sd(tempZ)
Results$Max[i] <- max(tempZ)
Results$75%Avg.[i] <- mean(tempZ[tempZ <= quantile(tempZ, .75)])
Results$50%Avg.[i] <- mean(tempZ[tempZ <= quantile(tempZ, .50)])
}