So I have a customer survey, and I need to determine if there are significant differences between the four areas. I obviously want to do a t-test on these, and here is my current R solution.
`for (i in colnames(c_survey))
assign(i, subset(c_survey, select=i))
elements <- list(quality, ease.of.use, price, service)
elements_alt<-list(service,price,ease.of.use,quality)
for(i in elements){print(names(elements)[i])
for (j in elements_alt) {print(t.test(i,j)$p.value)}}`
(Edit) I figured out the nested loop, but I still think there's a faster way to do what I want than this whole two list, nested loop nonsense. Also my output from this has no names on it so I have no idea what's being compared to what, and it includes all the duplicate inverse comparisons. I also can't save the results. I think my solution certainly lies elsewhere.
Also, would producing so many t-test p values even be the best statistical way to accomplish what I want? It seems like there should be something easier than this...
Related
first question, I'll try to go straight to the point.
I'm currently working with tables and I've chosen R because it has no limit with dataframe sizes and can perform several operations over the data within the tables. I am happy with that, as I can manipulate it at my will, merges, concats and row and column manipulation works fine; but I recently had to run a loop with 0.00001 sec/instruction over a 6 Mill table row and it took over an hour.
Maybe the approach of R was wrong to begin with, and I've tried to look for the most efficient ways to run some operations (using list assignments instead of c(list,new_element)) but, since as far as I can tell, this is not something that you can optimize with some sort of algorithm like graphs or heaps (is just tables, you have to iterate through it all) I was wondering if there might be some other instructions or other basic ways to work with tables that I don't know (assign, extract...) that take less time, or configuration over RStudio to improve performance.
This is the loop, just so if it helps to understand the question:
my_list <- vector("list",nrow(table[,"Date_of_count"]))
for(i in 1:nrow(table[,"Date_of_count"])){
my_list[[i]] <- format(as.POSIXct(strptime(table[i,"Date_of_count"]%>%pull(1),"%Y-%m-%d")),format = "%Y-%m-%d")
}
The table, as aforementioned, has over 6 Mill rows and 25 variables. I want the list to be filled to append it to the table as a column once finished.
Please let me know if it lacks specificity or concretion, or if it just does not belong here.
In order to improve performance (and properly work with R and tables), the answer was a mixture of the first comments:
use vectors
avoid repeated conversions
if possible, avoid loops and apply functions directly over list/vector
I just converted the table (which, realized, had some tibbles inside) into a dataframe and followed the aforementioned keys.
df <- as.data.frame(table)
In this case, by doing this the dates were converted directly to character so I did not have to apply any more conversions.
New execution time over 6 Mill rows: 25.25 sec.
I have this code, from Julian Farawy's linear models book:
round(cor(seatpos[,-9]),2)
I am unsure what [,-9],2 is doing - could someone please assist?
When you are learning new stuff nested functions can be difficult. This same computation could be accomplished in steps, which might be easier for you to see what KeonV and MrFlick are suggesting.
Here is an alternative way of doing this the same functions but easier steps to differentiate with simple explanations.
sub_seatpos<- seatpos[,-9]
this says take a subset of all rows and all columns EXCEPT column number nine and save it into sub_seatpos (this subseting was done in the initial code, but not saved into a new variable. This just makes seeing how each step works easier).
and reflects the bold portion below
round(cor(seatpos[,-9]),2)
cor_seatpos <- cor(sub_seatpos)
This takes the correlation for sub_seatpos and saves them into a variable named cor_seatpos. It reflects the part listed below in bold
round( cor( seatpos[,-9] ),2)
The final step just says round the correlation to 2 decimal places and would look like this in separate lines of code.
round(cor_seatpos, 2)
it is reflected in the bold below
round( cor(seatpos[,-9]),2)
What makes this confusing is that all of the functions are nested. As you become more proficient, this becomes less of a difficulty to read. But it can be confusing with new functions.
I'm optimizing a more complex code, but got stuck with this problem.
a<-array(sample(c(1:10),100,replace=TRUE),c(10,10))
m<-array(sample(c(1:10),100,replace=TRUE),c(10,10))
f<-array(sample(c(1:10),100,replace=TRUE),c(10,10))
g<-array(NA,c(10,10))
I need to use the values in a & m to index f and assign the value from f to g
i.e. g[1,1]<-f[a[1,1],m[1,1]] except for all the indexes, and as optimally/fast as possible
I could obviously make a for loop to do this for me but that seems rather dumb and slow. It seems like I should be able to us something in the apply family, but I've had no luck with figuring out how to do that. I do need to keep the data structured as it is here so that I can use matrix operations in different parts of my code. I've been searching for an answer to this but haven't found anything particularly helpful yet.
g[] <- f[cbind(c(a), c(m))]
This takes advantage of the fact that matrices can be addressed as vectors and using a matrix as the index.
I am trying to combine several for loops into a single loop or function. Each loop is evaluating if an individual is present at a site that is protected, and based on that is assigning a number (numbers represent sites) at each time step. After that, the results for each time step are stored in a matrix and later used in other analysis. The problem that I am having is that I am repeating the same loop several times to evaluate the different scenarios (10%, 50%, 100% of sites protected). Since I need to store my results for each scenario I can't think of a better way to simplify this into a single loop or function. Any ideas or suggestions will be appreciated. This is a very small and simplify idea of the problem. I would like to keep the structure of the loop since my original loop is using several if statements. The only thing that is changing is the proportion of sites that are protected.
N<-10 # number of sites
sites<-factor(seq(from=1,to=N))
sites10<-as.factor(sample(sites,N*1))
sites5<-as.factor(sample(sites,N*0.5))
sites1<-as.factor(sample(sites,N*0.1))
steps<-10
P.stay<-0.9
# storing results
result<-matrix(0,nrow=steps)
time.step<-seq(1,steps)
time.step<-data.frame(time.step)
time.step$event<-0
j<-numeric(steps)
j[1]<-sample(1:N,1)
time.step$event[1]<-j[1]
for(i in 1:(steps-1)){
if(j[i] %in% sites1){
if(rbinom(1,1,P.stay)==1){time.step$event[i+1]<-j[i+1]<-j[i]} else
time.step$event[i+1]<-0
}
time.step$event[i+1]<-j[i+1]<-sample(1:N,1)
}
results.sites1<-as.factor(result)
###
result<-matrix(0,nrow=steps)
time.step<-seq(1,steps)
time.step<-data.frame(time.step)
time.step$event<-0
j<-numeric(steps)
j[1]<-sample(1:N,1)
time.step$event[1]<-j[1]
for(i in 1:(steps-1)){
if(j[i] %in% sites5){
if(rbinom(1,1,P.stay)==1){time.step$event[i+1]<-j[i+1]<-j[i]} else
time.step$event[i+1]<-0
}
time.step$event[i+1]<-j[i+1]<-sample(1:N,1)
}
results.sites5<-as.factor(result)
###
result<-matrix(0,nrow=steps)
time.step<-seq(1,steps)
time.step<-data.frame(time.step)
time.step$event<-0
j<-numeric(steps)
j[1]<-sample(1:N,1)
time.step$event[1]<-j[1]
for(i in 1:(steps-1)){
if(j[i] %in% sites10){
if(rbinom(1,1,P.stay)==1){time.step$event[i+1]<-j[i+1]<-j[i]} else
time.step$event[i+1]<-0
}
time.step$event[i+1]<-j[i+1]<-sample(1:N,1)
}
results.sites10<-as.factor(result)
#
results.sites1
results.sites5
results.sites10
Instead of doing this:
sites10<-as.factor(sample(sites,N*1))
sites5<-as.factor(sample(sites,N*0.5))
sites1<-as.factor(sample(sites,N*0.1))
and running distinct loops for each of the three variables, you can make a general loop and put it in a function, then use one of the -apply functions to call it with specific parameters. For example:
N<-10 # number of sites
sites<-factor(seq(from=1,to=N))
steps<-10
P.stay<-0.9
simulate.n.sites <- function(n) {
n.sites <- sample(sites, n)
result<-matrix(0,nrow=steps)
time.step<-seq(1,steps)
time.step<-data.frame(time.step)
time.step$event<-0
j<-numeric(steps)
j[1]<-sample(1:N,1)
time.step$event[1]<-j[1]
for(i in 1:(steps-1)){
if(j[i] %in% n.sites){
...etc...
return(result)
}
results <- lapply(c(1, 5, 10), simulate.n.sites)
Now results will be a list, with three matrix elements.
The key is to identify places where you repeat yourself, and then refactor those areas into functions. Not only is this more concise, but it's easy to extend in the future. Want to sample for 2 site? Put a 2 in the vector you pass to lapply.
If you're unfamiliar with the -apply family of functions, definitely look into those.
I also suspect that much of the rest of your code could be simplified, but I think you've gutted it too much for me to make sense of it. For example, you define an element of time.step$event based on a condition, but then you overwrite that element. Surely this isn't what the actual code does?
I'm trying to subset a dataframe within a function using a mixture of fixed variables and some variables which are created within the function (I only know the variable names, but cannot vectorise them beforehand). Here is a simplified example:
a<-c(1,2,3,4)
b<-c(2,2,3,5)
c<-c(1,1,2,2)
D<-data.frame(a,b,c)
subbing<-function(Data,GroupVar,condition){
g=Data$c+3
h=Data$c+1
NewD<-data.frame(a,b,g,h)
subset(NewD,select=c(a,b,GroupVar),GroupVar%in%condition)
}
Keep in mind that in my application I cannot compute g and h outside of the function. Sometimes I'll want to make a selection according to the values of h (as above) and other times I'll want to use g. There's also the possibility I may want to use both, but even just being able to subset using 1 would be great.
subbing(D,GroupVar=h,condition=5)
This returns an error saying that the object h cannot be found. I've tried to amend subset using as.formula and all sorts of things but I've failed every single time.
Besides the ease of the function there is a further reason why I'd like to use subset.
In the function I'm actually working on I use subset twice. The first time it's the simple subset function. It's just been pointed out below that another blog explored how it's probably best to use the good old data[colnames()=="g",]. Thanks for the suggestion, I'll have a go.
There is however another issue. I also use subset (or rather a variation) in my function because I'm dealing with several complex design surveys (see package survey), so subset.survey.design allows you to get the right variance estimation for subgroups. If I selected my group using [] I would get the wrong s.e. for my parameters, so I guess this is quite an important issue.
Thank you
It's happening right as the function is trying to define GroupVar in the beginning. R is looking for the object h by itself (not within the dataframe).
The best thing to do is refer to the column names in quotes in the subset function. But of course, then you'd have to sidestep the condition part:
subbing <- function(Data, GroupVar, condition) {
....
DF <- subset(Data, select=c("a","b", GroupVar))
DF <- DF[DF[,3] %in% condition,]
}
That will do the trick, although it can be annoying to have one data frame indexing inside another.