This is probably a basic question in R, but I am trying to loop data within subsets of a larger data frame. I have added the 'data=sub240' line within the 'while' command, but this leads to a brackets error, which I suspect is indicative of a larger problem. Can anyone tell me how to run a loop on a subset of a dataframe? I don't think it is relevant, but I am using the 'optiscale' package within the loop.
{while (rsquared.differ > .00001 && niter <= 30 && data=sub240) {
niter <- niter + 1
reg.os<- lm(dvar.os ~ index.os + educ.os)
rsquared.differ <- summary(reg.os)$r.squared - previous.rsquared
previous.rsquared <- summary(reg.os)$r.squared
record <- c(record, niter, summary(reg.os)$r.squared, rsquared.differ)
if (rsquared.differ > .00001) {
dvar.pred <- predict(reg.os)
opscaled.dvar <- opscale(intknow, dvar.pred, level = 1, process = 1)
dvar.os <- opscaled.dvar$os
intknow240.pred <- (dvar.os - (reg.os$coefficients[1] +
(reg.os$coefficients[4]*educ.os)
))*
(1/reg.os$coefficients[2])
opscaled.intknow240 <- opscale(anes$intknow, intknow240.pred, level = 1, process = 1)
intknow240.os <- opscaled.intknow240$os
}}
data = sub240 is an assignment statement. You can assign things on their own line or in function definitions and calls, but you can only provide logical statements in a while loop definition. If you want logical equality, you need ==. But unless data changes in the loop AND you would like that to be a condition for the loop to break, then you don't want that inside your while statement. But generally, there is no need for that statement anyway. The way scoping works in R, the loop should still be able to access data if it is defined outside the loop.
Also, I'm pretty sure you don't need the bracket before while, only after. As your code stands, it won't run because that bracket is unclosed: You have three open brackets (before while, after while, and after if), but only two close brackets (that I could find, anyway).
Try something like this:
data <- sub240
while(rsquared.differ > 0.00001 && niter <=30){
# do stuff
}
Related
Inspired by the leetcode challenge for two sum, I wanted to solve it in R. But while trying to solve it by brute-force I run in to an issue with my for loop.
So the basic idea is that given a vector of integers, which two integers in the vector, sums up to a set target integer.
First I create 10000 integers:
set.seed(1234)
n_numbers <- 10000
nums <- sample(-10^4:10^4, n_numbers, replace = FALSE)
The I do a for loop within a for loop to check every single element against eachother.
# ensure that it is actually solvable
target <- nums[11] + nums[111]
test <- 0
for (i in 1:(length(nums)-1)) {
for (j in 1:(length(nums)-1)) {
j <- j + 1
test <- nums[i] + nums[j]
if (test == target) {
print(i)
print(j)
break
}
}
}
My problem is that it starts wildly printing numbers before ever getting to the right condition of test == target. And I cannot seem to figure out why.
I think there are several issues with your code:
First, you don't have to increase your j manually, you can do this within the for-statement. So if you really want to increase your j by 1 in every step you can just write:
for (j in 2:(length(nums)))
Second, you are breaking only the inner-loop of the for-loop. Look here Breaking out of nested loops in R for further information on that.
Third, there are several entries in nums that gave the "right" result target. Therefore, your if-condition works well and prints all combination of nums[i]+nums[j] that are equal to target.
So I currently have several functions where I want to modify a matrix that was created outside the function, in order to use it as a counting variable for things that happen inside the function. The matrix is named cost_counter, and I want to add to it when certain events occur inside of multiple functions. However, I'd like the solution to still be able to be used in foreach() and be parallelizable.
I know that using <<- is not recommended, however I can't figure out how to use assign() to modify an existing matrix. Example code is below. I've defined the variable cost_counter at the beginning. The function below goes on for longer, but I'm just including the first part for an example of what is happening.
cost_counter <<- matrix(0,nrow = 2, ncol = 12*15)
I0 <- function(){
if (screen[i] == 1){
cost_counter[2,ages[i]] <<- 1 + cost_counter[2,ages[i]] + 1
if(HIV[i] == 1){
if(ages[i] > 35){
if(pv[(i-min_i+1),1] < (1-specP3)){
cost_counter[1,ages[i]] <<- cost_counter[1,ages[i]] + 1
if(contact[i] == 1){return(c(5,0))}
}
}
When I run, error message simply says
"Error in cost_counter[2, ages[i]] <<- cost_counter[2, ages[i]] + 1 :
object 'cost_counter' not found"
I would just like to be able to modify the matrix, and for it to be recognized.
Any help would be appreciated. Thanks!
just at the end f the function re assign the function's matrix to the global env using:
cost_counter <<- cost_counter
I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.
I have got a column with different numbers (from 1 to tt) and would like to use looping to perform a count on the occurrence of these numbers in R.
count = matrix(ncol=1,nrow=tt) #creating an empty matrix
for (j in 1:tt)
{count[j] = 0} #initiate count at 0
for (j in 1:tt)
{
for (i in 1:N) #for each observation (1 to N)
{
if (column[i] == j)
{count[j] = count[j] + 1 }
}
}
Unfortunately I keep getting this error.
Error in if (column[i] == j) { :
missing value where TRUE/FALSE needed
So I tried:
for (i in 1:N) #from obs 1 to obs N
if (column[i] = 1) print("Test")
I basically got the same error.
Tried to do abit research on this kind of error and alot have to said about "debugging" which I'm not familiar with.
Hopefully someone can tell me what's happening here. Thanks!
As you progress with your learning of R, one feature you should be aware of is vectorisation. Many operations that (in C say) would have to be done in a loop, can be don all at once in R. This is particularly true when you have a vector/matrix/array and a scalar, and want to perform an operation between them.
Say you want to add 2 to the vector myvector. The C/C++ way to do it in R would be to use a loop:
for ( i in 1:length(myvector) )
myvector[i] = myvector[i] + 2
Since R has vectorisation, you can do the addition without a loop at all, that is, add a scalar to a vector:
myvector = myvector + 2
Vectorisation means the loop is done internally. This is much more efficient than writing the loop within R itself! (If you've ever done any Matlab or python/numpy it's much the same in this sense).
I know you're new to R so this is a bit confusing but just keep in mind that often loops can be eliminated in R.
With that in mind, let's look at your code:
The initialisation of count to 0 can be done at creation, so the first loop is unnecessary.
count = matrix(0,ncol=1,nrow=tt)
Secondly, because of vectorisation, you can compare a vector to a scalar.
So for your inner loop in i, instead of looping through column and doing if column[i]==j, you can do idx = (column==j). This returns a vector that is TRUE where column[i]==j and FALSE otherwise.
To find how many elements of column are equal to j, we just count how many TRUEs there are in idx. That is, we do sum(idx).
So your double-loop can be rewritten like so:
for ( j in 1:tt ) {
idx = (column == j)
count[j] = sum(idx) # no need to add
}
Now it's even possible to remove the outer loop in j by using the function sapply:
sapply( 1:tt, function(j) sum(column==j) )
The above line of code means: "for each j in 1:tt, return function(j)", an returns a vector where the j'th element is the result of the function.
So in summary, you can reduce your entire code to:
count = sapply( 1:tt, function(j) sum(column==j) )
(Although this doesn't explain your error, which I suspect is to do with the construction or class of your column).
I suggest to not use for loops, but use the count function from the plyr package. This function does exactly what you want in one line of code.
I know that I should avoid for-loops, but I'm not exactly sure how to do what I want to do with an apply function.
Here is a slightly simplified model of what I'm trying to do. So, essentially I have a big matrix of predictors and I want to run a regression using a window of 5 predictors on each side of the indexed predictor (i in the case of a for loop). With a for loop, I can just say something like:
results<-NULL
window<-5
for(i in 1:ncol(g))
{
first<-i-window #Set window boundaries
if(first<1){
1->first
}
last<-i+window-1
if(last>ncol(g)){
ncol(g)->last
}
predictors<-g[,first:last]
#Do regression stuff and return some result
results[i]<-regression stuff
}
Is there a good way to do this with an apply function? My problem is that the vector that apply would be shoving into the function really doesn't matter. All that matters is the index.
This question touches several points that are made in 'The R Inferno' http://www.burns-stat.com/pages/Tutor/R_inferno.pdf
There are some loops you should avoid, but not all of them. And using an apply function is more hiding the loop than avoiding it. This example seems like a good choice to leave in a 'for' loop.
Growing objects is generally bad form -- it can be extremely inefficient in some cases. If you are going to have a blanket rule, then "not growing objects" is a better one than "avoid loops".
You can create a list with the final length by:
result <- vector("list", ncol(g))
for(i in 1:ncol(g)) {
# stuff
result[[i]] <- #results
}
In some circumstances you might think the command:
window<-5
means give me a logical vector stating which values of 'window' are less than -5.
Spaces are good to use, mostly not to confuse humans, but to get the meaning directly above not to confuse R.
Using an apply function to do your regression is mostly a matter of preference in this case; it can handle some of the bookkeeping for you (and so possibly prevent errors) but won't speed up the code.
I would suggest using vectorized functions though to compute your first's and last's, though, perhaps something like:
window <- 5
ng <- 15 #or ncol(g)
xy <- data.frame(first = pmax( (1:ng) - window, 1 ),
last = pmin( (1:ng) + window, ng) )
Or be even smarter with
xy <- data.frame(first= c(rep(1, window), 1:(ng-window) ),
last = c((window+1):ng, rep(ng, window)) )
Then you could use this in a for loop like this:
results <- list()
for(i in 1:nrow(xy)) {
results[[i]] <- xy$first[i] : xy$last[i]
}
results
or with lapply like this:
results <- lapply(1:nrow(xy), function(i) {
xy$first[i] : xy$last[i]
})
where in both cases I just return the sequence between first and list; you would substitute with your actual regression code.