Problem with counting null values in 'if statement' in R - r

I am passing some data to a simple code block in R which counts the null values and then performs an ARIMA time series imputation. I have written a very simple 'if' statement which counts the null values in the time series, and if they are less than a certain amount, ignores that column and moves on to the next one (as the ARIMA imputation requires a certain amount of non-null data to work, otherwise it returns an error). Counting the nulls seems to work fine, but the if statement is behaving very strangely and not working. I included a print statement to count the nulls inside and outside the if statement, but the if statement is passing the code to the loop when the if statement is not fulfilled. Here is the code and the output:
stations <- c('BX1', 'BX2', 'BG3') # each station has a different data file
pollutants <- c('nox','no2','pm10','pm25') # each station contains data on a number of pollutants
for (s in stations) {
print(paste('starting imputation for station ', s, sep=" "))
s_result <- read.csv(paste("/path/to/file", s, "_rescaled.csv", sep=""))
for (p in pollutants) {
ts = c()
pcol = paste0(p,"_iqr",sep="") # find the right column
ts = s_result[[pcol]] # get the time series from the column
print(pcol) # check which pollutant we're working on
print(length(ts)) # test the length of the time series
print(sum(is.na(ts))) # test the number of nulls in the time series
if (sum(is.na(ts) != length(ts))) { # if the time series is not completely null
print(sum(is.na(ts))) # check the length of the time series again for testing
usermodel <- arima(ts, order = c(10, 1, 0))$model # calculate the arima
p_result <- na_kalman(ts, model = usermodel, maxgap = 24) # calculate the arima
s_result <- cbind(s_result,p_result) # add the computed column to the dataframe
names(s_result)[names(s_result) == "p_result"] <- paste0(p,"_imputed",sep ="")
} else { # otherwise add a null column
p_result <- c(NA, length=length(ts))
s_result <- cbind(s_result,p_result) # enter a null column
names(s_result)[names(s_result) == "p_result"] <- paste0(p,"_imputed",sep ="")
}
}
filename = paste0("/path/to/file", s, "_imputed_test.csv", sep="")
write.csv(s_result, filename, row.names = TRUE)
print(paste('completed imputation for station ', s, sep=" "))
}
The problem is, that this if statement is not working correctly as it is passing data to the arima imputation inside the if statement even when the number of nulls is equal to the length of the time series. Here's the output:
[1] "starting imputation for station BG1"
[1] "nox_iqr"
[1] 17520
[1] 4660
[1] 4660
[1] "no2_iqr"
[1] 17520
[1] 4664
[1] 4664
[1] "pm10_iqr"
[1] 17520
[1] 17520
[1] 17520
Error in arima(ts, order = c(10, 1, 0)) : 'x' must be numeric
Clearly something is wrong, as for the pm10 pollutant, there are 17520 nulls, the same as the length of the time series. Therefore the if statement should not run the line counting the number of nulls again inside the 'if' statement, as this line of code should be bypassed. ie. for the time series relating to column pm10_iqr, the number of nulls is 17520, the length of the time series is 17520, and this would cause the arima to fail - hence the if statement should skip this line. But it does not do this.
Where am I going wrong please? This should be very simple but it does not make any sense! I don't write alot of R code, usually Python. Thanks for your help!

sum(is.na(ts) != length(ts))
should probably be
sum(is.na(ts)) != length(ts))
Explanation what went wrong: In R, any number other than 0 evaluates to TRUE. For example:
if (0) {print("evaluated to TRUE")} else {print("evaluated to FALSE")} and
Returns:
[1] "evaluated to FALSE"
and:
if (5) {print("evaluated to TRUE")} else {print("evaluated to FALSE")}
Returns:
[1] "evaluated to TRUE"
Additionally, R accepts booleans (TRUE, FALSE) as arguments to sum (and other arithmetic functions) and treats them in these cases as 1 (TRUE) and 0 (FALSE).
is.na(ts) != length(ts)
Evaluates to some vector of TRUEs and FALSEs
and
sum(is.na(ts) != length(ts))
happily sums them up ;)
That's why your code didn't raise any errors, because it was kind of working, just not doing what we meant it to do... these are my most feared errors ;)

Related

function with vector R - argument is of length zero

Wrote this function lockdown_func(beta.hat_func).
First thing is: I get an error "argument is of length zero".
Second thing is: when I compute it without the date indices, it doesn't change the value as it should, output vector contains same value for every indices.
date= c(seq(from=30, to=165))
beta.hat_func <- c(rep(x = beta.hat, times = 135))
beta.hat <- beta0[which.min(SSE)]
#implement function for modeling
lockdown_func <- function(beta.hat_func,l){
h=beta.hat_func
{
for(i in 1:length(h))
if(date[i]>60 | date[i]<110){
beta.hat_func[i]=beta.hat_func[i]*exp(-l*(date[i]-date[i-1]))
}else{
beta.hat_func[i]=beta.hat_func[i]
}
return(h)
}
}
lockdown_func(beta.hat_func,0.03)
A few comments:
did you mean to apply an AND rather than an OR to get date range between 60 and 110? This would be date[i]>60 && date[i]<110 (it's better to use the double-&& if you are computing a length-1 logical value)
because you didn't, i=1 satisfies the criterion, so date[i-1] will refer to date[0], which is a length-0 vector.
You might want something like:
l_dates <- date>60 & date<110 ## single-& here for vectorized operation
beta.hat_func[l_dates] <- beta.hat_func[l_dates]*exp(-l*diff(date)[l_dates])

How can I make a loop skip over inputs that generate warnings?

I'm running a complicated function (multiple imputation with Amelia) over a list of datasets. Every so often, a dataset will trigger a long list of warnings that eventually result in an error. I would like R to give up as soon as the first warning is issued and move on to the next dataset. Here is a minimal working example:
df.list <- list(
data.frame(1:4),
data.frame(-1, -2, -4),
data.frame(10:15)
)
for(df in df.list){
ans <- sum(sapply(df, sqrt))
print(ans)
}
The script issues three warnings about NaNs and then prints:
[1] 6.146264
[1] NaN
[1] 21.1632
I would like it to produce 1 message input 2 failed and then output only the valid results:
[1] 6.146264
[1] 21.1632
(The function I'm actually running, amelia(), issues warnings for 10 minutes before finally throwing an error, so I would like to cut it off at the first warning.)
What about this: the sqrt function cannot return -1 so I make tryCatch return -1 when a warning occurs. The nested lapply is required to loop through the list elements to calculate the square root, returned as a list, and then to loop through those list elements to sum. The -1 value in the result indicates a failed calculation and I can test that.
result <- unlist(
lapply(
lapply(df.list, function(x) tryCatch(sqrt(x), warning = function(w) -1)), sum))
failed <- which(result == -1)
result <- result[-failed]
print(paste0("input ", failed, " failed"))
result
> print(paste0("input ", failed, " failed"))
[1] "input 2 failed"
> result
[1] 6.146264 21.163196

Calculating distance using latitude and longitude error [duplicate]

When working with R I frequently get the error message "subscript out of bounds". For example:
# Load necessary libraries and data
library(igraph)
library(NetData)
data(kracknets, package = "NetData")
# Reduce dataset to nonzero edges
krack_full_nonzero_edges <- subset(krack_full_data_frame, (advice_tie > 0 | friendship_tie > 0 | reports_to_tie > 0))
# convert to graph data farme
krack_full <- graph.data.frame(krack_full_nonzero_edges)
# Set vertex attributes
for (i in V(krack_full)) {
for (j in names(attributes)) {
krack_full <- set.vertex.attribute(krack_full, j, index=i, attributes[i+1,j])
}
}
# Calculate reachability for each vertix
reachability <- function(g, m) {
reach_mat = matrix(nrow = vcount(g),
ncol = vcount(g))
for (i in 1:vcount(g)) {
reach_mat[i,] = 0
this_node_reach <- subcomponent(g, (i - 1), mode = m)
for (j in 1:(length(this_node_reach))) {
alter = this_node_reach[j] + 1
reach_mat[i, alter] = 1
}
}
return(reach_mat)
}
reach_full_in <- reachability(krack_full, 'in')
reach_full_in
This generates the following error Error in reach_mat[i, alter] = 1 : subscript out of bounds.
However, my question is not about this particular piece of code (even though it would be helpful to solve that too), but my question is more general:
What is the definition of a subscript-out-of-bounds error? What causes it?
Are there any generic ways of approaching this kind of error?
This is because you try to access an array out of its boundary.
I will show you how you can debug such errors.
I set options(error=recover)
I run reach_full_in <- reachability(krack_full, 'in')
I get :
reach_full_in <- reachability(krack_full, 'in')
Error in reach_mat[i, alter] = 1 : subscript out of bounds
Enter a frame number, or 0 to exit
1: reachability(krack_full, "in")
I enter 1 and I get
Called from: top level
I type ls() to see my current variables
1] "*tmp*" "alter" "g"
"i" "j" "m"
"reach_mat" "this_node_reach"
Now, I will see the dimensions of my variables :
Browse[1]> i
[1] 1
Browse[1]> j
[1] 21
Browse[1]> alter
[1] 22
Browse[1]> dim(reach_mat)
[1] 21 21
You see that alter is out of bounds. 22 > 21 . in the line :
reach_mat[i, alter] = 1
To avoid such error, personally I do this :
Try to use applyxx function. They are safer than for
I use seq_along and not 1:n (1:0)
Try to think in a vectorized solution if you can to avoid mat[i,j] index access.
EDIT vectorize the solution
For example, here I see that you don't use the fact that set.vertex.attribute is vectorized.
You can replace:
# Set vertex attributes
for (i in V(krack_full)) {
for (j in names(attributes)) {
krack_full <- set.vertex.attribute(krack_full, j, index=i, attributes[i+1,j])
}
}
by this:
## set.vertex.attribute is vectorized!
## no need to loop over vertex!
for (attr in names(attributes))
krack_full <<- set.vertex.attribute(krack_full,
attr, value = attributes[,attr])
It just means that either alter > ncol( reach_mat ) or i > nrow( reach_mat ), in other words, your indices exceed the array boundary (i is greater than the number of rows, or alter is greater than the number of columns).
Just run the above tests to see what and when is happening.
Only an addition to the above responses: A possibility in such cases is that you are calling an object, that for some reason is not available to your query. For example you may subset by row names or column names, and you will receive this error message when your requested row or column is not part of the data matrix or data frame anymore.
Solution: As a short version of the responses above: you need to find the last working row name or column name, and the next called object should be the one that could not be found.
If you run parallel codes like "foreach", then you need to convert your code to a for loop to be able to troubleshoot it.
If this helps anybody, I encountered this while using purr::map() with a function I wrote which was something like this:
find_nearby_shops <- function(base_account) {
states_table %>%
filter(state == base_account$state) %>%
left_join(target_locations, by = c('border_states' = 'state')) %>%
mutate(x_latitude = base_account$latitude,
x_longitude = base_account$longitude) %>%
mutate(dist_miles = geosphere::distHaversine(p1 = cbind(longitude, latitude),
p2 = cbind(x_longitude, x_latitude))/1609.344)
}
nearby_shop_numbers <- base_locations %>%
split(f = base_locations$id) %>%
purrr::map_df(find_nearby_shops)
I would get this error sometimes with samples, but most times I wouldn't. The root of the problem is that some of the states in the base_locations table (PR) did not exist in the states_table, so essentially I had filtered out everything, and passed an empty table on to mutate. The moral of the story is that you may have a data issue and not (just) a code problem (so you may need to clean your data.)
Thanks for agstudy and zx8754's answers above for helping with the debug.
I sometimes encounter the same issue. I can only answer your second bullet, because I am not as expert in R as I am with other languages. I have found that the standard for loop has some unexpected results. Say x = 0
for (i in 1:x) {
print(i)
}
The output is
[1] 1
[1] 0
Whereas with python, for example
for i in range(x):
print i
does nothing. The loop is not entered.
I expected that if x = 0 that in R, the loop would not be entered. However, 1:0 is a valid range of numbers. I have not yet found a good workaround besides having an if statement wrapping the for loop
This came from standford's sna free tutorial
and it states that ...
# Reachability can only be computed on one vertex at a time. To
# get graph-wide statistics, change the value of "vertex"
# manually or write a for loop. (Remember that, unlike R objects,
# igraph objects are numbered from 0.)
ok, so when ever using igraph, the first roll/column is 0 other than 1, but matrix starts at 1, thus for any calculation under igraph, you would need x-1, shown at
this_node_reach <- subcomponent(g, (i - 1), mode = m)
but for the alter calculation, there is a typo here
alter = this_node_reach[j] + 1
delete +1 and it will work alright
What did it for me was going back in the code and check for errors or uncertain changes and focus on need-to-have over nice-to-have.

Why does for loop not iterate with (i in 1:1523)?

I have run into errors with my for loop. The code is as follows:
#finding IDs with >5% replicate variance
#initialize vectors
LS1repvariance = NULL
anomalylist = NULL
#open for loop iterating from 1 to end of dataset
for (i in 1:1523){
#call replicates, which start off as characters
charrep1 = widesubdat[i,2]
charrep2 = widesubdat[i,11]
#convert to numeric
rep1 = as.numeric(charrep1)
rep2 = as.numeric(charrep2)
#calculation
repvariance = (rep1-rep2)/((rep1+rep2)/2)*100
#if loop for anomalous replicates
if (abs(repvariance)>=5)
anomalylist[i]=widesubdat[i,0]
}
The error I get says
Error in if (abs(repvariance) >= 5) anomalylist[i] = widesubdat[i, 0]
: missing value where TRUE/FALSE needed
I think the error is in the iteration because it defines i as 336L, and it does not call charrep correctly, but I have no idea why. I've done for loops in python but never in R, but all of the for loop help pages seem to have the same structure. All of the lines that I can run outside of the for loop test out okay.
I've read that if statements also require curly brackets, but IDLE said unexpected "{" when I used them.
You could also drop the loop
pick <- abs(200*(widesubdat[,2]-widesubdat[,11])/(widesubdat[,2]+widesubdat[,11]))>=5
anomalylist <- widesubdat[,1] # Note the comment above with index 0
anomalylist[!pick] <- NA

Recording the Returned values return by sample function

I need to create a throw function that returns a random number between 1 to 6 every time the user uses the throw() function. But i need to record the values that the user got in a variable that the user is not aware of. The idea is that i can cross verify the reported values and actual values at some later point of time.
I went about it like this
throw<-function(){
dice<-c(1:6)
A<-c(0,0)
x<-sample(dice,1)
fetches<-c(A,x)
x
}
I thought that calling the fetches variable would get me the required vector.But i am getting the output "object not found" while calling for fetches
Thanks
I would use an "invisible" environment and pre-allocate for a bunch of values:
#objects with a name starting with . are invisible
.e <- new.env(, parent = .GlobalEnv)
#pre-allocate
.e$fetches <- integer(100)
throw<-function(){
dice <- c(1:6)
x <- sample(dice,1)
#check if vector has room for new values, grow if necessary
if (sum(.e$fetches == 0L) == 0L) .e$fetches <- c(.e$fetches, integer(100))
.e$fetches[which.min(.e$fetches)] <- x
return("complete")
}
set.seed(42)
throw()
#[1] "complete"
throw()
#[1] "complete"
.e$fetches[.e$fetches != 0L]
#[1] 6 6

Resources