R studio doesn't find objects in my function - r

I’m new to programming and I’m currently writing a function to go through hundreds of csv files in the working directory.
The files have tons of NA values in it.
The function (which I call it corr) has two parameters, the directory, and a threshold value (numeric vector of length 1 indicating the number of complete cases).
The purpose of the function is to take the complete cases for two columns that are sulfate and nitrate(second and third column in the spreadsheet) and calculate the correlation between them if the number of complete cases is greater than the threshold parameter.
The function should return a vector with the correlation if it met the threshold requirement (the default threshold value is 0).
When I run the code I get back two of the following:
A + sign in the console
OR
2.The objects I created in the function can't be found.
Any help would be much appreciated. Thank you in advance!
corr <- function(directory, threshold=0){
filelist2<- data.frame(list.files(path=directory,
pattern=".csv", full.names=TRUE))
corvector <- numeric()
for(i in 1:length(filelist2)){
data <-data.frame(read.csv(filelist2[i]))
removedNA<-complete.cases(data)
newdata<-data[removedNA,2:3]
if(nrow(removedNA) > threshold){
corvector<-c(corvector, cor(data$sulfate, data$nitrate ))
}
}
corvector
}

I don't think your nrow(removedNA) does what you think it does. To replicate the example I use the mtcars dataset.
data <- mtcars # create dataset
data[2:4, 2] <- NA # create some missings in column 2
data[15:17, 3] <- NA # create some missing in column 3
removedNA <- complete.cases(data)
table(removedNA) # 6 missings indeed
nrow(removedNA) # NULL removedNA is no data.frame, so nrow() doesn't work
newdata <- data[removedNA, 2:3] # this works though
nrow(newdata) # and this shows the rows in 'newdata'
#---- therefore instead of nrow(removedNA) try
if(nrow(data)-nrow(newdata) < threshold) {
...
}
NB: I changed the > in < in the line with threshold. I guess it depends on whether you want to set an absolute minimum number of lines (in which cases you could simply use nrow(newdata) > threshold) as threshold, or whether you want the threshold to reflect the different number of lines in the original data and 'new' data.

Related

R - Looping with while always results in missing value where TRUE/FALSE is expected

EDIT: I implemented offered solutions so far, and the code looks way cleaner now. This was the key to finally finding my error. It was a logical condition that I didn't check within the while loop. It could happen that the iterator would exceed the number of elements in the vector and thus pass a "NA" to the while condition! Thx
I also changed the solution to use vector assignments to store the results and then recombine after the for loop, as vector indexing seems to be way faster than data.table indexing and value assignment within the loop.
Pls let me apologize first for any errors and lack of information for troubleshooting my problem as this is my first post so far. I have already read that this can happen accidentally whenever ther is an error in a computation and the value of a condition results in an error, such as
if (TRUE & sqrt(-1))
It's been days and I am still receiving this error. It really gives me a headache, as the inherent logic behind such code is actually pretty straigth forward and I still can't properly formalize it. It goes like following: Compare for each unique bond ID contained in a vector of size N (loop through with i), the static value of its corresponding maturity to 7 periods' end date for distinct set of rules (loop through with k) to determine which periods with unique rules the respective issue falls into, and then determine by looping through all the periods' size thresholds (loop through by l) to find if a particular issue has violted these minimium size requirements. If a violation is determined, I can assign the date of the violation. If (l == k), I can reckon that for all periods that the issue's maturity falls into, have also successfully looped through the corresponding size requirements checks and as such hasn't violated any rules. I then assign the result of the conditional checks as corresponding binary values in a new data.table column as well as the violation date. So far, I really cant determine what is casusing this error.
My data looks like following. I have a pretty large data.table containing bond issue identifiers and various other column variables that describe those issues. It was imported as initially with the read_dta() function and then transformed to a data.table with setDT().
I extract 3 columns out of this data.table, using
issue_IDs.vec <- as.numeric(issues.dt[[2]])
maturity.vec <- as.Date(issues.dt[[8]], "%Y-%m-%d")
offerings_atm.vec <- as.numeric(issues.dt[[33]])
Next, I transform eligibility criteria of an index as following.
# (1) Creating size requirement end periods (valid thru) ----
size_req_per_1 <- as.Date("1992-01-01", "%Y-%m-%d")
size_req_per_2 <- as.Date("1994-01-01", "%Y-%m-%d")
size_req_per_3 <- as.Date("1999-07-01", "%Y-%m-%d")
size_req_per_4 <- as.Date("2003-10-01", "%Y-%m-%d")
size_req_per_5 <- as.Date("2004-07-01", "%Y-%m-%d")
size_req_per_6 <- as.Date("2017-02-01", "%Y-%m-%d")
size_req_per_7 <- as.Date("2021-02-01", "%Y-%m-%d")
size_req_val_per.vec <- c(size_req_per_1, size_req_per_2, size_req_per_3, size_req_per_4,
size_req_per_5, size_req_per_6, size_req_per_7)
# (2) Create a size requirement threshold per rules' validity period ----
size_req_thresh_1 <- 25000
size_req_thresh_2 <- 50000
size_req_thresh_3 <- 100000
size_req_thresh_4 <- 150000
size_req_thresh_5 <- 200000
size_req_thresh_6 <- 250000
size_req_thresh_7 <- 300000
size_req_thresh.vec <- c(size_req_thresh_1, size_req_thresh_2, size_req_thresh_3,
size_req_thresh_4, size_req_thresh_5, size_req_thresh_6,
size_req_thresh_7)
Next, I do write a loop to perform conditional checks to find for each issue ID stored in the issues_ID.vec if they violate the index eligibility criterium of the minimim issance size during their maturity. I do this by passing the value of iterator variable i as a position value to the issues_ID.vec.
# (3) Looping through a set of conditional check to find out if and if so when a particular issue violated the size requirement ---
# Iterator variables ----
# Length of issues.dt
j <- issues.dt[, .N]
# Main iterator looping through all entries of isssues.dt extracted as vector
i <- 1
# Looping through vector elements of issue rules (vec. 1: validity periods)
k <- 1
# Looping through vector elements of issue rules (vec. 2: size thresholds)
l <- 1
# Loop
for (i in 1:j) {
id <- issue_IDs.vec[i]
maturity <- maturity.vec[i]
offering_atm <- issue_IDs.vec[i]
k <- 1
maturity_comp <- size_req_val_per.vec[k]
while (maturity >= maturity_comp) {
if (k < 7) {
k <- k + 1
maturity_comp <- size_req_val_per.vec[k]
} else {
break
}
}
l <- 1
offering_size_comp <- size_req_thresh.vec[l]
for (l in 1:k) {
if (offering_atm >= offering_size_comp) {
offering_size_comp <- size_req_thresh.vec[l]
next
} else {}
}
if (l == k) {
issues.dt[ISSUE_ID == id,
`:=`(SIZE_REQ_VIOLATION = 0,
SIZE_REQ_VIOLATION_DATE = NA)]
} else {
issues.dt[ISSUE_ID == id,
`:=`(SIZE_REQ_VIOLATION = 1,
SIZE_REQ_VIOLATION_DATE = size_req_val_per.vec[l])]
}
i <- i + 1
}
Whenever I try running the code in a simplified version, such as
k <- 1
for (1 in 1:7) {
print(maturity >= maturity_comp)
k <- k + 1
maturity_comp <- format(as.Date(size_req_val_per.vec[k]), "%Y-%m-%d")
}
the code runs smooth and always results in the printed evaluations TRUE or FALSE, depending which ID I initially to create the corresponding static maturity of the particular bond issue. As this stage, I already exhasuted my troubleshooting skills.
I'd appreciate any input from you guys, and if you need any additional information, explanations etc. just let me know.
I think the answer lies in Gregor's comment. The way you are formatting your dates converts them to character variables. Here's a quick example:
Exmpl<-as.Date("08-25-2020", "%m-%d-%Y")
class(Exmpl)
[1] "Date"
##Not your preferred format, but it is a Date variable##
Exmpl
"2020-08-25"
##Formatting changes it to a character
Exmpl2<-format(as.Date(Exmpl), "%m-%d-%Y")
class(Exmpl2)
[1] "character"
When you call them in the while() function, R is trying make a comparison to decided if the condition (i.e., maturity is greater than or equal to maturity comp) is TRUE or FALSE (logical variables). Because you have character variables, R cannot make this comparison.
I think your code will work if you don't format the dates, but simply read them in and leave them in the YYYY-mm-dd format.

How do you create a function that row reduces a matrix in R?

So far I've tried the following code but it didn't work in R-studio; it just hangs there.
Am I doing something wrong? This is my first real R code project so I'd love suggestions!
new.rref <- function(M,fractions=FALSE)
{
#M is a matrix.
#Require numeric matricies.
if ((!is.matrix(M)) || (!is.numeric(M)))
stop("Sorry pal! Data not a numeric matrix.")
#Specify and differentiate between rows and columns.
r=nrow(M)
c=ncol(M)
#Now establish a continuous loop (*needed help on this one)
#According to the help documents I've read, this has to do with a
#computerized version of the Gaussian Reducing Algorithm
#While 1<r and 1<c, must set first column entries in which
#1:r < 1 equal to zero. This while loop is used to loop the
#algorithm until a specific condition is met -- in this case,
#until elements in the first column to which 1:r < 1
#are set to zero.
while((1<=r) & (1<=c))
new <- M[,1]
new[1:r < y.position] <- 0
# Now here's the fun part :)
#We need to find the maximum leading coefficient that lies
#at or below the current row.
new1 <- which.max(abs(new))
#We will assign these values to the vector "LC"
LC <- col[which]
#Now we need to allow for row exchange!
#Basically tells R that M[c(A,B),] = M[c(B,A),].
if (which > 1) { M[c(1,which),]<-A[c(which,1),] }
#Now we have to allow for the pivot, "sweep", and restoration
#of current row. I totally didn't know how to do this so I
#used and changed some code from different documentations.
#PIVOT (friends reference)
M[1,]<-M[1,]/LC
new2 <-M[1,]
#CLEAN
M <- M - outer(M[,x.position],new2)
#RESTORE
A[1,]<-new2
#Last, but certantly not least, we're going to round the matrix
#off to a certain value. I might have did this wrong.
round(M)
return(M)
print(M)
}
Edit: I added the first line, for some reason it got deleted.
Edit 2: Say you have a matrix M=matrix(c(2,3,4,7), nrow=2, ncol=2, byrow=TRUE); new.rref(M) needs to produce the reduced row echelon form of matrix M. I already did the math; new.rref(M) should be equal to matrix(c(1,0,0,1), nrow=2, ncol=2, byrow=T

Need help in intrepreting a warning in R

I wrote a function in R, which parses arguments from a dataframe, and outputs the old dataframe + a new column with stats from each row.
I get the following warning:
Warning message:
In [[.data.frame(xx, sxx[j]) :
named arguments other than 'exact' are discouraged
I am not sure what this means, to be honest. I did spot checks on the results and seem OK to me.
The function itself is quite long, I will post it if needed to better answer the question.
Edit:
This is a sample dataframe:
my_df<- data.frame('ALT'= c('A,C', 'A,G'),
'Sample1'= c('1/1:35,3,0,35,3,35:1:1:0:0,1,0', './.:0,0,0,0,0,0:0:0:0:0,0,0'),
'Sample2'= c('2/2:188,188,188,33,33,0:11:11:0:0,0,11', '1/1:255,99,0,255,99,255:33:33:0:0,33,0'),
'Sample3'= c('1/1:219,69,0,219,69,219:23:23:0:0,23,0', '0/1:36,0,78,48,87,120:7:3:0:4,3,0'))
And this is the function:
multi_allelic_filter_v2<- function(in_vcf, start_col, end_col, threshold=1){
#input: must have gone through biallelic_assessment first
table0<- in_vcf
#ALT_alleles is the number of alt alleles with coverage > threshold across samples
#The following function calculates coverage across samples for a single allele
single_allele_tot_cov_count<- function(list_of_unparsed_cov,
allele_pos){
single_allele_coverage_count<- 0
for (i in 1:length(list_of_unparsed_cov)) { # i is each group of coverages/sample
single_allele_coverage_count<- single_allele_coverage_count+
as.numeric(strsplit(as.character(list_of_unparsed_cov[i]),
split= ',')[[1]])[allele_pos]}
return(single_allele_coverage_count)}
#single row function
#Now we need to reiterate on each ALT allele in the row
single_row_assessment<- function(single_row){
# No. of alternative alleles over threshold
alt_alleles0 <- 0
if (single_row$is_biallelic==TRUE){
alt_alleles0<- 1
} else {
alt_coverages <- numeric() #coverages across sample of each ALT allele
altcovs_unparsed<- character() #Unparsed coverages from each sample
for (i in start_col:end_col) {
#Let's fill altcovs_unparsed
altcovs_unparsed<- c(altcovs_unparsed,
strsplit(x = as.character(single_row[1,i]), split = ':')[[1]][6])}
#Now let's calculate alt_coverages
for (i in 1:lengths(strsplit(as.character(
single_row$ALT),',',fixed = TRUE))) {
alt_coverages<- c(alt_coverages, single_allele_tot_cov_count(
list_of_unparsed_cov = altcovs_unparsed, allele_pos = i+1))}
#Now, let's see how many ALT alleles are over threshold
alt_alleles0<- sum(alt_coverages>threshold)}
return(alt_alleles0)}
#Now, let's reiterate across each row:
#ALT_alleles is no. of alt alleles with coverage >threshold across samples
table0$ALT_alleles<- -99 # Just as a marker, to make sure function works
for (i in 1:nrow(table0)){
table0[i,'ALT_alleles'] <- single_row_assessment(single_row = table0[i,])}
#Now we now how many ALT alleles >= threshold coverage are in each SNP
return(table0)}
Basically, in the following line:
'1/1:219,69,0,219,69,219:23:23:0:0,23,0'
fields are separated by ":", and I am interested in the last two numbers of the last field (23 and 0); in each row I want to sum all the numbers in those positions (two separate sums), and output how many of the "sums" are over a threshold. Hope it makes sense...
OK,
I re-run the script with the same dataset on the same computer (same project, then new project), then run it again on a different computer, could not get the warnings again in any case. I am not sure what happened, and the results seem correct. Never mind. Thanks anyway for the comments and advice

Appending cor() values to vector while looping through files in R 3.1.1

I am trying to write a function that goes through a directory of 332 files. Each file contains a data frame of air pollution values. Two columns are nitrate and sulfate.
As I go through each file, I need to determine if they meet a threshold criteria for complete cases. Basically, the code should omit any row with N/A or missing values. If they do meet the criteria, I will need to calculate a correlation between nitrate and sulfate using the cor() function.
This is my pseudocode:
corr <- function(directory, threshold = 0){
setwd(paste("C:/...", directory, sep = "/"))
# Writing from here on works fine
correlations <- numeric()
files <- list.files(getwd(), full.names = TRUE)
for (i in 1:332) {
read_file <- read.csv(file[i])
complete_observation <- read_file[complete.cases(read_file), ]
get number_complte using nrow
if (number_complete >= threshold) {
attach(complete_data)
correlations <- c(correlations, cor(sulfate, nitrate))
detach(complete_data)
}
}
When I write this in manually (starting from the comment), it works fine and I get the results I am looking for, which is a correlation vector of a certain length.
However, when I try it by calling the function I created (starting from the beginning of the code), I get a NULL vector. I noticed that when I nested a return (correlation) after appending the cor data it returned a single value, which indicates that it is calculating a correlation value as it meets the criteria and is storing it at the end, but once it exits the loop it is re-initialized to 0.
This baffles me because it did not behave this way when it was typed in manually: this behavior only occurs when I run the function. There should be no difference in the behavior of the function between both cases. How can I fix this?
I was able to get my function to work the way I wanted to by placing a return statement after the second closing bracket, basically after each for loop iteration
corr <- function(...){
correlation <- numeric()
for(...){
...
if (num_com >= threshold){
correlation <- c(correlation, cor(x,y))
correlation <- c(correlation, cor(data$nitrate, data$sulfate))
}
}
return(correlation)
}

How to change values in raster inside loop

I have aerial images of forests where I need to compute fragmentation index. I know how to do it for any individual image, but I want to use loop because the is a bunch of them.
# required libraries
library(raster)
library(SDMTools)
The desired index value is element number 11. But before I extract this value I need to replace all values to "1" (from original range 1-100)
# Individual raster can be done like this:
x <- raster(forest_cov[1])
x[x > 0] = 1
PatchStat(x)[11]
# I have tried this loop but it is not working
rast<-numeric(41)
for (i in 1:41) {
rast[i] <- PatchStat(raster(forest_cov[i][forest_cov[i] > 0 == 1]))[11]
}
The problem is that I do not know how to replace all values in raster to 1 (inside code). What am I doing wrong?
To work out why your code isn't returning the expected result, you should probably run chunks of your code from the inside out. For example, does forest_cov[i] > 0 == 1 return what you expect it to return for raster 1? (I suspect not, since according to your comments, forest_cov is a character vector and so the logical comparison of element i to 0 is not sensible.) But, if so, does forest_cov[i][forest_cov[i] > 0 == 1] return what you expect, and so on.
Here's how I would approach the problem.
Prepare some fake data:
# Write out three fake rasters to temp files
writeRaster(stack(replicate(3, raster(matrix(runif(100), nc=10)))),
{f <- tempfile()}, bylayer=TRUE, format='ascii')
# Filenames of these fake rasters
rasters <- paste0(f, '_', 1:3, '.asc')
Calculate the frac.dim.index (i.e. 11th element of PatchStat result) of each raster:
sapply(rasters, function(x) {
require(SDMTools)
PatchStat(raster(x) >= 0.1)[11]
})
Alternatively, if all the rasters referred to in your character vector have consistent extent and dimensions, then you can perform the operation on a stack as follows:
s <- stack(rasters) >= 0.1
sapply(seq_len(nlayers(s)), function(i) PatchStat(s[[i]])[11])

Resources