Im working with a large dataset (3.5M lines and 40 columns) and I need to clean out some values so I´ll be able to calculate other parameters that I are necessary when I start formulating a model around the data.
The problem is that it is taking forever to apply the for loops that I have been using so I wanted to try to make use of the ff package. The dataframe is called data and it consists of bunch of customer information for a bank. It was imported as a .csv file. What I need to do is remove all customers (labeled Serial) if their AverageStanding variable is ever negative
> ffd<-as.ffdf(data)
> lastserial = tail(ffd$Serial,1)
> for(k in 1:lastserial){
+ tempvecWith <- vector()
+ tempvecWith <- ffd[ffd$Serial==k, ]$AverageStanding
+ if(any(tempvecWith < 0)){
+ ffd_clean<- ffd[!ffd$Serial ==k, ]
+ }
+ }
This is the error that I am receiving:
Error in as.hi.integer(x, maxindex = maxindex, dim = dim, vw = vw, pack = pack) :
NAs in as.hi.integer
Any ideas on how I can avoid these errors?
The error comes from this part of your code ffd[ffd$Serial==k, ]. Namely ffd$Serial==k returns an ff logical vector. But if you want to index or subset an ff vector or ffdf, you need to supply the index numbers, not a vector of logicals. You can turn your ff vector of logicals into an ff vector of index numbers by using ffwhich from package ffbase.
So for your questions, I believe you are looking for this kind of code (not tested as you did not supply any data).
require(ffbase)
idx <- ffd$AverageStanding < 0
idx <- ffwhich(idx, idx==TRUE)
open(ffd)
serials.with.negative <- ffd$Serial[idx]
serials.with.negative <- unique(serials.with.negative)
ffd$is.customer.with.negative.avgstanding <- ffd$Serial %in% serials.with.negative
idx <- ffd$is.customer.with.negative.avgstanding == FALSE
idx <- ffwhich(idx, idx==TRUE)
open(ffd)
ffd_clean <- ffd[idx, ]
Related
EDIT: I implemented offered solutions so far, and the code looks way cleaner now. This was the key to finally finding my error. It was a logical condition that I didn't check within the while loop. It could happen that the iterator would exceed the number of elements in the vector and thus pass a "NA" to the while condition! Thx
I also changed the solution to use vector assignments to store the results and then recombine after the for loop, as vector indexing seems to be way faster than data.table indexing and value assignment within the loop.
Pls let me apologize first for any errors and lack of information for troubleshooting my problem as this is my first post so far. I have already read that this can happen accidentally whenever ther is an error in a computation and the value of a condition results in an error, such as
if (TRUE & sqrt(-1))
It's been days and I am still receiving this error. It really gives me a headache, as the inherent logic behind such code is actually pretty straigth forward and I still can't properly formalize it. It goes like following: Compare for each unique bond ID contained in a vector of size N (loop through with i), the static value of its corresponding maturity to 7 periods' end date for distinct set of rules (loop through with k) to determine which periods with unique rules the respective issue falls into, and then determine by looping through all the periods' size thresholds (loop through by l) to find if a particular issue has violted these minimium size requirements. If a violation is determined, I can assign the date of the violation. If (l == k), I can reckon that for all periods that the issue's maturity falls into, have also successfully looped through the corresponding size requirements checks and as such hasn't violated any rules. I then assign the result of the conditional checks as corresponding binary values in a new data.table column as well as the violation date. So far, I really cant determine what is casusing this error.
My data looks like following. I have a pretty large data.table containing bond issue identifiers and various other column variables that describe those issues. It was imported as initially with the read_dta() function and then transformed to a data.table with setDT().
I extract 3 columns out of this data.table, using
issue_IDs.vec <- as.numeric(issues.dt[[2]])
maturity.vec <- as.Date(issues.dt[[8]], "%Y-%m-%d")
offerings_atm.vec <- as.numeric(issues.dt[[33]])
Next, I transform eligibility criteria of an index as following.
# (1) Creating size requirement end periods (valid thru) ----
size_req_per_1 <- as.Date("1992-01-01", "%Y-%m-%d")
size_req_per_2 <- as.Date("1994-01-01", "%Y-%m-%d")
size_req_per_3 <- as.Date("1999-07-01", "%Y-%m-%d")
size_req_per_4 <- as.Date("2003-10-01", "%Y-%m-%d")
size_req_per_5 <- as.Date("2004-07-01", "%Y-%m-%d")
size_req_per_6 <- as.Date("2017-02-01", "%Y-%m-%d")
size_req_per_7 <- as.Date("2021-02-01", "%Y-%m-%d")
size_req_val_per.vec <- c(size_req_per_1, size_req_per_2, size_req_per_3, size_req_per_4,
size_req_per_5, size_req_per_6, size_req_per_7)
# (2) Create a size requirement threshold per rules' validity period ----
size_req_thresh_1 <- 25000
size_req_thresh_2 <- 50000
size_req_thresh_3 <- 100000
size_req_thresh_4 <- 150000
size_req_thresh_5 <- 200000
size_req_thresh_6 <- 250000
size_req_thresh_7 <- 300000
size_req_thresh.vec <- c(size_req_thresh_1, size_req_thresh_2, size_req_thresh_3,
size_req_thresh_4, size_req_thresh_5, size_req_thresh_6,
size_req_thresh_7)
Next, I do write a loop to perform conditional checks to find for each issue ID stored in the issues_ID.vec if they violate the index eligibility criterium of the minimim issance size during their maturity. I do this by passing the value of iterator variable i as a position value to the issues_ID.vec.
# (3) Looping through a set of conditional check to find out if and if so when a particular issue violated the size requirement ---
# Iterator variables ----
# Length of issues.dt
j <- issues.dt[, .N]
# Main iterator looping through all entries of isssues.dt extracted as vector
i <- 1
# Looping through vector elements of issue rules (vec. 1: validity periods)
k <- 1
# Looping through vector elements of issue rules (vec. 2: size thresholds)
l <- 1
# Loop
for (i in 1:j) {
id <- issue_IDs.vec[i]
maturity <- maturity.vec[i]
offering_atm <- issue_IDs.vec[i]
k <- 1
maturity_comp <- size_req_val_per.vec[k]
while (maturity >= maturity_comp) {
if (k < 7) {
k <- k + 1
maturity_comp <- size_req_val_per.vec[k]
} else {
break
}
}
l <- 1
offering_size_comp <- size_req_thresh.vec[l]
for (l in 1:k) {
if (offering_atm >= offering_size_comp) {
offering_size_comp <- size_req_thresh.vec[l]
next
} else {}
}
if (l == k) {
issues.dt[ISSUE_ID == id,
`:=`(SIZE_REQ_VIOLATION = 0,
SIZE_REQ_VIOLATION_DATE = NA)]
} else {
issues.dt[ISSUE_ID == id,
`:=`(SIZE_REQ_VIOLATION = 1,
SIZE_REQ_VIOLATION_DATE = size_req_val_per.vec[l])]
}
i <- i + 1
}
Whenever I try running the code in a simplified version, such as
k <- 1
for (1 in 1:7) {
print(maturity >= maturity_comp)
k <- k + 1
maturity_comp <- format(as.Date(size_req_val_per.vec[k]), "%Y-%m-%d")
}
the code runs smooth and always results in the printed evaluations TRUE or FALSE, depending which ID I initially to create the corresponding static maturity of the particular bond issue. As this stage, I already exhasuted my troubleshooting skills.
I'd appreciate any input from you guys, and if you need any additional information, explanations etc. just let me know.
I think the answer lies in Gregor's comment. The way you are formatting your dates converts them to character variables. Here's a quick example:
Exmpl<-as.Date("08-25-2020", "%m-%d-%Y")
class(Exmpl)
[1] "Date"
##Not your preferred format, but it is a Date variable##
Exmpl
"2020-08-25"
##Formatting changes it to a character
Exmpl2<-format(as.Date(Exmpl), "%m-%d-%Y")
class(Exmpl2)
[1] "character"
When you call them in the while() function, R is trying make a comparison to decided if the condition (i.e., maturity is greater than or equal to maturity comp) is TRUE or FALSE (logical variables). Because you have character variables, R cannot make this comparison.
I think your code will work if you don't format the dates, but simply read them in and leave them in the YYYY-mm-dd format.
I am writing a function which takes a directory of data, and reads them in, and (if it reaches the threshold of complete cases), calculates the correlation between two variables in the data ("sulfate" and "nitrate"). I want this to run in a for loop to create a numeric vector of the correlation values (one value for each file in the directory).
However, when I run the code, it only returns the last value.
I am quite new to R (so may be making simple mistakes, and have the newest version of R installed). Below is the code:
corr <- function(directory, threshold = 0) {
filenames3 <- list.files(directory, pattern = ".csv", full.names = TRUE)
loop_length <- length(filenames3)
correlation_values <- numeric()
for(i in loop_length) {
read_in_data3 <- read.csv(filenames3[i])
complete_boolean <- complete.cases(read_in_data3)
nobs2 <- sum(complete_boolean)
data_rmNA <- read_in_data3[complete_boolean, ]
if(nobs2 > threshold) {
correlation_values <- c(correlation_values,
cor(data_rmNA[["sulfate"]],
data_rmNA[["nitrate"]]))
}
}
correlation_values
}
corr("C:/Users/Danie/OneDrive/Documents/R/specdata")
I have tried specifying the length of the vector e.g. correlation_values <- numeric(length = loop_length). This returns a vector of the right length, but all the values are 0 excluding the last which runs properly. I have looked at similar questions, but still can't find a solution to my problem.
I assume I'm losing information in the loop somewhere (rewriting over a variable or something).
Thanks in advance for any help.
I think you need to say for(i in 1:loop_length) instead of for(i in loop_length).
R will loop over each element in the provided vector, but right now your vector is length 1 which is why only the last value is returned.
I am currently trying to write my first loop for lagged regressions on 30 variables. Variables are labeled as rx1, rx2.... rx3, and the data frame is called my_num_data.
I have created a loop that looks like this:
z <- zoo(my_num_data)
for (i in 1:30)
{dyn$lm(my_num_data$rx[i] ~ lag(my_num_data$rx[i], 1)
+ lag(my_num_data$rx[i], 2))
}
But I received an error message:
Error in model.frame.default(formula = dyn(my_num_data$rx[i] ~ lag(my_num_data$rx[i], :
invalid type (NULL) for variable 'my_num_data$rx[i]'
Can anyone tell me what the problem is with the loop?
Thanks!
This produces a list, L, whose ith component has the name of the ith column of z and whose content is the regression of the ith column of z on its first two lags. Lag is same as lag except for a reversal of argument k's sign.
library(dyn)
z <- zoo(anscombe) # test input using builtin data.frame anscombe
Lag <- function(x, k) lag(x, -k)
L <- lapply(as.list(z), function(x) dyn$lm(x ~ Lag(x, 1:2)))
First problem, I'm pretty sure the function you're looking for is dynlm(), without the $ character. Second, using $rx[i] doesn't concatenate rx and the contents of i, it selects the (single) element in $rx with index i. Try this... edited I don't have your data, so I can't test it on my machine:
results <- list()
for (i in 1:30) {
results[[i]] <- dynlm(my_num_data[,i] ~ lag(my_num_data[,i], 1)
+ lag(my_num_data[,i], 2))
}
and then list element results[[1]] will be the results from the first regresssion, and so on.
Note that this assumes your my_num_data data.frame ONLY consists of columns rx1, rx2, etc.
I am not super familiar with R, but it appears you are trying to increase the index of rx. Is rx a vector with values at different indices?
If not the solution my be to concatenate a string
for (i in 1:30){
varName <-- "rx"+i
dyn$lm(my_num_data$rx[i] ~ lag(my_num_data$rx[i], 1)
+ lag(my_num_data$varName, 2))
}
Again, I may be way off here, as this if my first post and R is still pretty new to me.
I am trying to write the following loop over an empirical data set where
each ID replicate has a different number of observations for each sample period.
Any suggestions would be greatly appreciated!
a <- unique(bma$ID)
t <- unique(bma$Sample.period)
# empty list to hold the data
dens.data <- vector(mode='list', length = length(a) * length(t))
tank1 <- double(length(a))
index = 0
for (i in 1:length(a)){
for (j in 1:length(t)){
index = index + 1
tank1[index] = a[index] ### building an ID column
temp.tank <- subset(bma, bma$ID == a[i])
time.tank <- subset(temp.tank, temp.tank$Sample.period == t[j])
temp1 <- unique(temp.tank$Sample.period)
temp.tank <- data.frame(temp.tank, temp1)
dens.1 <- density(time.tank$Biomass_.adults_mgC.mm.3, na.rm = T)
# extract the y-values from the pdf function - these need to be separated by each Replicate and Sample Period
dens.data[[index]] <- dens.1$y
}
}
#### extract the data and place into a dataframe
dens.new<- data.frame(dens.data)
dens.new
colnames(dens.new) <- c("Treatment","Sample Period","pdf/density for biomass")
all<- list(dens.new)
all
### create new spreadsheet with all the data from the loop
dens.new.data<- write.csv(dens.new, "New.density.csv") ## export file to excel spreadsheet
Calling dens.new<- data.frame(dens.data) Yield the following error message:
Error in data.frame(c(...) :
arguments imply differing number of rows: 512, 0
The loop seems to work for dens.data[[1]] but returns NULL for
dens.data[[>1]]
As there isn't a minimal example, it is difficult for me to guess what the original data.frame looks like. However, as for the error message, it is clear that your for-loop fails to assign values to the list dens.data for indices greater than 1.
My guess is that the index didn't update by index = index + 1. Maybe you could try changing the equal sign = to the standard R assignment operator <- and see whether the whole list is updated.
I heard that using equal sign for assignment may cause some problems in an older version of R, but I'm not sure whether you are facing the same problem. Anyway, using <- to assign a value is always safer and recommended.
I'm trying to count the number of missing values for each missing.value of all variables in a SPSS file. I imported the file using the memisc package. Here is my actual code:
library(memisc)
#Takes about 70seconds
escc <- spss.system.file(file.choose(), to.lower=FALSE)
system.time({
esccMiss <- matrix(,length(escc),9)
esccMiss[,1] <- names(escc)
for (i in 1:length(escc)) {
x <- escc[i]
if(length(miss <- missing.values(x)) > 0) {
ifelse(length(miss#range)>0 , vals <- miss#range[1]:(miss#range[1]+3), vals <- miss#filter)
for (j in 1:length(vals)) {
esccMiss[i, 2*j] <- vals[j]
esccMiss[i,2*j+1] <- length(x[x == vals[j]])
}
}
}
})
I'm fairly new to R (explains the C structure of my code) and i realise this is really slow but i have trouble finding the way to do the samething with lapply function in the memisc package.
Forget my other answer, this is much faster:
escc2 <- as.data.set(escc)
system.time(lis <- lapply(escc2,function(x) table(x[which(is.missing(x))])))
Should only take a few seconds now.
Explanation: The original dataset (escc) is of a class that simply does not work in the *apply family since there isn't a method written for it. However, memisc also includes as.data.set, which does work in *apply.
is.missing returns a vector of all the values that are marked as missing.
which finds the indices of those missings and x[] subsets x so you only have those missings.
table puts the values into a table.