Opening csv of specific sequences: NAs come out of nowhere? - r

I feel like this is a relatively straightforward question, and I feel I'm close but I'm not passing edge-case testing. I have a directory of CSVs and instead of reading all of them, I only want some of them. The files are in a format like 001.csv, 002.csv,...,099.csv, 100.csv, 101.csv, etc which should help to explain my if() logic in the loop. For example, to get all files, I'd do something like:
id = 1:1000
setwd("D:/")
filenames = as.character(NULL)
for (i in id){
if(i < 10){
i <- paste("00",i,sep="")
}
else if(i < 100){
i <- paste("0",i,sep="")
}
filenames[[i]] <- paste(i,".csv", sep="")
}
y <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
The above code works fine for id=1:1000, for id=1:10, id=20:70 but as soon as I pass it id=99:100 or any sequence involving numbers starting at over 100, it introduces a lot of NAs.
Example output below for id=98:99
> filenames
098 099
"098.csv" "099.csv"
Example output below for id=99:100
> filenames
099
"099.csv" NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
"100.csv"
I feel like I'm missing some catch statement in my if() logic. Any insight would be greatly appreciated! :)

You can avoid the loop for creating the filenames
filenames <- sprintf('%03d.csv', 1:1000)
y <- do.call(rbind, lapply(filenames, read.csv, header = TRUE))

#akrun has given you a much better way of solving your task. But in terms of the actual issue with your code, the problem is that for i < 100 you subset by a character vector (implicitly converted using paste) while for i >= 100 you subset by an integer. When you use id = 99:100 this translates to:
filenames <- character(0)
filenames["099"] <- "099.csv" # length(filenames) == 1L
filenames[100] <- "100.csv" # length(filenames) == 100L, with all(filenames[2:99] == NA)
Assigning to a named member of a vector that doesn't yet exist will create a new member at position length(vector) + 1 whereas assigning to a numbered position that is > length(vector) will also fill in every intervening position with NA.

Another approach, although less efficient than #akrun's solution, is with the following function:
merged <- function(id = 1:332) {
df <- data.frame()
for(i in 1:length(id)){
add <- read.csv(sprintf('%03d.csv', id[i]))
df <- rbind(df,add)
}
df
}
Now, you can merge the files with:
dat <- merged(99:100)
Furthermore, you can assign columnnames by inserting the following line in the function just before the last line with df:
colnames(df) <- c(..specify the colnames in here..)

Related

Creating multiple columns in data table with a for loop

My data table looks like:
head(data)
Date AI AGI ADI ASI ARI ERI NVRI SRI FRI IRI
1: 1991-09-06 NA 2094.19 NA NA NA NA NA NA NA NA
2: 1991-09-13 NA 2204.94 NA NA NA NA NA NA NA NA
3: 1991-09-20 NA 2339.10 NA NA NA NA NA NA NA NA
4: 1991-09-27 NA 2387.81 NA NA NA NA NA NA NA NA
5: 1991-10-04 NA 2459.94 NA NA NA NA NA NA NA NA
6: 1991-10-11 NA 2571.07 NA NA NA NA NA NA NA NA
Don't worry about the NAs. What I want to do is make a "percentage change" column for each of the columns apart from date.
What I've done so far is:
names_no_date <- unique(names(data))[!unique(names(data)) %in% "Date"]
for (i in names_no_date){
data_ch <- data[, paste0(i, "ch") := i/shift(i, n = 1, type = "lag")-1]}
I get the error:
Error in i/shift(i, n = 1, type = "lag") :
non-numeric argument to binary operator
I'm wondering how I get around this error?
i is a string, so you are trying to divide a string in i/shift(i, n = 1, type = "lag"):
> "AI"/NA
Error in "AI"/NA : non-numeric argument to binary operator
Instead, do
for (i in names_no_date){
data[, paste0(i, "ch") := get(i)/shift(get(i), n = 1, type = "lag")-1]
}
Also see Referring to data.table columns by names saved in variables.
Edit: #Frank writes in the comments that a more concise way to produce OP's output is
data[, paste0(names_no_date, "_pch") := .SD/shift(.SD) - 1, .SDcols=names_no_date]

Subscript with matrix generated by assign()

I assigned a matrix to a name which varies with j:
j <- 2L
assign(paste0("pca", j,".FAVAR_fcst", sep=""), matrix(ncol=24, nrow=12))
This works very neat. Then I try to access a column of that matrix
paste0("pca", j,".FAVAR_fcst", sep="")[,2]
and get the following error:
Error in paste0("pca", j, ".FAVAR_fcst", sep = "")[, 2] :
incorrect number of dimensions
I've tried several variations and combinations with cat(), print() and capture.output(), but nothing seems to work. I'm not sure what I have to search exactly for and couldn't find a solution. Can you help me?
You can use get :
get(paste0("pca", j,".FAVAR_fcst", sep="")) # for the matrix
get(paste0("pca", j,".FAVAR_fcst", sep=""))[,2] # for the column
# [1] NA NA NA NA NA NA NA NA NA NA NA NA
An other solution would be to combine eval and as.symbol :
eval(as.symbol(paste0("pca", j,".FAVAR_fcst", sep="")))[,2]
# [1] NA NA NA NA NA NA NA NA NA NA NA NA

Replacing each element of any object

Is there any clever way to replace each part of any object with some values (for example NA's).
Let's take those objects
obj1 <- t.test(1:10)
obj2 <- matrix(1:9, 3)
obj3 <- 1:10
obj4 <- list(a = 1:10, b = letters[1:5], c = as.factor(1:10))
the expected output would be similar to
for (i in 1:length(obj1)) obj1[[i]] <- rep(NA, length(obj1[[i]]))
obj2 <- matrix(rep(NA, 9), 3)
obj3 <- rep(NA, 10)
obj4 <- list(a = rep(NA, 10), b = rep(NA, 5), c = rep(NA, 10))
So no matter if an object is a list, matrix, data.frame, vector etc. each part of the object is to be replaced with NA.
Is there any clever way to do so that does not need multiple loops, checking for object type every time and lots of exceptions (if (is.list(part)) ... etc.)?
You can take advantage of the fact that using an empty extraction index during assignment (i.e., x[] <- NA) replaces all elements with the right-hand side value. In your case, you could do something like this using rapply to attack all elements of all objects:
> rapply(mget(ls()), function(x) x[] <- rep(NA, length(x)), how = "replace")
$obj1
$obj1$statistic
[1] NA
$obj1$parameter
[1] NA
$obj1$p.value
[1] NA
$obj1$conf.int
[1] NA NA
$obj1$estimate
[1] NA
$obj1$null.value
[1] NA
$obj1$alternative
[1] NA
$obj1$method
[1] NA
$obj1$data.name
[1] NA
$obj2
[1] NA NA NA NA NA NA NA NA NA
$obj3
[1] NA NA NA NA NA NA NA NA NA NA
$obj4
$obj4$a
[1] NA NA NA NA NA NA NA NA NA NA
$obj4$b
[1] NA NA NA NA NA
$obj4$c
[1] NA NA NA NA NA NA NA NA NA NA
That's a very simple solution, though. You could probably complicate the function being passed to rapply so that it used S3 method dispatch to identify what class of object it was seeing and possibly return a different data structure (e.g., data.frame or matrix) accordingly, rather than just a vector of NAs.

Assign value to data frame in R to all elements conditionally

I try to assign value to all cells in a dataframe having a specific value
by this code
train_data <- read.csv("train_set.csv",header=TRUE)
train_data[train_data == "<NA>"] <- 0
But it does not work, I still see the values unchanged. How can I change values? Data in CSV is as below
spec1 spec2
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
SP-0013 SP-0063
SP-0013 SP-0063
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
As #akrun and others have mentioned we may need an actual copy of your data, but give this a shot:
train_data <- read.csv("train_set.csv", header=TRUE, na.strings = c("NA", "<NA>"), stringsAsFactors=FALSE)
train_data[is.na(train_data)] <- 0

R Loop Script to Create Many, Many Variables

I want to create a lot of variables across several separate dataframes which I will then combine into one grand data frame.
Each sheet is labeled by a letter (there are 24) and each sheet contributes somewhere between 100-200 variables. I could write it as such:
a$varible1 <- NA
a$variable2 <- NA
.
.
.
w$variable25 <- NA
This can/will get ugly, and I'd like to write a loop or use a vector to do the work. I'm having a heck of a time doing it though.
I essentially need a script which will allow me to specify a form and then just tack numbers onto it.
So,
a$variable[i] <- NA
where [i] gets tacked onto the actual variable created.
I just learnt this neat little trick from #eddi
#created some random dataset with 3 columns
library(data.table)
a <- data.table(
a1 = c(1,5),
a2 = c(2,1),
a3 = c(3,4)
)
#assuming that you now need to ad more columns from a4 to a200
# first, creating the sequence from 4 to 200
v = c(4:200)
# then using that sequence to add the 197 more columns
a[, paste0("a", v) :=
NA]
# now a has 200 columns, as compared to the three we initiated it with
dim(a)
#[1] 2 200
I don't think you actually need this, although you seem to think so for some reason.
Maybe something like this:
a <- as.data.frame(matrix(NA, ncol=10, nrow=5))
names(a) <- paste0("Variable", 1:10)
print(a)
# Variable1 Variable2 Variable3 Variable4 Variable5 Variable6 Variable7 Variable8 Variable9 Variable10
# 1 NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA NA NA NA NA
If you want variables with different types:
p <- 10 # number of variables
N <- 100 # number of records
vn <- vector(mode="list", length=p)
names(vn) <- paste0("V", seq(p))
vn[1:8] <- NA_real_ # numeric
vn[9:10] <- NA_character_ # character
df <- as.data.frame(lapply(vn, function(x, n) rep(x, n), n=N))

Resources