Saving numeric data in a file - r

I am generating huge amount of data and would like to store only selected values during my run. But it always saves last result. For example, in the following sample code it always store last result satisfying my condition. Remember that I have huge data and don't want to store in vector or list but would like to store right away in a file. I need your help.
Thanks.
f<-function(x) (x-1)*(x-5)*(x-10)
fileE<-file("E.txt")
for (i in seq(1,100,0.1)){
if (f(i) > 0 && f(i) < 10)
writeLines(paste0(i," ",f(i)), fileE)
}
close(fileE)

Maybe use write with append:
unlink("E.txt")
for (i in seq(1, 100, 0.1)){
res <- f(i)
if (res > 0 & res < 10)
write(x = paste0(i, " ", res), file = "E.txt", append = TRUE)
}

Related

While loop nested in for loop R

I am trying to pull out individual subject data in R using a for and while loop. I would like for the loop to pull the data accordingly and save it as it's own data file. The issue is the for loop not counting Subjects properly and returning the proper value for b. My while loop works perfectly because I can manually set the value of b, run the loop, and produce the correct data files.
Subjects = (1:2)
r = (1);
ST = (d$onReadyTime) #need to get ST to read not just the first number in onReadyTime for each trial
ST=strsplit(ST,split = "a")
for (b in (1:length(Subjects)){
b <- Subjects[r]
while (r == Subjects[b])
{STSubject = (ST[[b]])
ST2=(STSubject)
#ST2=`colnames<-`(ST2,Subjects[i])
write.table(ST2,file = paste("ST_Subject_",b,".csv",sep=""),row.names = FALSE, col.names= TRUE)
r = r+1
}
}
Without a minimum working example, only general guidance can be offered.
n <- 0
for (i in seq_along(models)) {
for (j in seq_along(meters)) {
n <- n + 1
make_line(i, j) -> glances[[n]]
}
}
The key point here, as remarked, is to initialize the counter outside the inner loop.

Random extraction from a list with NO REPLACEMENT

So I am wondering how to extract randomly a string from a list in R with NO REPLACEMENT till the list is empty.
To write
sample(x, size=1, replace=FALSE)
is not helping me, since string are extracted more than once before the list gets empty.
Kind regards
In every iteration one list element will be picked, and from this element a value removed. If there is only one value left, the list element is removed.
x <- list(a = "bla", b = c("ble", "bla"), c = "bli")
while (length(x) > 0) {
s <- sample(x, size = 1)
column <- x[[names(s)]]
value <- sample(unlist(s, use.names = FALSE), size = 1)
list_element_without_value <- subset(column, column != value)
x[[names(s)]] <- if (length(list_element_without_value) == 0) {
NULL
} else {
list_element_without_value
}
}
sample(x)
You can't use size=1 on repeated calls and expect it to know not to grab values previously selected. You have to grab all the values you want at one time. This code will shuffle your data and then you can grab the first element when you need it. Then the next time you need something grab the second... And so on.

Why using two variables in loop in R (code given below)?

I am looking at DIGRE model code in R and there is a loop as follow:
idx <- 1
for (i in 1:length(drugName)) {
if (drugName[i] != "Neg_control") {
cat(idx, ". ", drugName[i], "\n", sep = "")
idx <- idx + 1
}
My question is a particular reason for using separate variables ( i and idx ) for loop and the counter. Wouldn't this loop work fine with just one variable. I am new to R therefore curious.
The variable idx only gets incremented if drugName isn't "Neg_control". So i indexes all the observations of drugName and idx counts the 'occurences'. I guess depending on how the data looks like and what the goal of the function is, this could be done without using a loop.
How about this?
controlTF = drugName != "Neg_control"
idx <- sum(controlTF)
paste0(1:idx, ". ", drugName[controlTF])

Direct update (replace) of sparse data frame is slow and inefficient

I'm attempting to read in a few hundred-thousand JSON files and eventually get them into a dplyr object. But the JSON files are not simple key-value parse and they require a lot of pre-processing. The preprocessing is coded and does fairly good for efficiency. But the challenge I am having is loading each record into a single object (data.table or dplyr object) efficiently.
This is very sparse data, I'll have over 2000 variables that will mostly be missing. Each record will have maybe a hundred variables set. The variables will be a mix of character, logical and numeric, I do know the mode of each variable.
I thought the best way to avoid R copying the object for every update (or adding one row at a time) would be to create an empty data frame and then update the specific fields after they are pulled from the JSON file. But doing this in a data frame is extremely slow, moving to data table or dplyr object is much better but still hoping to reduce it to minutes instead of hours. See my example below:
timeMe <- function() {
set.seed(1)
names = paste0("A", seq(1:1200))
# try with a data frame
# outdf <- data.frame(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
# try with data table
outdf <- data.table(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
for(i in seq(100)) {
# generate 100 columns (real data is in json)
sparse.cols <- sample(1200, 100)
# Each record is coming in as a list
# Each column is either a character, logical, or numeric
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401) { # logical
sample(c(TRUE, FALSE), 1)
} else if (i < 801) { # numeric
sample(seq(10), 1)
} else { # character
sample(LETTERS, 1)
}
}) # now we have a list with values to populate
names(sparse.val) <- paste0("A", sparse.cols)
# and here is the challenge and what takes a long time.
# want to assign the ith row and the named column with each value
for(x in names(sparse.val)) {
val=sparse.val[[x]]
# this is where the bottleneck is.
# for data frame
# outdf[i, x] <- val
# for data table
outdf[i, x:=val]
}
}
outdf
}
I thought the mode of each column might have been set and reset with each update, but I have also tried this by pre-setting each column type and this didn't help.
For me, running this example with a data.frame (commented out above) takes around 22 seconds, converting to a data.table is 5 seconds. I was hoping someone knew what was going on under the covers and could provide a faster way to populate the data table here.
I follow your code except the part where you construct sparse.val. There are minor errors in the way you assign columns. Don't forget to check that the answer is right in trying to optimise :).
First, the creation of data.table:
Since you say that you already know the type of the columns, it's important to generate the correct type up front. Else, when you do: DT[, LHS := RHS] and RHS type is not equal to LHS, RHS will be coerced to the type of LHS. In your case, all your numeric and character values will be converted to logical, as all columns are logical type. This is not what you want.
Creating a matrix won't help therefore (all columns will be of the same type) + it's also slow. Instead, I'd do it like this:
rows = 100L
cols = 1200L
outdf <- setDT(lapply(seq_along(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else rep(NA_character_, rows)
}))
Now we've the right type set. Next, I think it should be i >= 402L & i < 801L. Otherwise, you're assigning the first 401 columns as logical and then the first 801 columns as numeric, which, given that you know the type of the columns upfront, doesn't make much sense, right?
Second, doing names(.) <-:
The line:
names(sparse.val) <- paste0("A", sparse.cols)
will create a copy and is not really necessary. Therefore we'll delete this line.
Third, the time consuming for-loop:
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, x:=val]
}
is not actually doing what you think it's doing. It's not assigning the values from val to the name assigned to x. Instead it's (over)writing (each time) to a column named x. Check your output.
This is not a part of optimisation. This is just to let you know what you're actually wanting to do here.
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, (x) := val]
}
Note the ( around x. Now, it'll be evaluated and the value contained in x will be the column to which val's value will be assigned to. It's a bit subtle, I understand. But, this is necessary because it allows for the possibility to create column x as DT[, x := val] where you actually want val to be assigned to x.
Coming back to the optimisation, the good news is, your time consuming for-loop is simply:
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
This is where data.table's sub-assign by reference feature comes in handy!
Putting it all together:
Your final function looks like this:
timeMe2 <- function() {
set.seed(1L)
rows = 100L
cols = 1200L
outdf <- as.data.table(lapply(seq_len(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else sample(rep(NA_character_, rows))
}))
setnames(outdf, paste0("A", seq(1:1200)))
for(i in seq(100)) {
sparse.cols <- sample(1200L, 100L)
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401L) sample(c(TRUE, FALSE), 1)
else if (i >= 402 & i < 801L) sample(seq(10), 1)
else sample(LETTERS, 1)
})
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
}
outdf
}
By doing this, your solution takes 9.84 seconds on my system whereas the function above takes 0.34 seconds, which is ~29x improvement. I think this is the result you're looking for. Please verify it.
HTH

combine results from loop in one file in R (some results were missing)

I want to combine the results from a for loop into 1 txt file and I have written my code based on suggestion from this link
combine results from a loop in one file
There is one problem. I am supposed to get 8 results (row) but I only ended with only 5. Somehow the other results did not get into the file. I think the problem is with the if statement but I don't know how to fix it.
Here is my code
prob <- c(0.10, 0.20)
for (j in seq(prob)) {
range <- c(2,3)
for (i in seq(range)) {
sample <- c(10,20)
for (k in seq(sample)) {
data <- Simulation(X =1,Y =range[i], Z=sample[k] ,p = prob[j])
filename <- paste('file',i,'txt')
if (j == 1) {
write.table(data, "Desktop/file2.txt", col.names= TRUE)
} else {
write.table(data,"Desktop/file2.txt", append = TRUE, col.names = FALSE)
}
}
}
}
That's because the if ( j == 1 ) bit is meant to check whether this is the first time you've written to the file or not.
If it is the first time, then it will write the column names (i.e. X, Y, Z, p) into the file (see the col.names=TRUE?).
If it isn't the first time, then it won't write the column names, but will just append the data.
Since you have multiple nested loops, that condition won't work so well for you: when j==1 (i.e. for prob=0.1) you perform 4 other loops within. But since j==1, the data is getting overwritten each time.
I'd recommend initialising a variable count that counts how many times you've performed Simulation, and then changing that line to if ( count == 1 ):
count <- 1
prob <- c(0.10,0.20)
# .... code as before
data <- Simulation(X =1,Y =range[i], Z=sample[k] ,p = prob[j])
if ( count == 1 ) {
write.table(data, "Desktop/file2.txt", col.names=T)
} else {
write.table(data, "Desktop/file2.txt", append=T, col.names=F)
}
# increment count
count <- count + 1
}}}

Resources