Create new columns in data table with for loop - r

Is there a simple way to create columns within a for loop? I know this question has been asked here multiple times and I have tried this solution adjusted to my case
for (i in 1:100) {
eval(parse(text = paste0('a$colum', i, ' <- whatever_you_want_your_column_to_contain')))
}
from one of the posts, but it did not help. I have an existing data table data and I am trying to create columns P_1 to P_30 within a for loop and then assign them NULL (I am just trying to pre-define the columns). I have tried this:
for (i in 1:30) {
eval(parse(text = paste0('data$P_', i, ' <- NULL')))
}
but without any success. Can you please suggest any approach that would work?
A related question - how to refer to those columns in another loop, if I have again column P_i where i is from 1 to 30, how to refer to data$P_i within a loop?
Edit:
I have this data table to make an example:
customer_id <- c("1","1","1","2","2","2","2","3","3","3")
account_id <- as.character(c(11,11,11,55,55,55,55,38,38,38))
obs_date <- c(as.Date("2017-01-01","%Y-%m-%d"), as.Date("2017-02-01","%Y-%m-%d"), as.Date("2017-03-01","%Y-%m-%d"),
as.Date("2017-12-01","%Y-%m-%d"), as.Date("2018-01-01","%Y-%m-%d"), as.Date("2018-02-01","%Y-%m-%d"),
as.Date("2018-03-01","%Y-%m-%d"), as.Date("2018-04-01","%Y-%m-%d"), as.Date("2018-05-01","%Y-%m-%d"),
as.Date("2018-06-01","%Y-%m-%d"))
variable <- c(87,90,100,120,130,150,12,13,15,14)
data <- data.table(customer_id,account_id,obs_date,variable)
and I found out that the problem is really in assigning that NULL to those columns, because when I am doing this based on the post's advice:
for (i in 1:30) {
eval(parse(text = paste0('data$P_', i, ' <- 1')))
}
it really works, just with the NULL instead of 1 it does not. So, it is not a bad advice, it just does not work with NULL.

Here's a data.table answer - I think you were close, you just didn't have the right syntax to append a column to a data table:
for (i in 1:30) {
data[, paste0("P_", i) := "whatever_you_want_your_column_to_contain"]
}

Related

User defined function - issue with return values

I regularly come up against the issue of how to categorise dataframes from a list of dataframes according to certain values within them (E.g. numeric, factor strings, etc). I am using a simplified version using vectors here.
After writing messy for loops for this task a bunch of times, I am trying to write a function to repeatedly solve the problem. The code below returns a subscripting error (given at the bottom), however I don't think this is a subscripting problem, but to do with my use of return.
As well as fixing this, I would be very grateful for any pointers on whether there are any cleaner / better ways to code this function.
library(plyr)
library(dplyr)
#dummy data
segmentvalues <- c('1_P', '2_B', '3_R', '4_M', '5_D', '6_L')
trialvec <- vector()
for (i in 1:length(segmentvalues)){
for (j in 1:20) {
trialvec[i*j] <- segmentvalues[i]
}
}
#vector categorisation
vcategorise <- function(categories, data) {
#categorises a vector into a list of vectors
#requires plyr and dyplyr
assignment <- list()
catlength <- length(categories)
for (i in 1:length(catlength)){
for (j in 1:length(data)) {
if (any(contains(categories[i], ignore.case = TRUE,
as.vector(data[j])))) {
assignment[[i]][j] <- data[j]
}
}
}
return (assignment)
}
result <- vcategorise(categories = segmentvalues, data = trialvec)
Error in *tmp*[[i]] : subscript out of bounds
You are indexing assignments -- which is ok, even if at an index that doesn't have a value, that just gives you NULL -- and then indexing into what you get there -- which won't work if you get NULL. And NULL you will get, because you haven't allocated the list to be the right size.
In any case, I don't think it is necessary for you to allocate a table. You are already using a flat indexing structure in your test data generation, so why not do the same with assignment and then set its dimensions afterwards?
Something like this, perhaps?
vcategorise <- function(categories, data) {
assignment <- vector("list", length = length(data) * length(categories))
n <- length(data)
for (i in 1:length(categories)){
for (j in 1:length(data)) {
assignment[(i-1)*n + j] <-
if (any(contains(categories[i],
ignore.case = TRUE,
as.vector(data[j])))) {
data[j]
} else {
NA
}
}
}
dim(assignment) <- c(length(data), length(categories))
assignment
}
It is not the prettiest code, but without fully understanding what you want to achieve, I don't know how to go further.

Why using two variables in loop in R (code given below)?

I am looking at DIGRE model code in R and there is a loop as follow:
idx <- 1
for (i in 1:length(drugName)) {
if (drugName[i] != "Neg_control") {
cat(idx, ". ", drugName[i], "\n", sep = "")
idx <- idx + 1
}
My question is a particular reason for using separate variables ( i and idx ) for loop and the counter. Wouldn't this loop work fine with just one variable. I am new to R therefore curious.
The variable idx only gets incremented if drugName isn't "Neg_control". So i indexes all the observations of drugName and idx counts the 'occurences'. I guess depending on how the data looks like and what the goal of the function is, this could be done without using a loop.
How about this?
controlTF = drugName != "Neg_control"
idx <- sum(controlTF)
paste0(1:idx, ". ", drugName[controlTF])

Delete data.frame columns and loop through data.frame assignment function

I found the following piece of code here at stackoverflow:
library(svDialogs)
columnFunction <- function (x) {
column.D <- dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res
if (!length((column.D))) {
cat("No column selected\n")
} else {
cat("The following columns are choosen:\n")
print(column.D)
x <- x[,!names(x) %in% column.D]
}
return(x)
}
df <- columnFunction(df)
So i wanted to use it for my own proposes, but it did not work out as planned.
What i try to archive is to use it in a for loop or with lapply to use it with multiple data.frames. Amongst others I tried:
d.frame1 <- iris
d.frame2 <- cars
l.frames <- c("d.frame1","d.frame2")
for (b in l.frames){
columnFunction(b)
}
but it yields the following error message:
Error in dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res :
$ operator is invalid for atomic vectors
Well, what i need additionally is that I can loop though that function so that i can iterate through different data.frames.
Last but not least I would need something like:
for (xyz in l.frames){
xyz <- columnFunction(xyz)
}
to automate the saving step.
Does anyone have any idea how i could loop though that function or how i could change the function so that it performs all those steps and is loopable.
I`m quite new to R so perhaps Im missing something obvious.
lapply was designed for this task:
l.frames <- list(d.frame1, d.frame2)
l.frames <- lapply(l.frames, columnFunction)
If you insist on using a for loop:
for (i in seq_along(l.frames)) l.frames[[i]] <- columnFunction(l.frames[[i]])

Direct update (replace) of sparse data frame is slow and inefficient

I'm attempting to read in a few hundred-thousand JSON files and eventually get them into a dplyr object. But the JSON files are not simple key-value parse and they require a lot of pre-processing. The preprocessing is coded and does fairly good for efficiency. But the challenge I am having is loading each record into a single object (data.table or dplyr object) efficiently.
This is very sparse data, I'll have over 2000 variables that will mostly be missing. Each record will have maybe a hundred variables set. The variables will be a mix of character, logical and numeric, I do know the mode of each variable.
I thought the best way to avoid R copying the object for every update (or adding one row at a time) would be to create an empty data frame and then update the specific fields after they are pulled from the JSON file. But doing this in a data frame is extremely slow, moving to data table or dplyr object is much better but still hoping to reduce it to minutes instead of hours. See my example below:
timeMe <- function() {
set.seed(1)
names = paste0("A", seq(1:1200))
# try with a data frame
# outdf <- data.frame(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
# try with data table
outdf <- data.table(matrix(NA, nrow=100, ncol=1200, dimnames=list(NULL, names)))
for(i in seq(100)) {
# generate 100 columns (real data is in json)
sparse.cols <- sample(1200, 100)
# Each record is coming in as a list
# Each column is either a character, logical, or numeric
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401) { # logical
sample(c(TRUE, FALSE), 1)
} else if (i < 801) { # numeric
sample(seq(10), 1)
} else { # character
sample(LETTERS, 1)
}
}) # now we have a list with values to populate
names(sparse.val) <- paste0("A", sparse.cols)
# and here is the challenge and what takes a long time.
# want to assign the ith row and the named column with each value
for(x in names(sparse.val)) {
val=sparse.val[[x]]
# this is where the bottleneck is.
# for data frame
# outdf[i, x] <- val
# for data table
outdf[i, x:=val]
}
}
outdf
}
I thought the mode of each column might have been set and reset with each update, but I have also tried this by pre-setting each column type and this didn't help.
For me, running this example with a data.frame (commented out above) takes around 22 seconds, converting to a data.table is 5 seconds. I was hoping someone knew what was going on under the covers and could provide a faster way to populate the data table here.
I follow your code except the part where you construct sparse.val. There are minor errors in the way you assign columns. Don't forget to check that the answer is right in trying to optimise :).
First, the creation of data.table:
Since you say that you already know the type of the columns, it's important to generate the correct type up front. Else, when you do: DT[, LHS := RHS] and RHS type is not equal to LHS, RHS will be coerced to the type of LHS. In your case, all your numeric and character values will be converted to logical, as all columns are logical type. This is not what you want.
Creating a matrix won't help therefore (all columns will be of the same type) + it's also slow. Instead, I'd do it like this:
rows = 100L
cols = 1200L
outdf <- setDT(lapply(seq_along(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else rep(NA_character_, rows)
}))
Now we've the right type set. Next, I think it should be i >= 402L & i < 801L. Otherwise, you're assigning the first 401 columns as logical and then the first 801 columns as numeric, which, given that you know the type of the columns upfront, doesn't make much sense, right?
Second, doing names(.) <-:
The line:
names(sparse.val) <- paste0("A", sparse.cols)
will create a copy and is not really necessary. Therefore we'll delete this line.
Third, the time consuming for-loop:
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, x:=val]
}
is not actually doing what you think it's doing. It's not assigning the values from val to the name assigned to x. Instead it's (over)writing (each time) to a column named x. Check your output.
This is not a part of optimisation. This is just to let you know what you're actually wanting to do here.
for(x in names(sparse.val)) {
val=sparse.val[[x]]
outdf[i, (x) := val]
}
Note the ( around x. Now, it'll be evaluated and the value contained in x will be the column to which val's value will be assigned to. It's a bit subtle, I understand. But, this is necessary because it allows for the possibility to create column x as DT[, x := val] where you actually want val to be assigned to x.
Coming back to the optimisation, the good news is, your time consuming for-loop is simply:
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
This is where data.table's sub-assign by reference feature comes in handy!
Putting it all together:
Your final function looks like this:
timeMe2 <- function() {
set.seed(1L)
rows = 100L
cols = 1200L
outdf <- as.data.table(lapply(seq_len(cols), function(i) {
if (i < 401L) rep(NA, rows)
else if (i >= 402L & i < 801L) rep(NA_real_, rows)
else sample(rep(NA_character_, rows))
}))
setnames(outdf, paste0("A", seq(1:1200)))
for(i in seq(100)) {
sparse.cols <- sample(1200L, 100L)
sparse.val <- lapply(sparse.cols, function(i) {
if(i < 401L) sample(c(TRUE, FALSE), 1)
else if (i >= 402 & i < 801L) sample(seq(10), 1)
else sample(LETTERS, 1)
})
set(outdf, i=i, j=paste0("A", sparse.cols), value = sparse.val)
}
outdf
}
By doing this, your solution takes 9.84 seconds on my system whereas the function above takes 0.34 seconds, which is ~29x improvement. I think this is the result you're looking for. Please verify it.
HTH

Can a way in R be found out to do permutation for objects

I want to do
am<-0
an<-0
bm<-0
bn<-0
cm<-0
cn<-0
.....
.....
and son on till zn.Is there a way to do it without writing so much
You can use assign to create variable by name:
for (first in letters[1:3]) {
for (second in letters[13:14]) {
assign(paste(first, second, sep=""), 0)
}
}
Probably a better way would be to use dataframe like this:
df <- data.frame(
name=paste(rep(letters[1:3], each=2), rep(letters[13:14], 3), sep=""),
value=0
)
If you have already defined you am,an,... as a separate variable , one way to aggregate them in a vector is :
vars <- unlist(mget(ls(pattern='^[a-z](m|n)$')))
Then it is easy to initialize your vector like this :
vars <- 0

Resources