How to create columns in a loop? - r

I would like to create some columns with a loop. I am not sure why it is not working. To simplify, let's just assume that I want several columns with missing values.
Below are just some codes I've tried:
varlist <- c("5000_A", "5000_B", "5000_C", "5000_D",
"5000_E", "5000_F", "5000_G", "5000_G")
for(i in varlist){
df <- df %>% mutate(i = NA)
}
I have also tried:
letterseq <- c(LETTERS[1:8])
for(i in letterseq){
df <- df %>% mutate(paste("5000", i, sep = "_"), NA)
}
Or even:
letterseq <- c(LETTERS[1:8])
for(i in letterseq){
df <- df %>% assign(paste("5000", i, sep = "_"), NA)
}
All are giving me different errors. I would like to get by the end of the code 8 different columns called 5000_A, 5000_B, 5000_C, 5000_D, 5000_E, 5000_F, 5000_G, 5000_H.

varlist <- c("5000_A", "5000_B", "5000_C", "5000_D",
"5000_E", "5000_F", "5000_G", "5000_G")
for(i in varlist){
df[[i]] <- NA
}

Here is a solution using the package data.table.
dt[, varlist[1:length(varlist)]:=NA]
For example...
library(data.table)
varlist <- c("5000_A", "5000_B", "5000_C", "5000_D",
"5000_E", "5000_F", "5000_G", "5000_H")
dt <- data.table("A" = c(1,2,3), B = c("a", "b", "c"))
dt[, varlist[1:length(varlist)]:=NA]
> dt
A B 5000_A 5000_B 5000_C 5000_D 5000_E 5000_F 5000_G 5000_H
1: 1 a NA NA NA NA NA NA NA NA
2: 2 b NA NA NA NA NA NA NA NA
3: 3 c NA NA NA NA NA NA NA NA

Related

parsing quotes out of "NA" strings

My dataframe has some variables that contain missing values as strings like "NA". What is the most efficient way to parse all columns in a dataframe that contain these and convert them into real NAs that are catched by functions like is.na()?
I am using sqldf to query the database.
Reproducible example:
vect1 <- c("NA", "NA", "BANANA", "HELLO")
vect2 <- c("NA", 1, 5, "NA")
vect3 <- c(NA, NA, "NA", "NA")
df = data.frame(vect1,vect2,vect3)
To add to the alternatives, you can also use replace instead of the typical blah[index] <- NA approach. replace would look like:
df <- replace(df, df == "NA", NA)
Another alternative to consider is type.convert. This is the function that R uses when reading data in to automatically convert column types. Thus, the result is different from your current approach in that, for instance, the second column gets converted to numeric.
df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
df
Here's a performance comparison. The sample data is from #roland's answer.
Here are the functions to test:
funop <- function() {
df[df == "NA"] <- NA
df
}
funr <- function() {
ind <- which(vapply(df, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
as.data.table(df)[, names(df)[ind] := lapply(.SD, function(x) {
is.na(x) <- x == "NA"
x
}), .SDcols = ind][]
}
funam1 <- function() replace(df, df == "NA", NA)
funam2 <- function() {
df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
df
}
Here's the benchmarking:
library(microbenchmark)
microbenchmark(funop(), funr(), funam1(), funam2(), times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# funop() 3.629832 3.750853 3.909333 3.855636 4.098086 4.248287 10
# funr() 3.074825 3.212499 3.320430 3.279268 3.332304 3.685837 10
# funam1() 3.714561 3.899456 4.238785 4.065496 4.280626 5.512706 10
# funam2() 1.391315 1.455366 1.623267 1.566486 1.606694 2.253258 10
replace would be the same as #roland's approach, which is the same as #jgozal's. However, the type.convert approach would result in different column types.
all.equal(funop(), setDF(funr()))
all.equal(funop(), funam())
str(funop())
# 'data.frame': 10000000 obs. of 3 variables:
# $ vect1: Factor w/ 3 levels "BANANA","HELLO",..: 2 2 NA 2 1 1 1 NA 1 1 ...
# $ vect2: Factor w/ 3 levels "1","5","NA": NA 2 1 NA 1 NA NA 1 NA 2 ...
# $ vect3: Factor w/ 1 level "NA": NA NA NA NA NA NA NA NA NA NA ...
str(funam2())
# 'data.frame': 10000000 obs. of 3 variables:
# $ vect1: Factor w/ 2 levels "BANANA","HELLO": 2 2 NA 2 1 1 1 NA 1 1 ...
# $ vect2: int NA 5 1 NA 1 NA NA 1 NA 5 ...
# $ vect3: logi NA NA NA NA NA NA ...
I found this nice way of doing it from this question:
So for this particular situation it would just be:
df[df=="NA"]<-NA
It only took about 30 seconds with 5 million rows and ~250 variables
This is slightly faster:
set.seed(42)
df <- do.call(data.frame, lapply(df, sample, size = 1e7, replace = TRUE))
df2 <- df
system.time(df[df=="NA"]<-NA )
# user system elapsed
#3.601 0.378 3.984
library(data.table)
setDT(df2)
system.time({
#find character and factor columns
ind <- which(vapply(df2, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
#assign by reference
df2[, names(df2)[ind] := lapply(.SD, function(x) {
is.na(x) <- x == "NA"
x
}), .SDcols = ind]
})
# user system elapsed
#2.484 0.190 2.676
all.equal(df, setDF(df2))
#[1] TRUE

Insert an empty column between every column of a dataframe in R

Say you have a dataframe of four columns:
dat <- data.frame(A = rnorm(5), B = rnorm(5), C = rnorm(5), D = rnorm(5))
And you want to insert an empty column between each of the columns in the dataframe, so that the output is:
A A1 B B1 C C1 D D1
1 1.15660588 NA 0.78350197 NA -0.2098506 NA 2.07495662 NA
2 0.60107853 NA 0.03517539 NA -0.4119263 NA -0.08155673 NA
3 0.99680981 NA -0.83796981 NA 1.2742644 NA 0.67469277 NA
4 0.09940946 NA -0.89804952 NA 0.3419173 NA -0.95347049 NA
5 0.28270734 NA -0.57175554 NA -0.4889045 NA -0.11473839 NA
How would you do this?
The dataframe I would like to do this operation to has hundreds of columns and so obviously I don't want to type out each column and add them naively like this:
dat$A1 <- NA
dat$B1 <- NA
dat$C1 <- NA
dat$D1 <- NA
dat <- dat[, c("A", "A1", "B", "B1", "C", "C1", "D", "D1")]
Thanks for you help in advance!
You can try
res <- data.frame(dat, dat*NA)[order(rep(names(dat),2))]
res
# A A.1 B B.1 C C.1 D D.1
#1 1.15660588 NA 0.78350197 NA -0.2098506 NA 2.07495662 NA
#2 0.60107853 NA 0.03517539 NA -0.4119263 NA -0.08155673 NA
#3 0.99680981 NA -0.83796981 NA 1.2742644 NA 0.67469277 NA
#4 0.09940946 NA -0.89804952 NA 0.3419173 NA -0.95347049 NA
#5 0.28270734 NA -0.57175554 NA -0.4889045 NA -0.11473839 NA
NOTE: I am leaving the . in the column names as it is a trivial task to remove it.
Or another option is
dat[paste0(names(dat),1)] <- NA
dat[order(names(dat))]
you can try this
df <- cbind(dat, dat)
df <- df[, sort(names(df))]
df[, seq(2, 8,by=2)] <- NA
names(df) <- sub("\\.", "", names(df))
# create new data frame with twice the number of columns
bigdat <- data.frame(matrix(ncol = dim(dat)[2]*2, nrow = dim(dat)[1]))
# set sequence of target column indices
inds <- seq(1,dim(bigdat)[2],by=2)
# insert values
bigdat[,inds] <- dat
# set column names
colnames(bigdat)[inds] <- colnames(dat)

Merging multiple data frames without getting duplicates

I am trying to merge 6+ datasets into one by ID. Right now, the duplication of IDs makes merge treat each as a new observation.
Example code:
combined <-Reduce(function(x,y) merge(x,y, all=TRUE), list(NRa,NRb,NRc,NRd,NRe,NRf,NRg,NRh))
Which gives me this:
ID Segment.h Segment.g Segment.f Segment.e Segment.d Segment.c
1 62729107 NA NA NA NA NA 1
2 62734839 NA 1 NA NA 1 NA
3 62734839 NA NA NA 1 NA NA
4 62737229 NA 1 NA NA NA NA
5 62737229 NA NA NA 1 1 NA
I would like each ID to have a single row:
ID Segment.h Segment.g Segment.f Segment.e Segment.d Segment.c
1 62729107 NA NA NA NA NA 1
2 62734839 NA 1 NA 1 1 NA
3 62737229 NA 1 NA 1 1 NA
Any help is appreciated. Thank you.
Using R's sqldf package will work leaving you with one id per row.
Data1 <- data.frame(
X = sample(1:10),
Housing = sample(c("yes", "no"), 10, replace = TRUE)
)
Data2 <- data.frame(
X = sample(1:10),
Credit = sample(c("yes", "no"), 10, replace = TRUE)
)
Data3 <- data.frame(
X = sample(1:10),
OwnsCar = sample(c("yes", "no"), 10, replace = TRUE)
)
Data4 <- data.frame(
X = sample(1:10),
CollegeGrad = sample(c("yes", "no"), 10, replace = TRUE)
)
library(sqldf)
sqldf("Select Data1.X,Data1.Housing,Data2.Credit,Data3.OwnsCar,Data4.CollegeGrad from Data1
inner join Data2 on Data1.X = Data2.X
inner join Data3 on Data1.X = Data3.X
inner join Data4 on Data1.X = Data4.X
")
Why don't you try by='ID' in your merge() function. If that's not enough, try aggregate().
Your description of the problem is not entirely clear, and you don't provide data.
Assuming that all of your dataframes have the same dimensions, column names, column orders, ID entries, that the ID row orders match, that ID is the first column, that all other entries are either NA or 1 and that any cell in one dataframe featuring a 1 has NA values in that cell for all other data frames or that sums of numeric values are acceptable, and that you want the result as a data frame ...
An Old-School solution using the abind package:
consolidate <- function(lst) {
stopifnot(require(abind))
## form 3D array, replace NA
x <- abind(lst, along=3)
x[is.na(x)] <- 0
z <- x[,,1] ## data store
## sum array along 3rd dimension
for (j in seq(2,ncol(x)))
for (i in seq(nrow(x)))
z[i,j] <- sum(x[i,j,])
z[z==0] <- NA ## restore NA
as.data.frame(z)
}
For dataframes (with the above caveats) a,b,c:
consolidate(list(a,b,c))

How to create a new column with if condition

This seems simple but I could not perform. Its different than sound similar question ask here.
I want to create new columns say df$col1, df$col2, df$col3 on dataframe df using if condition in the column already exists ie df$con and df$val.
I would like to write the value of column "val" in df$col1 if df$con > 3
I would like to write the value of col df$val in df$col2 if df$con<2
and I would like to write the 30% of df$val in df$col3 if df$con is between 1 and 3.
How should I do it ? Below is my dataframe df with two columns "con" for condition and "val" for value use.
dput(df)
structure(list(con = c(-33.09524956, -36.120924, -28.7020053,
-26.06385399, -18.45731163, -14.51817928, -20.1005132, -23.62346403,
-24.90464018, -23.51471516), val = c(0.016808197, 1.821442227,
4.078385886, 3.763593573, 2.617612605, 2.691796601, 1.060565469,
0.416400183, 0.348732675, 1.185505136)), .Names = c("con", "val"
), row.names = c(NA, 10L), class = "data.frame")
This might do it. First we write a function to change FALSE values to NA
foo <- function(x) {
is.na(x) <- x == FALSE
return(x)
}
Then apply it over the list of logical vectors and take the matching val column values
df[paste0("col", 1:3)] <- with(df, {
x <- list(con > 3, con < 2, con < 3 & con > 1)
lapply(x, function(y) val[foo(y)])
})
resulting in
df
con val col1 col2 col3
1 -33.09525 0.0168082 NA 0.0168082 NA
2 -36.12092 1.8214422 NA 1.8214422 NA
3 -28.70201 4.0783859 NA 4.0783859 NA
4 -26.06385 3.7635936 NA 3.7635936 NA
5 -18.45731 2.6176126 NA 2.6176126 NA
6 -14.51818 2.6917966 NA 2.6917966 NA
7 -20.10051 1.0605655 NA 1.0605655 NA
8 -23.62346 0.4164002 NA 0.4164002 NA
9 -24.90464 0.3487327 NA 0.3487327 NA
10 -23.51472 1.1855051 NA 1.1855051 NA
Could go the tidyverse way. The pipes %>% just send the output of each operation to the next function. mutate allows you to make a new column in your data frame, but you have to remember to store it at the top. It's stored as output. The ifelse allows you to conditionally assign values to your new column, for example the column col1. The second argument in ifelse is the output for a true condition, and the third argument is when ifelse is false. Hope this helps some too!
Go tidyverse!
library(tidyverse)
output <- df %>%
mutate(col1=ifelse(con>3, val, NA)) %>%
mutate(col2=ifelse(con<2, val, NA)) %>%
mutate(col3=ifelse(con<=3 & con>=1, 0.3*val, NA))
Here's a df that actually meets some of the conditions:
structure(list(con = c(-33.09524956, 2.5, -28.7020053, 2, -18.45731163,
2, -20.1005132, 6, -24.90464018, -23.51471516), val = c(0.016808197,
1.821442227, 4.078385886, 3.763593573, 2.617612605, 2.691796601,
1.060565469, 0.416400183, 0.348732675, 1.185505136)), .Names = c("con",
"val"), row.names = c(NA, 10L), class = "data.frame")
Here's the output after running the code:
con val col1 col2 col3
1 -33.09525 0.0168082 NA 0.0168082 NA
2 2.50000 1.8214422 NA NA 0.5464327
3 -28.70201 4.0783859 NA 4.0783859 NA
4 2.00000 3.7635936 NA NA 1.1290781
5 -18.45731 2.6176126 NA 2.6176126 NA
6 2.00000 2.6917966 NA NA 0.8075390
7 -20.10051 1.0605655 NA 1.0605655 NA
8 6.00000 0.4164002 0.4164002 NA NA
9 -24.90464 0.3487327 NA 0.3487327 NA
10 -23.51472 1.1855051 NA 1.1855051 NA

eval(parse(text=x)) inside a function, how to evaluate in global environment?

I'm trying to write a function that generates a vector of strings, each of which are evaluated as expressions in the global environment. The problem is that eval(parse(text=x)) only evaluates inside the function's environment.
As a hypothetical example, say that I want to replace several variables' values with NA, but only if they're below a certain cutoff value.
set.seed(200)
df <- as.data.frame(matrix(runif(25), nrow=5, ncol=5))
df
V1 V2 V3 V4 V5
1 0.5337724 0.83929374 0.4543649 0.3072981 0.46036069
2 0.5837650 0.71160009 0.6492529 0.5667674 0.09874701
3 0.5895783 0.09650122 0.1537271 0.1317879 0.20659381
4 0.6910399 0.52382473 0.6492887 0.9221776 0.92233983
5 0.6673315 0.23535054 0.3832137 0.6463296 0.31942681
cutoff.V1 <- 0.9
cutoff.V2 <- 0.5
cutoff.V3 <- 0.1
cutoff.V4 <- 0.7
cutoff.V5 <- 0.4
Rather than copy-and-pasting the same line over and over, changing the same text in each line...
df$V1[df$V1 < cutoff.V1] <- NA
df$V2[df$V2 < cutoff.V2] <- NA
df$V3[df$V3 < cutoff.V3] <- NA
df$V4[df$V4 < cutoff.V4] <- NA
df$V5[df$V5 < cutoff.V5] <- NA
# ad infinitum...
...I'm trying to have R do it for me:
vars <- c("V1", "V2", "V3", "V4", "V5")
variable.queue <- function(vec, placeholder, command) {
x <- vector()
for(i in 1:length(vec)) {
x[i] <- gsub(placeholder, vec[i], command)
}
return(x)
}
commands <- variable.queue(vars, "foo", "df$foo[df$foo < cutoff.foo] <- NA")
for(i in 1:length(commands)) {eval(parse(text=commands[i]))}
df
V1 V2 V3 V4 V5
1 NA 0.8392937 0.4543649 NA 0.4603607
2 NA 0.7116001 0.6492529 NA NA
3 NA NA 0.1537271 NA NA
4 NA 0.5238247 0.6492887 0.9221776 0.9223398
5 NA NA 0.3832137 NA NA
# FYI the object "commands" is the vector of strings that I want evaluated
commands
[1] "df$V1[df$V1 < cutoff.V1] <- NA" "df$V2[df$V2 < cutoff.V2] <- NA" "df$V3[df$V3 < cutoff.V3] <- NA"
[4] "df$V4[df$V4 < cutoff.V4] <- NA" "df$V5[df$V5 < cutoff.V5] <- NA"
This solution works, but I want to put the last for-loop INSIDE the function. Any ideas?
Edit:
Thanks, Kevin. Here's the "functional" version (bwahaha, I just can't help myself sometimes):
variable.queue <- function(vec, placeholder, command) {
x <- vector()
for(i in 1:length(vec)) {
x[i] <- gsub(placeholder, vec[i], command)
}
for(i in 1:length(x)) {
eval(parse(text=x[i]), envir= .GlobalEnv)
}
}
variable.queue(vars, "foo", "df$foo[df$foo < cutoff.foo] <- NA")
eval has an argument, envir, that allows you to specify the environment in which you want to evaluate your expression. So,
eval(parse(text=command[i]), envir=.GlobalEnv)
should hopefully work.
There has to be a better solution. E.g., for your example, this works:
set.seed(200)
df <- as.data.frame(matrix(runif(25), nrow=5, ncol=5))
cutoff <- c(0.9,0.5,0.1,0.7,0.4)
df[mapply("<", df,cutoff)] <- NA
#or
df[sweep(df,2,cutoff,"<")] <- NA
#or even
df[df < rep(cutoff,each=nrow(df))] <- NA
Which all give:
> df
V1 V2 V3 V4 V5
1 NA 0.8392937 0.4543649 NA 0.4603607
2 NA 0.7116001 0.6492529 NA NA
3 NA NA 0.1537271 NA NA
4 NA 0.5238247 0.6492887 0.9221776 0.9223398
5 NA NA 0.3832137 NA NA

Resources