R: bind columns after lapply() the poly() function - r

I want to add columns containing polynomials to a dataframe (DF).
Background: I need to use polynomials in a glmnet setting. I cannot call poly() directly in the glmnet() estimation command. I get an error, likely because my “Xtrain” data contain factors.
My workaround is to slice my Xtrain DF in two pieces, one containing all factors (for which no transformation is needed) and one containing the rest, viz. the numeric columns.
Now I want to add columns with polynomials to my numeric DF.
Here is a minimal example of my problem.
# Some data
x <- 1:10
y <- 11:20
df = as.data.frame(cbind(x,y))
# Looks like this
x y
1 1 11
2 2 12
3 3 13
# Now I generate polys
lapply(df, function(i) poly(i, 2, raw=T)[,1:2])
However, I cannot figure out how to "cbind" the results. What I want to have in the end is a DF in which x, x^2, y, y^2, are contained. Order does not matter. However, ideally I would also have column labels (to identify the polys). For instance like this:
x x2 y y2
1 1 1 11 121
2 2 4 12 144
3 3 9 13 169
Thank you...
Cheers!

Another option is
as.data.frame(lapply(df, function(i) poly(i, 2, raw=T)[,1:2]))
# x.1 x.2 y.1 y.2
#1 1 1 11 121
#2 2 4 12 144
#3 3 9 13 169
# ...
As mentioned by #gpier and #akrun already, you might use ^ instead of poly
n <- 2
df[paste(names(df), n, sep = "_")] <- df^n
df

We can use do.call
do.call(cbind, lapply(df, function(i) poly(i, 2, raw=T)[,1:2]))
If we just need squares
cbind(df, as.matrix(df)^2)

poly is not the right function if you need squares. Try
cbind(df,lapply(df, function(x) x^2))
x y x y
1 1 11 1 121
2 2 12 4 144
3 3 13 9 169
4 4 14 16 196
5 5 15 25 225
6 6 16 36 256
7 7 17 49 289
8 8 18 64 324
9 9 19 81 361
10 10 20 100 400
EDIT: indeed you don't even need lapply, you could just use cbind(df, df^2)

Related

How to do iterations in R?

I'm operating with a dataset that contains the values of same variables at different points in time. In the example below I have the values of variables a and b at time points 1 and 2.
> set.seed(1)
> data <- data.frame(matrix(sample(16), ncol = 4))
> names(data) <- paste(rep(c("a", "b"), each = 2), 1:2, sep = "")
> data
a1 a2 b1 b2
1 5 3 14 13
2 6 10 1 8
3 9 11 2 4
4 12 15 7 16
Now, suppose I want to calculate a new variable for both time points so that it would contain the sum of a and b (instead of the NAs as in example below). Since my actual dataset contains about 15 different variables and 10 time points (so 150 columns), I want to automate this calculation of 10 new variables.
> data[, paste("ab", 1:2, sep = "")] <- NA
> data
a1 a2 b1 b2 ab1 ab2
1 5 3 14 13 NA NA
2 6 10 1 8 NA NA
3 9 11 2 4 NA NA
4 12 15 7 16 NA NA
I've previously used Stata where I could create a simple 'foreach' loop to do this. Something like below.
foreach t of numlist 1/2 {
generate ab`t' = a`t' + b`t'
}
But I've learned that using loops in R is not feasible, nor have I any idea how to loop over variable names like that in R.
So what would be the correct solution for my problem in R?
This will replicate the same foreach loop you used in Stata.
for(i in 1:2){
data[, paste("ab", i, sep="")] <-
data[,paste("a", i, sep="")] + data[, paste("b", i, sep="")]
}
The output looks like this:
> data
a1 a2 b1 b2 ab1 ab2
1 15 1 16 12 31 13
2 10 7 14 3 24 10
3 2 5 9 4 11 9
4 6 8 13 11 19 19
to do this the R way,
make use of some native iteration via a *apply function
use the built-in rowSums (as in #Sotos) answer
make use of assignment into the data.frame, that is `]`<-
all together
data[paste0('ab', 1:2)] <- sapply(1:2,
function(i)
rowSums(data[paste0(c('a', 'b'), i)]))
data
# a1 a2 b1 b2 ab1 ab2
# 1 5 3 14 13 19 16
# 2 6 10 1 8 7 18
# 3 9 11 2 4 11 15
# 4 12 15 7 16 19 31
ps, in a program use vapply instead, you'll need to provide an additional argument specifying the shape of the output but its safer and sometimes faster
You can do without iteration:
data$ab1 <- data$a1 + data$b1
data$ab2 <- data$a2 + data$b2
or
data <- transform(data, ab1=a1+b1, ab2=a2+b2)
BTW:
It is better not to name an object data because data= is often a parameter in functions.
Here is one way to do it. We iterate over the unique values of the column names and we calculate the rowSums when those unique values match the colname values.
sapply(unique(sub('\\D', '', names(data))),
function(i) rowSums(data[,grepl(i, sub('\\D', '', names(data)))]))
# 1 2
#[1,] 17 23
#[2,] 24 22
#[3,] 14 10
#[4,] 15 11

Concatenate data frames together based on similar column values

Specifically, say I had three data frames d1, d2, d3:
d1:
X Y Z value
1 0 20 135 43
2 0 4 105 50
3 5 18 20 10
...
d2:
X Y Z value
1 0 20 135 15
2 0 4 105 14
3 2 9 12 16
...
d3:
X Y Z value
1 0 20 135 29
2 2 9 14 16
...
I want to be able to combine these data frames such that each row of the combined data frame consists of three values, based on all unique X, Y, Z combinations. If such an X, Y, Z combination does not exist in one of the original data frames then I just want it to have a value of null (or some arbitrarily low number if that isn't possible). So I'd want an output of:
dfinal:
X Y Z value1 value2 value3
1 0 20 135 43 15 29
2 0 4 105 50 14 null
3 5 18 20 10 null null
4 2 9 12 null 16 null
5 2 9 14 null null 16
...
Is there any efficient way of doing this? I've tried doing this instead using data.table which seemed more suited for this but have yet to figure out how.
?merge
Should do the trick?
By default the data frames are merged on the columns with names they both have, but separate specifications of the columns can be given by by.x and by.y.
So:
merge(d1,d2, by=c("X","Y","Z"))
And you can include all=TRUE, to have complete rows.
The missing data will be NA
merge(d1,d2, by=c("X","Y","Z"), all=TRUE)
Take a look at dplyr and its join methods. I wrote a small example:
library(dplyr)
library(data.table)
d1 <- data.table(X = c(1,2,3), Y = c(2,3,4), Z = c(8,3,9), value = c(22,3,44))
d2 <- data.table(X = c(1,4,3), Y = c(2,6,4), Z = c(8,9,9), value = c(44,22,11))
d2 <- rename(d2, value2 = value)
full_join(d1,d2)
output:
X Y Z value value2
1 1 2 8 22 44
2 2 3 3 3 NA
3 3 4 9 44 11
4 4 6 9 NA 22

cut scoping link between key and data.frame for data.table

When using data.table as a lookup it is very fast. There is one behavior that does not work with my current workflow and I'm sure there's a better way and I'm missing it. The behavior is that of modifying in place even if the key was taken from a parent data.frame, data.table will act on the parent data.frame in ways that may not always be desirable.
Here's an example since I lack the language to express it properly:
library(data.table)
set.seed(123)
N <- 100
key <- data.frame(x = sample.int(N, N), y = 1:N, z = 1:N)
key$w <- key$x
head(key)
## x y z w
## 1 29 1 1 29
## 2 79 2 2 79
## 3 41 3 3 41
## 4 86 4 4 86
## 5 91 5 5 91
## 6 5 6 6 5
set.seed(1)
terms <- data.frame(z = sample.int(2 * N, 1e2, replace = TRUE))
subkey <- key[c("x", "y")]
setDT(subkey)
setDT(terms)
setkey(subkey, x)
subkey[terms][[2]]
head(key)
## x y z w
## 1 1 74 1 1
## 2 2 35 2 2
## 3 3 51 3 3
## 4 4 18 4 4
## 5 5 6 5 5
## 6 6 54 6 6
Notice the order of key is affected by the use of data.table even though it wasn't used in the lookup?
I know data.table is avoiding making copies but is there a way to cut this link to key and force data.table to act on subkey without modifying key?
Rather than
subkey <- key[c("x", "y")]
setDT(subkey)
just do
subkey <- as.data.table(key[c("x", "y")])
That will force a copy and sever the connection

automating a normal transformation function in R over multiple columns

I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284

R self reference

In R I find myself doing something like this a lot:
adataframe[adataframe$col==something]<-adataframe[adataframe$col==something)]+1
This way is kind of long and tedious. Is there some way for me
to reference the object I am trying to change such as
adataframe[adataframe$col==something]<-$self+1
?
Try package data.table and its := operator. It's very fast and very short.
DT[col1==something, col2:=col3+1]
The first part col1==something is the subset. You can put anything here and use the column names as if they are variables; i.e., no need to use $. Then the second part col2:=col3+1 assigns the RHS to the LHS within that subset, where the column names can be assigned to as if they are variables. := is assignment by reference. No copies of any object are taken, so is faster than <-, =, within and transform.
Also, soon to be implemented in v1.8.1, one end goal of j's syntax allowing := in j like that is combining it with by, see question: when should I use the := operator in data.table.
UDPDATE : That was indeed released (:= by group) in July 2012.
You should be paying more attention to Gabor Grothendeick (and not just in this instance.) The cited inc function on Matt Asher's blog does all of what you are asking:
(And the obvious extension works as well.)
add <- function(x, inc=1) {
eval.parent(substitute(x <- x + inc))
}
# Testing the `inc` function behavior
EDIT: After my temporary annoyance at the lack of approval in the first comment, I took the challenge of adding yet a further function argument. Supplied with one argument of a portion of a dataframe, it would still increment the range of values by one. Up to this point has only been very lightly tested on infix dyadic operators, but I see no reason it wouldn't work with any function which accepts only two arguments:
transfn <- function(x, func="+", inc=1) {
eval.parent(substitute(x <- do.call(func, list(x , inc)))) }
(Guilty admission: This somehow "feels wrong" from the traditional R perspective of returning values for assignment.) The earlier testing on the inc function is below:
df <- data.frame(a1 =1:10, a2=21:30, b=1:2)
inc <- function(x) {
eval.parent(substitute(x <- x + 1))
}
#---- examples===============>
> inc(df$a1) # works on whole columns
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 6 25 1
6 7 26 2
7 8 27 1
8 9 28 2
9 10 29 1
10 11 30 2
> inc(df$a1[df$a1>5]) # testing on a restricted range of one column
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 7 25 1
6 8 26 2
7 9 27 1
8 10 28 2
9 11 29 1
10 12 30 2
> inc(df[ df$a1>5, ]) #testing on a range of rows for all columns being transformed
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
# and even in selected rows and grepped names of columns meeting a criterion
> inc(df[ df$a1 <= 3, grep("a", names(df)) ])
> df
a1 a2 b
1 3 22 1
2 4 23 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
Here is what you can do. Let us say you have a dataframe
df = data.frame(x = 1:10, y = rnorm(10))
And you want to increment all the y by 1. You can do this easily by using transform
df = transform(df, y = y + 1)
I'd be partial to (presumably the subset is on rows)
ridx <- adataframe$col==something
adataframe[ridx,] <- adataframe[ridx,] + 1
which doesn't rely on any fancy / fragile parsing, is reasonably expressive about the operation being performed, and is not too verbose. Also tends to break lines into nicely human-parse-able units, and there is something appealing about using standard idioms -- R's vocabulary and idiosyncrasies are already large enough for my taste.

Resources