r "slot" two columns into one (like a zip) - r

Given two columns (perhaps from a data frame) of equal length N, how can I produce a column of length 2N with the odd entries from the first column and the even entries from the second column?
Suppose I have the following data frame
df.1 <- data.frame(X = LETTERS[1:10], Y = 2*(1:10)-1, Z = 2*(1:10))
How can I produce this data frame df.2?
i <- 1
j <- 0
XX <- NA
while (i <= 10){
XX[i+j] <- LETTERS[i]
XX[i+j+1]<- LETTERS[i]
i <- i+1
j <- i-1
}
df.2 <- data.frame(X.X = XX, Y.Z = c(1:20))

ggplot2 has an unexported function interleave which does this.
Whilst unexported it does have a help page (?ggplot2:::interleave)
with(df.1, ggplot2:::interleave(Y,Z))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

If I understand you right, you want to create a new vector twice the length of the vectors X, Y and Z in your data frame and then want all the elements of X to occupy the odd indices of this new vector and all the elements of Y the even indices. If so, then the code below should do the trick:
foo<-vector(length=2*nrow(df.1), mode='character')
foo[seq(from = 1, to = 2*length(df.1$X), by=2)]<-as.character(df.1$X)
foo[seq(from = 2, to = 2*length(df.1$X), by=2)]<-df.1$Y
Note, I first create an empty vector foo of length 20, then fill it in with elements of df.1$X and df.1$Y.
Cheers,
Danny

You can use melt from reshape2:
library(reshape2)
foo <- melt(df.1, id.vars='X')
> foo
X variable value
1 A Y 1
2 B Y 3
3 C Y 5
4 D Y 7
5 E Y 9
6 F Y 11
7 G Y 13
8 H Y 15
9 I Y 17
10 J Y 19
11 A Z 2
12 B Z 4
13 C Z 6
14 D Z 8
15 E Z 10
16 F Z 12
17 G Z 14
18 H Z 16
19 I Z 18
20 J Z 20
Then you can sort and pick the columns you want:
foo[order(foo$X), c('X', 'value')]

Another solution using base R.
First index the character vector of the data.frame using the vector [1,1,2,2 ... 10,10] and store as X.X. Next, rbind the data.frame vectors Y & Z effectively "zipping" them and store in Y.X.
> res <- data.frame(
+ X.X = df.1$X[c(rbind(1:10, 1:10))],
+ Y.Z = c(rbind(df.1$Y, df.1$Z))
+ )
> head(res)
X.X Y.Z
1 A 1
2 A 2
3 B 3
4 B 4
5 C 5
6 C 6

A one two liner in base R:
test <- data.frame(X.X=df.1$X,Y.Z=unlist(df.1[c("Y","Z")]))
test[order(test$X.X),]

Assuming that you want what you asked for in the first paragraph, and the rest of what you posted is your attempt at solving it.
a=df.1[df.1$Y%%2>0,1:2]
b=df.1[df.1$Z%%2==0,c(1,3)]
names(a)=c("X.X","Y.Z")
names(b)=names(a)
df.2=rbind(a, b)
If you want to group them by X.X as shown in your example, you can do:
library(plyr)
arrange(df.2, X.X)

Related

Use apply on two data.frame's

If I had a data.frame X and wanted to apply a function foo to each of its rows, I would just run apply(X, 1, foo). This is all well-known and simple.
Now imagine I have another data.frame Y and the following function:
mean_of_sum <- function(x,y) {
return(mean(x+y))
}
Is there a way to write an "apply equivalent" to the following loop:
my_loop_fun <- function(X, Y)
results <- numeric(nrow(X))
for(i in 1: length(results)) {
results[i] <- mean_of_sum(X[i,], Y[i,])
}
return(results)
If such an "apply syntax" exists, would it be more efficient than my "good" old loop?
this should work:
sapply(seq_len(nrow(X)), function(i) mean_of_sum(X[i,], Y[i,]))
You apply the function on the sequence 1, 2, ..., n (where n is the number of rows ) and in each "iteration" you evaluate mean_of_sum for the i-th row.
We can split every row of X and Y in list and use mapply to apply the function. Changing the function mean_of_sum a bit to convert one-row dataframe to numeric
mean_of_sum <- function(x,y) {
return(mean(as.numeric(x) + as.numeric(y)))
}
Consider an example,
X <- data.frame(a = 1:5, b = 6:10)
Y <- data.frame(c = 11:15, d = 16:20)
mapply(mean_of_sum, split(X, seq_len(nrow(X))), split(Y, seq_len(nrow(Y))))
# 1 2 3 4 5
#17 19 21 23 25
where X and Y are
X
# a b
#1 1 6
#2 2 7
#3 3 8
#4 4 9
#5 5 10
Y
# c d
#1 11 16
#2 12 17
#3 13 18
#4 14 19
#5 15 20
So the first value 17 is counted as
mean(c(1 + 11, 6 + 16))
#[1] 17
and so on for next values.

Aggregate all columns with data.table using 2 fixed columns

I have a custom function I would like to apply to a data table such as follows:
DT = data.table(x = rep(c("a","b","c"), each = 2),
x2 = rep(c("h","j"), each = 3),
y = c(1,3),
v = 1:6,
z = 7:12,
w = 13:18)
DT
x x2 y v z w
1: a h 1 1 7 13
2: a h 3 2 8 14
3: b h 1 3 9 15
4: b j 3 4 10 16
5: c j 1 5 11 17
6: c j 3 6 12 18
I have a function which I would like to score the numeric columns of DT by column x. The function scores by two fixed columns and performs a calculation on the 3rd column over the numeric columns. The function is as follows (the underscore represents the column that is not fixed):
scoring <- function(_, z, w) {
f <- abs(w - _) / abs(w - z)
f[is.infinite(f)] <- 1
f[is.nan(f)] <- 1
return(median(f))
}
The result would (in this case) have 2 new columns, y and v both of which would be aggregated using the the score function by x (that is for groups "a", "b" and "c". E.g.:
y: a: 0.9166667
y: b: 1.25
y: c: 1.583333
v: a: 1
v: b: 1
v: c: 1
My question is:
I know I can use the by functionality in data.table, but I don't know how to tell it to keep two columns fixed for my custom function and perform the calculation on the remaining columns.

Split dataframe into bins based on another vector

suppose I have the following dataframe
x <- c(12,30,45,100,150,305,2,46,10,221)
x2 <- letters[1:10]
df <- data.frame(x,x2)
df <- df[with(df, order(x)), ]
x x2
7 2 g
9 10 i
1 12 a
2 30 b
3 45 c
8 46 h
4 100 d
5 150 e
10 221 j
6 305 f
And I would like to split these into groups based on another vector,
v <- seq(0, 500, 50)
Basically, I would like to partition out each row based on column x and how it matches with to v ( so for example x <= an element in v) - the location/index of that element in v is then used to assign a group for that row. The resulting table should look something like the following:
x x2 group
7 2 g g1
9 10 i g1
1 12 a g1
2 30 b g1
3 45 c g1
8 46 h g2
4 100 d g3
5 150 e g4
10 221 j g4
6 305 f g6
I could try to loop through each row and try and match it to v but I'm still confuse as to how I could easily detect where the match x<=element v occurs so that I can assign a group id to it. thanks.
You can use cut to break up df$x by the values of v:
df$group <- as.numeric(cut(df$x, breaks = v))
df$group <- paste0('g', df$group)
cut returns a factor so you can use as.numeric to just pull out which numeric bucket the value of df$x falls into based on v.

Replacing header in data frame based on values in second data frame

Say I have a data frame which looks like this:
df.A
A B C
x 1 3 4
y 5 4 6
z 8 9 1
And I want to replace the column names in the first based on column values in a second:
df.B
Low High
A D
B F
C G
Such that I get:
df.A
D F G
x 1 3 4
y 5 4 6
z 8 9 1
How would I do it?
I have tried extracting the vector df.B$High from df.B and using this in names(df.A), but everything is in alphabetical order and shifted over one. Furthermore, this only works if the order of columns in df.A is conserved with respect to the elements in df.B$High, which is not always the case (and in my real example there is no numeric or alphabetical way to sort the two to the same order). So I think I need an rbind-type argument for matching elements, but I'm not sure.
Thanks!
You can use rename from plyr:
library(plyr)
dat <- read.table(text = " A B C
x 1 3 4
y 5 4 6
z 8 9 1",header = TRUE,sep = "")
> new <- read.table(text = "Low High
A D
B F
C G",header = TRUE,sep = "")
> rename(dat,replace = setNames(new$High,new$Low))
D F G
x 1 3 4
y 5 4 6
z 8 9 1
using match:
df.A <- read.table(sep=" ", header=T, text="
A B C
x 1 3 4
y 5 4 6
z 8 9 1")
df.B <- read.table(sep=" ", header=T, text="
Low High
A D
B F
C G")
df.C <- df.A
names(df.C) <- df.B$High[match(names(df.A), df.B$Low)]
df.C
# D F G
# x 1 3 4
# y 5 4 6
# z 8 9 1
You can play games with the row names of df.B to make a lookup more convenient:
rownames(df.B) <- df.B$Low
names(df.A) <- df.B[names(df.A),"High"]
df.A
## D F G
## x 1 3 4
## y 5 4 6
## z 8 9 1
Here's an approach abusing factor:
f <- factor(names(df.A), levels=df.B$Low)
levels(f) <- df.B$High
f
## [1] D F G
## Levels: D F G
names(df.A) <- f
## Desired results

Flatten matrix in R to four columns (indexes and upper/lower triangles)

I'm using the cor.prob() function that's been posted several times around the mailing list to get a matrix of correlations (lower diagonal) and p-values (upper diagonals):
cor.prob <- function (X, dfr = nrow(X) - 2) {
R <- cor(X)
above <- row(R) < col(R)
r2 <- R[above]^2
Fstat <- r2 * dfr/(1 - r2)
R[above] <- 1 - pf(Fstat, 1, dfr)
R[row(R) == col(R)] <- NA
R
}
d <- data.frame(x=1:5, y=c(10,16,8,60,80), z=c(10,9,12,2,1))
cor.prob(d)
> cor.prob(d)
x y z
x NA 0.04856042 0.107654038
y 0.8807155 NA 0.003523594
z -0.7953560 -0.97945703 NA
How would I collapse the above correlation matrix (with the correlations in the lower half, p-values in the upper half) into a four-column matrix: two indexes, the correlation, and the p-value? E.g.:
i j cor pval
x y .88 .048
x z -.79 .107
y z -.97 0.0035
I've seen the answer to the previous question like this, but will only give me a 3-column matrix, not a four column matrix with separate columns for the p-value and correlation.
Any help is appreciated!
well it's not a matrix, because you can't mix characters and numerics. But:
this is my first attempt (before your label swap):
m <- cor.prob(d)
ut <- upper.tri(m)
lt <- lower.tri(m)
d <- data.frame(i=rep(row.names(m),ncol(m))[as.vector(ut)],
j=rep(colnames(m),each=nrow(m))[as.vector(ut)],
cor=m[ut],
p=m[lt])
now apply the correction I suggested below and you get
d <- data.frame(i=rep(row.names(m),ncol(m))[as.vector(ut)],
j=rep(colnames(m),each=nrow(m))[as.vector(ut)],
cor=m[ut],
p=t(m)[ut])
finally your label swap, use row()/col(), and write it as a function:
f1 <- function(m) {
ut <- upper.tri(m)
data.frame(i = rownames(m)[row(m)[ut]],
j = rownames(m)[col(m)[ut]],
cor=t(m)[ut],
p=tm[ut])
}
then
m<-matrix(1:25,5,dimnames=list(letters[1:5],letters[1:5])
> m
a b c d e
a 1 6 11 16 21
b 2 7 12 17 22
c 3 8 13 18 23
d 4 9 14 19 24
e 5 10 15 20 25
> f1(m)
i j cor p
1 a b 6 2
2 a c 11 3
3 b c 12 8
4 a d 16 4
5 b d 17 9
6 c d 18 14
7 a e 21 5
8 b e 22 10
9 c e 23 15
10 d e 24 20
Can you explain what you expected if it wasn't this?
cd <- cor.prob(d)
dcd <- as.data.frame( which( row(cd) < col(cd), arr.ind=TRUE) )
dcd$pval <- cd[row(cd) < col(cd)]
dcd$cor <- cd[row(cd) > col(cd)]
dcd[[2]] <-dimnames(cd)[[2]][dcd$col]
dcd[[1]] <-dimnames(cd)[[2]][dcd$row]
dcd
#--------------------
row col pval cor
1 x y 0.048560420 0.8807155
2 x z 0.107654038 -0.7953560
3 y z 0.003523594 -0.9794570

Resources