sort and number within levels of a factor in r - r

if i have the following data frame G:
z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4
I am trying to get:
z type x y
3 a 6 3
2 a 5 2
1 a 4 1
4 b 1 2
5 b 0.9 1
6 c 4 1
I.e. i want to sort the whole data frame within the levels of factor type based on vector x. Get the length of of each level a = 3 b=2 c=1 and then number in a decreasing fashion in a new vector y.
My starting place is currently with sort()
tapply(y, x, sort)
Would it be best to first try and use sapply to split everything first?

There are many ways to skin this cat. Here is one solution using base R and vectorized code in two steps (without any apply):
Sort the data using order and xtfrm
Use rle and sequence to genereate the sequence.
Replicate your data:
dat <- read.table(text="
z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4
", header=TRUE, stringsAsFactors=FALSE)
Two lines of code:
r <- dat[order(dat$type, -xtfrm(dat$x)), ]
r$y <- sequence(rle(r$type)$lengths)
Results in:
r
z type x y
3 3 a 6.0 1
2 2 a 5.0 2
1 1 a 4.0 3
4 4 b 1.0 1
5 5 b 0.9 2
6 6 c 4.0 1
The call to order is slightly complicated. Since you are sorting one column in ascending order and a second in descending order, use the helper function xtfrm. See ?xtfrm for details, but it is also described in ?order.

I like Andrie's better:
dat <- read.table(text="z type x
1 a 4
2 a 5
3 a 6
4 b 1
5 b 0.9
6 c 4", header=T)
Three lines of code:
dat <- dat[order(dat$type), ]
x <- by(dat, dat$type, nrow)
dat$y <- unlist(sapply(x, function(z) z:1))
I Edited my response to adapt for the comments Andrie mentioned. This works but if you went this route instead of Andrie's you're crazy.

Related

How to filter variables by fold change difference in R

I'm trying to filter a very heterogeneous dataset.
I have numerous variables with several replicates each one. I have a factor with two levels (lets say X and Y), and I would like to subset the variables which present a fold change on its mean greater than 2 (X/Y >= 2 OR Y/X >= 2).
How can I achieve that in R? I can think of some ways but they seem too much of a hassle, I'm sure there is a better way. I would later run multivariate test on those filtered variables.
This would be an example dataset:
d <- read.table(text = "a b c d factor replicate
1 2 2 3 X 1
3 2 4 4 X 2
2 3 1 2 X 3
1 2 3 2 X 4
5 2 6 4 Y 1
7 4 5 5 Y 2
8 5 7 4 Y 3
6 4 3 3 Y 4", header = TRUE)
From this example, only variables a and c should be kept.
Using colMeans:
#subset
x <- d[ d$factor == "X", 1:4 ]
y <- d[ d$factor == "Y", 1:4 ]
# check colmeans, and get index
which(colMeans(x/y) >= 2 | colMeans(y/x) >= 2)
# a c
# 1 3

Operation between two dataframe with different size in R

I'd like to sum two dataframe with different size in R.
> x = data.frame(a=c(1,2,3),b=c(5,6,7))
> y = data.frame(x=c(1,1,1))
> x
a b
1 1 5
2 2 6
3 3 7
> y
x
1 1
2 1
3 1
The result I want is,
>
a b
1 2 6
2 3 7
3 4 8
How can I do this?
Maybe easiest to convert y to a vector with unlist and then perform the operation. Here, the vector in unlist(y) will be recycled over the columns of the data.frame x.
x + unlist(y)
a b
1 2 6
2 3 7
3 4 8
As a side note, data.frames are a special type of list object and sometimes performing operations on lists can be a bit more involved. On the otherhand, they tend to work fairly well with vectors as long as the dimensions line up (here, as long as the vector has the same length as the number of rows in the data.frame).
We can make the dimensions same and then get the sum
x + rep(y, ncol(x))
# a b
#1 2 6
#2 3 7
#3 4 8
Or another option is sweep
sweep(x, y$x, 1, `+`)
# a b
#1 2 6
#2 3 7
#3 4 8

Convert a full length column to one variable in a row in R

I was wondering if it is possible to convert 1 column into 1 variable next to eachother
i.e.:
d <- data.frame(y = 1:10)
> d
y
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
Convert this column into:
> d
1 2 3 4 5 6 7 8 9 10
We don't know how are you going to use the numbers, but I think it is unnecessary to make any transformation. You can use d$y to get the numbers applied to any map of colors. See for example.
d <- data.frame(y = 1:7)
library(RColorBrewer)
mypalette<-brewer.pal(4,"Greens")
mycol <-palette()#rainbow(7)
heatmap(matrix(1:28,ncol=4),col=mypalette[d$y[1:4]],xlab="Greens (sequential)",
ylab="",xaxt="n",yaxt="n",bty="n",RowSideColors=mycol[d$y])
Not sure what is the prupose of:
1 variable next to eachother
But there are few ways to get the desired result (again, depends on the objective). You can do either:
d$y
unname(unlist(d)) #suggested by agstudy
or, better yet, to convert your dataframe's column into a vector, do this:
v <- as.vector(d[,1])
as string:
args <- paste(d$y, sep=" ")
args<-noquote(args)
now you'll have
[1] 1 2 3 4 5 6 7 8 9 10

Extract data from data.frame based on coordinates in another data.frame

So here is what my problem is. I have a really big data.frame woth two columns, first one represents x coordinates (rows) and another one y coordinates (columns), for example:
x y
1 1
2 3
3 1
4 2
3 4
In another frame I have some data (numbers actually):
a b c d
8 7 8 1
1 2 3 4
5 4 7 8
7 8 9 7
1 5 2 3
I would like to add a third column in first data.frame with data from second data.frame based on coordinates from first data.frame. So the result should look like this:
x y z
1 1 8
2 3 3
3 1 5
4 2 8
3 4 8
Since my data.frames are really big the for loops are too slow. I think there is a way to do this with apply loop family, but I can't find how. Thanks in advance (and sorry for ugly message layout, this is my first post here and I don't know how to produce this nice layout with code and proper data.frames like in another questions).
This is a simple indexing question. No need in external packages or *apply loops, just do
df1$z <- df2[as.matrix(df1)]
df1
# x y z
# 1 1 1 8
# 2 2 3 3
# 3 3 1 5
# 4 4 2 8
# 5 3 4 8
A base R solution: (df1 and df2 are coordinates and numbers as data frames):
df1$z <- mapply(function(x,y) df2[x,y], df1$x, df1$y )
It works if the last y in the first data frame is corrected from 5 to 4.
I guess it was a typo since you don't have 5 columns in the second data drame.
Here's how I would do this.
First, use data.table for fast merging; then convert your data frames (I'll call them dt1 with coordinates and vals with values) to data.tables.
dt1<-data.table(dt)
vals<-data.table(vals)
Second, put vals into a new data.table with coordinates:
vals_dt<-data.table(x=rep(1:dim(vals)[1],dim(vals)[2]),
y=rep(1:dim(vals)[2],each=dim(vals)[1]),
z=matrix(vals,ncol=1)[,1],key=c("x","y"))
Now merge:
setkey(dt1,x,y)[vals_dt,z:=z]
You can also try the data.table package and update df1 by reference
library(data.table)
setDT(df1)[, z := df2[cbind(x, y)]][]
# x y z
# 1: 1 1 8
# 2: 2 3 3
# 3: 3 1 5
# 4: 4 2 8
# 5: 3 4 8

Generating random number by length of blocks of data in R data frame

I am trying to simulate n times the measuring order and see how measuring order effects my study subject. To do this I am trying to generate integer random numbers to a new column in a dataframe. I have a big dataframe and i would like to add a column into the dataframe that consists a random number according to the number of observations in a block.
Example of data(each row is an observation):
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5))
A B C
1 1 x 1
2 1 b 2
3 1 c 4
4 2 g 1
5 2 h 5
6 3 g 7
7 3 g 1
8 3 u 2
9 3 l 5
What I'd like to do is add a D column and generate random integer numbers according to the length of each block. Blocks are defined in column A.
Result should look something like this:
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5),
D=c(2,1,3,2,1,4,3,1,2))
> df
A B C D
1 1 x 1 2
2 1 b 2 1
3 1 c 4 3
4 2 g 1 2
5 2 h 5 1
6 3 g 7 4
7 3 g 1 3
8 3 u 2 1
9 3 l 5 2
I have tried to use R:s sample() function to generate random numbers but my problem is splitting the data according to block length and adding the new column. Any help is greatly appreciated.
It can be done easily with ave
df$D <- ave( df$A, df$A, FUN = function(x) sample(length(x)) )
(you could replace length() with max(), or whatever, but length will work even if A is not numbers matching the length of their blocks)
This is really easy with ddply from plyr.
ddply(df, .(A), transform, D = sample(length(A)))
The longer manual version is:
Use split to split the data frame by the first column.
split_df <- split(df, df$A)
Then call sample on each member of the list.
split_df <- lapply(split_df, function(df)
{
df$D <- sample(nrow(df))
df
})
Then recombine with
df <- do.call(rbind, split_df)
One simple way:
df$D = 0
counts = table(df$A)
for (i in 1:length(counts)){
df$D[df$A == names(counts)[i]] = sample(counts[i])
}

Resources