How do you convert information from rle into a data frame - r

I want to convert the information contained in a the "rle" function in R, into a data frame, but couldn't find how. For example, for the vector
x <- c(1,1,1,2,2,3,4,4,4)
I want a dataframe that has two columns of 1 2 3 4 and 3 2 1 3
Any help would be greatly appreciated!

Use unclass to remove the rle class. Then you can just use data.frame on the resulting list.
data.frame(unclass(rle(x)))
## lengths values
## 1 3 1
## 2 2 2
## 3 1 3
## 4 3 4

You can do it direclty with the data.frame function. rle actually returns a list of two components (lengths and values).
rleX
data.frame(values = rleX$values, lengths = rleX$lengths)

You can use this simple function to convert to dataframe
data <- with(rle(x), data.frame(values, lengths))

Try this:
data.frame(table(x))
x Freq
1 1 3
2 2 2
3 3 1
4 4 3

Related

Expand.grid with unknown number of columns

I have the following data frame:
map_value LDGroup ComboNum
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 3 1
1 3 2
I want to find all combinations, selecting one from each LD group. Expand.grid seems to work for this, doing
expand.grid(df[df$LDGroup==1,3],df[df$LDGroup==2,3],df[df$LDGroup==3,3])
My problem is that I have about 500 map_values I need to do this for and I do not know what number of LDGroups will exist for each map_value. Is there a way to dynamically provide the function arguments?
We can split the 3rd column by the 'LDGroup' and apply the expand.grid
out <- expand.grid(split(df$ComboNum, df$LDGroup))
names(out) <- paste0("Var", names(out))

How to remove columns of data from a data frame using a vector with a regular expression

I am trying to remove columns from a dataframe using a vector of numbers, with those numbers being just a part of the whole column header. What I'm looking to use is something like the wildcard "*" in unix, so that I can say that I want to remove columns with labels xxxx, xxkx, etc... To illustrate what I mean, if I have the following data:
data_test_read <- read.table("batch_1_8c9.structure-edit.tsv",sep="\t", header=TRUE)
data_test_read[1:5,1:5]
samp pop X12706_10 X14223_16 X14481_7
1 BayOfIslands_s088.fq 1 4 1 3
2 BayOfIslands_s088.fq 1 4 1 3
3 BayOfIslands_s089.fq 1 4 1 3
4 BayOfIslands_s089.fq 1 4 3 3
5 BayOfIslands_s090.fq 1 4 1 3
And I want to take out, for example, columns with headers (X12706_10, X14481_7), the following works
data_subs1=subset(data_test_read, select = -c(X12706_10, X14481_7))
data_subs1[1:4,1:4]
samp pop X14223_16 X15213_19
1 BayOfIslands_s088.fq 1 1 3
2 BayOfIslands_s088.fq 1 1 3
3 BayOfIslands_s089.fq 1 1 3
4 BayOfIslands_s089.fq 1 3 3
However, what I need is to be able to identify these columns by only the numbers, so, using (12706,14481). But, if I try this, I get the following
data_subs2=subset(data_test_read, select = -c(12706,14481))
data_subs2[1:4,1:4]
samp pop X12706_10 X14223_16
1 BayOfIslands_s088.fq 1 4 1
2 BayOfIslands_s088.fq 1 4 1
3 BayOfIslands_s089.fq 1 4 1
4 BayOfIslands_s089.fq 1 4 3
This is clearly because I haven't specified anything to do with the "x", or the "_" or what is after the underscore. I've read so many answers on using regular expressions, and I just can't seem to sort it out. Any thoughts, or pointers to what I might turn to would be appreciated.
First you can just extract the numbers from the headers
# for testing
col_names <- c("X12706_10","X14223_16","X14481_7")
# in practice, use
# col_names <- names(data_test_read)
samples <- gsub("X(\\d+)_.*","\\1",col_names)
The find the indexes of the samples you want to drop.
samples_to_drop <- c(12706, 14481)
cols_to_drop <- match(samples_to_drop, samples)
Then you can use
data_subs2 <- subset(data_test_read, select = -cols_to_drop)
to actually get rid of those columns.
Perhaps put this all in a function to make it easier to use
sample_subset <- function(x, drop) {
samples <- gsub("X(\\d+)_.*","\\1", names(x))
subset(x, select = -match(drop, samples))
}
sample_subset(data_test_read, c(12706, 14481))

Extract data from data.frame based on coordinates in another data.frame

So here is what my problem is. I have a really big data.frame woth two columns, first one represents x coordinates (rows) and another one y coordinates (columns), for example:
x y
1 1
2 3
3 1
4 2
3 4
In another frame I have some data (numbers actually):
a b c d
8 7 8 1
1 2 3 4
5 4 7 8
7 8 9 7
1 5 2 3
I would like to add a third column in first data.frame with data from second data.frame based on coordinates from first data.frame. So the result should look like this:
x y z
1 1 8
2 3 3
3 1 5
4 2 8
3 4 8
Since my data.frames are really big the for loops are too slow. I think there is a way to do this with apply loop family, but I can't find how. Thanks in advance (and sorry for ugly message layout, this is my first post here and I don't know how to produce this nice layout with code and proper data.frames like in another questions).
This is a simple indexing question. No need in external packages or *apply loops, just do
df1$z <- df2[as.matrix(df1)]
df1
# x y z
# 1 1 1 8
# 2 2 3 3
# 3 3 1 5
# 4 4 2 8
# 5 3 4 8
A base R solution: (df1 and df2 are coordinates and numbers as data frames):
df1$z <- mapply(function(x,y) df2[x,y], df1$x, df1$y )
It works if the last y in the first data frame is corrected from 5 to 4.
I guess it was a typo since you don't have 5 columns in the second data drame.
Here's how I would do this.
First, use data.table for fast merging; then convert your data frames (I'll call them dt1 with coordinates and vals with values) to data.tables.
dt1<-data.table(dt)
vals<-data.table(vals)
Second, put vals into a new data.table with coordinates:
vals_dt<-data.table(x=rep(1:dim(vals)[1],dim(vals)[2]),
y=rep(1:dim(vals)[2],each=dim(vals)[1]),
z=matrix(vals,ncol=1)[,1],key=c("x","y"))
Now merge:
setkey(dt1,x,y)[vals_dt,z:=z]
You can also try the data.table package and update df1 by reference
library(data.table)
setDT(df1)[, z := df2[cbind(x, y)]][]
# x y z
# 1: 1 1 8
# 2: 2 3 3
# 3: 3 1 5
# 4: 4 2 8
# 5: 3 4 8

Replace some component value in a vector with some other value

In R, in a vector, i.e. a 1-dim matrix, I would like to change components with value 3 to with value 1, and components with value 4 with value 2. How shall I do that? Thanks!
The idiomatic r way is to use [<-, in the form
x[index] <- result
If you are dealing with integers / factors or character variables, then == will work reliably for the indexing,
x <- rep(1:5,3)
x[x==3] <- 1
x[x==4] <- 2
x
## [1] 1 2 1 2 5 1 2 1 2 5 1 2 1 2 5
The car has a useful function recode (which is a wrapper for [<-), that will let you combine all the recoding in a single call
eg
library(car)
x <- rep(1:5,3)
xr <- recode(x, '3=1; 4=2')
x
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
xr
## [1] 1 2 1 2 5 1 2 1 2 5 1 2 1 2 5
Thanks to #joran for mentioning mapvalues from the plyr package, another wrapper for [<-
x <- rep(1:5,3)
mapvalues(x, from = c(3,1), to = c(1,2))
plyr::revalue is a wrapper for mapvalues specifically factor or character variables.

Create a vector listing run length of original vector with same length as original vector

This problem seems trivial but I'm at my wits end after hours of reading.
I need to generate a vector of the same length as the input vector that lists for each value of the input vector the total count for that value. So, by way of example, I would want to generate the last column of this dataframe:
> df
customer.id transaction.count total.transactions
1 1 1 4
2 1 2 4
3 1 3 4
4 1 4 4
5 2 1 2
6 2 2 2
7 3 1 3
8 3 2 3
9 3 3 3
10 4 1 1
I realise this could be done two ways, either by using run lengths of the first column, or grouping the second column using the first and applying a maximum.
I've tried both tapply:
> tapply(df$transaction.count, df$customer.id, max)
And rle:
> rle(df$customer.id)
But both return a vector of shorter length than the original:
[1] 4 2 3 1
Any help gratefully accepted!
You can do it without creating transaction counter with:
df$total.transactions <- with( df,
ave( transaction.count , customer.id , FUN=length) )
You can use rle with rep to get what you want:
x <- rep(1:4, 4:1)
> x
[1] 1 1 1 1 2 2 2 3 3 4
rep(rle(x)$lengths, rle(x)$lengths)
> rep(rle(x)$lengths, rle(x)$lengths)
[1] 4 4 4 4 3 3 3 2 2 1
For performance purposes, you could store the rle object separately so it is only called once.
Or as Karsten suggested with ddply from plyr:
require(plyr)
#Expects data.frame
dat <- data.frame(x = rep(1:4, 4:1))
ddply(dat, "x", transform, total = length(x))
You are probably looking for split-apply-combine approach; have a look at ddply in the plyr package or the split function in base R.

Resources