This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I would like to repeat entire rows in a data-frame based on the samples column.
My input:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text=df, header=TRUE)
My expected output:
df <- 'chr start end samples
1 10 20 1-10-20-s1
1 10 20 1-10-20-s2
2 4 10 2-4-10-s1
2 4 10 2-4-10-s2
2 4 10 2-4-10-s3'
Some idea how to perform it wisely?
We can use expandRows to expand the rows based on the value in the 'samples' column, then convert to data.table, grouped by 'chr', we paste the columns together along with sequence of rows using sprintf to update the 'samples' column.
library(splitstackshape)
setDT(expandRows(df, "samples"))[,
samples := sprintf("%d-%d-%d-%s%d", chr, start, end, "s",1:.N) , chr][]
# chr start end samples
#1: 1 10 20 1-10-20-s1
#2: 1 10 20 1-10-20-s2
#3: 2 4 10 2-4-10-s1
#4: 2 4 10 2-4-10-s2
#5: 2 4 10 2-4-10-s3
NOTE: data.table will be loaded when we load splitstackshape.
You can achieve this using base R (i.e. avoiding data.tables), with the following code:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text = df, header = TRUE)
duplicate_rows <- function(chr, starts, ends, samples) {
expanded_samples <- paste0(chr, "-", starts, "-", ends, "-", "s", 1:samples)
repeated_rows <- data.frame("chr" = chr, "starts" = starts, "ends" = ends, "samples" = expanded_samples)
repeated_rows
}
expanded_rows <- Map(f = duplicate_rows, df$chr, df$start, df$end, df$samples)
new_df <- do.call(rbind, expanded_rows)
The basic idea is to define a function that will take a single row from your initial data.frame and duplicate rows based on the value in the samples column (as well as creating the distinct character strings you're after). This function is then applied to each row of your initial data.frame. The output is a list of data.frames that then need to be re-combined into a single data.frame using the do.call pattern.
The above code can be made cleaner by using the Hadley Wickham's purrr package (on CRAN), and the data.frame specific version of map (see the documentation for the by_row function), but this may be overkill for what you're after.
Example using DataFrame function from S4Vector package:
df <- DataFrame(x=c('a', 'b', 'c', 'd', 'e'), y=1:5)
rep(df, df$y)
where y column represents the number of times to repeat its corresponding row.
Result:
DataFrame with 15 rows and 2 columns
x y
<character> <integer>
1 a 1
2 b 2
3 b 2
4 c 3
5 c 3
... ... ...
11 e 5
12 e 5
13 e 5
14 e 5
15 e 5
Related
We have a dataframe called data with 2 columns: Time which is arranged in ascending order, and Place which describes where the individual was:
data.frame(Time = seq(1,20,1),
Place = rep(letters[c(1:3,1)], c(5,5,3,7)))
Since this data is in ascending order with respect to Time, we want to subset the rows where Place changes from the previous observation.
The resulting dataframe for this data would look like this:
Time Place
1 a
6 b
11 c
14 a
Notice that the same Place can show up later, like Place == a did in this example. How can we perform this kind of subset in R?
Apply the duplicated on the rleid of the 'Place'
library(dplyr)
library(data.table)
df1 %>%
filter(!duplicated(rleid(Place)))
Or in base R with rle
subset(df1, !duplicated(with(rle(Place), rep(seq_along(values), lengths))))
-output
Time Place
1 1 a
6 6 b
11 11 c
14 14 a
Another base R option using subset + tail + head
subset(
df,
c(TRUE, tail(Place, -1) != head(Place, -1))
)
which gives
Time Place
1 1 a
6 6 b
11 11 c
14 14 a
I'm currently working with a huge count matrix issued of single cell sequencing ...
So, in order to analyze them with R and my 8 Gb of RAM, I had to split it in several sub-matrices.
I simply used split in order to do that so I loose the heathers of the matrix.
So, I would like to add them back with R or find a better way to split them more efficiently.
My questions are:
1. If a have an object called heathers with all the column names stocked inside, is there a way to efficiently add this object to a dataframe? I tried rbind but it doesn't really solve the problem.
2. Is there a better way to cut those huge count matrices into multiple parts? (I can't do it through R because I don't have enough RAM, R crashes if I try to import the whole matrix)
If a have an object called heathers with all the column names stocked inside, is there a way to efficiently add this object to a dataframe? I tried rbind but it doesn't really solve the problem.
You can add headers to a dataframe like this:
dataframe <- data.frame(c("a", "b","c"),
c("d", "e", "f"))
headers <- c("header_1" , "header_2")
names(dataframe) <- headers
dataframe
header_1 header_2
1 a d
2 b e
3 c f
You could use bash for such tasks.
You can access and mutate a data.frames column names with the names function:
df <- data.frame(foo = 1:5, bar = 6:10, opt = 11:15)
original_names <- names(df)
original_names
Returns:
[1] "foo" "bar" "opt"
And to assign new names:
names(df) <- c("new_col1", "new_col2", "new_col3")
Now:
df
Returns:
new_col1 new_col2 new_col3
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
And to 'undo' the renaming:
names(df) <- original_names
And df has again its original names:
foo bar opt
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
I'm very new to R and working on tidying a data set. I have a large number of columns, where some columns (in .CSV file) contain several comma separated names. For example, I need to split and duplicate the column and give the comma-separated-names individually to each column:
However, I may have more complicated situation, where there are several columns (with different numerical values) with the same repeated multiple names. these column should be split (each column for each name) and to the repeated names should be added suffixes ('.1' or even '.2' if they repeated more times), see here:
I am actively exploring how to do it, but still no luck. Any help would be highly appreciated.
Here's one way:
First lets create some dummy example data using data.table::fread
library(data.table)
dt = fread(
"a b c,d e f,g,h
1 2 3 4 5
1 2 3 4 5", sep=' ')
# a b c,d e f,g,h
#1: 1 2 3 4 5
#2: 1 2 3 4 5
cols = names(dt)
Now we use stringr to count occurences of commas in the names, and add columns accordingly. We use recycling in the matrix statement to fill new adjacent columns with the same values
library(stringr)
dt.new = dt[, lapply(cols, function(x) matrix(get(x), NROW(dt), str_count(x, ',')+1L))]
names(dt.new) <- unlist(strsplit(cols, ','))
dt.new
# a b c d e f g h
# 1: 1 2 3 3 4 5 5 5
# 2: 1 2 3 3 4 5 5 5
Similarly, in case you prefer to use a base data.frame rather than data.table we can instead do
dt.new = data.frame(lapply(cols, function(x) matrix(dt[[x]], NROW(dt), str_count(x,',')+1L)))
names(dt.new) <- unlist(strsplit(cols, ','))
I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
Embarrassingly basic question, but if you don't know.. I need to reshape a data.frame of count summarised data into what it would've looked like before being summarised. This is essentially the reverse of {plyr} count() e.g.
> (d = data.frame(value=c(1,1,1,2,3,3), cat=c('A','A','A','A','B','B')))
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B
> (summry = plyr::count(d))
value cat freq
1 1 A 3
2 2 A 1
3 3 B 2
If you start with summry what is the quickest way back to d? Unless I'm mistaken (very possible), {Reshape2} doesn't do this..
Just use rep:
summry[rep(rownames(summry), summry$freq), c("value", "cat")]
# value cat
# 1 1 A
# 1.1 1 A
# 1.2 1 A
# 2 2 A
# 3 3 B
# 3.1 3 B
A variation of this approach can be found in expandRows from my "SOfun" package. If you had that loaded, you would be able to simply do:
expandRows(summry, "freq")
There is a good table to dataframe function on the R cookbook website that you can modify slightly. The only modifications were changing 'Freq' -> 'freq' (to be consistent with plyr::count) and making sure the rownames were reset as increasing integers.
expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".") {
# Take each row in the source data frame table and replicate it
# using the Freq value
DF <- sapply(1:nrow(x),
function(i) x[rep(i, each = x$freq[i]), ],
simplify = FALSE)
# Take the above list and rbind it to create a single DF
# Also subset the result to eliminate the Freq column
DF <- subset(do.call("rbind", DF), select = -freq)
# Now apply type.convert to the character coerced factor columns
# to facilitate data type selection for each column
for (i in 1:ncol(DF)) {
DF[[i]] <- type.convert(as.character(DF[[i]]),
na.strings = na.strings,
as.is = as.is, dec = dec)
}
row.names(DF) <- seq(nrow(DF))
DF
}
expand.dft(summry)
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B