I'm very new to R and working on tidying a data set. I have a large number of columns, where some columns (in .CSV file) contain several comma separated names. For example, I need to split and duplicate the column and give the comma-separated-names individually to each column:
However, I may have more complicated situation, where there are several columns (with different numerical values) with the same repeated multiple names. these column should be split (each column for each name) and to the repeated names should be added suffixes ('.1' or even '.2' if they repeated more times), see here:
I am actively exploring how to do it, but still no luck. Any help would be highly appreciated.
Here's one way:
First lets create some dummy example data using data.table::fread
library(data.table)
dt = fread(
"a b c,d e f,g,h
1 2 3 4 5
1 2 3 4 5", sep=' ')
# a b c,d e f,g,h
#1: 1 2 3 4 5
#2: 1 2 3 4 5
cols = names(dt)
Now we use stringr to count occurences of commas in the names, and add columns accordingly. We use recycling in the matrix statement to fill new adjacent columns with the same values
library(stringr)
dt.new = dt[, lapply(cols, function(x) matrix(get(x), NROW(dt), str_count(x, ',')+1L))]
names(dt.new) <- unlist(strsplit(cols, ','))
dt.new
# a b c d e f g h
# 1: 1 2 3 3 4 5 5 5
# 2: 1 2 3 3 4 5 5 5
Similarly, in case you prefer to use a base data.frame rather than data.table we can instead do
dt.new = data.frame(lapply(cols, function(x) matrix(dt[[x]], NROW(dt), str_count(x,',')+1L)))
names(dt.new) <- unlist(strsplit(cols, ','))
Related
I'm currently working with a huge count matrix issued of single cell sequencing ...
So, in order to analyze them with R and my 8 Gb of RAM, I had to split it in several sub-matrices.
I simply used split in order to do that so I loose the heathers of the matrix.
So, I would like to add them back with R or find a better way to split them more efficiently.
My questions are:
1. If a have an object called heathers with all the column names stocked inside, is there a way to efficiently add this object to a dataframe? I tried rbind but it doesn't really solve the problem.
2. Is there a better way to cut those huge count matrices into multiple parts? (I can't do it through R because I don't have enough RAM, R crashes if I try to import the whole matrix)
If a have an object called heathers with all the column names stocked inside, is there a way to efficiently add this object to a dataframe? I tried rbind but it doesn't really solve the problem.
You can add headers to a dataframe like this:
dataframe <- data.frame(c("a", "b","c"),
c("d", "e", "f"))
headers <- c("header_1" , "header_2")
names(dataframe) <- headers
dataframe
header_1 header_2
1 a d
2 b e
3 c f
You could use bash for such tasks.
You can access and mutate a data.frames column names with the names function:
df <- data.frame(foo = 1:5, bar = 6:10, opt = 11:15)
original_names <- names(df)
original_names
Returns:
[1] "foo" "bar" "opt"
And to assign new names:
names(df) <- c("new_col1", "new_col2", "new_col3")
Now:
df
Returns:
new_col1 new_col2 new_col3
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
And to 'undo' the renaming:
names(df) <- original_names
And df has again its original names:
foo bar opt
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I would like to repeat entire rows in a data-frame based on the samples column.
My input:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text=df, header=TRUE)
My expected output:
df <- 'chr start end samples
1 10 20 1-10-20-s1
1 10 20 1-10-20-s2
2 4 10 2-4-10-s1
2 4 10 2-4-10-s2
2 4 10 2-4-10-s3'
Some idea how to perform it wisely?
We can use expandRows to expand the rows based on the value in the 'samples' column, then convert to data.table, grouped by 'chr', we paste the columns together along with sequence of rows using sprintf to update the 'samples' column.
library(splitstackshape)
setDT(expandRows(df, "samples"))[,
samples := sprintf("%d-%d-%d-%s%d", chr, start, end, "s",1:.N) , chr][]
# chr start end samples
#1: 1 10 20 1-10-20-s1
#2: 1 10 20 1-10-20-s2
#3: 2 4 10 2-4-10-s1
#4: 2 4 10 2-4-10-s2
#5: 2 4 10 2-4-10-s3
NOTE: data.table will be loaded when we load splitstackshape.
You can achieve this using base R (i.e. avoiding data.tables), with the following code:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text = df, header = TRUE)
duplicate_rows <- function(chr, starts, ends, samples) {
expanded_samples <- paste0(chr, "-", starts, "-", ends, "-", "s", 1:samples)
repeated_rows <- data.frame("chr" = chr, "starts" = starts, "ends" = ends, "samples" = expanded_samples)
repeated_rows
}
expanded_rows <- Map(f = duplicate_rows, df$chr, df$start, df$end, df$samples)
new_df <- do.call(rbind, expanded_rows)
The basic idea is to define a function that will take a single row from your initial data.frame and duplicate rows based on the value in the samples column (as well as creating the distinct character strings you're after). This function is then applied to each row of your initial data.frame. The output is a list of data.frames that then need to be re-combined into a single data.frame using the do.call pattern.
The above code can be made cleaner by using the Hadley Wickham's purrr package (on CRAN), and the data.frame specific version of map (see the documentation for the by_row function), but this may be overkill for what you're after.
Example using DataFrame function from S4Vector package:
df <- DataFrame(x=c('a', 'b', 'c', 'd', 'e'), y=1:5)
rep(df, df$y)
where y column represents the number of times to repeat its corresponding row.
Result:
DataFrame with 15 rows and 2 columns
x y
<character> <integer>
1 a 1
2 b 2
3 b 2
4 c 3
5 c 3
... ... ...
11 e 5
12 e 5
13 e 5
14 e 5
15 e 5
This question already has answers here:
Split dataframe using two columns of data and apply common transformation on list of resulting dataframes
(3 answers)
Closed 4 years ago.
Suppose I have a dataframe with 3 columns. I would like to create separate sub-dataframes for each of the unique combinations of a few columns.
For example, suppose we have just 3 columns,
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
df <- data.frame(a,b,c)
I would like to get a separate dataframe for each of the unique combinations of Column 'a' and 'b'
I started with using unique to get a list of the unique combinations as the following,
factors <- unique(df[,c('a','b')])
a b
1 1 a
2 5 a
3 2 f
4 3 d
5 4 f
6 5 c
7 3 a
8 2 r
10 3 c
But I am not sure what to do next.
The code below are for illustration purposes. Ideally this will be done through a loop where it uses each of the rows in factors to create the dataframes.
df_1_a <- df %>% filter(a==1, b=='a')
a b c
1 1 a 0.2
2 1 a 0.9
df_3_a <- %>% filter(a==3, b=='a')
a b c
1 3 a 0.112
.
.
.
This is kinda dirty and I'm not sure that answer your question but try this :
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
d <- paste0(a,b)
df <- data.frame(a,b,c,d)
df_splited <- split(df,df$d)
You obtain a list composed of dataframes with unique combinaison of a,b
You can use split after you get the unique combinations you are after.
a <- c(1,5,2,3,4,5,3,2,1,3)
b <- c("a","a","f","d","f","c","a","r","a","c")
c <- c(.2,.6,.4,.545,.98,.312,.112,.4,.9,.5)
df <- data.frame(a,b,c,stringsAsFactors = FALSE)
fx <- unique(df[,c('a','b')])
fx_list <- split(fx,rownames(fx))
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
Embarrassingly basic question, but if you don't know.. I need to reshape a data.frame of count summarised data into what it would've looked like before being summarised. This is essentially the reverse of {plyr} count() e.g.
> (d = data.frame(value=c(1,1,1,2,3,3), cat=c('A','A','A','A','B','B')))
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B
> (summry = plyr::count(d))
value cat freq
1 1 A 3
2 2 A 1
3 3 B 2
If you start with summry what is the quickest way back to d? Unless I'm mistaken (very possible), {Reshape2} doesn't do this..
Just use rep:
summry[rep(rownames(summry), summry$freq), c("value", "cat")]
# value cat
# 1 1 A
# 1.1 1 A
# 1.2 1 A
# 2 2 A
# 3 3 B
# 3.1 3 B
A variation of this approach can be found in expandRows from my "SOfun" package. If you had that loaded, you would be able to simply do:
expandRows(summry, "freq")
There is a good table to dataframe function on the R cookbook website that you can modify slightly. The only modifications were changing 'Freq' -> 'freq' (to be consistent with plyr::count) and making sure the rownames were reset as increasing integers.
expand.dft <- function(x, na.strings = "NA", as.is = FALSE, dec = ".") {
# Take each row in the source data frame table and replicate it
# using the Freq value
DF <- sapply(1:nrow(x),
function(i) x[rep(i, each = x$freq[i]), ],
simplify = FALSE)
# Take the above list and rbind it to create a single DF
# Also subset the result to eliminate the Freq column
DF <- subset(do.call("rbind", DF), select = -freq)
# Now apply type.convert to the character coerced factor columns
# to facilitate data type selection for each column
for (i in 1:ncol(DF)) {
DF[[i]] <- type.convert(as.character(DF[[i]]),
na.strings = na.strings,
as.is = as.is, dec = dec)
}
row.names(DF) <- seq(nrow(DF))
DF
}
expand.dft(summry)
value cat
1 1 A
2 1 A
3 1 A
4 2 A
5 3 B
6 3 B
got that one I can't resolve.
Example dataset:
company <- c("compA","compB","compC")
compA <- c(1,2,3)
compB <- c(2,3,1)
compC <- c(3,1,2)
df <- data.frame(company,compA,compB,compC)
I want to create a new column with the value from the column which name is in the column "company" of the same line. the resulting extraction would be:
df$new <- c(1,3,2)
df
The way you have it set up, there's one row and one column for every company, and the rows and columns are in the same order. If that's your real dataset, then as others have said diag(...) is the solution (and you should select that answer).
If your real dataset has more than one instance of company (e.g., more than one row per company, then this is more general:
# using your df
sapply(1:nrow(df),function(i)df[i,as.character(df$company[i])])
# [1] 1 3 2
# more complex case
set.seed(1) # for reproducible example
newdf <- data.frame(company=LETTERS[sample(1:3,10,replace=T)],
A = sample(1:3,10,replace=T),
B=sample(1:5,10,replace=T),
C=1:10)
head(newdf)
# company A B C
# 1 A 1 5 1
# 2 B 1 2 2
# 3 B 3 4 3
# 4 C 2 1 4
# 5 A 3 2 5
# 6 C 2 2 6
sapply(1:nrow(newdf),function(i)newdf[i,as.character(newdf$company[i])])
# [1] 1 2 4 4 3 6 7 2 5 3
EDIT: eddi's answer is probably better. It is more likely that you would have the dataframe to work with rather than the individual row vectors.
I am not sure if I understand your question, it is unclear from your description. But it seems you are asking for the diagonals of the data values since this would be the place where "name is in the column "company" of the same line". The following will do this:
df$new <- diag(matrix(c(compA,compB,compC), nrow = 3, ncol = 3))
The diag function will return the diagonal of the matrix for you. So I first concatenated the three original vectors into one vector and then specified it to be wrapped into a matrix of three rows and three columns. Then I took the diagonal. The whole thing is then added to the dataframe.
Did that answer your question?