Replicate each row of data.frame when occurrence - r

I am facing a tricky question and would be glad to have some help.
I have a data frame with an ID name taking different structures. Something like this following :
ID
bbb-5p/mi-98/6134
abb-4p
bbb-5p/mi-98
Every time I have this "/" I would like to duplicate the row. Each row should be duplicated the number of time we find this "/".
Then the name of the duplicated row should be the root + the characters right after the "/".
For exemple this :
ID
bbb-5p/mi-98/6134
should give :
ID
bbb-5p
bbb-5p-mi-98
bbb-5p-6134
Also my initial data frame have 5 variables :
[ID, varA, varB, varC, varD]
And every time I have this "/" I would like to duplicate the entire row. Then I am expecting to have a new data frame with something like
newID newvarA newvarB newvarC newvarD
bbb-5p varA(1) varB(1) varC(1) varD(1)
bbb-5p-mi-98 varA(1) varB(1) varC(1) varD(1)
bbb-5p-6134 varA(1) varB(1) varC(1) varD(1)
abb-4p varA(2) varB(2) varC(2) varD(2)
bbb-5p varA(3) varB(3) varC(3) varD(3)
bbb-5p-mi-98 varA(3) varB(3) varC(3) varD(3)
Any idea?
Thank you in advance
Peter

You can accomplish this in base R, using lapply() with a custom function. First, you split your character column on "/", resulting in a list of vectors:
l <- strsplit(df$ID,"/")
Then you apply a user defined function to each element of l using lapply():
l_stacked <- lapply(l, function(x)
if(length(x) > 1) {
c(x[1], paste0(x[1],"-",x[-1])) }
else { x })
The function first checks whether the vector has a length > 1. If so, it concatenates all elements with the first element, separated by "-". If length <= 1, it means the string didn't contain "/", hence it is returned as is. Finally we flatten our output using unlist() to be able to convert to data.frame.
data.frame(ID = unlist(l_stacked))
# ID
#1 bbb-5p
#2 bbb-5p-mi-98
#3 bbb-5p-6134
#4 abb-4p
#5 bbb-5p
#6 bbb-5p-mi-98

One way to achieve this is the following:
library(dplyr)
library(tidyr)
res <- df %>% mutate(i=row_number(),
ID = strsplit(ID,split='/')) %>%
unnest() %>%
group_by(i) %>%
mutate(ID=ifelse(ID==first(ID),first(ID),paste(first(ID),ID,sep='-'))) %>%
ungroup() %>% select(-i)
### A tibble: 6 x 1
## ID
## <chr>
##1 bbb-5p
##2 bbb-5p-mi-98
##3 bbb-5p-6134
##4 abb-4p
##5 bbb-5p
##6 bbb-5p-mi-98
Notes:
First, create an indexing column i to group by later so that we can group each "root".
Use strsplit to split each row by "|".
tidyr::unnest the result to separate rows.
group_by the created index i and then if the row is the first row, just return the root; otherwise, paste to prepend the root to the row with separator "-".
Finally, ungroup and remove the created index column i.
Data
df <- structure(list(ID = c("bbb-5p/mi-98/6134", "abb-4p", "bbb-5p/mi-98"
)), .Names = "ID", row.names = c(NA, -3L), class = "data.frame")
ID
1 bbb-5p/mi-98/6134
2 abb-4p
3 bbb-5p/mi-98

Here is one option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df1, ..)) and create a column of rownames, grouped by 'rn', split the 'ID' by /, loop through the sequence of rows, paste the split elements based on the index.
library(splitstackshape)
library(data.table)
setDT(df1, keep.rownames=TRUE)[, unlist(strsplit(ID, "/")),
by = rn][, .(ID=sapply(seq_len(.N), function(i)
paste(V1[unique(c(1,i))], collapse="-"))) , rn]
Or an option with dplyr/tidyr/tibble. Create the rownames column with tibble::rownames_to_column, separate the rows into long format with separate_rows, grouped by 'rn', we mutate the 'ID' by pasteing the elements based on the condition of length and remove the 'rn' column.
library(dplyr)
library(tidyr)
library(tidyr)
rownames_to_column(df1, var = "rn") %>%
separate_rows(ID, sep="/") %>%
group_by(rn) %>%
mutate(ID = if(n()>1) c(ID[1], paste(ID[1], ID[-1], sep="-")) else ID) %>%
ungroup() %>%
select(-rn)
# ID
# <chr>
#1 bbb-5p
#2 bbb-5p-mi-98
#3 bbb-5p-6134
#4 abb-4p
#5 bbb-5p
#6 bbb-5p-mi-98

Related

R Subsetting text from a comma seperated column in a data-frame

I have a data.frame with a column that looks like that:
diagnosis
F.31.2,A.43.2,R.45.2,F.43.1
I want to somehow split this column into two colums with one containing all the values with F and one for all the other values, resulting in two columns in a df that looks like that.
F other
F.31.2,F43.1 A.43.2,R.45.2
Thanks in advance
Try next tidyverse approach. You can separate the rows by , and then create a group according to the pattern in order to reshape to wide and obtain the expected result:
library(dplyr)
library(tidyr)
#Data
df <- data.frame(diagnosis='F.31.2,A.43.2,R.45.2,F.43.1',stringsAsFactors = F)
#Code
new <- df %>% separate_rows(diagnosis,sep = ',') %>%
mutate(Group=ifelse(grepl('F',diagnosis),'F','Other')) %>%
pivot_wider(values_fn = toString,names_from=Group,values_from=diagnosis)
Output:
# A tibble: 1 x 2
F Other
<chr> <chr>
1 F.31.2, F.43.1 A.43.2, R.45.2
First, use strsplit at the commas. Then, using grep find indexes of F, and select/antiselect them by multiplying by 1 or -1 and paste them.
tmp <- el(strsplit(d$diagnosis, ","))
res <- lapply(c(1, -1), function(x) paste(tmp[grep("F", tmp)*x], collapse=","))
res <- setNames(as.data.frame(res), c("F", "other"))
res
# F other
# 1 F.31.2,F.43.1 A.43.2,R.45.2
Data:
d <- setNames(read.table(text="F.31.2,A.43.2,R.45.2,F.43.1"), "diagnosis")

Dictionary-like matching on string in R

I have a dataframe in which a string variable is an informal list of elements, that can be split on a symbol. I would like to make operaion on these elements on the basis of another dataset.
e.g. task: Calculate the sum of the elements
df_1 <- data.frame(element=c(1:2),groups=c("A,B,C","A,D"))
df_2 <- data.frame(groups=c("A","B","C","D"), values=c(1:4))
desired <- data.frame(element=c(1:2),groups=c("A,B,C","A,D"),sum=c(6,5))
An option would be to split the 'groups' by the delimiter , to expand the rows with separate_rows, do a join with the key/val dataset ('df_2'), groued by 'element', get the sum of 'values'
library(tidyverse)
df_1 %>%
separate_rows(groups) %>%
left_join(df_2) %>%
group_by(element) %>%
summarise(groups = toString(groups), sum = sum(values))
# A tibble: 2 x 3
# element groups sum
# <int> <chr> <int>
#1 1 A, B, C 6
#2 2 A, D 5
Or another option with base R would be to use a named key/value vector 'nm1') to change the values in the splitted list elements, sum and assign it to new column in 'df_1'
nm1 <- setNames(df_2$values, df_2$groups)
df_1$sum <- sapply(strsplit(as.character(df_1$groups), ","), function(x) sum(nm1[x]))

Split variable on every other row to form two new columns in data.frame

After scraping a pdf, I have a data frame with a chr text var:
df = data.frame(text = c("abc","def","abc","def"))
My question is how to turn it into:
df = data.frame(text1 = c("abc","abc"),text2=c("def","def"))
I am able to index the rows and manually rebuild a new df, but was curious if it could be done within the dplyr pipe.
All solutions I have been able to find involve splitting each row, but not to split whole rows of a variable into new columns.
Using dplyr you could create a new column (ind) for grouping which would have same values every alternate rows and then we group_by ind and create a sequence column (id) to spread the data into two columns.
library(dplyr)
library(tidyr)
df %>%
mutate(ind = rep(c(1, 2),length.out = n())) %>%
group_by(ind) %>%
mutate(id = row_number()) %>%
spread(ind, text) %>%
select(-id)
# `1` `2`
# <fct> <fct>
#1 abc def
#2 abc def
A base R option would be to split df into separate dataframe every alternate rows creating a sequence using rep and cbind them together to form 2-column data frame.
do.call("cbind", split(df, rep(c(1, 2), length.out = nrow(df))))
# text text
#1 abc def
#3 abc def
We could do this in base R. Use the matrix route to rearrange a vector/column into a matrix and then convert it to data.frame (as.data.frame). As the number of columns is constant i.e. 2, specify that value in ncol
as.data.frame(matrix(df$text, ncol = 2, byrow = TRUE,
dimnames = list(NULL, c('text1', 'text2'))))
# text1 text2
#1 abc def
#2 abc def
Or another option is unstack from base R after creating a sequence of alternate ids (making use of the recycling)
unstack(transform(df, val = paste0('text', 1:2)), text ~ val)
# text1 text2
#1 abc def
#2 abc def
Or we can split into a list of vectors and then cbind it together
as.data.frame(do.call(cbind, split(as.character(df$text), 1:2)))
# 1 2
#1 abc def
#2 abc def
Or another option is dcast from data.table
library(data.table)
dcast(setDT(df), rowid(text)~ text)[, text := NULL][]
data
df <- data.frame(text = c("abc","def","abc","def"))

Select unique values

I need to change this function that doesn't match for unique values. For example, if I want MAPK4, the function matches MAPK41 and AMAPK4 etc. The function must select only the unique values.
Function:
library(dplyr)
df2 <- df %>%
rowwise() %>%
mutate(mutated = paste(mutated_genes[unlist(
lapply(mutated_genes, function(x) grepl(x,genes, ignore.case = T)))], collapse=","),
circuit_name = gsub("", "", circuit_name)) %>%
select(-genes) %>%
data.frame()
data:
df <-structure(list(circuit_name = c("hsa04010__117", "hsa04014__118" ), genes = c("MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP3*,DUSP3*,DUSP3*,DUSP3*,PPM1A,AKT3,AKT3,AKT3,ZAK,MAP3K12,MAP3K13,TRAF2,CASP3,IL1R1,IL1R1,TNFRSF1A,IL1A,IL1A,TNF,RAC1,RAC1,RAC1,RAC1,MAP2K7,MAPK8,MAPK8,MAPK8,MECOM,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,HSPA1A,MAP4K3,MAPK8IP2,MAP4K1", "MAP4K4,DUSP10*,DUSP10*,DUSP10*,DUSP10*,DUSP10*")), class = "data.frame", row.names = c(NA, -2L))
mutated_genes <- c("MAP4K4", "MAP3K12","TRAF2", "CACNG3")
output:
circuit_name mutated
1 hsa04010__117 MAP4K4,TRAF2
2 hsa04014__118 MAP4K4
A base R approach would be by splitting the genes on "," and return those string which match mutated_genes.
df$mutated <- sapply(strsplit(df$genes, ","), function(x)
toString(grep(paste0(mutated_genes, collapse = "|"), x, value = TRUE)))
df[c(1, 3)]
# circuit_name mutated
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Please note that based on the mutated_genes vector, your expected output is missing MAP3K12 for hsa04010__117.
Here is a tidyverse possibility
df %>%
separate_rows(genes) %>%
filter(genes %in% mutated_genes) %>%
group_by(circuit_name) %>%
summarise(mutated = toString(genes))
## A tibble: 2 x 2
# circuit_name mutated
# <chr> <chr>
#1 hsa04010__117 MAP4K4, MAP3K12, TRAF2
#2 hsa04014__118 MAP4K4
Explanation: We separate comma-separated entries into different rows, then select only those rows where genes %in% mutated_genes and summarise results per circuit_name by concatenating genes entries.
PS. Personally I'd recommend keeping the data in a tidy long format (i.e. don't concatenate entries with toString); that way you have one row per gene, which will make any post-processing of the data much more straightforward.
We can use str_extract
library(stringr)
df$mutated <- sapply(str_extract_all(df$genes, paste(mutated_genes,
collapse="|")), toString)

Sum by aggregating complex paired names in R

In R, I'm trying to aggregate a dataframe based on unique IDs, BUT I need to use some kind of wild card value for the IDs. Meaning I have paired names like this:
lion_tiger
elephant_lion
tiger_lion
And I need the lion_tiger and tiger_lion IDs to be summed together, because the order in the pair does not matter.
Using this dataframe as an example:
df <- data.frame(pair = c("1_3","2_4","2_2","1_2","2_1","4_2","3_1","4_3","3_2"),
value = c("12","10","19","2","34","29","13","3","14"))
So the values for pair IDs, "1_2" and "2_1" need to be summed in a new table. That new row would then read:
1_2 36
Any suggestions? While my example has numbers as the pair IDs, in reality I would need this to read in text (like the lion_tiger" example above).
We can split the 'pair' column by _, then sort and paste it back, use it in a group by function to get the sum
tapply(as.numeric(as.character(df$value)),
sapply(strsplit(as.character(df$pair), '_'), function(x)
paste(sort(as.numeric(x)), collapse="_")), FUN = sum)
Or another option is gsubfn
library(gsubfn)
df$pair <- gsubfn('([0-9]+)_([0-9]+)', ~paste(sort(as.numeric(c(x, y))), collapse='_'),
as.character(df$pair))
df$value <- as.numeric(as.character(df$value))
aggregate(value~pair, df, sum)
Using tidyverse and purrrlyr
df <- data.frame(name=c("lion_tiger","elephant_lion",
"tiger_lion"),value=c(1,2,3),stringsAsFactors=FALSE)
require(tidyverse)
require(purrrlyr)
df %>% separate(col = name, sep = "_", c("A", "B")) %>%
by_row(.collate = "rows",
..f = function(this_row) {
paste0(sort(c(this_row$A, this_row$B)), collapse = "_")
}) %>%
rename(sorted = ".out") %>%
group_by(sorted) %>%
summarize(sum(value))%>%show
## A tibble: 2 x 2
# sorted `sum(value)`
# <chr> <dbl>
#1 elephant_lion 2
#2 lion_tiger 4

Resources