Splitting a column into multiple columns based on 2 conditions - r

I have a large dataframe and I would like to split a column into many columns based on two conditions the caret character ^ and the letter following IMM-. Based on the data below Column 1 would be split into columns named IMM-A, IMM-B, IMM-C, and IMM-W. I tried the separate function but it only works if you specify the column names and because my data is not uniform I don't always know what the column names should be.
SampleId Column1
1 IMM-A*010306+IMM-A*0209^IMM-B*6900+IMM-B*779999^IMM-C*1212+IMM-C*3333
2 IMM-A*010306+IMM-A*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333
3 IMM-B*010306+IMM-B*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333
The expected output would be;
SampleId IMM-A IMM-B IMM-C IMM-W
1 IMM-A*010306+IMM-A*0209 IMM-B*6900+IMM-B*779999 IMM-C*1212+IMM-C*3333
2 IMM-A*010306+IMM-A*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
3 IMM-B*010306+IMM-B*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333

Not clear about the expected output. Based on the description, we may need
library(tidyverse)
map(strsplit(df$Column1, "[*+^]"), ~
stack(setNames(as.list(.x[c(FALSE, TRUE)]), .x[c(TRUE, FALSE)])) %>%
group_by(ind) %>%
mutate(rn = row_number()) %>%
spread(ind, values)) %>%
set_names(df$SampleId) %>%
bind_rows(.id = 'SampleId') %>%
select(-rn)
# A tibble: 6 x 5
# SampleId `IMM-A` `IMM-B` `IMM-C` `IMM-W`
# <chr> <chr> <chr> <chr> <chr>
#1 1 010306 6900 1212 <NA>
#2 1 0209 779999 3333 <NA>
#3 2 010306 <NA> 6900 1212
#4 2 0209 <NA> 779999 3333
#5 3 <NA> 010306 6900 1212
#6 3 <NA> 0209 779999 3333
Update
Based on the OP's expected output, we expand the data by splitting the 'Column1' at the ^ delimiter, then separate the 'Column1' into 'colA', 'colB' at the delimiter *, remove the 'colB' and spread to 'wide' format
df %>%
separate_rows(Column1, sep = "\\^") %>%
separate(Column1, into = c("colA", "colB"), remove = FALSE, sep="[*]") %>%
select(-colB) %>%
spread(colA, Column1, fill = "")
#SampleId IMM-A IMM-B IMM-C IMM-W
#1 1 IMM-A*010306+IMM-A*0209 IMM-B*6900+IMM-B*779999 IMM-C*1212+IMM-C*3333
#2 2 IMM-A*010306+IMM-A*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
#3 3 IMM-B*010306+IMM-B*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
data
df <- structure(list(SampleId = 1:3, Column1 =
c("IMM-A*010306+IMM-A*0209^IMM-B*6900+IMM-B*779999^IMM-C*1212+IMM-C*3333",
"IMM-A*010306+IMM-A*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333",
"IMM-B*010306+IMM-B*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333"
)), class = "data.frame", row.names = c(NA, -3L))

Related

Trying to match strings from multiple columns and create pair list where matches are found

I have two data frames with string values:
df1 <- data.frame(values = c("apples_x", "oranges_z", "bananas_y", "berries_u", "melons_r"))
df2 = data.frame(values = c('apples','oranges','z','pears','x','bananas','plums','y','h','grapes','q'))
I would like to perform a pairwise comparison between both data frames, by iterating over every row of data frame 2 and assigning pair numbers where both fruit and letter of df1 value appears in df2.
I want to create a new data frame that stores the pair numbers for the matches found.
Ideally it would look something like this:
df3 %>% head()
values paired
<ch> <int>
1 apples 1
2 x 1
3 oranges 2
4 z 2
5 bananas 3
6 y 3
I tried to separate the values in df1 into two strings, but I am getting back strings with matches on any character.
lapply(df2, FUN=function(x){any(df1==x[[1]] & df1==x[[2]])})
Based on the update, we may filter after splitting the column in 'df1', then create a sequence index and reshape to 'long' format
library(dplyr)
library(tidyr)
df1 %>%
separate(values, into = c('values1', 'values2')) %>%
filter(if_all(everything(), ~ .x %in% df2$values)) %>%
mutate(paired = row_number()) %>%
pivot_longer(cols = -paired, values_to = 'value', names_to = NULL) %>%
select(value, paired)
-output
# A tibble: 6 × 2
value paired
<chr> <int>
1 apples 1
2 x 1
3 oranges 2
4 z 2
5 bananas 3
6 y 3

Separate rows with conditions

I have this dataframe separate_on_condition with two columns:
separate_on_condition <- data.frame(first = 'a3,b1,c2', second = '1,2,3,4,5,6')`
# first second
# 1 a3,b1,c2 1,2,3,4,5,6
How can I turn it to:
# A tibble: 6 x 2
first second
<chr> <chr>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
where:
a3 will be separated into 3 rows
b1 into 1 row
c2 into 2 rows
Is there a better way on achieving this instead of using rep() on first column and separate_rows() on the second column?
Any help would be much appreciated!
Create a row number column to account for multiple rows.
Split second column on , in separate rows.
For each row extract the data to be repeated along with number of times it needs to be repeated.
library(dplyr)
library(tidyr)
library(stringr)
separate_on_condition %>%
mutate(row = row_number()) %>%
separate_rows(second, sep = ',') %>%
group_by(row) %>%
mutate(first = rep(str_extract_all(first(first), '[a-zA-Z]+')[[1]],
str_extract_all(first(first), '\\d+')[[1]])) %>%
ungroup %>%
select(-row)
# first second
# <chr> <chr>
#1 a 1
#2 a 2
#3 a 3
#4 b 4
#5 c 5
#6 c 6
You can the following base R option
with(
separate_on_condition,
data.frame(
first = unlist(sapply(
unlist(strsplit(first, ",")),
function(x) rep(gsub("\\d", "", x), as.numeric(gsub("\\D", "", x)))
), use.names = FALSE),
second = eval(str2lang(sprintf("c(%s)", second)))
)
)
which gives
first second
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
Here is an alternative approach:
add NA to first to get same length
use separate_rows to bring each element to a row
use extract by regex digit to split first into first and helper
group and slice by values in helper
do some tweaking
library(tidyr)
library(dplyr)
separate_on_condition %>%
mutate(first = str_c(first, ",NA,NA,NA")) %>%
separate_rows(first, second, sep = "[^[:alnum:].]+", convert = TRUE) %>%
extract(first, into = c("first", "helper"), "(.{1})(.{1})", remove=FALSE) %>%
group_by(second) %>%
slice(rep(1:n(), each = helper)) %>%
ungroup() %>%
drop_na() %>%
mutate(second = row_number()) %>%
select(first, second)
first second
<chr> <int>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6

create list from characters in R tibble

I have a tibble with a character column. The character in each row is a set of words like this: "type:mytype,variable:myvariable,variable:myothervariable:asubvariableofthisothervariable". Things like that. I want to either convert this into columns in my tibble (a column "type", a column "variable", and so on; but then I don't really know what to do with my 3rd level words), or convert it to a column list x, so that x has a structure of sublists: x$type, x$variable, x$variable$myothervariable.
I'm not sure what is the best approach, but also, I don't know how to implement this two approaches that I suggest here. I have to say that I have maximum 3 levels, and more 1st level words than "type" and "variable".
Small Reproducible Example:
df <- tibble()
df$id<- 1:3
df$keywords <- c(
"type:novel,genre:humor:black,year:2010"
"type:dictionary,language:english,type:bilingual,otherlang:french"
"type:essay,topic:philosophy:purposeoflife,year:2005"
)
# expected would be in idea 1:
colnames(df)
# n, keywords, type, genre, year,
# language, otherlang, topic
# on idea 2:
colnames(df)
# n, keywords, keywords.as.list
We can use separate_rows from tidyr to split the 'keywords' column by ,, then with cSplit, split the column 'keywords' into multiple columns at :, reshape to 'long' format with pivot_longer and then reshape back to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
library(splitstackshape)
df %>%
separate_rows(keywords, sep=",") %>%
cSplit("keywords", ":") %>%
pivot_longer(cols = keywords_2:keywords_3, values_drop_na = TRUE) %>%
select(-name) %>%
mutate(rn = rowid(id, keywords_1)) %>%
pivot_wider(names_from = keywords_1, values_from = value) %>%
select(-rn) %>%
type.convert(as.is = TRUE)
-output
# A tibble: 6 x 7
# id type genre year language otherlang topic
# <int> <chr> <chr> <int> <chr> <chr> <chr>
#1 1 novel humor 2010 <NA> <NA> <NA>
#2 1 <NA> black NA <NA> <NA> <NA>
#3 2 dictionary <NA> NA english french <NA>
#4 2 bilingual <NA> NA <NA> <NA> <NA>
#5 3 essay <NA> 2005 <NA> <NA> philosophy
#6 3 <NA> <NA> NA <NA> <NA> purposeoflife
data
df <- structure(list(id = 1:3, keywords = c("type:novel,genre:humor:black,year:2010",
"type:dictionary,language:english,type:bilingual,otherlang:french",
"type:essay,topic:philosophy:purposeoflife,year:2005")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))

How to spread two column dataframe with creating a unique identifier?

Trying to spread two column data to a format where there will be some NA values.
dataframe:
df <- data.frame(Names = c("TXT","LSL","TXT","TXT","TXT","USL","LSL"), Values = c("apple",-2,"orange","banana","pear",10,-1),stringsAsFactors = F)
If a row includes TXT following rows that has LSL or USL will belong to that row.
For ex:
in the first row; Name is TXT Value is apple next row is LSL value will be for apple's LSL and since no USL that will be NA until the next TXT name.
If there is a TXT followed by another TXT, then LSL and USL values for that row will be NA
trying to create this:
I tried using spread with row numbers as unique identifier but that's not what I want:
df %>% group_by(Names) %>% mutate(row = row_number()) %>% spread(key = Names,value = Values)
I guess I need to create following full table with NAs then spread but couldn't figure out how.
We can expand the dataset with complete after creating a grouping index based on the occurence of 'TXT'
library(dplyr)
library(tidyr)
df %>%
group_by(grp = cumsum(Names == 'TXT')) %>%
complete(Names = unique(.$Names)) %>%
ungroup %>%
spread(Names, Values) %>%
select(TXT, LSL, USL)
# A tibble: 4 x 3
# TXT LSL USL
# <chr> <chr> <chr>
#1 apple -2 <NA>
#2 orange <NA> <NA>
#3 banana <NA> <NA>
#4 pear -1 10
In data.table, we can use dcast :
library(data.table)
dcast(setDT(df), cumsum(Names == 'TXT')~Names, value.var = 'Values')[, -1]
# LSL TXT USL
#1: -2 apple <NA>
#2: <NA> orange <NA>
#3: <NA> banana <NA>
#4: -1 pear 10

Looping and concatenating based on a condition in R

I'm new to R and still struggling with loops.
I'm trying to create a loop where, based on a condition (variable_4 == 1), it will concatenate the content of variable_5, separated by comma.
data1 <- data.frame(
ID = c(123:127),
agent_1 = c('James', 'Lucas','Yousef', 'Kyle', 'Marisa'),
agent_2 = c('Sophie', 'Danielle', 'Noah', 'Alex', 'Marcus'),
agent_3 = c('Justine', 'Adrienne', 'Olivia', 'Janice', 'Josephine'),
Flag_1 = c(1,0,1,0,1),
Flag_2 = c(0,1,0,0,1),
Flag_3 = c(1,0,1,0,1)
)
data1$new_var<- ""
for(i in 2:10){
variable_4 <- paste0("flag_", i)
variable_5 <- paste0("agent_", i)
data1 <- data1 %>%
mutate(!! new_var = case_when(variable_4 == 1,paste(new_var, variable_5, sep=",")))
}
I've created new_var in a previous step because the code was giving me an error that the variable was not found. Ideally, the loop will accumulate the contents of variable_5, only if variable_4 is equal 1 and the result would be big string, separate by comma.
The loop will paste in the new var only the name of the agents which the flags are = 1. If Flag_1=1, then paste the name of the agent in the new_var, if not, ignore. If flag_2 =1, then concatenate the name of the agent in the new var, separating by comma, if not, then ignore...
You shouldn't need to use a loop for this. The data is in wide format which makes it harder, but if we convert to long format, we can easily find a vectorized solution rather than using a loop.
The pivot_longer function is useful here which requires tidyr version >= 1.0.0.
library(tidyr)
library(dplyr)
pivot_longer(data1,
cols = -ID,
names_to = c(".value", "group"),
names_sep = "_") %>%
group_by(ID) %>%
mutate(new_var = paste0(agent[Flag==1], collapse = ',')) %>%
pivot_wider(names_from = c("group"),
values_from = c('agent', 'Flag'),
names_sep = '_') %>%
ungroup() %>%
select(ID, starts_with('agent'), starts_with('Flag'), new_var)
## A tibble: 5 x 8
# ID agent_1 agent_2 agent_3 Flag_1 Flag_2 Flag_3 new_var
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 123 James Sophie Justine 1 0 1 James,Justine
#2 124 Lucas Danielle Adrienne 0 1 0 Danielle
#3 125 Yousef Noah Olivia 1 0 1 Yousef,Olivia
#4 126 Kyle Alex Janice 0 0 0 ""
#5 127 Marisa Marcus Josephine 1 1 1 Marisa,Marcus,Josephine
Details:
pivot_longer puts our data into a more natural format where each row represents one observation of the variables agent and flag, rather than several:
pivot_longer(data1,
cols = -ID,
names_to = c(".value", "group"),
names_sep = "_")
## A tibble: 15 x 4
# ID group agent Flag
# <int> <chr> <chr> <chr>
# 1 123 1 James 1
# 2 123 2 Sophie 0
# 3 123 3 Justine 1
# 4 124 1 Lucas 0
# 5 124 2 Danielle 1
# 6 124 3 Adrienne 0
# ...
For each ID, we can then paste together the agents which have flag values of 1. This is easy now that our variables are contained in single columns.
Lastly, we revert back to the wide format with pivot_wider. We also ungroup the data we previously grouped, and re-order the columns to the desired format.
There are a few different ways to do this in BaseR or the tidyverse, or a combination of both, if you stick to using tidyverse then consider this:
I have used mtcars as your dataframe instead!
#load dplyr or tidyverse
library(tidyverse)
# create data as mtcars
df <- mtcars
# create two new columns flag and agent as rownumbers
df <- df %>%
mutate(flag = paste0("flag", row_number())) %>%
mutate(agent = paste0("agent", row_number()))
# using case when in mutate statement
df2 <- df %>%
mutate(new_column = ifelse(flag == "flag1", yes = paste0(agent, " this is a new variable"), no = flag))
print(df2)
an ifelse statement might be more appropriate if you have one case - but if you have many then use case_when instead.

Resources