R combine rows with single column entry into new row - r

How would I combine any row with a single column entry into a single combined input in a new column? e.g. when column A has value, but B-C are empty, I would like to merge the row entries into a single input in column D.
original file is a txt file that looks like this:
A|B|C
1|2|3
1
text
2
[end]
4|5|6
2
1
[end]
df <-read.delim("file.txt", header=TRUE, sep="|", blank.lines.skip = TRUE)
A B C
1 2 3
1
text
2
[end]
4 5 6
2
1
[end]
desired out data table with newly added column D:
A B C D
1 2 3 1 text 2 [end]
4 5 6 2 1 [end]
I imagine this would be combination of is.na and mutate functions but have been unable to find a solution. The code could also include ends_with("[end]") since each row that I want to combine ends with this text. Any thoughts on this?

Not sure if this is what you need given that the questions about your data structure are unanswered:
library(tidyverse)
df %>%
# change empty cells to NA:
mutate(across(everything(), ~na_if(., ""))) %>%
# filter rows with NA:
filter(if_any(everything(), is.na)) %>%
# contract rows in new column `D`:
summarise(D = str_c(A, collapse = " ")) %>%
# bind original `df` (after mutations) to result:
bind_cols(df %>%
mutate(across(everything(), ~na_if(., ""))) %>%
filter(!if_any(everything(), is.na)), .) %>%
# remove duplicated values in `D`:
mutate(D = ifelse(duplicated(D), NA, D))
A B C D
1 1 2 3 1 text 2 [end]
2 4 5 6 <NA>
Data:
df <- data.frame(
A = c(1,1, "text", 2, "[end]", 4),
B = c(2, "", "", "", "", 5),
C = c(3, "", "", "", "", 6)
)

Related

R: conditionally mutate a variable when columns match in different dataframes

I am attempting to write some R code that assesses whether or not two dataframes have any matches in their columns. If there are matches, one of the columns in the second dataframe should assign a "link" (via the links variable) to the first dataframe using the id column of the first dataframe.
In the event that there are multiple matches, I am trying to get the "link" variable to randomly select one of the matching id's.
Some reproducible code:
library(dplyr)
df1 = data.frame(ids = c(1:5),
var = c("a","a","c","b","b"))
df2 = data.frame(var = c('c','a','b','b','d'),
links = 0)
Ideally, I would like a resulting dataframe that looks like:
var links
1 c 3
2 a 1 or 2
3 b 4 or 5
4 b 4 or 5
5 d 0
where observations in the links column randomly select ids from df1 when df1$var matches df2$var. In the dataframe above, this is denoted by "or".
Note 1: The links column should be a numeric, I only made it character to allow to write the word "or".
Note 2: If there is not a match between df1$var and df2$var, the links column should remain a 0.
So far, I've gone this route, but I'm unsure about what to put after the ~
linked_df = df2 %>%
mutate(links=case_when(links==0 & var %in% df1$var ~
sample(c(df1$ids),n(),replace=T) # unsure about this line
TRUE ~ links)
I think this is what you want. I've left the ids column in the result, but
it can be removed when the sampling is complete.
library(dplyr)
library(tidyr)
df1_nest = df1 %>%
group_by(var) %>%
summarize(ids = list(ids))
safe_sample = function(x, ...) {
if(length(x) == 1) return(x)
sample(x, ...)
}
set.seed(47)
df2 %>%
left_join(df1_nest) %>%
mutate(
links = sapply(ids, \(x) if(is.null(x)) 0L else safe_sample(x, size = 1))
)
# Joining, by = "var"
# var links ids
# 1 c 3 3
# 2 a 1 1, 2
# 3 b 4 4, 5
# 4 b 5 4, 5
# 5 d 0 NULL
Something like this could do the trick, just a map of a filter of the first dataframe:
df2 %>%
as_tibble() %>%
mutate(links = map(var, ~sample(filter(df1, var == .)$ids), 1),
index = row_number()) %>%
unnest(links, keep_empty = TRUE) %>%
group_by(index) %>%
slice_sample(n = 1) %>%
ungroup() %>%
select(-index)
# # A tibble: 5 × 2
# var links
# <chr> <int>
# 1 c 1
# 2 a 1
# 3 b 4
# 4 b 5
# 5 d NA

Separate rows with conditions

I have this dataframe separate_on_condition with two columns:
separate_on_condition <- data.frame(first = 'a3,b1,c2', second = '1,2,3,4,5,6')`
# first second
# 1 a3,b1,c2 1,2,3,4,5,6
How can I turn it to:
# A tibble: 6 x 2
first second
<chr> <chr>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
where:
a3 will be separated into 3 rows
b1 into 1 row
c2 into 2 rows
Is there a better way on achieving this instead of using rep() on first column and separate_rows() on the second column?
Any help would be much appreciated!
Create a row number column to account for multiple rows.
Split second column on , in separate rows.
For each row extract the data to be repeated along with number of times it needs to be repeated.
library(dplyr)
library(tidyr)
library(stringr)
separate_on_condition %>%
mutate(row = row_number()) %>%
separate_rows(second, sep = ',') %>%
group_by(row) %>%
mutate(first = rep(str_extract_all(first(first), '[a-zA-Z]+')[[1]],
str_extract_all(first(first), '\\d+')[[1]])) %>%
ungroup %>%
select(-row)
# first second
# <chr> <chr>
#1 a 1
#2 a 2
#3 a 3
#4 b 4
#5 c 5
#6 c 6
You can the following base R option
with(
separate_on_condition,
data.frame(
first = unlist(sapply(
unlist(strsplit(first, ",")),
function(x) rep(gsub("\\d", "", x), as.numeric(gsub("\\D", "", x)))
), use.names = FALSE),
second = eval(str2lang(sprintf("c(%s)", second)))
)
)
which gives
first second
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6
Here is an alternative approach:
add NA to first to get same length
use separate_rows to bring each element to a row
use extract by regex digit to split first into first and helper
group and slice by values in helper
do some tweaking
library(tidyr)
library(dplyr)
separate_on_condition %>%
mutate(first = str_c(first, ",NA,NA,NA")) %>%
separate_rows(first, second, sep = "[^[:alnum:].]+", convert = TRUE) %>%
extract(first, into = c("first", "helper"), "(.{1})(.{1})", remove=FALSE) %>%
group_by(second) %>%
slice(rep(1:n(), each = helper)) %>%
ungroup() %>%
drop_na() %>%
mutate(second = row_number()) %>%
select(first, second)
first second
<chr> <int>
1 a 1
2 a 2
3 a 3
4 b 4
5 c 5
6 c 6

Remove unwanted letter in data column names in R environment

I have a dataset the contains a large number of columns every column has a name of date in the form of x2019.10.10
what I want is to remove the x letter and change the type of the date to be 2019-10-10
How this could be done in the R environment?
One solution would be:
Get rid of x
Replace . with -.
Here I create a dataframe that has similar columns to yours:
df = data.frame(x2019.10.10 = c(1, 2, 3),
x2020.10.10 = c(4, 5, 6))
df
x2019.10.10 x2020.10.10
1 1 4
2 2 5
3 3 6
And then, using dplyr (looks much tidier):
library(dplyr)
names(df) = names(df) %>%
gsub("x", "", .) %>% # Get rid of x and then (%>%):
gsub("\\.", "-", .) # replace "." with "-"
df
2019-10-10 2020-10-10
1 1 4
2 2 5
3 3 6
If you do not want to use dplyr, here is how you would do the same thing in base R:
names(df) = gsub("x", "", names(df))
names(df) = gsub("\\.", "-", names(df))
df
2019-10-10 2020-10-10
1 1 4
2 2 5
3 3 6

Combining two variables to create new variable

I would like to combine two variables that have only one answer each into a single variable that has both answers.
Example
IPV_YES only has answers that are 1
IPV_NO only has answers that are 2
I would like to combine them into a single variable named IPV that would have the 1 and 2 results from both individual category.
I have tried using ifelse command but it only shows me the value of IPV_YES.
Dataset I have
My desired outcome
my answer
df %>% mutate(across(everything(), ~ifelse(. == "", NA, as.numeric(.)))) %>%
group_by(ID) %>%
rowwise() %>%
transmute(IPV = sum(c_across(everything()), na.rm = T))
# A tibble: 4 x 2
# Rowwise: ID
ID IPV
<dbl> <dbl>
1 1 1
2 2 2
3 3 1
4 4 2
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
We can use coalesce after converting the '' to NA
library(dplyr)
df <- df %>%
transmute(ID, IPV = coalesce(na_if(IPV_YES, ""), na_if(IPV_NO, ""))) %>%
type.convert(as.is = TRUE)
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
df$IPV <- ifelse(df$IPV_YES != "", df$IPV_YES, df$IPV_NO[!df$IPV_NO==""])
Here, we specify an ifelse statement; it can be glossed thus: if the value in df$IPV_YES is not blank, then give the value in df$IPV_YES, else give those values from df$IPV_NO that are not blank.
If you want to remove the IPV_* columns:
df[,2:3] <- NULL
Result:
df
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2
Data:
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
Maybe you can try the code below
replace(df, df == "", NA) %>%
mutate(IPV = coalesce(IPV_YES, IPV_NO)) %>%
select(ID, IPV) %>%
type.convert(as.is = TRUE)
which gives
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2

Summarising data from when groups are not the same

I have the following dataframe:
df <- data.frame(
ID = c(1,1,1,1,1,1,2,2,2,2,2,2),
group = c("S_1","G_1","G_2","G_3","M_1","M_2","G_1","G_2","S_1","S_2","M_1","M_2"),
CODE = c(0,1,0,0,1,1,0,1,0,0,1,1)
)
ID group CODE
1 1 S_1 0
2 1 G_1 1
3 1 G_2 0
4 1 G_3 0
5 1 M_1 1
6 1 M_2 1
7 2 G_1 0
8 2 G_2 1
9 2 S_1 0
10 2 S_2 0
11 2 M_1 1
12 2 M_2 1
I would like to summarize the CODE column such that for each ID, I end up with one row:
ID CODE
1 1 100,11,0
2 2 01,11,00
for ID==1, I would like to paste G_1,G_2,G_3 without a delimiter (in numeric order). Same goes for M_1 and M_2 and then S_1. Lastly, I would like to add the summarized G, M, and S into one row separating these by a comma (in alphabetic order).
I could potentially remove the numbers and do group_by(group) %>% summarise(CODE=paste(CODE, collapse="")) for the first step. Though I would like the final string to be in alphabetic order.
We can use tidyr::separate to get data in group in different columns based on delimiter (_) and then summarise first by ID and group1 and then by ID to get one string for each ID.
library(dplyr)
df %>%
arrange(ID,group) %>%
tidyr::separate(group, into = c('group1', 'group2'), sep = "_") %>%
group_by(ID, group1) %>%
summarise(CODE = paste(CODE, collapse = "")) %>%
summarise(CODE = toString(CODE))
# A tibble: 2 x 2
# ID CODE
# <dbl> <chr>
#1 1 100, 11, 0
#2 2 01, 11, 00
Without using separate, we can remove everything after "_" and use it as group.
df %>%
arrange(ID,group) %>%
mutate(group = sub('_.*', '', group)) %>%
group_by(ID, group) %>%
summarise(CODE = paste(CODE, collapse = "")) %>%
summarise(CODE = toString(CODE))
Base R solution:
# Order the dataframe and genericise the group vector:
ordered_df <- within(df[with(df, order(ID, group)), ], {
group <- gsub("_.*", "", group)
}
)
# Summarise the dataframe:
aggregate(CODE~ID, do.call("rbind", lapply(split(ordered_df, paste0(ordered_df$ID, ordered_df$group)),
function(x){
data.frame(ID = unique(x$ID), CODE = paste0(x$CODE, collapse = ""))
}
)
), paste, collapse = ",")

Resources