separate a column into multiple variables with unique column names in R - r

Here is how I want my dataframe to look:
record color size height weight
1 blue large heavy
1 red
2 green small tall thin
However, the data (df) appears as follows:
record vars
1 color = "blue", size = "large"
2 color = "green", size = "small"
2 height = "tall", weight = "thin"
1 color = "red", weight = "heavy"
The code for df
structure(list(record = c(1L, 2L, 2L, 1L), vars = structure(c(1L,
2L, 4L,
3L), .Label = c("color = \"blue\", size = \"large\"",
"color = \"green\", size = \"small\"", "color = \"red\", weight =
\"heavy\"",
"height = \"tall\", weight = \"thin\""), class = "factor")), class =
"data.frame", row.names = c(NA,
-4L))
For each record, I would like to separate the vars column by the "," delimiter, and create a new column with the indicated variable name...The record should be repeated if there are multiple values for a particular variable
I know that to do this with tidyverse I will need to use dplyr::group_by and dplyr::separate, however I'm not clear how to incorporate the new variable names in the "into" parameter for separate. Do I need some type of regular expression to identify any text prior to an equal sign "=" as the new variable name in "into"?? Any suggestions much welcome!
df %>%
group_by(record) %>%
separate(col = vars, into = c(regex expression?? / character vector?), sep = ",")

Since the columns are already almost written as R code defining a list, you could parse/eval them and then unnest_wider
library(tidyverse)
df %>%
mutate(vars = map(vars, ~ eval(parse_expr(paste('list(', .x, ')'))))) %>%
unnest_wider(vars)
# record color size height weight
# <int> <chr> <chr> <chr> <chr>
# 1 1 blue large NA NA
# 2 2 green small NA NA
# 3 2 NA NA tall thin

Here is one option with tidyverse. Create a sequence column 'rn', then separate_rows of the 'vars' column based on the ,, remove the quotes with str_remove_all, separate the column into two, and reshape from 'long' to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(vars, sep=",\\s*\\n*") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars, into = c("vars1", "vars2"), sep="\\s*=\\s*") %>%
pivot_wider(names_from = vars1, values_from = vars2,
values_fill = list(vars2 = '')) %>%
select(-rn)
# A tibble: 3 x 5
# record color size height weight
# <int> <chr> <chr> <chr> <chr>
#1 1 blue large "" ""
#2 2 green small "" ""
#3 2 "" "" tall thin

Another way is to convert to 2-column-matrices and merge. We'll need a helper FUNction that converts a vector into a matrix with first row as the header.
FUN <- function(x) {m <- matrix(x, 2);as.data.frame(rbind(`colnames<-`(m, m[1, ])[-1, ]))}
Then just get rid of non-character stuff and merge.
l <- lapply(strsplit(trimws(gsub("\\W+", " ", as.character(dat$vars))), " "), FUN)
l <- Map(`[<-`, l, 1, "record", dat$record) # cbind record column
Reduce(function(...) merge(..., all=TRUE), l) # merge
# record color weight size height
# 1 1 blue <NA> large <NA>
# 2 1 red heavy <NA> <NA>
# 3 2 green thin small tall

I just noticed that all answers posted so far (including the accepted answer) do not exactly reproduce OP's expected result:
record color size height weight
1 blue large heavy
1 red
2 green small tall thin
which shows 3 rows although the input data has 4 rows.
If I understand correctly, the key-value-pairs for record 2 can be arranged in one row because there are no duplicate values for the same variable. For record 1, variable color has two values which appear in rows 1 and 2, resp., as the OP has requested
The record should be repeated if there are multiple values for a
particular variable
All other variables of record 1 have only one value (or none) and are arranged in row 1.
So, for each record a sub-table with a ragged bottom is created where the columns are filled from top to bottom (separately for each column).
I have tried to reproduce this in two different ways: First with data.table which I am more fluent with and then with dplyr/tidyr. Finally, I will propose an alternative presentation of duplicate values using toString().
data.table
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record + rowid(record, V1) ~ fct_inorder(V1), value.var = "V2")][
, record_1 := NULL][]
record color size height weight
1: 1 blue large <NA> heavy
2: 1 red <NA> <NA> <NA>
3: 2 green small tall thin
This works in 5 steps:
Split multiple key-value-pairs in each row and arrange in separate rows.
Remove double quotes.
Split key-value-pairs and arrange in separate columns.
Reshape from long to wide format where the rows are given by record and by a
count of each individual key within record using rowid() and the columns are given by the keys (variables). Using fct_inorder() ensures the columns are arranged in order of appearance of the variables (just to reproduce exactly OP's expected result).
Drop the helper column from the final result.
To be even more consistent with OP's expected result, the NAs can be turned into blanks by adding the parameter fill = "" to the dcast() call.
dplyr / tidyr
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
group_by(record, key) %>%
mutate(keyid = row_number(key)) %>%
pivot_wider(id_cols = c(record, keyid), names_from = key, values_from = val) %>%
arrange(record, keyid) %>%
select(-keyid)
# A tibble: 3 x 5
# Groups: record [2]
record color size height weight
<int> <chr> <chr> <chr> <chr>
1 1 blue large NA heavy
2 1 red NA NA NA
3 2 green small tall thin
The steps are essentially the same as for the data.table approach. The statements
group_by(record, key) %>%
mutate(keyid = row_number(key))
are a replacement for data.table::rowid().
Add the parameter values_fill = list(val = "") to repalce the NAs by blank.
Alternative representation
The following does not aim at reproducing OP'S expected result as close as possible but to propose an alternative, more concise representation of the result with one row per record.
During reshaping, a function can be used to aggregate the data in each cell. The toString() function concatenates character strings.
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record ~ fct_inorder(V1), toString, value.var = "V2")]
record color size height weight
1: 1 blue, red large heavy
2: 2 green small tall thin
or
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
pivot_wider(names_from = key, values_from = val, values_fn = list(val = toString))
# A tibble: 2 x 5
record color size height weight
<int> <chr> <chr> <chr> <chr>
1 1 blue, red large NA heavy
2 2 green small tall thin

Related

How can I cross check a data frame if all possible combinations on it exist in another data frame of reference in R using dplyr?

I have two data frames.
The first one that contain all the possible combinations with their corresponding value and looks like this :
first
second
val
Alpha
Beta
10
Alpha
Corn
20
Alpha
Desk
30
Beta
Corn
40
Betea
Desk
50
Corn
Desk
60
Hat
Ian
70
The second one that comes from the production line has two columns the date column that has grouped all the variables corresponding to their date and are concatenated :
date
var
2022-01-01
A
2022-02-01
Beta,Corn,Fanta,Epsilon,George,Hat,Ian
I want to find all the combinations in the second data frame and to see if they match with any combinations in the first data frame.If a variable stands alone in the second data frame as Alpha in 2022-01-01 to give me the 0 and otherwise the value of the combination.
Ideally I want the resulted data frame to look like this :
date
comb
val
2022-01-01
Alpha
0
2022-02-01
Beta,Corn
40
2022-02-01
Hat,Ian
70
How can I do this in R using dplyr ?
library(tidyverse)
first = c("Alpha","Alpha","Alpha","Beta","Beta","Corn","Hat")
second = c("Beta","Corn","Desk","Corn","Desk","Desk","Ian")
val = c(10,20,30,40,50,60,70)
df1 = tibble(first,second,val);df1
date = c(as.Date("2022-01-01"),as.Date("2022-02-01"))
var = c("Alpha","Beta,Corn,Fanta,Epsilon,George,Hat,Ian")
df2 = tibble(date,var);df2
An option is to split the rows of the 'df2' based on the 'var' column delimiter , with separate_rows, grouped by 'date', do a combnation of the 'var's create the first, second columns in a tibble from the pairwise combinations, unnest the list columns, do a join with the 'df1' dataset, and another join with 'df2' (in case some dates are lost because of no matches), and unite the 'first', 'second' to create the combn after coalesceing the 'first' with the 'var'
library(dplyr)
library(tidyr)
df2 %>%
separate_rows(var) %>%
group_by(date) %>%
summarise(var = if(length(var) > 1) list(combn(var, 2, \(x)
tibble(first = x[1], second = x[2]), simplify = FALSE) %>%
bind_rows) else
list(tibble(first = var, second = var)) ) %>%
unnest(var) %>%
inner_join(df1, by = c("first", "second")) %>%
full_join(df2, by = "date") %>%
mutate(first = coalesce(first, var)) %>%
unite(combn, first, second, sep = ", ") %>%
select(-var)
-output
# A tibble: 3 × 3
date combn val
<date> <chr> <dbl>
1 2022-02-01 Beta, Corn 40
2 2022-02-01 Hat, Ian 70
3 2022-01-01 Alpha, NA NA

Conditional (row-wise) formating of currency, number, and percentage in R DT (datatable)

I have column in my DT output (in Shiny) that has a numeric value whose units depend on another column. Some values are percentages, some are currency, and some are plain numbers.
For example, I would like to turn this input...
DefaultFormat
Value
PCT
12345.67
DOLLAR
12345.67
NUMBER
12345.67
...into this DT output:
DefaultFormat
Value
PCT
123.45%
DOLLAR
$12,345
NUMBER
12,345.67
The formatCurrency(), formatPercentage() and formatRound() functions do what I need for each of these respective formats but they affect the entire column instead specific cells. On the other hand formatStyle() can target specific cells in a column based on another column but I can't figure out a way to have it change the contents rather than the styles.
Furthermore, I tried setting the class using formatStyle() in the hopes that in the .css file I could then target, e.g. .pctclass:after and .currencyclass:before but it ignores the class attribute.
What is a good way to get the conditional behavior of formatStyle() but for numbers, percentages, and currencies?
EDIT: here's a solution borrowing from the approach here: https://stackoverflow.com/a/35657820/6851825
You are seeking to sort a formatted column based on the underlying data instead of its varied formatted appearance. You can do this by using an unformatted helper column to handle the sorting:
library(dplyr)
data.frame(
stringsAsFactors = FALSE,
DefaultFormat = c("PCT", "DOLLAR", "NUMBER"),
Value = c(54.54, 12345.67, 12345.67)
) %>%
mutate(Value_fmt = case_when(DefaultFormat == "PCT" ~ scales::percent(Value),
DefaultFormat == "DOLLAR" ~ scales::dollar(Value),
DefaultFormat == "NUMBER" ~ scales::comma(Value),
TRUE ~ as.character(Value)) %>%
forcats::fct_reorder(Value), .after = 1) %>%
DT::datatable(rownames = FALSE, options = list(columnDefs = list(
list(orderData = 2, targets = 1),
list(visible=FALSE, targets = 2))))
For example, note how 5 454% appears before the other entries even though it is alphabetically later:
(This is not DT-specific, it wasn't clear if that was a requirement.)
You can group or split and assign:
library(dplyr)
set.seed(2)
dat <- data.frame(fmt = sample(c("PCT","DOLLAR","NUMBER"), 10, replace = TRUE), value = round(runif(10, 10, 9999), 2))
dat %>%
group_by(fmt) %>%
mutate(value2 = switch(fmt[1],
PCT=scales::percent(value),
DOLLAR=scales::dollar(value),
NUMBER=scales::percent(value),
as.character(value))
)
# # A tibble: 10 x 3
# # Groups: fmt [3]
# fmt value value2
# <chr> <dbl> <chr>
# 1 PCT 1816. 181 621%
# 2 NUMBER 4058. 405 836%
# 3 DOLLAR 8536. $8,536.10
# 4 DOLLAR 9763. $9,763.24
# 5 PCT 2266. 226 577%
# 6 PCT 4453. 445 320%
# 7 PCT 759. 75 897%
# 8 PCT 6622. 662 171%
# 9 PCT 3881. 388 123%
# 10 DOLLAR 8370. $8,369.69
An alternative would be to use case_when and it would come up with very similar results, but it will be working one string at a time; this method calls the format function once per group, perhaps a bit more efficient. (Over to you if that's necessary.)

Convert columns of R dataframe to JSON

I have this dataframe:
df <- data.frame(
option_label_1 = c("thickness", "strength", "color"),
option_value_1 = c("0.5 in", "2 lb" , "red"),
option_label_2 = c("size", "color", "thickness"),
option_value_2 = c("0.5 Inches x 7200 Feet", "blue" , "1 in"),
option_label_3 = c("stretch", NA, NA),
option_value_3 = c("wide", NA , NA)
)
option_label_1 option_value_1 option_label_2 option_value_2 option_label_3 option_value_3
1 thickness 0.5 in size 0.5 Inches x 7200 Feet stretch wide
2 strength 2 lb color blue <NA> <NA>
3 color red thickness 1 in <NA> <NA>
I want this data frame:
option_label_1 option_value_1 option_label_2 option_value_2 option_label_3 option_value_3
1 thickness 0.5 in size 0.5 Inches x 7200 Feet stretch wide
2 strength 2 lb color blue <NA> <NA>
3 color red thickness 1 in <NA> <NA>
json
1 {"thickness":"0.5 in","size":"0.5 Inches x 7200 Feet","stretch":"wide"}
2 {"strength":"2 lb","color":"blue"}
3 {"color":"red","thickness":"1 in"}
Essentially I want a JSON column added to the original df built off of the original columns using the option labels and option values. Please note I do not want a solution that converts the whole dataframe to JSON using toJSON. I have a much larger dataframe with other fields I do not want in JSON. I just want the option_labels and their respective option_values to be in JSON.
I have tried using list and paste functions nested in toJSON, but the "option_labels" are static and don't change accordingly in the resulting JSON column.
Thanks for your help!
Here's an option using dplyr and tidyr -
library(dplyr)
library(tidyr)
#Add a row number column to keep track of each row
#Useful for joining afterwards.
df1 <- df %>% mutate(rownum = row_number())
df1 <- df1 %>%
#Get the data in long format so that we have
#option_label and option_value as 2 separate columns
#Drop NA values.
pivot_longer(cols = -rownum,
names_to = '.value',
names_pattern = '(option_\\w+)_',
values_drop_na = TRUE) %>%
#Create a string with the pattern "column_name" : "column_value"
mutate(json = sprintf('"%s" : "%s"', option_label, option_value)) %>%
#for each row
group_by(rownum) %>%
#Combine the json value in a comma separated string.
#Also add "{..}" surrounding them.
summarise(json = sprintf('{%s}', toString(json))) %>%
#Join to get original dataframe back with a new column
inner_join(df1, by = 'rownum')
#View the output
cat(df1$json, sep = "\n")
#{"thickness" : "0.5 in", "size" : "0.5 Inches x 7200 Feet", "stretch" : "wide"}
#{"strength" : "2 lb", "color" : "blue"}
#{"color" : "red", "thickness" : "1 in"}
You could also do:
df %>%
rownames_to_column('rn') %>%
pivot_longer(-rn, '.value', names_pattern = '(.*)_', values_drop_na = TRUE) %>%
group_by(rn) %>%
summarise(json = jsonlite::toJSON(data.table::transpose(cur_data(),make.names = TRUE)))
# A tibble: 3 x 2
rn json
<chr> <json>
1 1 [{"thickness":"0.5 in","size":"0.5 Inches x 7200 Feet","stretch":"wide"}]
2 2 [{"strength":"2 lb","color":"blue"}]
3 3 [{"color":"red","thickness":"1 in"}]
This solution ultimately worked for me. Thanks to everyone who contributed.
df_1 <- df %>%
rowwise() %>%
dplyr::mutate(
options_json = ifelse(!is.na(option_value_3),
paste(toJSON(setNames(list(option_label_1 = paste(option_value_1), option_label_2 = paste(option_value_2), option_label_3 = paste(option_value_3)), c(option_label_1, option_label_2, option_label_3)), auto_unbox = T)),
ifelse(!is.na(option_value_2),
paste(toJSON(setNames(list(option_label_1 = paste(option_value_1), option_label_2 = paste(option_value_2)), c(option_label_1, option_label_2)), auto_unbox = T)),
paste(toJSON(setNames(list(option_label_1 = paste(option_value_1)), option_label_1), auto_unbox = T))
)
)
)

How can I create new columns with the values of multiple old columns?

first time posting so if something is wrong please let me know, I have a dataframe in R that is divided in the following way:
location
type
amount
produt
a
x
10
p1
a
x
20
p2
b
x
50
p5
b
y
100
p10
In the end I need to group the locations into a single line and make the "type" column to become new columns with the value of the "product" and "amount" column, just like this :
location
A_P_X
A_P_Y
a
p1_10,p2_20
b
p5_50
p10_100
I tried to make the new coluns using one hot encoding but I run into problem when tring to fill the new columns based on their original "type" value
library(tidyverse)
df %>%
unite("val", produt:amount, sep = "_") %>%
mutate(type = paste0("A_P_", toupper(type))) %>%
pivot_wider(names_from = type, values_from = val,
values_fn = list(val = ~paste(., collapse = ", ")))
result
# A tibble: 2 x 3
location A_P_X A_P_Y
<chr> <chr> <chr>
1 a p1_10, p2_20 NA
2 b p5_50 p10_100

adding values using rowSums and tidyverse

I am having some issues trying to sum a bunch of columns in R. I am analyzing a huge dataset so I am reproducing a sample. of fake data.
Here's how the data looks like (I have 800 columns).
library(data.table)
dataset <- data.table(name = c("A", "B", "C", "D"), a1 = 1:4, a2 = c(1,2,NaN,5), a3 = 1:4, a4 = 1:4, a5 = c(1,2,NA,5), a6 = 1:4, a8 = 1:4)
dataset
What I want to do is sum the columns in buckets of 100 columns so, for example, all the values in the first row between the first column and the column 100, all the values in the first row between the column 1 and the column 200, all the values in the second row between the first column and the column 100, etc.
Using the sample data I've come with this solution using rowSums.
dataset %>%
mutate_if(~!is.numeric(.x), as.numeric) %>%
mutate_all(funs(replace_na(., 0))) %>%
mutate(sum = rowSums(.[,paste("a", 1:3, sep="")])) %>%
mutate(sum1 = rowSums(.[,paste("a", 4:5, sep="")])) %>%
mutate(sum2 = rowSums(.[,paste("a", 6:8, sep="")]))
but I am getting the following error:
Error in `[.data.frame`(., , paste("a", 6:8, sep = "")) : undefined columns selected
as the data does not include column a7.
The original data is missing a bunch of columns between a1 and a800 so solving this would be key to make it work.
What would it be the best way to approach and solve this error?
Also, I have a few more questions regarding the code I've written:
Is there a smarter way to select the column a1 and a100 instead of using this approach .[,paste("a", 1:3, sep="")]? I am interested in selected the column by name. I do not want to select it by the position of the column because sometimes a100 does not mean that is the column 100.
Also, I am converting the NAs and the NaNs to 0 in order to be able to sum the rows. I am doing it this way mutate_all(funs(replace_na(., 0))), losing my first row than contains the names of the values. What would it be the best way to replace NA and NaN without mutating the string values of the first row to 0?
The type of the columns I am adding is integer as I converted them beforehand mutate_if(~!is.numeric(.x), as.numeric) . Should I follow the same approach in case I have dbl?
Thank you!
Here is one way to do this after transforming data to longer format, for each name, we create a group of n rows and take the sum.
library(dplyr)
library(tidyr)
n <- 2 #No of columns to bucket. Change this to 100 for your case.
dataset %>%
pivot_longer(cols = -name, names_to = 'col') %>%
group_by(name) %>%
group_by(grp = rep(seq_len(n()), each = n, length.out = n()), add = TRUE) %>%
summarise(value = sum(value, na.rm = TRUE)) %>%
#If needed in wider format again
pivot_wider(names_from = grp, values_from = value, names_prefix = 'col')
# name col1 col2 col3 col4
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 2 2 2 1
#2 B 4 4 4 2
#3 C 3 6 3 3
#4 D 9 8 9 4

Resources