I have this dataframe:
df <- data.frame(
option_label_1 = c("thickness", "strength", "color"),
option_value_1 = c("0.5 in", "2 lb" , "red"),
option_label_2 = c("size", "color", "thickness"),
option_value_2 = c("0.5 Inches x 7200 Feet", "blue" , "1 in"),
option_label_3 = c("stretch", NA, NA),
option_value_3 = c("wide", NA , NA)
)
option_label_1 option_value_1 option_label_2 option_value_2 option_label_3 option_value_3
1 thickness 0.5 in size 0.5 Inches x 7200 Feet stretch wide
2 strength 2 lb color blue <NA> <NA>
3 color red thickness 1 in <NA> <NA>
I want this data frame:
option_label_1 option_value_1 option_label_2 option_value_2 option_label_3 option_value_3
1 thickness 0.5 in size 0.5 Inches x 7200 Feet stretch wide
2 strength 2 lb color blue <NA> <NA>
3 color red thickness 1 in <NA> <NA>
json
1 {"thickness":"0.5 in","size":"0.5 Inches x 7200 Feet","stretch":"wide"}
2 {"strength":"2 lb","color":"blue"}
3 {"color":"red","thickness":"1 in"}
Essentially I want a JSON column added to the original df built off of the original columns using the option labels and option values. Please note I do not want a solution that converts the whole dataframe to JSON using toJSON. I have a much larger dataframe with other fields I do not want in JSON. I just want the option_labels and their respective option_values to be in JSON.
I have tried using list and paste functions nested in toJSON, but the "option_labels" are static and don't change accordingly in the resulting JSON column.
Thanks for your help!
Here's an option using dplyr and tidyr -
library(dplyr)
library(tidyr)
#Add a row number column to keep track of each row
#Useful for joining afterwards.
df1 <- df %>% mutate(rownum = row_number())
df1 <- df1 %>%
#Get the data in long format so that we have
#option_label and option_value as 2 separate columns
#Drop NA values.
pivot_longer(cols = -rownum,
names_to = '.value',
names_pattern = '(option_\\w+)_',
values_drop_na = TRUE) %>%
#Create a string with the pattern "column_name" : "column_value"
mutate(json = sprintf('"%s" : "%s"', option_label, option_value)) %>%
#for each row
group_by(rownum) %>%
#Combine the json value in a comma separated string.
#Also add "{..}" surrounding them.
summarise(json = sprintf('{%s}', toString(json))) %>%
#Join to get original dataframe back with a new column
inner_join(df1, by = 'rownum')
#View the output
cat(df1$json, sep = "\n")
#{"thickness" : "0.5 in", "size" : "0.5 Inches x 7200 Feet", "stretch" : "wide"}
#{"strength" : "2 lb", "color" : "blue"}
#{"color" : "red", "thickness" : "1 in"}
You could also do:
df %>%
rownames_to_column('rn') %>%
pivot_longer(-rn, '.value', names_pattern = '(.*)_', values_drop_na = TRUE) %>%
group_by(rn) %>%
summarise(json = jsonlite::toJSON(data.table::transpose(cur_data(),make.names = TRUE)))
# A tibble: 3 x 2
rn json
<chr> <json>
1 1 [{"thickness":"0.5 in","size":"0.5 Inches x 7200 Feet","stretch":"wide"}]
2 2 [{"strength":"2 lb","color":"blue"}]
3 3 [{"color":"red","thickness":"1 in"}]
This solution ultimately worked for me. Thanks to everyone who contributed.
df_1 <- df %>%
rowwise() %>%
dplyr::mutate(
options_json = ifelse(!is.na(option_value_3),
paste(toJSON(setNames(list(option_label_1 = paste(option_value_1), option_label_2 = paste(option_value_2), option_label_3 = paste(option_value_3)), c(option_label_1, option_label_2, option_label_3)), auto_unbox = T)),
ifelse(!is.na(option_value_2),
paste(toJSON(setNames(list(option_label_1 = paste(option_value_1), option_label_2 = paste(option_value_2)), c(option_label_1, option_label_2)), auto_unbox = T)),
paste(toJSON(setNames(list(option_label_1 = paste(option_value_1)), option_label_1), auto_unbox = T))
)
)
)
Related
I have a problem that sounds easy, however, I could not find a solution in R. I would like to shift values according to the first year of the release. I mean the first column represents the years of the release and the columns are years when the device is broken (values are numbers of broken devices).
This is a solution in Python:
def f(x):
shifted = np.argmin((x.index.astype(int)< x.name[0]))
return x.shift(-shifted)
df = df.set_index(['Delivery Year', 'Freq']).apply(f, axis=1)
df.columns = [f'Year.{i + 1}' for i in range(len(df.columns))]
df = df.reset_index()
df
I would like to have it in R too.
# TEST
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1976` = c(10,NA,NA,NA),
`Year.1977` = c(5,3,NA,NA),
`Year.1978` = c(10,NA,8,NA),
`Year.1979` = c(13,10,5,14)
)
data
# DESIRED
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1` = c(10,3,8,14),
`Year.2` = c(5,NA,5,NA),
`Year.3` = c(10,10,NA,NA),
`Year.4` = c(13,NA,NA,NA)
)
data
In addition, would it be also possible to transform the number of broken devices into the percentage of Freq column?
Thank you
Using tidyverse
data %>%
pivot_longer(!c(Delivery.Year, Freq)) %>%
separate(name, c("Lab", "Year")) %>%
select(-Lab) %>%
mutate_all(as.numeric) %>%
filter(Year >= Delivery.Year) %>%
group_by(Delivery.Year, Freq) %>%
mutate(ind = paste0("Year.", row_number()),
per = value/Freq) %>%
ungroup() %>%
pivot_wider(id_cols = c(Delivery.Year, Freq), names_from = ind, values_from = c(value, per))
I pivoted it into long form to begin with and separated the original column names Year.1976, Year.1977, etc. to just get the years from the columns and dropped the Year piece of it. Then I converted all columns to numeric to allow for mathematical operations like filtering for when Year >= Delivery.Year. I then created a column to get the titles you did request Year.1, Year.2, etc. and calculated the percent. Then I pivot_wider to get it in the format you requested. One thing to note is that I was unclear if you wanted both the original values and the percent or just the percent. If you only want the percent then values_from = per should do it for you.
library(dplyr)
f <- function(df) {
years <- paste0("Year.",sort(as.vector(na.omit(as.integer(stringr::str_extract(colnames(df), "\\d+"))))))
df1 <- df %>% select(years)
df2 <- df %>% select(-years)
val <- c()
firstyear <- years[1]
for (k in 1:nrow(df1) ) {
vec <- as.numeric(as.vector(df1[k,]))
val[k] <- (as.numeric(suppressWarnings(na.omit(vec))))[1]
}
df1[firstyear] <- val
colnames(df1) <- c(paste0("Year.",seq(1:ncol(df1))))
df <- cbind(df2,df1)
print(df)
}
> f(data)
Delivery.Year Freq Year.1 Year.2 Year.3 Year.4
1 1976 120 10 5 10 13
2 1977 100 3 3 NA 10
3 1978 80 8 NA 8 5
4 1979 60 14 NA NA 14
I have a dataframe with a large amount of annual data. For example consider the following toy example like so:
dat <- data.frame(id = 1:2, quantity = 3:4, agg_2002 = 5:6, agg_2003 = 7:8, agg_2020 = 9:10)
What I would like to do is the following:
Look for columns named "agg_",in the set of column names, names(df)
Substitute the "agg_" in names(df) for "change_"
Calculate the relative change from year to year, so for example,
df$change_2002 <- df$agg_2002/df$agg_2002 (since 2002 is first year)
df$change_2003 <- df$agg_2003/df$agg_2002
df$change_2004 <- df$agg_2004/df$agg_2003...all the way up to 2020 or the latest value with "agg_" in the column name.
What I have so far is the following function:
func <- function(dat, overwrite = FALSE) {
nms <- grep("agg_[0-9]+$", names(dat), value = TRUE)
revnms <- gsub("agg_", "chg_", nms)
for i = 1:ncol(df) %in% revnms{
dat[, rvnms][i] <- lapply(dat[, rvnms][i], `/`, dat[, rvnms][i-1])
}
dat
}
What I am struggling with is the indexing. How do I get R to make the above calculations recursively without having to do it manually? The desired result is the "chg_" columns appended to the original dataframe:
id quantity agg_2002 agg_2003 agg_2020 chg_2002 chg_2003 chg_2020
1 1 3 5 7 9 1 1.40 1.28
2 2 4 6 8 10 1 1.33 1.25
I would like to modify the specified function above to produce the desired result via lapply if possible. All ideas are welcome. Thank you.
UPDATE: I would much prefer something using lapply or something that can accomodate differing data types
You can make table to long form, change name (can use gsub), then spread back
library(tidyverse)
library(stringr)
df <- dat %>% pivot_longer(-c(id,quantity), names_to = "agg", values_to = "year") %>%
mutate(agg = str_replace(agg, "agg", "change")) %>%
group_by(id) %>%
mutate(year = ifelse(is.na(lag(year)), year/year, year/lag(year))) %>% # Divide itself if there is no lag(year)
pivot_wider(names_from = "agg", values_from = "year")
inner_join(dat, df, by = c("id","quantity"))
id quantity agg_2002 agg_2003 agg_2020 change_2002 change_2003 change_2020
1 1 3 5 7 9 1 1.400000 1.285714
2 2 4 6 8 10 1 1.333333 1.250000
Here is a solution with dplyr and tidyr:
library(tidyr)
library(dplyr)
dat %>%
pivot_longer(cols = starts_with("agg"),
names_to = "year",
names_prefix = "agg_",
values_to = "agg") %>%
group_by(id) %>%
arrange(year) %>%
mutate(change = agg / lag(agg, 1)) %>%
pivot_wider(names_from = year, values_from = c("agg", "change"))
This question is slightly modified from this one.
I have a dataframe in long table format like this:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
ID name value
1 a broad
1 c 50
1 a mangrove
1 c 50
1 a mangrove
1 c 50
2 a coniferous
2 c 50
About the data: The value from the second row 50 corresponds to the value broad from the first row. Similarly, the value from the fourth row 50 corresponds to the value mangrove from the third row and so on.. In simple words, values for name c are related with name a.
I want to combine the value in such a way that I could get the corresponding values for each name, which would also aggregate the values with similar names:
df2 <- data.frame(ID=c(1,1,2),
name=c("c_broad","c_mangrove","c_coniferous"),
value=c(50,100,50))
which should look like this:
ID name value
1 c_broad 50
1 c_mangrove 100
2 c_coniferous 50
Using reshape2:
library(reshape2)
df1$grp = cumsum(df1$name == "a")
df2 = dcast(df1, ID + grp ~ name)
df2$c = as.numeric(df2$c)
aggregate(c ~ ID + a, df2, sum)
ID a c
1 1 broad 50
2 2 coniferous 50
3 1 mangrove 100
Column names can be changed if desired, also "c_" can be added to the names with paste.
Using tidyverse:
value_a <- df1 %>% dplyr::filter(name=="a") %>% dplyr::pull(value)
df1 %>%
dplyr::filter(name=="c") %>% #Modify into a sensible data frame from here
dplyr::mutate(a = value_a,
name = stringr::str_c(name, "_" ,a)) %>%
dplyr::select(-a) %>% # to here
dplyr::group_by(ID, name) %>%
dplyr::summarise(value=sum(as.numeric(value)))
# A tibble: 3 x 3
# Groups: ID [2]
ID name value
<dbl> <chr> <dbl>
1 1 c_broad 50
2 1 c_mangrove 100
3 2 c_coniferous 50
Tha main problem you find in your dataframe is that a single column is containing, names and values, and that is the first thing you should fix. My advice is always modify the original dataframe into a tidy format (https://tidyr.tidyverse.org/articles/tidy-data.html) and from there leverage all tidyverse power, or data.table or your framework of choice.
Notice the temporal variable value_a could be included in the pipeline directly I have not done it for clarity. The main idea is to separate values and species in different columns, the first three calls in the pipeline, and then apply the usual tidyverse operations.
Might not be the most elegant, but it works:
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50)
)
df1 %>% group_by( 1+floor((1:n()-1)/2) ) %>%
summarize(
ID = ID[1],
name = paste0( name[2], "_", value[1] ),
value = as.numeric(value[2])
) %>% ungroup %>% select( -1 ) %>% group_by(name) %>%
mutate( value = sum(value) ) %>%
unique
Here is somthing improved, that actually is humanly readable:
i <- seq( 1, nrow(df1), 2 )
df1 %>% summarise(
ID = ID[i],
name = paste0( name[i+1], "_", value[i] ),
value = as.numeric(value[i+1])
) %>% group_by(name) %>%
summarize(
ID=ID[1], value = sum( value )
) %>% arrange(ID)
Base R solution:
# Nullify numeric values belonging to a grouping category: grps => character vector
grps <- gsub("\\d+", NA, df1$value)
# Interpolate NA values using prior string value: a => character vector
df1$a <- na.omit(grps)[cumsum(!(is.na(grps)))]
# Split-Apply-Combine aggregation: data.frame => stdout(console)
data.frame(do.call(rbind, lapply(with(df1, split(df1, a)), function(x){
y <- transform(subset(x, !grepl("\\D+", value)), value = as.numeric(value))
setNames(
aggregate(value ~ ID + a, y, FUN = function(z){sum(z, na.rm = TRUE)}),
c("ID", "a", "c")
)
}
)
),
row.names = NULL
)
additional option
df1 <- data.frame(ID=c(1,1,1,1,1,1,2,2),
name=c("a","c","a","c","a","c","a","c"),
value=c("broad",50,"mangrove",50,"mangrove",50,"coniferous",50))
library(tidyverse)
df1 %>%
pivot_wider(ID, names_from = name, values_from = value) %>%
unnest(c("a", "c")) %>%
group_by(ID, name = a) %>%
summarise(value = sum(as.numeric(c), na.rm = T), .groups = "drop")
#> # A tibble: 3 x 3
#> ID name value
#> <dbl> <chr> <dbl>
#> 1 1 broad 50
#> 2 1 mangrove 100
#> 3 2 coniferous 50
Created on 2021-04-12 by the reprex package (v2.0.0)
I regularly receive data that is formatted with multiple headers and merged cells (yes..excel). Typically these data come in the form of 2+ merged cells representing sample sites, over the top of a number of observations in columns representing parameters of interest for that site. I am using the "openxlsx" package to read in the data with the read.xlsx function shown below (won't run just for reference):
read.xlsx('Mussels.xlsx',
detectDates = T,
sheet = 2,
fillMergedCells = T,
startRow = 2)
An example: I am currently working with invasive mussel survey data where I have 25 lengths for two species for each of 14 sites, which I've abbreviated for ease here for ease:
lendat <- data.frame(site.a = c("species.1",1,1,1,1),
site.a = c("species.2",2,2,2,2),
site.b = c("species.1",3,3,3,3),
site.b = c("species.2",4,4,4,4),
check.names = F)
I would like to be able to write some code that will re-format these data into long form where the column names become values under a new column named "site", and the first row of data becomes the other column names representing the lengths for each species like this:
data_form <- data.frame(site = c(rep("site.a", 4), rep("site.b",4)),
species.1 = c(1,1,1,1,3,3,3,3),
species.2 = c(2,2,2,2,4,4,4,4))
Update based on #Ronak Shah answer
Using code from the accepted answer below with the actual data results in a tibble with no data. I discovered that the issue arises with the filter step when decimal values are introduced in the data (actual data contains decimal values). I thought this was a data format issue (example data are all factors) but even when this is true the decimal data are changed into NA's. See example:
lendat <- data.frame(site.a = c("species.1", 1.1,2.2,3,4),
site.a = c("species.2",5,6,7,8),
site.b = c("species.1", 9,10,11,12),
site.b = c("species.2",13,14,15,16),
check.names = F)
str(lendat)
'data.frame': 5 obs. of 4 variables:
$ site.a: Factor w/ 5 levels "1.1","2.2","3",..: 5 1 2 3 4
$ site.a: Factor w/ 5 levels "5","6","7","8",..: 5 1 2 3 4
$ site.b: Factor w/ 5 levels "10","11","12",..: 5 4 1 2 3
$ site.b: Factor w/ 5 levels "13","14","15",..: 5 1 2 3 4
I split the piped code out to go line by line
#Get data in long format
pivot_longer(junk, cols = everything(), names_to = 'site') %>%
#Create a new column with column names
mutate(col = paste0('species', .copy)) %>%
#Remove the values from the first row
filter(!grepl('\\D', value)) %>%
#Remove .copy column which was created
select(-.copy) %>%
#Group by the new column
group_by(col) %>%
#Add a row index
mutate(row = row_number()) %>%
#Get data in wide format
pivot_wider(names_from = col, values_from = value) %>%
#Remove row index
select(-row) %>%
#Arrange data according to site information
arrange(site)
x <- pivot_longer(junk, cols = everything(), names_to = 'site')
x
x <- mutate(x, col = paste0('species', .copy))
x
x <- filter(x, !grepl('\\D', value))
x
x <- select(.data = x, -.copy)
x
x <- group_by(x, col)
x
x <- mutate(x, row = row_number())
x
x <- pivot_wider(x, names_from = col, values_from = value)
x
x <- select(x, -row)
x
x <- arrange(x, site)
x
The code executes but leaves NA's in the final tibble.
Using dplyr and tidyr :
library(dplyr)
library(tidyr)
#Get data in long format
pivot_longer(lendat, cols = everything(), names_to = 'site') %>%
#Create a new column with column names
mutate(col = paste0('species', .copy)) %>%
#Remove the values from the first row
filter(!grepl('[A-Za-z]', value)) %>%
#Remove .copy column which was created
select(-.copy) %>%
#Group by the new column
group_by(col) %>%
#Add a row index
mutate(row = row_number()) %>%
#Get data in wide format
pivot_wider(names_from = col, values_from = value) %>%
#Remove row index
select(-row) %>%
#Arrange data according to site information
arrange(site)
# site species1 species2
# <chr> <chr> <chr>
#1 site.a 1.1 5
#2 site.a 2.2 6
#3 site.a 3 7
#4 site.a 4 8
#5 site.b 9 13
#6 site.b 10 14
#7 site.b 11 15
#8 site.b 12 16
Here is how I want my dataframe to look:
record color size height weight
1 blue large heavy
1 red
2 green small tall thin
However, the data (df) appears as follows:
record vars
1 color = "blue", size = "large"
2 color = "green", size = "small"
2 height = "tall", weight = "thin"
1 color = "red", weight = "heavy"
The code for df
structure(list(record = c(1L, 2L, 2L, 1L), vars = structure(c(1L,
2L, 4L,
3L), .Label = c("color = \"blue\", size = \"large\"",
"color = \"green\", size = \"small\"", "color = \"red\", weight =
\"heavy\"",
"height = \"tall\", weight = \"thin\""), class = "factor")), class =
"data.frame", row.names = c(NA,
-4L))
For each record, I would like to separate the vars column by the "," delimiter, and create a new column with the indicated variable name...The record should be repeated if there are multiple values for a particular variable
I know that to do this with tidyverse I will need to use dplyr::group_by and dplyr::separate, however I'm not clear how to incorporate the new variable names in the "into" parameter for separate. Do I need some type of regular expression to identify any text prior to an equal sign "=" as the new variable name in "into"?? Any suggestions much welcome!
df %>%
group_by(record) %>%
separate(col = vars, into = c(regex expression?? / character vector?), sep = ",")
Since the columns are already almost written as R code defining a list, you could parse/eval them and then unnest_wider
library(tidyverse)
df %>%
mutate(vars = map(vars, ~ eval(parse_expr(paste('list(', .x, ')'))))) %>%
unnest_wider(vars)
# record color size height weight
# <int> <chr> <chr> <chr> <chr>
# 1 1 blue large NA NA
# 2 2 green small NA NA
# 3 2 NA NA tall thin
Here is one option with tidyverse. Create a sequence column 'rn', then separate_rows of the 'vars' column based on the ,, remove the quotes with str_remove_all, separate the column into two, and reshape from 'long' to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(vars, sep=",\\s*\\n*") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars, into = c("vars1", "vars2"), sep="\\s*=\\s*") %>%
pivot_wider(names_from = vars1, values_from = vars2,
values_fill = list(vars2 = '')) %>%
select(-rn)
# A tibble: 3 x 5
# record color size height weight
# <int> <chr> <chr> <chr> <chr>
#1 1 blue large "" ""
#2 2 green small "" ""
#3 2 "" "" tall thin
Another way is to convert to 2-column-matrices and merge. We'll need a helper FUNction that converts a vector into a matrix with first row as the header.
FUN <- function(x) {m <- matrix(x, 2);as.data.frame(rbind(`colnames<-`(m, m[1, ])[-1, ]))}
Then just get rid of non-character stuff and merge.
l <- lapply(strsplit(trimws(gsub("\\W+", " ", as.character(dat$vars))), " "), FUN)
l <- Map(`[<-`, l, 1, "record", dat$record) # cbind record column
Reduce(function(...) merge(..., all=TRUE), l) # merge
# record color weight size height
# 1 1 blue <NA> large <NA>
# 2 1 red heavy <NA> <NA>
# 3 2 green thin small tall
I just noticed that all answers posted so far (including the accepted answer) do not exactly reproduce OP's expected result:
record color size height weight
1 blue large heavy
1 red
2 green small tall thin
which shows 3 rows although the input data has 4 rows.
If I understand correctly, the key-value-pairs for record 2 can be arranged in one row because there are no duplicate values for the same variable. For record 1, variable color has two values which appear in rows 1 and 2, resp., as the OP has requested
The record should be repeated if there are multiple values for a
particular variable
All other variables of record 1 have only one value (or none) and are arranged in row 1.
So, for each record a sub-table with a ragged bottom is created where the columns are filled from top to bottom (separately for each column).
I have tried to reproduce this in two different ways: First with data.table which I am more fluent with and then with dplyr/tidyr. Finally, I will propose an alternative presentation of duplicate values using toString().
data.table
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record + rowid(record, V1) ~ fct_inorder(V1), value.var = "V2")][
, record_1 := NULL][]
record color size height weight
1: 1 blue large <NA> heavy
2: 1 red <NA> <NA> <NA>
3: 2 green small tall thin
This works in 5 steps:
Split multiple key-value-pairs in each row and arrange in separate rows.
Remove double quotes.
Split key-value-pairs and arrange in separate columns.
Reshape from long to wide format where the rows are given by record and by a
count of each individual key within record using rowid() and the columns are given by the keys (variables). Using fct_inorder() ensures the columns are arranged in order of appearance of the variables (just to reproduce exactly OP's expected result).
Drop the helper column from the final result.
To be even more consistent with OP's expected result, the NAs can be turned into blanks by adding the parameter fill = "" to the dcast() call.
dplyr / tidyr
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
group_by(record, key) %>%
mutate(keyid = row_number(key)) %>%
pivot_wider(id_cols = c(record, keyid), names_from = key, values_from = val) %>%
arrange(record, keyid) %>%
select(-keyid)
# A tibble: 3 x 5
# Groups: record [2]
record color size height weight
<int> <chr> <chr> <chr> <chr>
1 1 blue large NA heavy
2 1 red NA NA NA
3 2 green small tall thin
The steps are essentially the same as for the data.table approach. The statements
group_by(record, key) %>%
mutate(keyid = row_number(key))
are a replacement for data.table::rowid().
Add the parameter values_fill = list(val = "") to repalce the NAs by blank.
Alternative representation
The following does not aim at reproducing OP'S expected result as close as possible but to propose an alternative, more concise representation of the result with one row per record.
During reshaping, a function can be used to aggregate the data in each cell. The toString() function concatenates character strings.
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record ~ fct_inorder(V1), toString, value.var = "V2")]
record color size height weight
1: 1 blue, red large heavy
2: 2 green small tall thin
or
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
pivot_wider(names_from = key, values_from = val, values_fn = list(val = toString))
# A tibble: 2 x 5
record color size height weight
<int> <chr> <chr> <chr> <chr>
1 1 blue, red large NA heavy
2 2 green small tall thin