Convert a properly formatted string to data frame - r

I have
x<-"1, A | 2, B | 10, C "
x is always this way formatted, | denotes a new row and the first value is the variable1, the second value is variable2.
I would like to convert it to a data.frame
variable1 variable2
1 1 A
2 2 B
3 10 C
I haven't found any package that can understand the escape character |
How can I convert it to data.frame?

We may use read.table from base R to read the string into two columns after replacing the | with \n
read.table(text = gsub("|", "\n", x, fixed = TRUE), sep=",",
header = FALSE, col.names = c("variable1", "variable2"), strip.white = TRUE )
-output
variable1 variable2
1 1 A
2 2 B
3 10 C
Or use fread from data.table
library(data.table)
fread(gsub("|", "\n", x, fixed = TRUE), col.names = c("variable1", "variable2"))
variable1 variable2
1: 1 A
2: 2 B
3: 10 C
Or using tidyverse - separate_rows to split the column and then create two columns with separate
library(tidyr)
library(dplyr)
tibble(x = trimws(x)) %>%
separate_rows(x, sep = "\\s*\\|\\s*") %>%
separate(x, into = c("variable1", "variable2"), sep=",\\s+", convert = TRUE)
# A tibble: 3 × 2
variable1 variable2
<int> <chr>
1 1 A
2 2 B
3 10 C

Here's a way using scan().
x <- "1, A | 2, B | 10, C "
do.call(rbind.data.frame,
strsplit(scan(text=x, what="A", sep='|', quiet=T, strip.white=T), ', ')) |>
setNames(c('variable1', 'variable2'))
# variable1 variable2
# 1 1 A
# 2 2 B
# 3 10 C
Note: R version 4.1.2 (2021-11-01).

Related

Turning a text column into a vector in r

I want to see whether the text column has elements outside the specified values of "a" and "b"
specified_value=c("a","b")
df=data.frame(key=c(1,2,3,4),text=c("a,b,c","a,d","1,2","a,b")
df_out=data.frame(key=c(1,2,3),text=c("c","d","1,2",NA))
This is what I have tried:
df=df%>%mutate(text_vector=strsplit(text, split=","),
extra=text_vector[which(!text_vector %in% specified_value)])
But this doesn't work, any suggestions?
We can split the 'text' by the delimiter , with separate_rows, grouped by 'key', get the elements that are not in 'specified_value' with setdiff and paste them together (toString), then do a join to get the other columns in the original dataset
library(dplyr) # >= 1.0.0
library(tidyr)
df %>%
separate_rows(text) %>%
group_by(key) %>%
summarise(extra = toString(setdiff(text, specified_value))) %>%
left_join(df) %>%
mutate(extra = na_if(extra, ""))
# A tibble: 4 x 3
# key extra text
# <dbl> <chr> <chr>
#1 1 c a,b,c
#2 2 d a,d
#3 3 1, 2 1,2
#4 4 <NA> a,b
Using setdiff.
df$outside <- sapply({
x <- lapply(strsplit(df$text, ","), setdiff, specified_value)
replace(x, lengths(x) == 0, NA)},
paste, collapse=",")
df
# key text outside
# 1 1 a,b,c c
# 2 2 a,d d
# 3 3 1,2 1,2
# 4 4 a,b NA
Data:
df <- structure(list(key = c(1, 2, 3, 4), text = c("a,b,c", "a,d",
"1,2", "a,b")), class = "data.frame", row.names = c(NA, -4L))
specified_value <- c("a", "b")
use stringi::stri_split_fixed
library(stringi)
!all(stri_split_fixed("a,b", ",", simplify=T) %in% specified_value) #FALSE
!all(stri_split_fixed("a,b,c", ",", simplify=T) %in% specified_value) #TRUE
An option using regex without splitting the data on comma :
#Collapse the specified_value in one string and remove from text
df$text1 <- gsub(paste0(specified_value, collapse = "|"), '', df$text)
#Remove extra commas
df$text1 <- gsub('(?<![a-z0-9]),', '', df$text1, perl = TRUE)
df
# key text text1
#1 1 a,b,c c
#2 2 a,d d
#3 3 1,2 1,2
#4 4 a,b

How to make each row a new set of variables and rename them dynamically in r

First, I want to convert this data:
datinput = read.table(header = TRUE, text = "
var1 var2 var3
A 3 10
B 2 6
")
datinput
var1 var2 var3
1 A 3 10
2 B 2 6
into this format:
datoutput = read.table(header = TRUE, text = "
var2.A var3.A Var2.B var3.B
3 10 2 6
")
var2.A var3.A Var2.B var3.B
1 3 10 2 6
I tried reshape2::dcast, but does not deliver the desired output.
Instead, dcast gives this:
datinput%>%reshape2::dcast(var1~var2, value.var="var3")
var1 2 3
1 A NA 10
2 B 6 NA
datinput%>%reshape2::dcast(var1, value.var=c("var2", "var3"))
Error in is.formula(formula) : object 'var1' not found
datinput%>%reshape2::dcast(var1~var1, value.var=c("var2", "var3"))
Error in .subset2(x, i, exact = exact) : subscript out of bounds
In addition: Warning message:
In if (!(value.var %in% names(data))) { :
the condition has length > 1 and only the first element will be used
Then, I want to make the names_from come first in the new names.
I want to have these new columns named A.var2 B.var2 A.var3 B.var3. This is because I want to arrange the resulting variables using the variable names by alphabetical order into A.var2 A.var3 B.var2 B.var3
Thanks for any help.
We can use pivot_wider
library(dplyr)
library(tidyr)
library(stringr)
datinput %>%
pivot_wider( names_from = var1, values_from = c('var2', 'var3'), names_sep=".") %>%
rename_all(~ str_replace(., '^(.*)\\.(.*)', '\\2.\\1'))
The dcast from reshape2 do not use multiple value columns. Instead, it can be done with data.table::dcast
library(data.table)
dcast(setDT(datinput), rowid(var1) ~ var1, value.var = c("var2", "var3"), sep=".")
# var1 var2.A var2.B var3.A var3.B
#1: 1 3 2 10 6

Extract character list values from data.frame rows and reshape data

I have a variable x with character lists in each row:
dat <- data.frame(id = c(rep('a',2),rep('b',2),'c'),
x = c('f,o','f,o,o','b,a,a,r','b,a,r','b,a'),
stringsAsFactors = F)
I would like to reshape the data so that each row is a unique (id, x) pair such as:
dat2 <- data.frame(id = c(rep('a',2),rep('b',3),rep('c',2)),
x = c('f','o','a','b','r','a','b'))
> dat2
id x
1 a f
2 a o
3 b a
4 b b
5 b r
6 c a
7 c b
I've attempted to do this by splitting the character lists and keeping only the unique list values in each row:
dat$x <- sapply(strsplit(dat$x, ','), sort)
dat$x <- sapply(dat$x, unique)
dat <- unique(dat)
> dat
id x
1 a f, o
3 b a, b, r
5 c a, b
However, I'm not sure how to proceed with converting the row lists into individual row entries.
How would I accomplish this? Or is there a more efficient way of converting a list of strings to reshape the data as described?
You can use tidytext::unnest_tokens:
library(tidytext)
library(dplyr)
dat %>%
unnest_tokens(x1, x) %>%
distinct()
id x1
1 a f
2 a o
3 b b
4 b a
5 b r
6 c b
7 c a
A base R method with two lines is
#get list of X potential vars
x <- strsplit(dat$x, ",")
# construct full data.frame, then use unique to return desired rows
unique(data.frame(id=rep(dat$id, lengths(x)), x=unlist(x)))
This returns
id x
1 a f
2 a o
6 b b
7 b a
9 b r
13 c b
14 c a
If you don't want to write out the variable names yourself, you can use setNames.
setNames(unique(data.frame(rep(dat$id, lengths(x)), unlist(x))), names(dat))
We could use separate_rows
library(tidyverse)
dat %>%
separate_rows(x) %>%
distinct()
# id x
#1 a f
#2 a o
#3 b b
#4 b a
#5 b r
#6 c b
#7 c a
A solution can be achieved using splitstackshape::cSplit to split x column into mulltiple columns. Then gather and filter will help to achieve desired output.
library(tidyverse)
library(splitstackshape)
dat %>% cSplit("x", sep=",") %>%
mutate_if(is.factor, as.character) %>%
gather(key, value, -id) %>%
filter(!is.na(value)) %>%
select(-key) %>% unique()
# id value
# 1 a f
# 3 b b
# 5 c b
# 6 a o
# 8 b a
# 10 c a
# 13 b r
Base solution:
temp <- do.call(rbind, apply( dat, 1,
function(z){ data.frame(
id=z[1],
x = scan(text=z['x'], what="",sep=","),
stringsAsFactors=FALSE)} ) )
Read 2 items
Read 3 items
Read 4 items
Read 3 items
Read 2 items
Warning messages:
1: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
2: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
3: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
4: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
5: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
temp[!duplicated(temp),]
#------
id x
1 a f
2 a o
6 b b
7 b a
9 b r
13 c b
14 c a
To get rid of all the messages and warnings:
temp <- do.call(rbind, apply( dat, 1,
function(z){ suppressWarnings(data.frame(id=z[1],
x = scan(text=z['x'], what="",sep=",", quiet=TRUE), stringsAsFactors=FALSE)
)} ) )
temp[!duplicated(temp),]

Function of strings to tibble

I need to define a function f(x,y) such that:
x = "col1,1,2,3,4"
y = "col2,a,b,c,d"
becomes:
# A tibble: 4 x 2
col1 col2
<int> <chr>
1 1 a
2 2 b
3 3 c
4 4 d
Any thoughts? Thanks.
The most obvious idea that comes to mind is to split the input by comma, use paste to combine the output into a single string, and read that using read_csv.
Example:
paste(do.call(paste, c(strsplit(c(x, y), ","), sep = ", ")), collapse = "\n")
# [1] "col1, col2\n1, a\n2, b\n3, c\n4, d"
library(tidyverse)
read_csv(paste(do.call(paste, c(strsplit(c(x, y), ","), sep = ", ")), collapse = "\n"))
# # A tibble: 4 x 2
# col1 col2
# <int> <chr>
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
From there, I hope you're able to convert the approach to a function.

extract key value pairs from R dataframe column

I have a dataframe with two columns. An ID column and a character column containing key value pairs delimited by semicolon.
ID | KeyValPairs
1 | "zx=1; ds=4; xx=6"
2 | "qw=5; df=2"
. | ....
I want to turn this into a dataframe with three columns
ID | Key | Val
1 | zx | 1
1 | ds | 4
1 | xx | 6
2 | qw | 5
2 | df | 2
There is no fixed number of key value pairs in the KeyValPairs column, and no closed set of possible keys. I have been goofing around with solutions that involve looping and reinserting into an empty dataframe, but it is not working properly and I am told I should avoid loops in R.
A tidyr and dplyr approach:
tidyr
library(tidyr)
library(reshape2)
s <- separate(df, KeyValPairs, 1:3, sep=";")
m <- melt(s, id.vars="ID")
out <- separate(m, value, c("Key", "Val"), sep="=")
na.omit(out[order(out$ID),][-2])
# ID Key Val
# 1 1 zx 1
# 3 1 ds 4
# 5 1 xx 6
# 2 2 qw 5
# 4 2 df 2
dplyrish
library(tidyr)
library(dplyr)
df %>%
mutate(KeyValPairs = strsplit(as.character(KeyValPairs), "; ")) %>%
unnest(KeyValPairs) %>%
separate(KeyValPairs, into = c("key", "val"), "=")
#courtesy of #jeremycg
Data
df <- structure(list(ID = c(1, 2), KeyValPairs = structure(c(2L, 1L
), .Label = c(" qw=5; df=2", " zx=1; ds=4; xx=6"), class = "factor")), .Names = c("ID",
"KeyValPairs"), class = "data.frame", row.names = c(NA, -2L))
A data.table solution, just to use tstrsplit:
library(data.table) # V 1.9.6+
setDT(df)[, .(key = unlist(strsplit(as.character(KeyValPairs), ";"))), by = ID
][, c("Val", "Key") := tstrsplit(key, "=")
][, key := NULL][]
# ID Val Key
#1: 1 zx 1
#2: 1 ds 4
#3: 1 xx 6
#4: 2 qw 5
#5: 2 df 2
Maybe also a case for {splitstackshape} from #AnandaMahto:
df <- read.table(sep = "|", header = TRUE, text = '
ID | KeyValPairs
1 | "zx=1; ds=4; xx=6"
2 | "qw=5; df=2"')
library(splitstackshape)
setNames(
cSplit(cSplit(df, 2, ";", "long"), 2, "="),
c("id", "key", "val")
)
# id key val
# 1: 1 zx 1
# 2: 1 ds 4
# 3: 1 xx 6
# 4: 2 qw 5
# 5: 2 df 2

Resources