I have a dataframe with two columns. An ID column and a character column containing key value pairs delimited by semicolon.
ID | KeyValPairs
1 | "zx=1; ds=4; xx=6"
2 | "qw=5; df=2"
. | ....
I want to turn this into a dataframe with three columns
ID | Key | Val
1 | zx | 1
1 | ds | 4
1 | xx | 6
2 | qw | 5
2 | df | 2
There is no fixed number of key value pairs in the KeyValPairs column, and no closed set of possible keys. I have been goofing around with solutions that involve looping and reinserting into an empty dataframe, but it is not working properly and I am told I should avoid loops in R.
A tidyr and dplyr approach:
tidyr
library(tidyr)
library(reshape2)
s <- separate(df, KeyValPairs, 1:3, sep=";")
m <- melt(s, id.vars="ID")
out <- separate(m, value, c("Key", "Val"), sep="=")
na.omit(out[order(out$ID),][-2])
# ID Key Val
# 1 1 zx 1
# 3 1 ds 4
# 5 1 xx 6
# 2 2 qw 5
# 4 2 df 2
dplyrish
library(tidyr)
library(dplyr)
df %>%
mutate(KeyValPairs = strsplit(as.character(KeyValPairs), "; ")) %>%
unnest(KeyValPairs) %>%
separate(KeyValPairs, into = c("key", "val"), "=")
#courtesy of #jeremycg
Data
df <- structure(list(ID = c(1, 2), KeyValPairs = structure(c(2L, 1L
), .Label = c(" qw=5; df=2", " zx=1; ds=4; xx=6"), class = "factor")), .Names = c("ID",
"KeyValPairs"), class = "data.frame", row.names = c(NA, -2L))
A data.table solution, just to use tstrsplit:
library(data.table) # V 1.9.6+
setDT(df)[, .(key = unlist(strsplit(as.character(KeyValPairs), ";"))), by = ID
][, c("Val", "Key") := tstrsplit(key, "=")
][, key := NULL][]
# ID Val Key
#1: 1 zx 1
#2: 1 ds 4
#3: 1 xx 6
#4: 2 qw 5
#5: 2 df 2
Maybe also a case for {splitstackshape} from #AnandaMahto:
df <- read.table(sep = "|", header = TRUE, text = '
ID | KeyValPairs
1 | "zx=1; ds=4; xx=6"
2 | "qw=5; df=2"')
library(splitstackshape)
setNames(
cSplit(cSplit(df, 2, ";", "long"), 2, "="),
c("id", "key", "val")
)
# id key val
# 1: 1 zx 1
# 2: 1 ds 4
# 3: 1 xx 6
# 4: 2 qw 5
# 5: 2 df 2
Related
I have a data.table of multiple columns that need to be summarized by a particular function depending on the column name.
A | B | C | D
1 | 1 | 1 | x
2 | 2 | 2 | y
3 | 3 | 3 | z
Should become
A | B | C | D
1 | 3 | 6 | z
As per the lookup table:
A | B | C | D
"min" | "max" | "sum" | "max"
We may use match.fun provided the function names in lookup table are valid R functions.
data.frame(Map(function(x, y) match.fun(y)(x), df1, lookup))
# A B C D
#1 1 3 6 z
data
df1 <- structure(list(A = 1:3, B = 1:3, C = 1:3, D = c("x", "y", "z"
)), row.names = c(NA, -3L), class = c("data.table", "data.frame"))
lookup <- structure(list(A = "min", B = "max", C = "sum", D = "max"),
row.names = c(NA, -1L), class = "data.frame")
You can't really summarize characters like letters (such as Z), so I assume you mean grouped summary data like this:
# Create data frame:
A <- c(1,2,3)
B <- c(1,2,3)
C <- c(1,2,3)
D <- c("x", "y", "z")
letters <- data.frame(A,B,C,D)
# Load library for summarizing values:
library(dplyr)
# Summarize and group by specific vector:
letters %>%
group_by(D) %>%
summarize(Min_A = min(A),
Max_B = max(B),
Sum_C = sum(C))
Which gives you this:
D Min_A Max_B Sum_C
<chr> <dbl> <dbl> <dbl>
1 x 1 1 1
2 y 2 2 2
3 z 3 3 3
Otherwise if you just mean all the descriptives (min, max, etc.):
# Ungrouped:
letters %>%
summarize(Min_A = min(A),
Max_B = max(B),
Sum_C = sum(C))
Which gives you:
Min_A Max_B Sum_C
1 1 3 6
Alternatively you can name it like this:
# Named Ungrouped:
zdata <- letters %>%
summarize(Min = min(A),
Max = max(B),
Sum = sum(C))
rownames(zdata) <- "Max"
zdata
Which gives you this:
Min Max Sum
Max 1 3 6
Not entirely sure why you want the max label for the rows, but this would achieve both your aims I think. There are many functions like this within dplyr. You can get a background on these functions in a book called R for Data Science!
I have
x<-"1, A | 2, B | 10, C "
x is always this way formatted, | denotes a new row and the first value is the variable1, the second value is variable2.
I would like to convert it to a data.frame
variable1 variable2
1 1 A
2 2 B
3 10 C
I haven't found any package that can understand the escape character |
How can I convert it to data.frame?
We may use read.table from base R to read the string into two columns after replacing the | with \n
read.table(text = gsub("|", "\n", x, fixed = TRUE), sep=",",
header = FALSE, col.names = c("variable1", "variable2"), strip.white = TRUE )
-output
variable1 variable2
1 1 A
2 2 B
3 10 C
Or use fread from data.table
library(data.table)
fread(gsub("|", "\n", x, fixed = TRUE), col.names = c("variable1", "variable2"))
variable1 variable2
1: 1 A
2: 2 B
3: 10 C
Or using tidyverse - separate_rows to split the column and then create two columns with separate
library(tidyr)
library(dplyr)
tibble(x = trimws(x)) %>%
separate_rows(x, sep = "\\s*\\|\\s*") %>%
separate(x, into = c("variable1", "variable2"), sep=",\\s+", convert = TRUE)
# A tibble: 3 × 2
variable1 variable2
<int> <chr>
1 1 A
2 2 B
3 10 C
Here's a way using scan().
x <- "1, A | 2, B | 10, C "
do.call(rbind.data.frame,
strsplit(scan(text=x, what="A", sep='|', quiet=T, strip.white=T), ', ')) |>
setNames(c('variable1', 'variable2'))
# variable1 variable2
# 1 1 A
# 2 2 B
# 3 10 C
Note: R version 4.1.2 (2021-11-01).
library(data.table)
DATA=data.table(STUDENT= c(1,2,3,4),
DOG_1= c("a","e","a","c"),
DOG_2= c("a","e","d","b"),
DOG_3= c("a","d","b","c"),
CAT_1= c("c","a","d","c"),
CAT_2= c("c","d","a","b"),
MOUSE_1= c("d","b","e","b"),
MOUSE_2= c("c","a","b","e"),
MOUSE_3= c("a","b","b","e"),
MOUSE_4= c("b","c","a","d"))
This is how my data looks like above. I wish to end up with a new data that looks like this:
where 'a' equals to 1; 'b' equals to 2; 'c' equals to 3; 'd' equals to 4; 'e' equals to 5 and to get the value for example STUDENT 1 DOG equals to 3 is gotten by converting the letters to the values and summing up.
If we want to use data.table solution, melt the 'DATA', by specifying the patterns from the column names into 'long' format, then using a named vector ('keyval'), grouped by 'STUDENT, loop over the columns specified in .SDcols, match and replace the values with the integer values and sum
library(data.table)
nm1 <- unique(sub("_\\d+$", "", names(DATA)[-1]))
dt1 <- melt(DATA, id.var = 'STUDENT',
measure = patterns(nm1), value.name = nm1)
keyval <- setNames(1:5, letters[1:5])
dt1[, lapply(.SD, function(x) sum(keyval[x],
na.rm = TRUE)), by = STUDENT, .SDcols = nm1]
-output
# STUDENT DOG CAT MOUSE
#1: 1 3 6 10
#2: 2 14 5 8
#3: 3 7 5 10
#4: 4 8 5 16
A similar option in tidyverse would be
library(dplyr)
library(tidyr)
DATA %>%
pivot_longer(cols = -STUDENT, names_to = c('.value', 'grp'),
names_sep='_') %>%
group_by(STUDENT) %>%
summarise(across(all_of(nm1), ~ sum(keyval[.], na.rm = TRUE)))
# A tibble: 4 x 4
# STUDENT DOG CAT MOUSE
# <dbl> <int> <int> <int>
#1 1 3 6 10
#2 2 14 5 8
#3 3 7 5 10
#4 4 8 5 16
For the sake of completeness, here are two data.table approaches which use the new measure() function (available with data.table version 1.14.1) in the call to melt()
1. Melting, joining with a lookup table on-the-fly, casting
melt(DATA, measure.vars = measure(animal, rn, pattern = "(\\w+)_(\\d)"), value.name = "code")[
.(code = letters[1:5], value = 1:5), on = "code", value := i.value][
, dcast(.SD, STUDENT ~ animal, sum, value.var = "value")]
STUDENT CAT DOG MOUSE
1: 1 6 3 10
2: 2 5 14 8
3: 3 5 7 10
4: 4 5 8 16
2. Melting and summing factor levels
When the lettersa to e are turned into factors, the corresponding factor levels get the numeric values 1 to 5.
library(magrittr) # piping used to improve readability
melt(DATA, measure.vars = measure(value.name, rn, pattern = "(\\w+)_(\\d)"))[, rn := NULL][
, lapply(.SD, \(x) factor(x, levels = letters[1:5]) %>% as.integer() %>% sum(na.rm = TRUE)),
by = STUDENT]
STUDENT DOG CAT MOUSE
1: 1 3 6 10
2: 2 14 5 8
3: 3 7 5 10
4: 4 8 5 16
Another data.table option using melt + dcast
dcast(
melt(DATA, id.var = "STUDENT")[
,
c("variable", "value") := .(gsub("_.*", "", variable),
value = setNames(1:5, c("a", "b", "c", "d", "e"))[value]
)
], STUDENT ~ variable, sum
)
gives
STUDENT CAT DOG MOUSE
1: 1 6 3 10
2: 2 5 14 8
3: 3 5 7 10
4: 4 5 8 16
I have a dataframe:
ID value1 value2
1 wad 11
2 NA NA
3 elf 1
When i do this:
dt[,new:=paste0(value1, value2)]
I get:
ID value1 value2 new
1 wad 11 wad11
2 NA NA NANA
3 elf 1 elf1
How to ignore NA? And remove value1 and value2? I want to get:
ID new
1 wad11
2 NA
3 elf1
A data.table option
setDT(df)[,
.(ID, new = do.call(paste, c(replace(.SD, is.na(.SD), ""), sep = ""))),
.SDcols = patterns("^value")
]
gives
ID new
1: 1 wad11
2: 2
3: 3 elf1
Data
> dput(df)
structure(list(ID = 1:3, value1 = c("wad", NA, "elf"), value2 = c(11L,
NA, 1L)), class = "data.frame", row.names = c(NA, -3L))
We can do
library(dplyr)
df %>%
transmute(ID, new = case_when(is.na(value1) & is.na(value2) ~ "",
TRUE ~str_c(value1, value2, sep='_')))
# ID new
#1 1 wad_11
#2 2
#3 3 elf_1
With data.table :
library(data.table)
setDT(df)
df[is.na(df)] <- ''
df[, new := paste0(value1, value2)][, .(ID, new)]
tidyr::unite has a one-liner solution to this :
tidyr::unite(df, new, value1, value2, na.rm = TRUE)
# ID new
#1 1 wad_11
#2 2
#3 3 elf_1
I have many data frames that come in such a format:
df1 <- structure(list(ID = 1:2, Name = 1:2, Gender = 1:2, Group = 1:2,
FORMULA_RULE = 1:2, FORMULA_TRANSFORM = 1:2, FORMULA_UNITE = 1:2,
FORMULA_CALCULATE = 1:2, FORMULA_JOIN = 1:2), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(ID = 1:2, Name = 1:2, Gender = 1:2, FORMULA_RULE = 1:2,
FORMULA_META = c(NA, NA), FORMULA_DATA = 1:2, FORMULA_JOIN = 1:2,
FORMULA_TRANSFORM = 1:2, Group = 1:2), class = "data.frame", row.names = c(NA,
-2L))
View:
df1
ID Name Gender Group FORMULA_RULE FORMULA_TRANSFORM FORMULA_UNITE FORMULA_CALCULATE FORMULA_JOIN
1 1 1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2 2 2
df2
ID Name Gender FORMULA_RULE FORMULA_META FORMULA_DATA FORMULA_JOIN FORMULA_TRANSFORM Group
1 1 1 1 1 NA 1 1 1 1
2 2 2 2 2 NA 2 2 2 2
I want to write a code that would work on all such dataframes in a way that all columns are kept, but among the columns starts with FORMULA_, only FORMULA_TRANSFORM is selected. Please note that columns that do NOT start with FORMULA_ are not always the same, that is to say, I cannot simply write a code that always selects ID, Name, Gender, Group, and FORMULA_TRANSFORM, because there are some data frames that contain many other columns that do not start with FORMULA_ which I want to keep.
My attempt to solve this problem is this ugly code which works as expected:
library(tidyverse)
for(i in 1:length(ls(pattern = "df"))){
get(paste0("df", i)) %>%
select(-starts_with("FORMULA"),
(names(get(paste0("df", i))) %>% grep(pattern = "FORMULA", value = T))[!names(get(paste0("df", i))) %>% grep(pattern = "FORMULA", value = T) %in% "FORMULA_TRANSFORM"])
%>% print
}
Is there a more straight-forward way to do this?
With dplyr we can use select and it's pretty straight forward using starts_with and contains.
library(dplyr)
df1 %>%
select(-starts_with("FORMULA_"), contains("FORMULA_TRANSFORM"))
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
Let's try with a dataframe without "FORMULA_TRANSFORM" column
df3 <- df1
df3$FORMULA_TRANSFORM <- NULL
df3 %>%
select(-starts_with("FORMULA_"), contains("FORMULA_TRANSFORM"))
# ID Name Gender Group
#1 1 1 1 1
#2 2 2 2 2
With minus sign we are removing the columns that starts_with "FORMULA_" and selecting the one with "FORMULA_TRANSFORM". Instead of contains we can also use one_of() or matches() and it would still work.
Using base R we can use grep with invert and value set as TRUE
df1[c(grep("^FORMULA_", names(df1), invert = TRUE, value = TRUE),
"FORMULA_TRANSFORM")]
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
This creates a vector of column names where column name doesn't start with "FORMULA_" and we add "FORMULA_TRANSFORM" manually later.
The above method assumes that you always have "FORMULA_TRANSFORM" column in your dataframe and it will fail if there isn't. Safer option would be
get_selected_cols <- function(df1) {
cbind(df1[grep("^FORMULA_", names(df1), invert = TRUE)],
df1[names(df1) == "FORMULA_TRANSFORM"])
}
get_selected_cols(df1)
# ID Name Gender Group FORMULA_TRANSFORM
#1 1 1 1 1 1
#2 2 2 2 2 2
get_selected_cols(df3)
# ID Name Gender Group
#1 1 1 1 1
#2 2 2 2 2