split columns with delimiters and multiple categories in R - r

I have a messy table which has a single column that contains multiple category labels, separated by several delimiters. I want to us R to split that column at each delimiter, and create a new column for each category label. The methods I have seen can only split at one delimiter at a time.
My current table looks like this:
my_table = read.csv("./my_table.csv")
# > my_table
# ID TYPE TEXT
# 1 1 a blue water
# 2 2 a,b,c fresh water
# 3 3 a;b,f cold stream
# 4 4 f, b and c lovely sunset
# 5 5 b;c up there
I want a table that looks like this:
# ID A B C D TEXT
# 1 1 a blue water
# 2 2 a b c fresh water
# 3 3 a b d cold stream
# 4 4 b c d lovely sunset
# 5 5 b c up there
Here is what I have tried:
my_table1 <- my_table %>%
separate(TYPE, c('A', 'B'), ",")
my_table1
# > docs1
# ID A B TEXT
# 1 1 a <NA> blue water
# 2 2 a b fresh water
# 3 3 a;b f cold stream
# 4 4 f b and c lovely sunset
# 5 5 b;c <NA> up there
my_table2 <- my_table1 %>%
separate(A, c('A', 'C' ), ";")
# > docs2
# ID A C B TEXT
# 1 1 a <NA> <NA> blue water
# 2 2 a <NA> b fresh water
# 3 3 a b f cold stream
# 4 4 f <NA> b and c lovely sunset
# 5 5 b c <NA> up there
my_table3 <- my_table2 %>%
separate(A, c('A', 'D'), "and")
# > docs3
# ID A D C B TEXT
# 1 1 a <NA> <NA> <NA> blue water
# 2 2 a <NA> <NA> b fresh water
# 3 3 a <NA> b f cold stream
# 4 4 f <NA> <NA> b and c lovely sunset
# 5 5 b <NA> c <NA> up there
This gets me close, but the column names are off. Plus, I don't want to have to guess about where the string "b and c" ends up after a couple iterations. I have thousands of rows and maybe five or six categories. My guess is that there is a simpler way to do this.

As an alternative and to extend your tidyverse attempt, here is a solution using strsplit and unnest:
df %>%
mutate(
val = strsplit(as.character(TYPE), "(;|,\\s*|\\s*and\\s*)")) %>%
unnest() %>%
select(-TYPE) %>%
group_by(ID, TEXT) %>%
mutate(n = 1:n()) %>%
spread(n, val)
## A tibble: 5 x 5
## Groups: ID, TEXT [5]
# ID TEXT `1` `2` `3`
# <int> <fct> <chr> <chr> <chr>
#1 1 blue water a NA NA
#2 2 fresh water a b c
#3 3 cold stream a b f
#4 4 lovely sunset f b c
#5 5 up there b c NA
Note that this is not exactly the same as your expected output. It does however match #MKR's output.
Sample data
df <- read.table(text =
"ID TYPE TEXT
1 1 'a' 'blue water'
2 2 'a,b,c' 'fresh water'
3 3 'a;b,f' 'cold stream'
4 4 'f, b and c' 'lovely sunset'
5 5 'b;c' 'up there'")

The cSplit function from splitstackshape package can make problem easier to solve. An approach could be as:
library(splitstackshape)
# First use `gsub` to replace other delimiter and have only ',' delimiter.
my_table$TYPE <- gsub("and|;",",",my_table$TYPE)
Mod_df <- cSplit(my_table, "TYPE", sep = ",")
Mod_df
# ID TEXT TYPE_1 TYPE_2 TYPE_3
# 1: 1 blue water a NA NA
# 2: 2 fresh water a b c
# 3: 3 cold stream a b f
# 4: 4 lovely sunset f b c
# 5: 5 up there b c NA
The tidyr::gather and spread can be used to get the format mentioned by OP as:
library(tidyr)
gather(Mod_df, key, value, -ID,-TEXT) %>% mutate_if(is.factor, as.character) %>%
mutate(K = toupper(value)) %>%
select(-key) %>%
filter(!is.na(K)) %>%
spread(K, value)
# ID TEXT A B C F
# 1 1 blue water a <NA> <NA> <NA>
# 2 2 fresh water a b c <NA>
# 3 3 cold stream a b <NA> f
# 4 4 lovely sunset <NA> b c f
# 5 5 up there <NA> b c <NA>
Data
my_table <- read.table(text =
" ID TYPE TEXT
1 1 a 'blue water'
2 2 'a,b,c' 'fresh water'
3 3 'a;b,f' 'cold stream'
4 4 'f, b and c' 'lovely sunset'
5 5 'b;c' 'up there'",
header = TRUE, stringsAsFactors = FALSE)

Related

Sort/arrange within group for only chosen groups

I would like to sort/arrange data by group. That's easy enough. However, I only want to sort values within specific groups, not all groups.
I found one possible instance of a similar question at the link. But I found it to be confusing due to the framing of the question by the OP.
Arrange values within a specific group
Sample data:
df <- data.frame(var = c("apple", "banana", "eggplant", "carrot", "dill", "fava", "garlic"),
grp = c("A", "A", "B", "B", "B", "C", "C"),
val = c(4, 2, 1, 3, 7, 6, 2))
df
# var grp val
# 1 apple A 4
# 2 banana A 2
# 3 carrot B 3
# 4 dill B 7
# 5 eggplant B 1
# 6 fava C 6
# 7 garlic C 2
Desired output:
# var grp val
# 1 apple A 4
# 2 banana A 2
# 3 eggplant B 1
# 4 carrot B 3
# 5 dill B 7
# 6 garlic C 2
# 7 fava C 6
Partial solution:
library(dplyr)
df %>%
group_by(grp) %>%
arrange(val, .by_group = T)
This of course sorts for all groups. How do I get it to sort for only the groups I would like sorted, which are "B" and "C"? I would like a tidyverse solution but feel free to post a base solution as well.
We can change the sign to the elements in 'val' that correspond to "A" group so that it is ordered in the opposite direction compared to the 'val' elements in other group
library(dplyr)
df %>%
arrange(grp, val * c(1, -1)[(grp == 'A') + 1])
-output
var grp val
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
6 garlic C 2
7 fava C 6
Or if the values for 'A' should be kept as such, then mltiply by 0 so that each value is same for 'A'
df %>%
arrange(grp, val * c(1, 0)[(grp == 'A') + 1])
var grp val
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
6 garlic C 2
7 fava C 6
NOTE: This is done without any group_by attribute
If we want to use the OP's way, i.e. using group_by
df %>%
group_by(grp) %>%
arrange(case_when(grp == 'A' ~ -1 * val, TRUE ~ val),
.by_group = TRUE) %>%
ungroup
-ouptutu
# A tibble: 7 x 3
var grp val
<chr> <chr> <dbl>
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
6 garlic C 2
7 fava C 6
If the values in 'val' for grp 'A' are showed in descending order because of coincidence, then create a sequence column before doing the grouping and then use that for modifying
df %>%
mutate(rn = row_number()) %>%
group_by(grp) %>%
arrange(case_when(grp == 'A' ~ as.numeric(rn), TRUE ~ val),
.by_group = TRUE) %>%
ungroup %>%
dplyr::select(-rn)
-output
# A tibble: 7 x 3
var grp val
<chr> <chr> <dbl>
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
6 garlic C 2
7 fava C 6
Or using base R
df[with(df, order(grp, c(1, 0)[(grp == 'A') + 1] * val)),]
var grp val
1 apple A 4
2 banana A 2
3 eggplant B 1
4 carrot B 3
5 dill B 7
7 garlic C 2
6 fava C 6
You can filter the groups you want to arrange, sort them and bind to the remaining data.
library(dplyr)
order_groups <- c('B', 'C')
df %>%
filter(grp %in% order_groups) %>%
arrange(grp, val) %>%
bind_rows(df %>%
filter(!grp %in% order_groups)) %>%
arrange(grp)
#. var grp val
#1 apple A 4
#2 banana A 2
#3 eggplant B 1
#4 carrot B 3
#5 dill B 7
#6 garlic C 2
#7 fava C 6

Transforming an R Dataframe with 2 columns and delimiter in rows

I have a dataframe that has two columns "id" and "detail" (df_current below). I need to group the dataframe by id, and spread the file so that the columns become "Interface1", "Interface2", etc. and the contents under the interface columns are the immediate values under each time the interface value appears. Essentially the "!" is working as a separator, but it is not needed in the output.
The desired output is shown below as: "df_needed_from_current".
I have tried multiple approaches (group_by, spread, reshape, dcast etc.), but can't get it to work. Any help would be greatly appreciated!
Sample Current Dataframe (code to create under):
id
detail
1
!
1
Interface1
1
a
1
b
1
!
1
Interface2
1
a
1
b
2
!
2
Interface1
2
a
2
b
2
c
2
!
2
Interface2
2
a
3
!
3
Interface1
3
a
3
b
3
c
3
d
df_current <- data.frame(
id = c("1","1","1","1","1","1","1","1","2",
"2","2","2","2","2","2","2","3","3",
"3","3","3","3","4","4","4","4","4",
"4","4","4","4","4","4","4","4","4",
"5","5","5","5","5","5","5","5","5",
"5","5","5","5"),
detail = c("!", "Interface1","a","b","!",
"Interface2","a","b","!","Interface1",
"a","b","c","!","Interface2","a",
"!", "Interface1","a","b","c","d",
"!", "Interface1","a","b","!",
"Interface2","a","b","c","!","Interface3",
"a","b","c","!","Interface1","a","b","!",
"Interface2","a","b","c","!","Interface3",
"a","b"))
Dataframe Needed (code to create under):
ID
Interface1
Interface2
Interface3
1
a
a
NA
1
b
b
NA
2
a
a
NA
2
b
NA
NA
2
c
NA
NA
3
a
NA
NA
3
b
NA
NA
3
c
NA
NA
3
d
NA
NA
df_needed_from_current <- data.frame(
id = c("1","1","2","2","2","3","3","3","3","4","4","4","5","5","5"),
Interface1 = c("a","b","a","b","c","a","b","c","d","a","b","NA","a","b","NA"),
Interface2 = c("a","b","a","NA","NA","NA","NA","NA","NA","a","b","c","a","b","c"),
Interface3 = c("NA","NA","NA","NA","NA","NA","NA","NA","NA","a","b","c","a","b","NA")
)
We remove the rows where the 'detail' values is "!", then create a new column 'interface' with only values that have prefix 'Interface' from 'detail', use fill from tidyr to fill the NA elements with the previous non-NA, filter the rows where the 'detail' values are not the same as 'interface' column, create a row sequence id with rowid(from data.table) and reshape to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
df_current %>%
filter(detail != "!") %>%
mutate(interface = case_when(str_detect(detail, 'Interface') ~ detail)) %>%
group_by(id) %>%
fill(interface) %>%
ungroup %>%
filter(detail != interface) %>%
mutate(rn = rowid(id, interface)) %>%
pivot_wider(names_from = interface, values_from = detail) %>%
select(-rn)
# A tibble: 15 x 4
# id Interface1 Interface2 Interface3
# <chr> <chr> <chr> <chr>
# 1 1 a a <NA>
# 2 1 b b <NA>
# 3 2 a a <NA>
# 4 2 b <NA> <NA>
# 5 2 c <NA> <NA>
# 6 3 a <NA> <NA>
# 7 3 b <NA> <NA>
# 8 3 c <NA> <NA>
# 9 3 d <NA> <NA>
#10 4 a a a
#11 4 b b b
#12 4 <NA> c c
#13 5 a a a
#14 5 b b b
#15 5 <NA> c <NA>

R slide window through tibble

I got a simple question that I cannot figure out solutions.
Also, I didn't find an answer that I understand.
Imagine I got this data frame
(ts <- tibble(
+ a = LETTERS[1:10],
+ b = c(rep(1, 5), rep(2,5))
+ ))
# A tibble: 10 x 2
a b
<chr> <dbl>
1 A 1
2 B 1
3 C 1
4 D 1
5 E 1
6 F 2
7 G 2
8 H 2
9 I 2
10 J 2
What I want is simple. I want to build a df with the column b indexing a sliding window which sizes n f the column a.
The output can be something like this:
# A tibble: 8 x 2
b a
<dbl> <chr>
1 1 A B
2 1 B C
3 1 C D
4 1 D E
5 2 F G
6 2 G H
7 2 H I
8 2 I J
I don't care if the column a contains an array (nest values).
I just need a new data frame based on the sliding window.
Since this operation will run in a relational database I'd like a function compatible with DBI-PostgresSQL.
Any help is appreciated.
Thanks in advance
We can group by 'b', create the new column based on the lead of 'a', remove the NA rows with na.omit
library(dplyr)
ts %>%
group_by(b) %>%
mutate(a2 = lead(a)) %>%
ungroup %>%
na.omit %>%
select(b, everything())
# A tibble: 8 x 3
# b a a2
# <dbl> <chr> <chr>
#1 1 A B
#2 1 B C
#3 1 C D
#4 1 D E
#5 2 F G
#6 2 G H
#7 2 H I
#8 2 I J
If lead doesn't works, then just remove the first element, append NA at the end in the mutate step
ts %>%
group_by(b) %>%
mutate(a2 = c(a[-1], NA)) %>%
ungroup %>%
na.omit %>%
select(b, everything())

Matching values from multiple columns in 1 data frame to key in second data frame and creating columns

I have 2 data frames. One (df1) looks like this:
var.1 var.2 var.3 var.4
1 7 9 1 2
2 4 6 9 7
3 2 NA NA NA
And the other (df2) looks like this:
var.a var.b var.c var.d
1 1 b c d
2 2 f g h
3 4 j k l
3 7 j k z
...
with all of the values listed out in var.1-var.4 in df1 in var.a of df2.
I want to match var.a from df2 across all of the columns listed in df1 and then add these columns to df1 with new/combined column names. So for instance it'll look like this:
var.1 var1.b var1.c var1.d ... var.4 var4.b var4.c var4.d
1 7 j k z 2 f g h
2 4 j k l 7 j k z
3 2 f g h NA NA NA NA
Thanks in advance!
Here's a tidyverse solution. First, I define the data frames.
df1 <- read.table(text = " var.1 var.2 var.3 var.4
1 7 9 1 2
2 4 6 9 7
3 2 NA NA NA", header = TRUE)
df2 <- read.table(text = " var.a var.b var.c var.d
1 1 b c d
2 2 f g h
3 4 j k l
4 7 j k z", header=TRUE)
Then, I load the libraries.
# Load libraries
library(tidyr)
library(dplyr)
library(tibble)
Finally, I restructure the data.
# Manipulate data
df1 %>%
rownames_to_column() %>%
gather(variable, value, -rowname) %>%
left_join(df2, by = c("value" = "var.a")) %>%
gather(foo, bar, -variable, -rowname) %>%
unite(goop, variable, foo) %>%
spread(goop, bar) %>%
select(-rowname)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
which gives,
#> var.1_value var.1_var.b var.1_var.c var.1_var.d var.2_value var.2_var.b
#> 1 7 j k z 9 <NA>
#> 2 4 j k l 6 <NA>
#> 3 2 f g h <NA> <NA>
#> var.2_var.c var.2_var.d var.3_value var.3_var.b var.3_var.c var.3_var.d
#> 1 <NA> <NA> 1 b c d
#> 2 <NA> <NA> 9 <NA> <NA> <NA>
#> 3 <NA> <NA> <NA> <NA> <NA> <NA>
#> var.4_value var.4_var.b var.4_var.c var.4_var.d
#> 1 2 f g h
#> 2 7 j k z
#> 3 <NA> <NA> <NA> <NA>
Created on 2019-05-30 by the reprex package (v0.3.0)
This is a little bit convoluted, but I'll try to explain.
I turn row numbers into a column at first, as this will help me put the data back together at the very end.
I go from wide to long format for df1.
I join df2 to df1 based on var.a and var.1 (now called value), respectively.
I go from wide to long again.
I combine the variable names from each data frame into one variable.
Finally, I go from long to wide format (this is where the row numbers come in handy) and drop the row numbers.

Fill NA with character in list

I have some data as follows:
library(tidyr)
library(data.table)
thisdata <- data.frame(numbers = c(1,3,4,5,6,1,2,4,5,6)
,letters = c('A','A','A','A','A','B','B','B','B','B'))
otherdata <- data.frame(numbers = c(1,2,3,4,5,6))
I am looking to split 'thisdata' by the letters column, merge the two lists to 'otherdata' by the numbers column, then fill letters NA with the corresponding letter in that list. So:
out <- split(thisdata , f = thisdata$letters )
out2 <- lapply(out, function(x) merge(x,otherdata,by="numbers",all = TRUE))
However, I can't get the 'fill' function in tidyr to work within the lapply
out3 <- lapply(out2,function(x) fill(x$channel))
Error in UseMethod("fill_") :
no applicable method for 'fill_' applied to an object of class "NULL"
This is the output I'm after, but would rather perform the calculation within the list format:
out4 <- rbindlist(out2)
out5 <- out4 %>%
fill(letters) %>% #default direction down
fill(letters,.direction = "up")
numbers letters
1: 1 A
2: 2 A
3: 3 A
4: 4 A
5: 5 A
6: 6 A
7: 1 B
8: 2 B
9: 3 B
10: 4 B
11: 5 B
12: 6 B
fill expects a data frame as first parameter, try fill(x, letters) or x %>% fill(letters) with magrittr pipe:
out3 <- lapply(out2,function(x) fill(x, letters))
out3
#$A
# numbers letters
#1 1 A
#2 2 A
#3 3 A
#4 4 A
#5 5 A
#6 6 A
#$B
# numbers letters
#1 1 B
#2 2 B
#3 3 B
#4 4 B
#5 5 B
#6 6 B
A simpler method is use tidyr::complete:
thisdata %>%
complete(numbers = otherdata$numbers, letters) %>%
arrange(letters)
# A tibble: 12 x 2
# numbers letters
# <dbl> <fctr>
# 1 1 A
# 2 2 A
# 3 3 A
# 4 4 A
# 5 5 A
# 6 6 A
# 7 1 B
# 8 2 B
# 9 3 B
#10 4 B
#11 5 B
#12 6 B

Resources