R; separate columns with comma into rows [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I'm having trouble to split column into multiple rows even after using separate_rows function.
It gives me the following error..
Error: Can't subset columns that don't exist.
INPUT:
ID Colours Shapes
1 Red Triangle
1 Red Square
2 Green, Black Circle
2 Green, Black Triangle
3 Blue Square
3 Blue Oval
OUTPUT:
ID Colours Shapes
1 Red Triangle
1 Red Square
2 Green Circle
2 Green Triangle
2 Black Circle
2 Black Triangle
3 Blue Square
3 Blue Oval

I tried to use separate_rows with your data and I had no problems:
df <- data.frame(ID = c(1,1,2,2,3,3),
Colours = c("Red", "Red", "Green, Black", "Green, Black", "Blue", "Blue"),
Shapes = c("Triangle", "Square", "Circle", "Triangle", "Square", "Oval"))
library(tidyr)
df %>% separate_rows(Colours, sep = ", ")
#> # A tibble: 8 x 3
#> ID Colours Shapes
#> <dbl> <chr> <chr>
#> 1 1 Red Triangle
#> 2 1 Red Square
#> 3 2 Green Circle
#> 4 2 Black Circle
#> 5 2 Green Triangle
#> 6 2 Black Triangle
#> 7 3 Blue Square
#> 8 3 Blue Oval

Try this. You can use tidyverse functions to separate rows by comma. The solution will work for n elements separated by comma. Initially, reshape data to long with pivot_longer(), then separate rows with separate_rows(). As ids for rows were necessary you can reshape to wide to obtain the expected output. Finally, use fill() to complete the missing values and arrange() to give the desired order. Here the code:
library(tidyverse)
#Code
newdf <- df %>% mutate(id=row_number()) %>%
pivot_longer(-c(ID,id)) %>%
separate_rows(value,sep=',') %>%
mutate(value=trimws(value)) %>%
group_by(id,name) %>% mutate(id2=row_number()) %>%
pivot_wider(names_from = name,values_from=value) %>%
fill(Shapes) %>% ungroup() %>% select(-c(id,id2)) %>%
arrange(ID,Colours)
Output:
# A tibble: 8 x 3
ID Colours Shapes
<int> <chr> <chr>
1 1 Red Triangle
2 1 Red Square
3 2 Black Circle
4 2 Black Triangle
5 2 Green Circle
6 2 Green Triangle
7 3 Blue Square
8 3 Blue Oval

Related

tidyr join an ID table with main table across multiple columns

This seems like a very basic operation, but my searches are not finding a simple solution.
As an example of what I am trying to do, consider the following two data frames from a database.
First an ID table that assigns an index to a color name:
ColorID <- tibble(ID = c(1:4), Name = c("Red", "Green", "Blue", "Black"))
ColorID
# A tibble: 4 x 2
ID Name
<int> <chr>
1 1 Red
2 2 Green
3 3 Blue
4 4 Black
Next some table that points to these color indexes (instead of storing text strings):
Widgets <- tibble(Front = c(1,3,4,2,1,1), Back = c(4,4,3,3,1,2),
Top = c(4,3,2,1,2,3), Bottom = c(1,2,3,4,3,2))
Widgets
# A tibble: 6 x 4
Front Back Top Bottom
<dbl> <dbl> <dbl> <dbl>
1 1 4 4 1
2 3 4 3 2
3 4 3 2 3
4 2 3 1 4
5 1 1 2 3
6 1 2 3 2
Now I just want to join the two tables to substitute the index values with the actual color names, so what I want is:
Joined <- tibble(Front = c("Red", "Blue", "Black", "Green", "Red","Red"),
Back = c("Black", "Black", "Blue","Blue", "Red", "Green"),
Top = c("Black","Blue", "Green", "Red", "Green", "Blue"),
Bottom = c("Red", "Green", "Blue", "Black", "Blue","Green"))
Joined
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green
I've tried many iterations with no success, what I thought would work is something like:
J <- Widgets %>% inner_join(ColorID, by = c(. = "ID"))
I can tackle this column by column by using one variable at a time, e.g.
J <- Widgets %>% inner_join(ColorID, by = c("Front" = "ID"))
Which doesn't replace "Front", but instead creates a new "Name" column. Seems like there has to be a simple solution to this though. Thanks.
There is no need for join functions:
library(dplyr)
ColorID <- tibble(ID = c(1:4), Name = c("Red", "Green", "Blue", "Black"))
# reorder so that row number and ID are different
ColorID <- ColorID[c(2, 1, 4, 3), ]
Widgets <- tibble(Front = c(1,3,4,2,1,1), Back = c(4,4,3,3,1,2),
Top = c(4,3,2,1,2,3), Bottom = c(1,2,3,4,3,2))
check_id <- function(col){
ColorID$Name[match(col, ColorID$ID)]
}
Widgets %>%
mutate(across(everything(), check_id))
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green
(Edited) What I'm doing with dplyr and mutate is matching the numbers on Widgets with the number on the ColorID$ID column. This provides me with the row on the ColorID data frame I need for extracting the name.
Does this work:
library(dplyr)
library(tidyr)
Widgets %>% pivot_longer(everything()) %>%
inner_join(ColorID, by = c('value' = 'ID')) %>% select(-value) %>%
pivot_wider(names_from = name, values_from = Name) %>% unnest(everything())
# A tibble: 6 x 4
Front Back Top Bottom
<chr> <chr> <chr> <chr>
1 Red Black Black Red
2 Blue Black Blue Green
3 Black Blue Green Blue
4 Green Blue Red Black
5 Red Red Green Blue
6 Red Green Blue Green

How to conditional subset a list in R based on range in another row

This question relates somehow to another question I asked 14 days ago.
How to conditional subset a list in R based on range in another column
The difference here, is that I need to subset rows, instead of columns, and I cannot make that work.
I have imported more than 100 equal .xls files with 10 sheets each into a list in R. I am now trying to get the information out that I need. The data in the files are highly unstructured.
I have created some toy data to show what I want.
list3 <- list(data.frame(depth = c(NA,NA,NA,1,2,3,4,5),
col1 = c(NA,NA,"black",NA,"x",NA,NA,NA),
col2 = c(NA,NA,"blue",NA,NA,"x",NA,NA),
col3 = c(NA,NA,"white","x",NA,NA,NA,NA),
col4 = c(NA,NA,"grey",NA,NA,NA,"x",NA),
col5 = c(NA,NA,"yellow",NA,NA,NA,NA,"x")))
list4 <- list(data.frame(depth = c(NA,NA,NA,1,2,3,4,5),
col1 = c(NA,NA,"black",NA,NA,"x",NA,NA),
col2 = c(NA,NA,"blue",NA,NA,NA,"x",NA),
col3 = c(NA,NA,"white","x",NA,NA,NA,NA),
col4 = c(NA,NA,"grey",NA,"x",NA,NA,NA),
col5 = c(NA,NA,"yellow",NA,NA,NA,NA,"x")))
list5 <- list(data.frame(depth = c(NA,NA,NA,1,2,3,4,5),
col1 = c(NA,NA,"black",NA,"x","x",NA,NA),
col2 = c(NA,NA,"blue",NA,NA,NA,"x",NA),
col3 = c(NA,NA,"white","x",NA,NA,NA,NA),
col4 = c(NA,NA,"grey",NA,NA,NA,NA,NA),
col5 = c(NA,NA,"yellow",NA,NA,NA,NA,"x")))
my_list <- list(list3,list4,list5)
desired_result <- data.frame(depth = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
color = c("white","black","blue","grey","yellow",
"white","grey","black","blue","yellow",
"white","black","black","blue","yellow"))
As I mentioned in my previous question, the data are highly unstructered and I therefore need a solution based on subsetting a range.
I need to iterate over my list. I have done that with purrr:map with success so far. But this one I cant seem to figure out.
I need to link the color found on each depth in all my files. The result dont need to be in a dataframe, a vector for each depth is fine.
I hope for a purrr solution, but everything is thankfully accepted.
Additional requirement given in comments
Your my_list actually has no names! so try this syntax
library(janitor)
imap_dfr(my_list, ~(.x[[1]] %>% mutate(across(starts_with("col"), ~ifelse(. == "x", depth, .))) %>%
select(-depth) %>% row_to_names(3) %>% ungroup() %>%
pivot_longer(everything(), names_to = "color", values_to = "depth", values_drop_na = T) %>%
mutate(list_name = .y)))
# A tibble: 15 x 3
color depth list_name
<chr> <chr> <int>
1 white 1 1
2 black 2 1
3 blue 3 1
4 grey 4 1
5 yellow 5 1
6 white 1 2
7 grey 2 2
8 black 3 2
9 blue 4 2
10 yellow 5 2
11 white 1 3
12 black 2 3
13 black 3 3
14 blue 4 3
15 yellow 5 3
If list contain names, the output will have names else index numbers of list. Use of imap_dfr is recommended. Assumption lied is here that third column contains color names.
Try this:
library(purrr)
library(dplyr)
my_fun <-function(x){
depth <- x %>% summarise(across(.cols = starts_with("col"),.fns=~depth[which(.=="x")])) %>%
as.numeric()
color <- select(x,starts_with("col"))[3,] %>% as.character(.)
data.frame(depth,color) %>% arrange(depth)
}
map(my_list,function(l)do.call("rbind",map(l,my_fun))) %>% do.call("rbind",.)
Output:
# depth color
# 1 1 white
# 2 2 black
# 3 3 blue
# 4 4 grey
# 5 5 yellow
# 6 1 white
# 7 2 grey
# 8 3 black
# 9 4 blue
# 10 5 yellow

Assign unique ID based on values in EITHER of two columns

This is not a duplicate of this question. Please read questions entirely before labeling duplicates.
I have a data.frame like so:
library(tidyverse)
tibble(
color = c("blue", "blue", "red", "green", "purple"),
shape = c("triangle", "square", "circle", "hexagon", "hexagon")
)
color shape
<chr> <chr>
1 blue triangle
2 blue square
3 red circle
4 green hexagon
5 purple hexagon
I'd like to add a group_id column like this:
color shape group_id
<chr> <chr> <dbl>
1 blue triangle 1
2 blue square 1
3 red circle 2
4 green hexagon 3
5 purple hexagon 3
The difficulty is that I want to group by unique values of color or shape. I suspect the solution might be to use list-columns, but I can't figure out how.
We can use duplicated in base R
df1$group_id <- cumsum(!Reduce(`|`, lapply(df1, duplicated)))
-output
df1
# A tibble: 5 x 3
# color shape group_id
# <chr> <chr> <int>
#1 blue triangle 1
#2 blue square 1
#3 red circle 2
#4 green hexagon 3
#5 purple hexagon 3
Or using tidyverse
library(dplyr)
library(purrr)
df1 %>%
mutate(group_id = map(., duplicated) %>%
reduce(`|`) %>%
`!` %>%
cumsum)
data
df1 <- structure(list(color = c("blue", "blue", "red", "green", "purple"
), shape = c("triangle", "square", "circle", "hexagon", "hexagon"
)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))

R: create a data frame containing unique different elements across different columns of another data frame

I think it's more intuitive to show what I want to obtain with the following example.
basically, for any level I want to get a list of all the elements contained in the other 3 columns without repetition.
group1 group2 group3 Level
cat cat dog 1
dog parrot cat 1
mouse dolphin dolphin 1
red blue blue 2
green yellow green 2
black purple cat 2
result I want to obtain:
var1 level
cat 1
dog 1
mouse 1
dolphin 1
Parrot 1
red 2
blue 2
green 2
purple 2
cat 2
black 2
One option is to pivot into 'long' format and then get the distinct roww
library(tidyr)
library(dplyr)
df1 %>%
pivot_longer(cols = -Level, values_to = 'var1') %>%
distinct(Level, var1) %>%
select(var1, level = Level)

How to add a logical AND to a combination of logical ORs using filter and str_detect?

I have the following example dataframe "df" with the variable "Text" containing text:
df:
Text
1 I like blue shoes.
2 Black is great!
3 Pink and grey books.
4 I don't like grey trousers.
5 Yellow is my favorite colour
6 No more green!
7 Cars are red.
8 I have a pink bike
I use the following code to filter every case which contains at least one of the listed words, which works perfectly fine:
library(tidyverse)
library(igraph)
library(stringi)
library(stringr)
filter <- c("blue","green","yellow","red")
df2 <-
df %>%
filter(str_detect(tolower(df$Text), paste(filter, collapse = "|")))
df2:
Text
1 I like blue shoes.
5 Yellow is my favorite colour
6 No mor green!
7 Cars are red.
As an additional condition, I now want to add the combination of "pink" and "grey", filtering for at least one of the listed words above OR the combination. The dataframe I want to have looks like that:
df2:
Text
1 I like blue shoes.
3 Pink and grey books.
5 Yellow is my favorite colour
6 No mor green!
7 Cars are red.
Do you have any idea how I can get there?
Thanks in advance!
You can use the & operator to combine filter operations ( there is also the | OR operator).
> f1
[1] "blue" "green" "yellow" "red"
> f2
[1] "pink" "grey"
> df
# A tibble: 4 x 2
Text1 Text2
<chr> <chr>
1 Yellow This
2 red That
3 Purple grey The
4 green pink other
> filter(df, str_detect(Text1, paste0(f1, collapse = "|")))
# A tibble: 2 x 2
Text1 Text2
<chr> <chr>
1 red That
2 green pink other
> filter(df,
str_detect(Text1, paste0(f1, collapse = "|")) &
str_detect(Text1, paste0(f2, collapse = "|")))
# A tibble: 1 x 2
Text1 Text2
<chr> <chr>
1 green pink other
Note the second requires both operations.
EDIT ADDRESSING THE COMMENT
> filter(df,
str_detect(Text1, paste0(f1, collapse = "|")) |
(str_detect(Text1, "pink") & str_detect(Text1, "grey")))
You can still use & or | operators together with brackets to get the logical combinations you want.

Resources