R data frame rearraangement - r

I have an R data frame (actually an excel sheet which I have read into R) in the format below:
ID Text
1 This is a red
car. Its electric
and has 4 wheels.
2 This is a van with
six wheels.
I want to reshape it into the following format
ID Text
1 This is a red car. Its electric and has 4 wheels.
2 This is a van with six wheels
Essentially between the two ID numbers my text has been broken into multiple lines. I want to combine it to look like the output above.
Using group_by a numeric ID did not work as it gets rid of lines w/o the ID#.
Any thoughts on how I can achieve this type of output?
Thanks!

Here is one option with tidyverse. Convert the blank ("") in 'ID' to NA (na_if), using fill from tidyr, change the NA elements to previous non-Na value, grouped by 'ID', then paste the 'Text' by collapseing the elements together to a single string
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(ID = na_if(ID, "")) %>%
fill(ID) %>%
group_by(ID) %>%
summarise(Text = str_c(Text, collapse=' '))
# A tibble: 2 x 2
# ID Text
# <chr> <chr>
#1 1 This is a red car. Its electric and has 4 wheels.
#2 2 This is a van with six wheels.
Or create a logical index converted to numeric to fill the 'ID' and use that as grouping variable to summarise the 'Text' column
df1 %>%
group_by(ID = ID[ID != ""][cumsum(ID != "")]) %>%
summarise(Text = str_c(Text, collapse=" "))
# A tibble: 2 x 2
# ID Text
# <chr> <chr>
#1 1 This is a red car. Its electric and has 4 wheels.
#2 2 This is a van with six wheels.
data
df1 <- structure(list(ID = c("1", "", "", "2", ""), Text = c("This is a red",
"car. Its electric", "and has 4 wheels.", "This is a van with",
"six wheels.")), row.names = c(NA, -5L), class = "data.frame")

Related

R reorder column values alphabetic

i have a dataframe like this in R:
and i want to reorder the second column "Car" alphbethic like this:
Car
Audi/BMW/VW
Audi/BMW
Audi/BMW/VW
Audi/BMW/Porsche/VW
there could be 0 to 15 Cars with seperator "/"
my solution is a little bit complicated. (build a new DataFrame with this column, split them in multiple columns, reorder the rows alphabetic, paste them together, insert in original dataframe)
do you know a better and smarter solution?
thanks a lot
This is basically what you did but without creating new dataframe and new columns.
df$Car <- sapply(strsplit(as.character(df$Car), "/"), function(x)
paste(sort(x), collapse = "/"))
We can use separate_rows to split the second column, then arrange by 'Name', and 'Car' and paste the elements grouped by 'Name'
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
separate_rows(Car) %>%
arrange(Name, Car) %>%
group_by(Name, zipcode) %>%
summarise(Car = str_c(Car, collapse="/"))
# A tibble: 4 x 3
# Groups: Name [4]
# Name zipcode Car
# <chr> <dbl> <chr>
#1 Frank 3456 Audi/BMW/VW
#2 Lilly 1333 Audi/BMW/Porsche/VW
#3 Marie 1416 Audi/BMW
#4 Peter 1213 Audi/BMW/VW
data
df1 <- structure(list(Name = c("Peter", "Marie", "Frank", "Lilly"),
Car = c("BMW/VW/Audi", "Audi/BMW", "VW/BMW/Audi", "Audi/BMW/VW/Porsche"
), zipcode = c(1213, 1416, 3456, 1333)),
class = "data.frame", row.names = c(NA,
-4L))

splitting strings into columns in R

I have a vector with text in R data frame such as below:
string<-c("Real estate surface: 60m2 Number of rooms: 3 Number of bedrooms: 2 Number of bathrooms: 1 Number of toilets: 0 Year of construction: 1980 Last renovation: Floor: 1/15")
and I want to split text into 8 columns data frame with associated values, as e.g.
How can I do that?
Thanks!
An option would be to create NA for missing cases, then use separate_rows/separate to split the string
library(dplyr)
library(tidyr)
library(stringr)
library(tibble)
tibble(col = string) %>%
mutate(col = str_replace_all(col, ": (?![0-9])", ": NA ")) %>%
separate_rows(col, sep="(?<=:\\s\\w{1,5}) ") %>%
separate(col, into = c('col1', 'col2'), sep=":\\s+") %>%
deframe %>%
as.data.frame.list(check.names = FALSE) %>%
type.convert(as.is = TRUE)
#Real estate surface Number of rooms Number of bedrooms Number of bathrooms Number of toilets Year of construction
#1 60m2 3 2 1 0 1980
# Last renovation Floor
#1 NA 1/15

separate a column into multiple variables with unique column names in R

Here is how I want my dataframe to look:
record color size height weight
1 blue large heavy
1 red
2 green small tall thin
However, the data (df) appears as follows:
record vars
1 color = "blue", size = "large"
2 color = "green", size = "small"
2 height = "tall", weight = "thin"
1 color = "red", weight = "heavy"
The code for df
structure(list(record = c(1L, 2L, 2L, 1L), vars = structure(c(1L,
2L, 4L,
3L), .Label = c("color = \"blue\", size = \"large\"",
"color = \"green\", size = \"small\"", "color = \"red\", weight =
\"heavy\"",
"height = \"tall\", weight = \"thin\""), class = "factor")), class =
"data.frame", row.names = c(NA,
-4L))
For each record, I would like to separate the vars column by the "," delimiter, and create a new column with the indicated variable name...The record should be repeated if there are multiple values for a particular variable
I know that to do this with tidyverse I will need to use dplyr::group_by and dplyr::separate, however I'm not clear how to incorporate the new variable names in the "into" parameter for separate. Do I need some type of regular expression to identify any text prior to an equal sign "=" as the new variable name in "into"?? Any suggestions much welcome!
df %>%
group_by(record) %>%
separate(col = vars, into = c(regex expression?? / character vector?), sep = ",")
Since the columns are already almost written as R code defining a list, you could parse/eval them and then unnest_wider
library(tidyverse)
df %>%
mutate(vars = map(vars, ~ eval(parse_expr(paste('list(', .x, ')'))))) %>%
unnest_wider(vars)
# record color size height weight
# <int> <chr> <chr> <chr> <chr>
# 1 1 blue large NA NA
# 2 2 green small NA NA
# 3 2 NA NA tall thin
Here is one option with tidyverse. Create a sequence column 'rn', then separate_rows of the 'vars' column based on the ,, remove the quotes with str_remove_all, separate the column into two, and reshape from 'long' to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
separate_rows(vars, sep=",\\s*\\n*") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars, into = c("vars1", "vars2"), sep="\\s*=\\s*") %>%
pivot_wider(names_from = vars1, values_from = vars2,
values_fill = list(vars2 = '')) %>%
select(-rn)
# A tibble: 3 x 5
# record color size height weight
# <int> <chr> <chr> <chr> <chr>
#1 1 blue large "" ""
#2 2 green small "" ""
#3 2 "" "" tall thin
Another way is to convert to 2-column-matrices and merge. We'll need a helper FUNction that converts a vector into a matrix with first row as the header.
FUN <- function(x) {m <- matrix(x, 2);as.data.frame(rbind(`colnames<-`(m, m[1, ])[-1, ]))}
Then just get rid of non-character stuff and merge.
l <- lapply(strsplit(trimws(gsub("\\W+", " ", as.character(dat$vars))), " "), FUN)
l <- Map(`[<-`, l, 1, "record", dat$record) # cbind record column
Reduce(function(...) merge(..., all=TRUE), l) # merge
# record color weight size height
# 1 1 blue <NA> large <NA>
# 2 1 red heavy <NA> <NA>
# 3 2 green thin small tall
I just noticed that all answers posted so far (including the accepted answer) do not exactly reproduce OP's expected result:
record color size height weight
1 blue large heavy
1 red
2 green small tall thin
which shows 3 rows although the input data has 4 rows.
If I understand correctly, the key-value-pairs for record 2 can be arranged in one row because there are no duplicate values for the same variable. For record 1, variable color has two values which appear in rows 1 and 2, resp., as the OP has requested
The record should be repeated if there are multiple values for a
particular variable
All other variables of record 1 have only one value (or none) and are arranged in row 1.
So, for each record a sub-table with a ragged bottom is created where the columns are filled from top to bottom (separately for each column).
I have tried to reproduce this in two different ways: First with data.table which I am more fluent with and then with dplyr/tidyr. Finally, I will propose an alternative presentation of duplicate values using toString().
data.table
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record + rowid(record, V1) ~ fct_inorder(V1), value.var = "V2")][
, record_1 := NULL][]
record color size height weight
1: 1 blue large <NA> heavy
2: 1 red <NA> <NA> <NA>
3: 2 green small tall thin
This works in 5 steps:
Split multiple key-value-pairs in each row and arrange in separate rows.
Remove double quotes.
Split key-value-pairs and arrange in separate columns.
Reshape from long to wide format where the rows are given by record and by a
count of each individual key within record using rowid() and the columns are given by the keys (variables). Using fct_inorder() ensures the columns are arranged in order of appearance of the variables (just to reproduce exactly OP's expected result).
Drop the helper column from the final result.
To be even more consistent with OP's expected result, the NAs can be turned into blanks by adding the parameter fill = "" to the dcast() call.
dplyr / tidyr
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
group_by(record, key) %>%
mutate(keyid = row_number(key)) %>%
pivot_wider(id_cols = c(record, keyid), names_from = key, values_from = val) %>%
arrange(record, keyid) %>%
select(-keyid)
# A tibble: 3 x 5
# Groups: record [2]
record color size height weight
<int> <chr> <chr> <chr> <chr>
1 1 blue large NA heavy
2 1 red NA NA NA
3 2 green small tall thin
The steps are essentially the same as for the data.table approach. The statements
group_by(record, key) %>%
mutate(keyid = row_number(key))
are a replacement for data.table::rowid().
Add the parameter values_fill = list(val = "") to repalce the NAs by blank.
Alternative representation
The following does not aim at reproducing OP'S expected result as close as possible but to propose an alternative, more concise representation of the result with one row per record.
During reshaping, a function can be used to aggregate the data in each cell. The toString() function concatenates character strings.
library(data.table)
library(stringr)
library(forcats)
setDT(df)[, str_split(vars, ", "), by = .(rn = seq_along(vars), record)][
, V1 := str_remove_all(V1, '"')][
, tstrsplit(V1, " = "), by = .(rn, record)][
, dcast(.SD, record ~ fct_inorder(V1), toString, value.var = "V2")]
record color size height weight
1: 1 blue, red large heavy
2: 2 green small tall thin
or
library(dplyr)
library(tidyr)
library(stringr)
df %>%
separate_rows(vars, sep = ", ") %>%
mutate(vars = str_remove_all(vars, '"')) %>%
separate(vars,c("key", "val")) %>%
pivot_wider(names_from = key, values_from = val, values_fn = list(val = toString))
# A tibble: 2 x 5
record color size height weight
<int> <chr> <chr> <chr> <chr>
1 1 blue, red large NA heavy
2 2 green small tall thin

Extract words from text using dplyr and stringr

I'm trying to find an effective way to extract words from an text column in a dataset. The approach I'm using is
library(dplyr)
library(stringr)
Text = c("A little bird told me about the dog", "A pig in a poke", "As busy as a bee")
data = as.data.frame(Text)
keywords <- paste0(c("bird", "dog", "pig","wolf","cat", "bee", "turtle"), collapse = "|")
data %>% mutate(Word = str_extract(Text, keywords))
It's just an example but I have more than 2000 possible words to extract from each row. I don't know yet another approach to use, but the fact I will have a big regex will make things slow or doesn't matter the size of the regex? I think it will not appear more than one of these words in each row, but there is a way to make multiple columns automatically if more than one word appear in each row?
We can use str_extract_all to return a list, convert the list elements to a named list or tibble and use unnest_wider
library(purrr)
library(stringr)
library(tidyr)
library(dplyr)
data %>%
mutate(Words = str_extract_all(Text, keywords),
Words = map(Words, ~ as.list(unique(.x)) %>%
set_names(str_c('col', seq_along(.))))) %>%
unnest_wider(Words)
# A tibble: 3 x 3
# Text col1 col2
# <fct> <chr> <chr>
#1 A little bird told me about the dog bird dog
#2 A pig in a poke pig <NA>
#3 As busy as a bee bee <NA>
Try intersect with keywords as an array
data <- data.frame(Text = Text, Word = sapply(Text, function(v) intersect(unlist(strsplit(v,split = " ")),keywords),USE.NAMES = F))

replace symbols, in factors, in a data frame, with dplyr mutate

I have a data frame, and for various reasons I need to keep one of the elements as a factor and, maintaining the order of the levels, replace periods in the levels with spaces. Here's an example
library(tidyverse) library(stringr)
sandwich <- c("bread", "mustard.sauce", "tuna.fish", "lettuce", "bread")
data_frame(sandwich_str = sandwich) %>%
mutate(sandwich_factor = factor(sandwich)) %>%
mutate(sandwich2 = factor(sandwich_factor,
levels = str_replace_all(levels(sandwich_factor), "\\.", " "))) %>%
mutate(sandwich3 = str_replace_all(sandwich_str, "\\.", " "))
print(sandwich_df)
# A tibble: 5 x 4
sandwich_str, sandwich_factor, sandwich2, sandwich3
<chr> <fctr>, <fctr> <chr>,
1 bread bread bread bread
2 mustard.sauce mustard.sauce <NA> mustard sauce
3 tuna.fish tuna.fish <NA> tuna fish
4 lettuce lettuce lettuce lettuce
5 bread bread bread bread
So in this data frame:
sandwich_str is an element of characters
sandwich_factor is an element of factors
in sandwich2 I tried replacing all of the periods in the levels of sandwich_factor. For whatever reason, this returns NA whenever there are periods.
in sandwich3 I take the more simple approach of just replacing all of the periods in strings with spaces. This works substantially better.
So I'm wondering what isn't working in my attempt at sandwich2. I'd like it to look more like sandwich3. Any advice?
Does this suit?
library(tidyverse)
library(stringr)
# Data --------------------------------------------------------------------
sandwich <-
c("bread", "mustard.sauce", "tuna.fish", "lettuce", "bread")
df <-
data_frame(sandwich_str = sandwich)
# Convert periods to spaces -----------------------------------------------
df$sandwich_str <-
df$sandwich_str %>%
as.character() %>%
str_replace("\\."," ") %>%
as.factor()
# Print output ------------------------------------------------------------
df %>%
print()
Credit to #aosmith for posting this answer as a comment. I'll post it here as an answer so I can accept and close this.
The problem was that factor levels are defined with the flag labels rather than levels. So the correct way for me to have written this previously would be:
library(tidyverse) library(stringr)
sandwich <- c("bread", "mustard.sauce", "tuna.fish", "lettuce", "bread")
data_frame(sandwich_str = sandwich) %>%
mutate(sandwich_factor = factor(sandwich)) %>%
mutate(sandwich2 = factor(sandwich_factor,
labels = str_replace_all(levels(sandwich_factor), "\\.", " "))) %>%
mutate(sandwich3 = str_replace_all(sandwich_str, "\\.", " "))
print(sandwich_df)

Resources