I have couple of unstructured sentences like below. Description below is column name
Description
Automatic lever for a machine
Vaccum chamber with additional spare
Glove box for R&D
The Mini Guage 5 sets
Vacuum chamber only
Automatic lever only
I want to split this sentence from Col1 to Col5 and count there occurrence like below
Col1 Col2 Col3 Col4
Automatic_lever lever_for for_a a_machine
Vaccum_chamber chamber_with with_additional additional_spare
Glove_box box_for for_R&D R&D
The_Mini Mini_Guage Guage_5 5_sets
Vacuum_chamber chamber_only only
Automatic_lever lever_only only
Also from above columns, can i have the occurence of these words. Like, Vaccum_chamber and Automatic_lever are repeated twice here. Similarly, the occurence of other words?
Here is a tidyverse option
df %>%
rowid_to_column("row") %>%
mutate(words = map(str_split(Description, " "), function(x) {
if (length(x) %% 2 == 0) words <- c(words, "")
idx <- 1:(length(words) - 1)
map_chr(idx, function(i) paste0(x[i:(i + 1)], collapse = "_"))
})) %>%
unnest() %>%
group_by(row) %>%
mutate(
words = str_replace(words, "_NA", ""),
col = paste0("Col", 1:n())) %>%
filter(words != "NA") %>%
spread(col, words, fill = "")
## A tibble: 6 x 6
## Groups: row [6]
# row Description Col1 Col2 Col3 Col4
# <int> <fct> <chr> <chr> <chr> <chr>
#1 1 Automatic lever for a mac… Automatic_… lever_for for_a a_machine
#2 2 Vaccum chamber with addit… Vaccum_cha… chamber_w… with_addi… additional…
#3 3 Glove box for R&D Glove_box box_for for_R&D R&D
#4 4 The Mini Guage 5 sets The_Mini Mini_Guage Guage_5 5_sets
#5 5 Vacuum chamber only Vacuum_cha… chamber_o… only ""
#6 6 Automatic lever only Automatic_… lever_only only ""
Explanation: We split the sentences in Description on a single whitespace " ", then concatenate every two words together with a sliding window approach, making sure that there are always an odd odd number of words per sentence; the rest is just a long-to-wide transformation.
Not pretty but it reproduces your expected output; instead of the manual sliding window approach you could also you zoo::rollapply.
Sample data
df <- read.table(text =
"Description
'Automatic lever for a machine'
'Vaccum chamber with additional spare'
'Glove box for R&D'
'The Mini Guage 5 sets'
'Vacuum chamber only'
'Automatic lever only'", header = T)
Related
I have been given some data in a text format that I would like to convert into a dataframe:
text <- "
VALUE Ethnic
1 = 'White - British'
2 = 'White - Irish'
9 = 'White - Other'
;
"
I'm looking to convert into a dataframe with a column for the first number and a column for the test in the string. So - in this case, it would be two columns and three rows.
library(tidyr)
library(dplyr)
tibble(text = trimws(text)) %>%
separate_rows(text, sep = "\n") %>%
filter(text != ";") %>%
slice(-1) %>%
separate(text, into = c("VALUE", "Ethnic"), sep = "\\s+=\\s+")
-output
# A tibble: 3 × 2
VALUE Ethnic
<chr> <chr>
1 1 'White - British'
2 2 'White - Irish'
3 9 'White - Other'
Or in base R
read.table(text = gsub("=", " ", trimws(text,
whitespace = "\n(;\n)*"), fixed = TRUE), header = TRUE)
VALUE Ethnic
1 1 White - British
2 2 White - Irish
3 9 White - Other
create the years list
years_list = list(range(1986,2020))
defines the columns separation specified in the layout
columns_width = [(0,2),(2,10),(10,12),(12,24),(24,27),(27,39),(39,49),(49,52),(52,56),(56,69),(69,82),
(82,95),(95,108),(108,121),(121,134),(134,147),(147,152),(152,170),(170,188),(188,201),
(201,202),(202,210),(210,217),(217,230),(230,242),(242,245)]
defines the english transleted columns according to the layout
columns_header = ['Register Type','Trading Date','BDI Code','Negociation Code','Market Type','Trade Name',
'Specification','Forward Market Term In Days','Currency','Opening Price','Max. Price',
'Min. Price','Mean Price','Last Trade Price','Best Purshase Order Price',
'Best Purshase Sale Price','Numbor Of Trades','Number Of Traded Stocks',
'Volume Of Traded Stocks','Price For Options Market Or Secondary Term Market',
'Price Corrections For Options Market Or Secondary Term Market',
'Due Date For Options Market Or Secondary Term Market','Factor Of Paper Quotatuion',
'Points In Price For Options Market Referenced In Dollar Or Secondary Term',
'ISIN Or Intern Code ','Distribution Number']
create a empty df that will be filled during the iteration below
years_concat = pd.DataFrame()
iterate all years
for year in years_list:
time_serie = pd.read_fwf('/kaggle/input/bmfbovespas-time-series-19862019/COTAHIST_A'+str(year)+'.txt',
header=None, colspecs=columns_width)
# delete the first and the last lines containing identifiers
# use two comented lines below to see them
# output = pd.DataFrame(np.array([time_serie.iloc[0],time_serie.iloc[-1]]))
# output
time_serie = time_serie.drop(time_serie.index[0])
time_serie = time_serie.drop(time_serie.index[-1])
years_concat = pd.concat([years_concat,time_serie],ignore_index=True)
years_concat.columns = columns_header
Hi I'm observing a dataset which have a column named "genres" of string vectors that contain all tags of genres the film has, I want to create a plot that shows the popularity of all genres.
structure(list(anime_id = c("10152", "11061", "11266", "11757",
"11771"), Name.x = c("Kimi ni Todoke 2nd Season: Kataomoi", "Hunter
x Hunter (2011)",
"Ao no Exorcist: Kuro no Iede", "Sword Art Online", "Kuroko no
Basket"
), genres = list("Romance", c("Action", " Adventure", " Fantasy"
), "Fantasy", c("Action", " Adventure", " Fantasy", " Romance"
), "Sports")), row.names = c(NA, 5L), class = "data.frame")
initially the genres column is a string with genres divided by comma . for example : ['action', 'drama', 'fantasy']. To work with I run this code to edit the column :
AnimeList2022new$genres <- gsub("\\[|\\]|'" , "",
as.character(AnimeList2022new$genres))
AnimeList2022new$genres <- strsplit( AnimeList2022new$genres,
",")
I don't know how to compare all the vectors in order to know how many times a tags appear
enter image description here
I'm trying with group_by and summarise
genresdata <-MyAnimeList %>%
group_by(genres) %>%
summarise( count = n() ) %>%
arrange( -count)
but obviously this code group similar vectors and not similar string contained in the vectors.
this is the output:
enter image description here
Your genres column is of class list, so it sounds like you want the length() of reach row in it. Generally, we could do that like this:
MyAnimeList %>%
mutate(n_genres = sapply(genres, length))
But this is a special case where there is a nice convenience function lengths() (notice the s at the end) built-in to R that gives us the same result, so we can simply do
MyAnimeList %>%
mutate(n_genres = lengths(genres))
The above will give the number of genres for each row.
In the comments I see you say you want "for example how many times "Action" appears in the whole column". For that, we can unnest() the genre list column and then count:
library(tidyr)
MyAnimeList %>%
unnest(genres) %>%
count(genres)
# # A tibble: 7 × 2
# genres n
# <chr> <int>
# 1 " Adventure" 2
# 2 " Fantasy" 2
# 3 " Romance" 1
# 4 "Action" 2
# 5 "Fantasy" 1
# 6 "Romance" 1
# 7 "Sports" 1
Do notice that some of your genres have leading white space--it's probably best to solve this problem "upstream" whenever the genre column was created, but we could do it now using trimws to trim whitespace:
MyAnimeList %>%
unnest(genres) %>%
count(trimws(genres))
# # A tibble: 5 × 2
# `trimws(genres)` n
# <chr> <int>
# 1 Action 2
# 2 Adventure 2
# 3 Fantasy 3
# 4 Romance 2
# 5 Sports 1
How to separate a column into many, based on a symbol "|" and any additional spaces around this symbol if any:
input <- tibble(A = c("Ae1 tt1 | Ae2 tt2", "Be1 | Be2 | Be3"))
output <- tibble(B = c("Ae1 tt1", "Be1") , C = c("Ae2 tt2", "Be2"), D = c(NA, "Be3"))
I tried :
input %>%
separate(A, c("B","C","D"))
#separate(A, c("B","C","D"), sep = "|.")
#mutate(B = str_split(A, "*|")) %>% unnest
What is the syntax with regex ?
Solution from R - separate with specific symbol, vertical bare, | (and tidyr::separate() producing unexpected results) does not provide expected output and produces a warning:
input %>% separate(col=A, into=c("B","C","D"), sep = '\\|')
# A tibble: 2 x 3
B C D
<chr> <chr> <chr>
1 "Ae1 tt1 " " Ae2 tt2" <NA>
2 "Be1 " " Be2 " " Be3"
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
Using separate from tidyr with different length vectors does not seem related unfortunately.
You can use
output <- input %>%
separate(col=A, into=c("B","C","D"), sep="\\s*\\|\\s*", fill="right")
R test:
> input %>% separate(col=A, into=c("B","C","D"), sep="\\s*\\|\\s*", fill="right")
# A tibble: 2 x 3
B C D
<chr> <chr> <chr>
1 Ae1 tt1 Ae2 tt2 <NA>
2 Be1 Be2 Be3
The \s*\|\s* pattern matches a pipe char with any zero or more whitespace chars on both ends of the pipe.
The fill="right" argument fills with missing values on the right.
I need to split the string in dataframe to two columns, the first one contains the value before the round brackets and the second column contains the value inside the round brackets.
This is an example:
study_name = c("apple bannan (tcga, raw 2018)", "frame shift (mskk2 nature, 2000)" )
results= c("Untested", "tested")
df = data_frame(study_name,results)
This is how I tried to do it:
df <- df %>%
mutate(reference = str_extract_all(study_name, "\\([^()]+\\)")) %>%
rename(~gsub("\\([^()]+\\)", "", study_name))
This is the expected dataframe:
reference = c("(tcga, raw 2018)", "(mskk2 nature, 2000)")
study = c("apple bannan", "frame shift")
expexted_df = data_frame(study, reference)
You can use separate() and set the separator as "\\s(?=\\()".
library(tidyr)
df %>%
separate(study_name, c("study", "reference"), sep = "\\s(?=\\()")
# # A tibble: 2 x 3
# study reference results
# <chr> <chr> <chr>
# 1 apple bannan (tcga, raw 2018) Untested
# 2 frame shift (mskk2 nature, 2000) tested
If you want to extract the text in the parentheses, using extract() is a suitable choice.
df %>%
extract(study_name, c("study", "reference"), regex = "(.+)\\s\\((.+)\\)")
# # A tibble: 2 x 3
# study reference results
# <chr> <chr> <chr>
# 1 apple bannan tcga, raw 2018 Untested
# 2 frame shift mskk2 nature, 2000 tested
We can use str_extract thus:
library(stringr)
df$reference <- str_extract(df$study_name, "\\(.*\\)")
df$study <- str_extract(df$study_name, ".*(?= \\(.*\\))")
Result:
df
study_name results reference study
1 apple bannan (tcga, raw 2018) Untested (tcga, raw 2018) apple bannan
2 frame shift (mskk2 nature, 2000) tested (mskk2 nature, 2000) frame shift
If you no longer want the study_name column, remove it thus:
df$study_name <- NULL
This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Split column at delimiter in data frame [duplicate]
(6 answers)
Closed 5 years ago.
I have a tibble.
library(tidyverse)
df <- tibble(
id = 1:4,
genres = c("Action|Adventure|Science Fiction|Thriller",
"Adventure|Science Fiction|Thriller",
"Action|Crime|Thriller",
"Family|Animation|Adventure|Comedy|Action")
)
df
I want to separate the genres by "|" and empty columns filled with NA.
This is what I did:
df %>%
separate(genres, into = c("genre1", "genre2", "genre3", "genre4", "genre5"), sep = "|")
However, it's being separated after each letter.
I think you haven't included into:
df <- tibble::tibble(
id = 1:4,
genres = c("Action|Adventure|Science Fiction|Thriller",
"Adventure|Science Fiction|Thriller",
"Action|Crime|Thriller",
"Family|Animation|Adventure|Comedy|Action")
)
df %>% tidyr::separate(genres, into = c("genre1", "genre2", "genre3",
"genre4", "genre5"))
Result:
# A tibble: 4 x 6
id genre1 genre2 genre3 genre4 genre5
* <int> <chr> <chr> <chr> <chr> <chr>
1 1 Action Adventure Science Fiction Thriller
2 2 Adventure Science Fiction Thriller <NA>
3 3 Action Crime Thriller <NA> <NA>
4 4 Family Animation Adventure Comedy Action
Edit: Or as RichScriven wrote in the comments, df %>% tidyr::separate(genres, into = paste0("genre", 1:5)). For separating on | exactly, use sep = "\\|".
Well, this is what helped, writing regex properly.
df %>%
separate(genres, into = paste0("genre", 1:5), sep = "\\|")