How many elements in common on multiple lists? - r

Hi I'm observing a dataset which have a column named "genres" of string vectors that contain all tags of genres the film has, I want to create a plot that shows the popularity of all genres.
structure(list(anime_id = c("10152", "11061", "11266", "11757",
"11771"), Name.x = c("Kimi ni Todoke 2nd Season: Kataomoi", "Hunter
x Hunter (2011)",
"Ao no Exorcist: Kuro no Iede", "Sword Art Online", "Kuroko no
Basket"
), genres = list("Romance", c("Action", " Adventure", " Fantasy"
), "Fantasy", c("Action", " Adventure", " Fantasy", " Romance"
), "Sports")), row.names = c(NA, 5L), class = "data.frame")
initially the genres column is a string with genres divided by comma . for example : ['action', 'drama', 'fantasy']. To work with I run this code to edit the column :
AnimeList2022new$genres <- gsub("\\[|\\]|'" , "",
as.character(AnimeList2022new$genres))
AnimeList2022new$genres <- strsplit( AnimeList2022new$genres,
",")
I don't know how to compare all the vectors in order to know how many times a tags appear
enter image description here
I'm trying with group_by and summarise
genresdata <-MyAnimeList %>%
group_by(genres) %>%
summarise( count = n() ) %>%
arrange( -count)
but obviously this code group similar vectors and not similar string contained in the vectors.
this is the output:
enter image description here

Your genres column is of class list, so it sounds like you want the length() of reach row in it. Generally, we could do that like this:
MyAnimeList %>%
mutate(n_genres = sapply(genres, length))
But this is a special case where there is a nice convenience function lengths() (notice the s at the end) built-in to R that gives us the same result, so we can simply do
MyAnimeList %>%
mutate(n_genres = lengths(genres))
The above will give the number of genres for each row.
In the comments I see you say you want "for example how many times "Action" appears in the whole column". For that, we can unnest() the genre list column and then count:
library(tidyr)
MyAnimeList %>%
unnest(genres) %>%
count(genres)
# # A tibble: 7 × 2
# genres n
# <chr> <int>
# 1 " Adventure" 2
# 2 " Fantasy" 2
# 3 " Romance" 1
# 4 "Action" 2
# 5 "Fantasy" 1
# 6 "Romance" 1
# 7 "Sports" 1
Do notice that some of your genres have leading white space--it's probably best to solve this problem "upstream" whenever the genre column was created, but we could do it now using trimws to trim whitespace:
MyAnimeList %>%
unnest(genres) %>%
count(trimws(genres))
# # A tibble: 5 × 2
# `trimws(genres)` n
# <chr> <int>
# 1 Action 2
# 2 Adventure 2
# 3 Fantasy 3
# 4 Romance 2
# 5 Sports 1

Related

Convert text into dataframe

I have been given some data in a text format that I would like to convert into a dataframe:
text <- "
VALUE Ethnic
1 = 'White - British'
2 = 'White - Irish'
9 = 'White - Other'
;
"
I'm looking to convert into a dataframe with a column for the first number and a column for the test in the string. So - in this case, it would be two columns and three rows.
library(tidyr)
library(dplyr)
tibble(text = trimws(text)) %>%
separate_rows(text, sep = "\n") %>%
filter(text != ";") %>%
slice(-1) %>%
separate(text, into = c("VALUE", "Ethnic"), sep = "\\s+=\\s+")
-output
# A tibble: 3 × 2
VALUE Ethnic
<chr> <chr>
1 1 'White - British'
2 2 'White - Irish'
3 9 'White - Other'
Or in base R
read.table(text = gsub("=", " ", trimws(text,
whitespace = "\n(;\n)*"), fixed = TRUE), header = TRUE)
VALUE Ethnic
1 1 White - British
2 2 White - Irish
3 9 White - Other
create the years list
years_list = list(range(1986,2020))
defines the columns separation specified in the layout
columns_width = [(0,2),(2,10),(10,12),(12,24),(24,27),(27,39),(39,49),(49,52),(52,56),(56,69),(69,82),
(82,95),(95,108),(108,121),(121,134),(134,147),(147,152),(152,170),(170,188),(188,201),
(201,202),(202,210),(210,217),(217,230),(230,242),(242,245)]
defines the english transleted columns according to the layout
columns_header = ['Register Type','Trading Date','BDI Code','Negociation Code','Market Type','Trade Name',
'Specification','Forward Market Term In Days','Currency','Opening Price','Max. Price',
'Min. Price','Mean Price','Last Trade Price','Best Purshase Order Price',
'Best Purshase Sale Price','Numbor Of Trades','Number Of Traded Stocks',
'Volume Of Traded Stocks','Price For Options Market Or Secondary Term Market',
'Price Corrections For Options Market Or Secondary Term Market',
'Due Date For Options Market Or Secondary Term Market','Factor Of Paper Quotatuion',
'Points In Price For Options Market Referenced In Dollar Or Secondary Term',
'ISIN Or Intern Code ','Distribution Number']
create a empty df that will be filled during the iteration below
years_concat = pd.DataFrame()
iterate all years
for year in years_list:
time_serie = pd.read_fwf('/kaggle/input/bmfbovespas-time-series-19862019/COTAHIST_A'+str(year)+'.txt',
header=None, colspecs=columns_width)
# delete the first and the last lines containing identifiers
# use two comented lines below to see them
# output = pd.DataFrame(np.array([time_serie.iloc[0],time_serie.iloc[-1]]))
# output
time_serie = time_serie.drop(time_serie.index[0])
time_serie = time_serie.drop(time_serie.index[-1])
years_concat = pd.concat([years_concat,time_serie],ignore_index=True)
years_concat.columns = columns_header

R Concatenate Across Rows Within Groups but Preserve Sequence

My data consists of text from many dyads that has been split into sentences, one per row. I'd like to concatenate the data by speaker within dyads, essentially converting the data to speaking turns. Here's an example data set:
dyad <- c(1,1,1,1,1,2,2,2,2)
speaker <- c("John", "John", "John", "Paul","John", "George", "Ringo", "Ringo", "George")
text <- c("Let's play",
"We're wasting time",
"Let's make a record!",
"Let's work it out first",
"Why?",
"It goes like this",
"Hold on",
"Have to tighten my snare",
"Ready?")
dat <- data.frame(dyad, speaker, text)
And this is what I'd like the data to look like:
dyad speaker text
1 1 John Let's play. We're wasting time. Let's make a record!
2 1 Paul Let's work it out first
3 1 John Why?
4 2 George It goes like this
5 2 Ringo Hold on. Have to tighten my snare
6 2 George Ready?
I've tried grouping by sender and pasting/collapsing from dplyr but the concatenation combines all of a sender's text without preserving speaking turn order. For example, John's last statement ("Why") winds up with his other text in the output rather than coming after Paul's comment. I also tried to check if the next speaker (using lead(sender)) is the same as current and then combining, but it only does adjacent rows, in which case it misses John's third comment in the example. Seems it should be simple but I can't make it happen. And it should be flexible to combine any series of continuous rows by a given speaker.
Thanks in advance
Create another group with rleid (from data.table) and paste the rows in summarise
library(dplyr)
library(data.table)
library(stringr)
dat %>%
group_by(dyad, grp = rleid(speaker), speaker) %>%
summarise(text = str_c(text, collapse = ' '), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 6 × 3
dyad speaker text
<dbl> <chr> <chr>
1 1 John Let's play We're wasting time Let's make a record!
2 1 Paul Let's work it out first
3 1 John Why?
4 2 George It goes like this
5 2 Ringo Hold on Have to tighten my snare
6 2 George Ready?
Not as elegant as dear akrun's solution. helper does the same as rleid function here without the NO need of an additional package:
library(dplyr)
dat %>%
mutate(helper = (speaker != lag(speaker, 1, default = "xyz")),
helper = cumsum(helper)) %>%
group_by(dyad, speaker, helper) %>%
summarise(text = paste0(text, collapse = " "), .groups = 'drop') %>%
select(-helper)
dyad speaker text
<dbl> <chr> <chr>
1 1 John Let's play We're wasting time Let's make a record!
2 1 John Why?
3 1 Paul Let's work it out first
4 2 George It goes like this
5 2 George Ready?
6 2 Ringo Hold on Have to tighten my snare

Removing all characters before and after text in R, then creating columns from the new text

So I have a string that I'm attempting to parse through and then create 3 columns with the data I extract. From what I've seen, stringr doesn't really cover this case and the gsub I've used so far is excessive and involves me making multiple columns, parsing from those new columns, and then removing them and that seems really inefficient.
The format is this:
"blah, grabbed by ???-??-?????."
I need this:
???-??-?????
I've used placeholders here, but this is how the string typically looks
"blah, grabbed by PHI-80-J.Matthews."
or
"blah, grabbed by NE-5-J.Mills."
and sometimes there is text after the name like this:
"blah, grabbed by KC-10-T.Hill. Blah blah blah."
This is what I would like the end result to be:
Place
Number
Name
PHI
80
J.Matthews
NE
5
J.Mills
KC
10
T. Hill
Edit for further explanation:
Most strings include other people in the same format so "downed by" needs to be incorporated in someway to make sure it is grabbing the right name.
Ex.
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
Desired Output:
Place
Number
Name
KC
10
T. Hill
This solution simply extract the components based on the logic OP mentioned i.e. capture the characters that are needed as three groups - 1) one or more upper case letter ([A-Z]+) followed by a dash (-), 2) then one or more digits (\\d+), and finally 3) non-whitespace characters (\\S+) that follow the dash
library(tidyr)
extract(df1, col1, into = c("Place", "Number", "Name"),
".*grabbed by\\s([A-Z]+)-(\\d+)-(\\S+)\\..*", convert = TRUE)
-ouputt
# A tibble: 4 x 3
Place Number Name
<chr> <int> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
4 KC 10 T.Hill
Or do this in base R
read.table(text = sub(".*grabbed by\\s((\\w+-){2}\\S+)\\..*", "\\1",
df1$col1), header = FALSE, col.names = c("Place", "Number", "Name"), sep='-')
Place Number Name
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
data
df1 <- structure(list(col1 = c("blah, grabbed by PHI-80-J.Matthews.",
"blah, grabbed by NE-5-J.Mills.", "blah, grabbed by KC-10-T.Hill. Blah blah blah.",
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
This solution actually does what you say in the title, namely first remove the text around the the target substring, then split it into columns:
library(tidyr)
library(stringr)
df1 %>%
mutate(col1 = str_extract(col1, "\\w+-\\w+-\\w\\.\\w+")) %>%
separate(col1,
into = c("Place", "Number", "Name"),
sep = "-")
# A tibble: 3 x 3
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
Here, we make use of the fact that the character class \\w is for letters irrespective of case and for digits (and also for the underscore).
Here is an alternative way using sub with regex "([A-Za-z]+\\.[A-Za-z]+).*", "\\1" that removes the string after the second point.
separate that splits the string by by, and finally again separate to get the desired columns.
library(dplyr)
library(tidyr)
df1 %>%
mutate(test1 = sub("([A-Za-z]+\\.[A-Za-z]+).*", "\\1", col1)) %>%
separate(test1, c('remove', 'keep'), sep = " by ") %>%
separate(keep, c("Place", "Number", "Name"), sep = "-") %>%
select(Place, Number, Name)
Output:
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill

R: find words from tweets in Lexicon, count them and save number in dataframe with tweets

I have a data set of 50,176 tweets (tweets_data: 50176 obs. of 1 variable). Now, I have created a self-made lexicon (formal_lexicon), which consists of around 1 million words, which are all formal language style. Now, I want to create a small code which per tweet counts how many (if there are any) words are also in that lexicon.
tweets_data:
Content
1 "Blablabla"
2 "Hi my name is"
3 "Yes I need"
.
.
.
50176 "TEXT50176"
formal_lexicon:
X
1 "admittedly"
2 "Consequently"
3 "Furthermore"
.
.
.
1000000 "meanwhile"
The output should thus look like:
Content Lexicon
1 "TEXT1" 1
2 "TEXT2" 3
3 "TEXT3" 0
.
.
.
50176 "TEXT50176" 2
Should be a simple for loop like:
for(sentence in tweets_data$Content){
for(word in sentence){
if(word %in% formal_lexicon){
...
}
}
}
I don't think "word" works and I'm not sure how to count in the specific column if a word is in the lexicon. Can anyone help?
structure(list(X = c("admittedly", "consequently", "conversely", "considerably", "essentially", "furthermore")), row.names = c(NA, 6L), class = "data.frame")
c("#barackobama Thank you for your incredible grace in leadership and for being an exceptional… ", "happy 96th gma #fourmoreyears! \U0001f388 # LACMA Los Angeles County Museum of Art", "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381", "Damn, it's hard to wrap presents when you're drunk. cc #santa", "When my whole fam tryna have a peaceful holiday " )
You can try something like this:
library(tidytext)
library(dplyr)
# some fake phrases and lexicon
formal_lexicon <- structure(list(X = c("admittedly", "consequently", "conversely", "considerably", "essentially", "furthermore")), row.names = c(NA, 6L), class = "data.frame")
tweets_data <- c("#barackobama Thank you for your incredible grace in leadership and for being an exceptional… ", "happy 96th gma #fourmoreyears! \U0001f388 # LACMA Los Angeles County Museum of Art", "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381", "Damn, it's hard to wrap presents when you're drunk. cc #santa", "When my whole fam tryna have a peaceful holiday " )
# put in a data.frame your tweets
tweets_data_df <- data.frame(Content = tweets_data, id = 1:length(tweets_data))
tweets_data_df %>%
# get the word
unnest_tokens( txt,Content) %>%
# add a field that count if the word is in lexicon - keep the 0 -
mutate(pres = ifelse(txt %in% formal_lexicon$X,1,0)) %>%
# grouping
group_by(id) %>%
# summarise
summarise(cnt = sum(pres)) %>%
# put back the texts
left_join(tweets_data_df ) %>%
# reorder the columns
select(id, Content, cnt)
With result:
Joining, by = "id"
# A tibble: 6 x 3
id Content cnt
<int> <chr> <dbl>
1 1 "#barackobama Thank you for your incredible grace in leadership a~ 0
2 2 "happy 96th gma #fourmoreyears! \U0001f388 # LACMA Los Angeles Co~ 0
3 3 "2017 resolution: to embody authenticity!" 0
4 4 "Happy Holidays! Sending love and light to every corner of the ea~ 0
5 5 "Damn, it's hard to wrap presents when you're drunk. cc #santa" 0
6 6 "When my whole fam tryna have a peaceful holiday " 0
Hope this is useful for you:
library(magrittr)
library(dplyr)
library(tidytext)
# Data frame with tweets, including an ID
tweets <- data.frame(
id = 1:3,
text = c(
'Hello, this is the first tweet example to your answer',
'I hope that my response help you to do your task',
'If it is tha case, please upvote and mark as the correct answer'
)
)
lexicon <- data.frame(
word = c('hello', 'first', 'response', 'task', 'correct', 'upvote')
)
# Couting words in tweets present in your lexicon
in_lexicon <- tweets %>%
# To separate by row every word in your twees
tidytext::unnest_tokens(output = 'words', input = text) %>%
# Determining if a word is in your lexicon
dplyr::mutate(
in_lexicon = words %in% lexicon$word
) %>%
dplyr::group_by(id) %>%
dplyr::summarise(words_in_lexicon = sum(in_lexicon))
# Binding count and the original data
dplyr::left_join(tweets, in_lexicon)

Is there an R function to split the sentence

I have couple of unstructured sentences like below. Description below is column name
Description
Automatic lever for a machine
Vaccum chamber with additional spare
Glove box for R&D
The Mini Guage 5 sets
Vacuum chamber only
Automatic lever only
I want to split this sentence from Col1 to Col5 and count there occurrence like below
Col1 Col2 Col3 Col4
Automatic_lever lever_for for_a a_machine
Vaccum_chamber chamber_with with_additional additional_spare
Glove_box box_for for_R&D R&D
The_Mini Mini_Guage Guage_5 5_sets
Vacuum_chamber chamber_only only
Automatic_lever lever_only only
Also from above columns, can i have the occurence of these words. Like, Vaccum_chamber and Automatic_lever are repeated twice here. Similarly, the occurence of other words?
Here is a tidyverse option
df %>%
rowid_to_column("row") %>%
mutate(words = map(str_split(Description, " "), function(x) {
if (length(x) %% 2 == 0) words <- c(words, "")
idx <- 1:(length(words) - 1)
map_chr(idx, function(i) paste0(x[i:(i + 1)], collapse = "_"))
})) %>%
unnest() %>%
group_by(row) %>%
mutate(
words = str_replace(words, "_NA", ""),
col = paste0("Col", 1:n())) %>%
filter(words != "NA") %>%
spread(col, words, fill = "")
## A tibble: 6 x 6
## Groups: row [6]
# row Description Col1 Col2 Col3 Col4
# <int> <fct> <chr> <chr> <chr> <chr>
#1 1 Automatic lever for a mac… Automatic_… lever_for for_a a_machine
#2 2 Vaccum chamber with addit… Vaccum_cha… chamber_w… with_addi… additional…
#3 3 Glove box for R&D Glove_box box_for for_R&D R&D
#4 4 The Mini Guage 5 sets The_Mini Mini_Guage Guage_5 5_sets
#5 5 Vacuum chamber only Vacuum_cha… chamber_o… only ""
#6 6 Automatic lever only Automatic_… lever_only only ""
Explanation: We split the sentences in Description on a single whitespace " ", then concatenate every two words together with a sliding window approach, making sure that there are always an odd odd number of words per sentence; the rest is just a long-to-wide transformation.
Not pretty but it reproduces your expected output; instead of the manual sliding window approach you could also you zoo::rollapply.
Sample data
df <- read.table(text =
"Description
'Automatic lever for a machine'
'Vaccum chamber with additional spare'
'Glove box for R&D'
'The Mini Guage 5 sets'
'Vacuum chamber only'
'Automatic lever only'", header = T)

Resources