tidyverse: Combining a column with different length into exisitng tibble

tidyverse: Combining a column with different length into exisitng tibble - r

I have tibble which looks like:
Review_Text
<chr>
Because it is a nice game
Best trump soumd board out there
Boring hated because it does not work when I get done
but you can make better game if game has unlimeted chemicals bottles
cant get pass loading screen
Can't play video
Casting from Note 3 to Roku 3 screen appears to start loading then back to Roku home screen. Roku software version 6.1 build 5604. It is up to date but still not able to cast Showbox. ..
Crashes all the time in the middle of the show. Whining ensues. Ugh.
Crashing
Does not work on tab 3
Doesn't work
Doesn't work with S7 which is unacceptable in this day and age.
Doesn't work... I absolutely hate it
Dont use this app battery consumers
Dose this work for snmsung I tried some many times ðŸ˜¡
ðŸ˜„I loved it so much I would recommend this to other families ðŸ˜„
Every time i pressed apply it just took me to the home screen
Everytime it says collect on T.V. it won't obtain the magisword
Excellent!!! My grandchildren watch it all the time...
Feel like Lizzie McGuire ðŸ˜‚â\u009d¤
I want to remove the stopwords from the Review_Text and append the column (that does not have stopwords) with the existing tibble. I am using following code, to remove the stopwords:
no_stpwrd <- tibble(line = 1:nrow(tb), text = tb$Review_Text) %>%
unnest_tokens(word, text)%>%
anti_join(stop_words, by = c("word" = "word")) %>%
group_by(line) %>% summarise(title = paste(word,collapse =' '))
Then I use the following command to merge the no_stpwrd with the existing tibble:
add_column(tb,no_stpwrd).
However, when I run the above command, it throws an error message because of mismatch between the number of rows tibble and no_stowrd have. There are few row values in tibble which contains the only stopword (for example, line 11 of tibble), so when I remove stopwords it returns null hence the number of rows reduced in a no_stpwrd column. Is there any way to fix the issue?

Instead of trying to use add_column() here, what you want to do is use a join.
library(tidyverse)
library(tidytext)
review_df <- tibble(review_text = c("Because it is a nice game",
"cant get pass loading screen",
"Because I don't",
"Dont use this app battery consumers")) %>%
mutate(line = row_number())
review_df
#> # A tibble: 4 x 2
#> review_text line
#> <chr> <int>
#> 1 Because it is a nice game 1
#> 2 cant get pass loading screen 2
#> 3 Because I don't 3
#> 4 Dont use this app battery consumers 4
no_stpwrd <- review_df %>%
unnest_tokens(word, review_text) %>%
anti_join(get_stopwords()) %>%
group_by(line) %>%
summarise(title = paste(word,collapse =' '))
#> Joining, by = "word"
no_stpwrd
#> # A tibble: 3 x 2
#> line title
#> <int> <chr>
#> 1 1 nice game
#> 2 2 cant get pass loading screen
#> 3 4 dont use app battery consumers
Notice that the third document is no longer there because it was made up of all stop words. It's time for a left_join().
review_df %>%
left_join(no_stpwrd)
#> Joining, by = "line"
#> # A tibble: 4 x 3
#> review_text line title
#> <chr> <int> <chr>
#> 1 Because it is a nice game 1 nice game
#> 2 cant get pass loading screen 2 cant get pass loading screen
#> 3 Because I don't 3 <NA>
#> 4 Dont use this app battery consumers 4 dont use app battery consumers
Created on 2020-03-20 by the reprex package (v0.3.0)

Related

counting word frequency in a string across columns in R

I am trying to get a count of how many times each word appears total for every index of a column for my whole data set. The data can be found here:https://www.kaggle.com/tovarischsukhov/southparklines
My code is as follows:
SP = read.csv("All-seasons.csv")
SP$Season = as.numeric(SP$Season)
SP$Episode = as.numeric(SP$Episode)
Cartman = SP %>% group_by(Character) %>%
arrange(Season, Episode) %>%
filter(Character =="Cartman")
Cartman_text_tbl <- as_tibble(data.frame(uniqueID = 1:length(Cartman$Season),Cartman[1:length(Cartman$Season),]))
Cartman_text_tbl_words <- Cartman_text_tbl %>% select(uniqueID,Cartman$Line) %>%
unnest_tokens(word, Cartman$Line) %>% filter(str_detect(word,"^[a-z']+$")) %>%
group_by(uniqueID) %>% count(word)
When I run the last line of code I get this error:
Error in `select()`:
! Can't subset columns that don't exist.
x Columns `Yeah, go home you little dildo.\n`, `I know what it means!\n`, `I'm not telling you.\n`, `He-yeah, that's what Kyle's little brother is all right! Ow! \n`, `That's 'cause I was having these... bogus nightmares.\n`, etc. don't exist.
I did a project for a class a couple of years ago where the professor provided some similar code, I am trying to format this code off what was previously provided for me. If there is a better way to get a count that would be awesome to know about as well, otherwise a way to fix the error would be great. Additionally, each line ends with a "\n" I was wondering if its possible to remove those from every column? Thanks!

If I understand you correctly, I believe this may help you. The output gives you the count of each word said by Cartman for each episode and season. Of course for other characters you can use the same code and change the filter and object the output is assigned to. Also if you need to remove stop words you can add anti_join(stop_words, by = "word") %>% after the unnest_tokens() function. It is also set as sort = TRUE, so it will sort the words in descending order based on frequency, so you can change this and sort as needed.
Code:
library(tidyverse)
library(tidytext)
df <- read_csv("All-seasons.csv")
cartman <- df %>%
filter(Character == "Cartman") %>%
group_by(Season, Episode) %>%
unnest_tokens(output = word, input = Line) %>%
count(word, sort = TRUE)
Output Example:
> head(cartman)
# A tibble: 6 x 4
# Groups: Season, Episode [6]
Season Episode word n
<dbl> <dbl> <chr> <int>
1 7 11 you 73
2 11 8 i 73
3 5 4 you 66
4 16 7 you 63
5 14 8 i 61
6 11 2 i 60

Having difficulty using rle command within a mutate step in r to count the max number of consecutive characters in a word

I created this function to count the maximum number of consecutive characters in a word.
max(rle(unlist(strsplit("happy", split = "")))$lengths)
The function works on individual words, but when I try to use the function within a mutate step it doesn't work. Here is the code that involves the mutate step.
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
num_vowels = get_count(word),
num_consec_char = max(rle(unlist(strsplit(word, split = "")))$lengths)
)
The variables num_letters and num_vowels work fine, but I get a 2 for every value of num_consec_char. I can't figure out what I'm doing wrong.

This command rle(unlist(strsplit(word, split = "")))$lengths is not vectorized and thus is operating on the entire list of words for each row thus the same result for each row.
You will need to use some type of loop (ie for, apply, purrr::map) to solve it.
library(dplyr)
library(tidytext)
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
output<- text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
# num_vowels = get_count(word),
)
output$num_consec_char<- sapply(output$word, function(word){
max(rle(unlist(strsplit(word, split = "")))$lengths)
})
output
# A tibble: 32 × 4
line word num_letters num_consec_char
<int> <chr> <int> <int>
1 1 the 3 1
2 1 most 4 1
3 1 pressing 8 2
4 1 of 2 1
5 1 those 5 1
6 1 issues 6 2
7 1 considering 11 1

Function for writing an automated report in R

So I am trying to write an automated report in R with Functions. One of the questions I am trying to answer is this " During the first week of the month, what were the 10 most viewed products? Show the results in a table with the product's identifier, category, and count of the number of views.". To to this I wrote the following function
most_viewed_products_per_week <- function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
}
print(most_viewed_products_per_week)
However the output I get is this:
function (month,first_seven_days, views){
month <- views....February.2020.2
first_seven_days <- function( month, date_1, date_2){
date_1 <-2020-02-01
date_2 <- 2020-02-07
return (first_seven_days)}
views <-function(views, desc){
return (views.head(10))}
How do I fix that?
This report has more questions like this, so I am trying to get my function writing as correct as possible from the start.
Thanks in advance,
Edo

It is a good practice to code in functions. Still I recommend you get your code doing what you want and then think about what parts you want to wrap in a function (for future re-use). This is to get you going.
In general: to support your analysis, make sure that your data is in the right class. I.e. dates are formatted as dates, numbers as double or integers, etc. This will give you access to many helper functions and packages.
For the case at hand, read up on {tidyverse}, in particular {dplyr} which can help you with coding pipes.
simulate data
As mentioned - you will find many friends on Stackoverflow, if you provide a reproducible example.
Your questions suggests your data look a bit like the following simulated data.
Adapt as appropriate (or provide example)
library(tibble) # tibble are modern data frames
library(dplyr) # for crunching tibbles/data frames
library(lubridate) # tidyverse package for date (and time) handling
df <- tribble( # create row-tibble
~date, ~identifier, ~category, ~views
,"2020-02-01", 1, "TV", 27
,"2020-02-02", 2, "PC", 40
,"2020-02-03", 1, "TV", 12
,"2020-02-03", 2, "PC", 2
,"2020-02-08", 3, "UV", 200
) %>%
mutate(date = ymd(date)) # date is read in a character - lubridate::ymd() for date
This yields
> df
# A tibble: 5 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
5 2020-02-08 3 UV 200
Notice: date-column is in date-format.
work your algorithm
From your attempt it follows you want to extract the first 7 days.
Since we have a "date"-column, we can use a date-function to help us here.
{lubridate}'s day() extracts the "day-number".
> df %>% filter(day(date) <= 7)
# A tibble: 4 x 4
date identifier category views
<date> <dbl> <chr> <dbl>
1 2020-02-01 1 TV 27
2 2020-02-02 2 PC 40
3 2020-02-03 1 TV 12
4 2020-02-03 2 PC 2
Anything outside the first 7 days is gone.
Next you want to summarise to get your product views total.
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
## ---------- summarise in bins that you need := groups -------
group_by(identifier, category) %>%
summarise(total_views = sum(views)
, .groups = "drop" ) # if grouping is not needed "drop" it
This gives you:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 1 TV 39
2 2 PC 42
Now pick the top-10 and sort the order:
df %>%
## ---------- c.f. above ------------
filter(day(date) <= 7) %>%
group_by(identifier, category) %>%
summarise(total_views = sum(views), .groups = "drop" ) %>%
## ---------- make use of another helper function of dplyr
top_n(n = 10, total_views) %>% # note top-10 makes here no "real" sense :), try top_n(1, total_views)
arrange(desc(total_views)) # arrange in descending order on total_views
wrap in function
Now that the workflow is in place, think about breaking your code into the blocks you think are useful.
I leave this to you. You can assign interim results to new data frames and wrap the preparation of the data into a function and then the top_n() %>% arrange() in another function, ...
This yields:
# A tibble: 2 x 3
identifier category total_views
<dbl> <chr> <dbl>
1 2 PC 42
2 1 TV 39

R - Finding identical rows or rows that only differ by x columns

I'm trying to use R on a large CSV file that for this example can be said to represent a list of people and forms of transportation. If a person owns that mode of transportation, this is represented by a X in the corresponding cell. Example data of this is as per below:
Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,
The below image makes it easier to see what it represents:
What I'm after is to learn which persons have identical modes of transportation, or, ideally, where the modes of transportation differs by no more than one.
The format is a bit weird but, assuming the csv file is named example.csv, I can read it into a data frame and transpose it as per below (it should be fairly obvious that I'm a complete R noob)
ex <- read.csv('example.csv')
ext <- as.data.frame(t(ex))
This post explained how to find duplicates and it seems to work
duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1]
which(duplicated(ext) | duplicated(ext[nrow(ext):1, ])[nrow(ext):1])
This returns the following indexes:
1 2 4 5 6 7
That does indeed correspond with what I consider to be duplicate rows. That is, Peter has the same modes of transportation as Mary and Stan (indexes 2, 4 and 6); Don and Mike likewise share the same modes of transportation, indexes 5 and 7.
Again, that seems to work ok but if the modes of transportation and number of people are significant, it becomes really difficult finding/knowing not just which rows are duplicates, but which indexes actually matched. In this case that indexes 2, 4 and 6 are identical and that 5 and 7 are identical.
Is there an easy way of getting that information so that one doesn't have to try and find the matches manually?
Also, given all of the above, is it possible to alter the code in any way so that it would consider rows to match if there was only a difference in X positions (for example a difference of one is acceptable so as long as the persons in the above example have no more than one mode of transportation that is different, it's still considered a match)?
Happy to elaborate further and very grateful for any and all help.

library(dplyr)
library(tidyr)
ex <- read.csv(text = "Type,Peter,Paul,Mary,Don,Stan,Mike
Scooter,X,X,X,,X,
Car,,,,X,,X
Bike,,,,,,
Skateboard,X,X,X,X,X,X
Boat,,X,,,,", )
ext <- tidyr::pivot_longer(ex, -Type, names_to = "person")
# head(ext)
ext <- ext %>%
group_by(person) %>%
filter(value == "X") %>%
summarise(Modalities = n(), Which = paste(Type, collapse=", ")) %>%
arrange(desc(Modalities), Which) %>%
mutate(IdenticalGrp = rle(Which)$lengths %>% {rep(seq(length(.)), .)})
ext
#> # A tibble: 6 x 4
#> person Modalities Which IdenticalGrp
#> <chr> <int> <chr> <int>
#> 1 Paul 3 Scooter, Skateboard, Boat 1
#> 2 Don 2 Car, Skateboard 2
#> 3 Mike 2 Car, Skateboard 2
#> 4 Mary 2 Scooter, Skateboard 3
#> 5 Peter 2 Scooter, Skateboard 3
#> 6 Stan 2 Scooter, Skateboard 3
To get a membership list in any particular IndenticalGrp you can just pull like this.
ext %>% filter(IdenticalGrp == 3) %>% pull(person)
#> [1] "Mary" "Peter" "Stan"

Using the Nested List Column Approach and Purrr Together with Tidytext::Unnest_Tokens

I have a dataframe that contains survey responses with each row representing a different person. One column - "Text" - is an open-ended text question. I would like to use Tidytext::unnest_tokens so that I do text analysis by each row, including sentiment scores, word counts, etc.
Here is the simple dataframe for this example:
Satisfaction<-c ("Satisfied","Satisfied","Dissatisfied","Satisfied","Dissatisfied")
Text<-c("I'm very satisfied with the services", "Your service providers are always late which causes me a lot of frustration", "You should improve your staff training, service providers have bad customer service","Everything is great!","Service is bad")
Gender<-c("M","M","F","M","F")
df<-data.frame(Satisfaction,Text,Gender)
I then turned the Text column into character...
df$Text<-as.character(df$Text)
Next I grouped by the id column and nested the dataframe.
df<-df%>%mutate(id=row_number())%>%group_by(id)%>%unnest_tokens(word,Text)%>%nest(-id)
Getting this far seems to have worked ok, but now how do I use purrr::map functions to work on the nested list column "word"? For example, if I want to create a new column using dplyr::mutate with word counts for each row?
Also, is there a better way to nest the dataframe so that only the "Text" column is a nested list?

I love using purrr::map to do modeling for different groups, but for what you are talking about doing, I think you can stick with just straight dplyr.
You can set up your dataframe like this:
library(dplyr)
library(tidytext)
Satisfaction <- c("Satisfied",
"Satisfied",
"Dissatisfied",
"Satisfied",
"Dissatisfied")
Text <- c("I'm very satisfied with the services",
"Your service providers are always late which causes me a lot of frustration",
"You should improve your staff training, service providers have bad customer service",
"Everything is great!",
"Service is bad")
Gender <- c("M","M","F","M","F")
df <- data_frame(Satisfaction, Text, Gender)
tidy_df <- df %>%
mutate(id = row_number()) %>%
unnest_tokens(word, Text)
Then to find, for example, the number of words per line, you can use group_by and mutate.
tidy_df %>%
group_by(id) %>%
mutate(num_words = n()) %>%
ungroup
#> # A tibble: 37 × 5
#> Satisfaction Gender id word num_words
#> <chr> <chr> <int> <chr> <int>
#> 1 Satisfied M 1 i'm 6
#> 2 Satisfied M 1 very 6
#> 3 Satisfied M 1 satisfied 6
#> 4 Satisfied M 1 with 6
#> 5 Satisfied M 1 the 6
#> 6 Satisfied M 1 services 6
#> 7 Satisfied M 2 your 13
#> 8 Satisfied M 2 service 13
#> 9 Satisfied M 2 providers 13
#> 10 Satisfied M 2 are 13
#> # ... with 27 more rows
You can do sentiment analysis by implementing an inner join; check out some examples here.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

tidyverse: Combining a column with different length into exisitng tibble - r

Related

counting word frequency in a string across columns in R

Having difficulty using rle command within a mutate step in r to count the max number of consecutive characters in a word

Function for writing an automated report in R

R - Finding identical rows or rows that only differ by x columns

Using the Nested List Column Approach and Purrr Together with Tidytext::Unnest_Tokens

Categories

Resources