I have a df in R of only one column of food ratings from amazon.
head(food_ratings)
product.productId..B001E4KFG0
1 review/userId: A3SGXH7AUHU8GW
2 review/profileName: delmartian
3 review/helpfulness: 1/1
4 review/score: 5.0
5 review/time: 1303862400
6 review/summary: Good Quality Dog Food
The rows repeat themselves, so that rows 7 through 12 have the same information regarding another user(row 7). This pattern is repeated many times.
Therefore, I need to have every group of 6 rows distributed in one row with 6 columns, so that later I can subset, for instance, the review/summary according to their review/score.
I'm using RStudio 1.0.143
EDIT: I was asked to show the output of dput(head(food_ratings, 24)) but it was too big regardless of the number used.
Thanks a lot
I have taken your data and added 2 more fake users to it. Using tidyr and dplyr you can create new columns and collapse the data into a nice data.frame. You can use select from dplyr to drop the id column if you don't need it or to rearrange the order of the columns.
library(tidyr)
library(dplyr)
df %>%
separate(product.productId..B001E4KFG0, into = c("details", "data"), sep = ": ") %>%
mutate(details = sub("review/ ", "", details)) %>%
group_by(details) %>%
mutate(id = row_number()) %>%
spread(details, data)
# A tibble: 3 x 7
id helpfulness profileName score summary time userId
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1/1 delmartian 5.0 Good Quality Dog Food 1303862400 A3SGXH7AUHU8GW
2 2 1/1 martian2 1.0 Good Quality Snake Food 1303862400 123456
3 3 2/5 martian3 5.0 Good Quality Cat Food 1303862400 123654
data:
df <- structure(list(product.productId..B001E4KFG0 = c("review/userId: A3SGXH7AUHU8GW",
"review/profileName: delmartian", "review/helpfulness: 1/1",
"review/score: 5.0", "review/time: 1303862400", "review/summary: Good Quality Dog Food",
"review/userId: 123456", "review/profileName: martian2", "review/helpfulness: 1/1",
"review/score: 1.0", "review/time: 1303862400", "review/summary: Good Quality Snake Food",
"review/userId: 123654", "review/profileName: martian3", "review/helpfulness: 2/5",
"review/score: 5.0", "review/time: 1303862400", "review/summary: Good Quality Cat Food"
)), class = "data.frame", row.names = c(NA, -18L))
Related
My data consists of text from many dyads that has been split into sentences, one per row. I'd like to concatenate the data by speaker within dyads, essentially converting the data to speaking turns. Here's an example data set:
dyad <- c(1,1,1,1,1,2,2,2,2)
speaker <- c("John", "John", "John", "Paul","John", "George", "Ringo", "Ringo", "George")
text <- c("Let's play",
"We're wasting time",
"Let's make a record!",
"Let's work it out first",
"Why?",
"It goes like this",
"Hold on",
"Have to tighten my snare",
"Ready?")
dat <- data.frame(dyad, speaker, text)
And this is what I'd like the data to look like:
dyad speaker text
1 1 John Let's play. We're wasting time. Let's make a record!
2 1 Paul Let's work it out first
3 1 John Why?
4 2 George It goes like this
5 2 Ringo Hold on. Have to tighten my snare
6 2 George Ready?
I've tried grouping by sender and pasting/collapsing from dplyr but the concatenation combines all of a sender's text without preserving speaking turn order. For example, John's last statement ("Why") winds up with his other text in the output rather than coming after Paul's comment. I also tried to check if the next speaker (using lead(sender)) is the same as current and then combining, but it only does adjacent rows, in which case it misses John's third comment in the example. Seems it should be simple but I can't make it happen. And it should be flexible to combine any series of continuous rows by a given speaker.
Thanks in advance
Create another group with rleid (from data.table) and paste the rows in summarise
library(dplyr)
library(data.table)
library(stringr)
dat %>%
group_by(dyad, grp = rleid(speaker), speaker) %>%
summarise(text = str_c(text, collapse = ' '), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 6 × 3
dyad speaker text
<dbl> <chr> <chr>
1 1 John Let's play We're wasting time Let's make a record!
2 1 Paul Let's work it out first
3 1 John Why?
4 2 George It goes like this
5 2 Ringo Hold on Have to tighten my snare
6 2 George Ready?
Not as elegant as dear akrun's solution. helper does the same as rleid function here without the NO need of an additional package:
library(dplyr)
dat %>%
mutate(helper = (speaker != lag(speaker, 1, default = "xyz")),
helper = cumsum(helper)) %>%
group_by(dyad, speaker, helper) %>%
summarise(text = paste0(text, collapse = " "), .groups = 'drop') %>%
select(-helper)
dyad speaker text
<dbl> <chr> <chr>
1 1 John Let's play We're wasting time Let's make a record!
2 1 John Why?
3 1 Paul Let's work it out first
4 2 George It goes like this
5 2 George Ready?
6 2 Ringo Hold on Have to tighten my snare
I would like to merge rows in the my data frame by unique emails, but I do not want to lose any data. To do this I would like the function to combine rows with the same email address. Along with this, if there happens to be overlapping data for an email address that I am trying to combine into one, I want the data from the row with less cells filled in to be added into a new column. Please as questions because I know that I am not explaining this very clearly.
Below is an example of what I am looking for the function to do (data made up).
First Name
Last Name
Email
Phone
Address
Shoe Size
John
Schmitt
jschmitt#gmail.com
914-392-1840
address 1
4
Paul
Johnson
pjohnson#gmail.com
274-184-3653
address 2
2
Brad
Arnold
barnold#gmail.com
157-135-3175
address 3
5
John
Schmitt
jschmitt#gmail.com
914-392-1840
6
This sheet should become:
First Name
Last Name
Email
Phone
Address
Shoe Size
Shoe Size 2
John
Schmitt
jschmitt#gmail.com
914-392-1840
address 1
4
6
Paul
Johnson
pjohnson#gmail.com
274-184-3653
address 2
2
Brad
Arnold
barnold#gmail.com
157-135-3175
address 3
5
Basically, the phone number connected to forjschmitt#gmail.com stays in the "Phone" column because it is the same for both rows. Even though the rows are not the same for the address, because the bottom row is blank, it stays the same. Finally, a new column is created for Shoe Size, because there are two differing values for the rows that we are merging. The way that the function should pick which Shoe size to put in Shoe Size 2 is by looking at the number of cells in each row. The shoe size in the row with more cells filled goes in the original Shoe Size column. The shoe size in the row with less cells filled goes in the new Shoe Size 2 column.
Feel free to ask any questions or make any suggestions about how I could do something of this nature in an easier way. I also haven't figured out what to do if the two rows with conflicting data have the same number of cells filled...
Update: tidyverse only solution with the note of Martin Gal using chop
df %>%
select(-Address) %>%
chop(`Shoe Size`) %>%
unnest_wider(`Shoe Size`) %>%
rename(`Shoe Size` = ...1, `Shoe Size 2` = ...2) %>%
left_join(df, by= "Shoe Size") %>%
select(-contains(".y")) %>%
rename_with(~str_remove(., '.x')) %>%
relocate(Address, .after = Phone) %>%
arrange(Address)
First answer:
Here is a way how we could achieve the result. The logic:
remove Address and assgin to new df1
use aggregate to basically combine the duplicate parts of rows and aggregate the not duplicate part (here: Shoe Size)
Use unnest_wider to unnest the list column
rename
left_join with df and clean with select, rename_with
relocate and arrange
library(dplyr)
library(tidyr)
# base R remove column Address and assign to df1
df1 <- df[,-5]
# aggregate Shoe Size (I don´t know how to do this in dplyr, therefore base R)
df1 <- aggregate(df1[5], df1[-5], unique)
# now with tidyverse(dpylr, tidyr)
df1 %>%
unnest_wider(`Shoe Size`) %>%
rename(`Shoe Size` = ...1, `Shoe Size 2` = ...2) %>%
left_join(df, by= "Shoe Size") %>%
select(-contains(".y")) %>%
rename_with(~str_remove(., '.x')) %>%
relocate(Address, .after = Phone) %>%
arrange(Address)
# A tibble: 3 x 7
`First Name` `Last Name` Email Phone Address `Shoe Size` `Shoe Size 2`
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 John Schmitt jschmitt#gmail.com 914-392-1840 address 1 4 6
2 Paul Johnson pjohnson#gmail.com 274-184-3653 address 2 2 NA
3 Brad Arnold barnold#gmail.com 157-135-3175 address 3 5 NA
data:
structure(list(`First Name` = c("John", "Paul", "Brad", "John"
), `Last Name` = c("Schmitt", "Johnson", "Arnold", "Schmitt"),
Email = c("jschmitt#gmail.com", "pjohnson#gmail.com", "barnold#gmail.com",
"jschmitt#gmail.com"), Phone = c("914-392-1840", "274-184-3653",
"157-135-3175", "914-392-1840"), Address = c("address 1",
"address 2", "address 3", NA), `Shoe Size` = c(4, 2, 5, 6
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
I have a dataframe that contains variable 'AgentID', 'Type', 'Date', and 'Text' and a subset is as follows:
structure(list(AgentID = c("AA0101", "AA0101", "AA0101", "AA0101",
"AA0101"), Type = c("PS", "PS", "PS", "PS", "PS"), Date = c("4/1/2019", "4/1/2019", "4/1/2019", "4/1/2019", "4/1/2019"), Text = c("I am on social security XXXX and I understand it can not be garnished by Paypal credit because it's federally protected.I owe paypal {$3600.00} I would like them to cancel this please.",
"My XXXX account is being reported late 6 times for XXXX per each loan I was under the impression that I was paying one loan but it's split into three so one payment = 3 or one missed payment would be three missed on my credit,. \n\nMy account is being reported wrong by all credit bureaus because I was in forbearance at the time that these late payments have been reported Section 623 ( a ) ( 2 ) States : If at any time a person who regularly and in the ordinary course of business furnishes information to one or more CRAs determines that the information provided is not complete or accurate, the furnisher must promptly provide complete and accurate information to the CRA. In addition, the furnisher must notify all CRAs that received the information of any corrections, and must thereafter report only the complete and accurate information. \n\nIn this case, I was in forbearance during that tie and document attached proves this. By law, credit need to be reported as of this time with all information and documentation",
"A few weeks ago I started to care for my credit and trying to build it up since I have never used my credit in the past, while checking my I discover some derogatory remarks in my XXXX credit report stating the amount owed of {$1900.00} to XXXX from XX/XX/2015 and another one owed to XXXX for {$1700.00} I would like to address this immediately and either pay off this debt or get this negative remark remove from my report.",
"I disputed this XXXX account with all three credit bureaus, the reported that it was closed in XXXX, now its reflecting closed XXXX once I paid the {$120.00} which I dont believe I owed this amount since it was an fee for a company trying to take money out of my account without my permission, I was charged the fee and my account was closed. I have notified all 3 bureaus to have this removed but they keep saying its correct. One bureau is showing XXXX closed and the other on shows XXXX according to XXXX XXXX, XXXX shows a XXXX, this account has been on my report for seven years",
"On XX/XX/XXXX I went on XXXX XXXX and noticed my score had gone down, went to check out why and seen something from XXXX XXXX and enhanced recovery company ... I also seen that it had come from XXXX and XXXX dated XX/XX/XXXX, XX/XX/XXXX, and XX/XX/XXXX ... I didnt have neither one before, I called and it the rep said it had come from an address Im XXXX XXXX, Florida I have never lived in Florida ever ... .I have also never had XXXX XXXX nor XXXX XXXX ... I need this taken off because it if affecting my credit score ... This is obviously identify theft and fraud..I have never received bills from here which proves that is was not done by me, I havent received any notifications ... if it was not for me checking my score I wouldnt have known nothing of this" )), row.names = c(NA, 5L), class = "data.frame")
First, I found out the top 10 anger words using the following:
library(tm)
library(tidytext)
library(tidyverse)
library(sentimentr)
library(wordcloud)
library(ggplot2)
CS <- function(txt){
MC <- Corpus(VectorSource(txt))
SW <- stopwords('english')
MC <- tm_map(MC, tolower)
MC<- tm_map(MC,removePunctuation)
MC <- tm_map(MC, removeNumbers)
MC <- tm_map(MC, removeWords, SW)
MC <- tm_map(MC, stripWhitespace)
myTDM <- as.matrix(TermDocumentMatrix(MC))
v <- sort(rowSums(myTDM), decreasing=TRUE)
FM <- data.frame(word = names(v), freq=v)
row.names(FM) <- NULL
FM <- FM %>%
mutate(word = tolower(word)) %>%
filter(str_count(word, "x") <= 1)
return(FM)
}
DF <- CS(df$Text)
# using nrc
nrc <- get_sentiments("nrc")
# create final dataset
DF_nrc = DF %>% inner_join(nrc)
And the I created a vector of top 10 anger words as follows:
TAW <- DF_nrc %>%
filter(sentiment=="anger") %>%
group_by(word) %>%
summarize(freq = mean(freq)) %>%
arrange(desc(freq)) %>%
top_n(10) %>%
select(word)
Next what I wanted to do is to find which were the 'Agent'(s) who spoke these words frequently and rank them. But I am confused how we could do that? Should I search the words one by one and group all by agents or is there some other better way. What I am looking at as a result, something like as follows:
AgentID Words_Spoken Rank
A0001 theft, dispute, money 1
A0001 theft, fraud, 2
.......
If you are more of a dplyr/tidyverse person, you can take an approach using some dplyr verbs, after converting your text data to a tidy format.
First, let's set up some example data with several speakers, one of whom speaks no anger words. You can use unnest_tokens() to take care of most of your text cleaning steps with its defaults, such as splitting tokens, removing punctuation, etc. Then remove stopwords using anti_join(). I show using inner_join() to find the anger words as a separate step, but you could join these up into one big pipe if you like.
library(tidyverse)
library(tidytext)
my_df <- tibble(AgentID = c("AA0101", "AA0101", "AA0102", "AA0103"),
Text = c("I want to report a theft and there has been fraud.",
"I have taken great offense when there was theft and also poison. It is distressing.",
"I only experience soft, fluffy, happy feelings.",
"I have a dispute with the hateful scorpion, and also, I would like to report a fraud."))
my_df
#> # A tibble: 4 x 2
#> AgentID Text
#> <chr> <chr>
#> 1 AA0101 I want to report a theft and there has been fraud.
#> 2 AA0101 I have taken great offense when there was theft and also poison.…
#> 3 AA0102 I only experience soft, fluffy, happy feelings.
#> 4 AA0103 I have a dispute with the hateful scorpion, and also, I would li…
tidy_words <- my_df %>%
unnest_tokens(word, Text) %>%
anti_join(get_stopwords())
#> Joining, by = "word"
anger_words <- tidy_words %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment == "anger"))
#> Joining, by = "word"
anger_words
#> # A tibble: 10 x 3
#> AgentID word sentiment
#> <chr> <chr> <chr>
#> 1 AA0101 theft anger
#> 2 AA0101 fraud anger
#> 3 AA0101 offense anger
#> 4 AA0101 theft anger
#> 5 AA0101 poison anger
#> 6 AA0101 distressing anger
#> 7 AA0103 dispute anger
#> 8 AA0103 hateful anger
#> 9 AA0103 scorpion anger
#> 10 AA0103 fraud anger
Now you now which anger words each person used, and the next step is to count them up and rank people. The dplyr package has fantastic support for exactly this kind of work. First you want to group_by() the person identifier, then calculate a couple of summarized quantities:
the total number of words (so you can arrange by this)
a pasted-together string of the words used
Afterwards, arrange by the number of words and make a new column that gives you the rank.
anger_words %>%
group_by(AgentID) %>%
summarise(TotalWords = n(),
WordsSpoken = paste0(word, collapse = ", ")) %>%
arrange(-TotalWords) %>%
mutate(Rank = row_number())
#> # A tibble: 2 x 4
#> AgentID TotalWords WordsSpoken Rank
#> <chr> <int> <chr> <int>
#> 1 AA0101 6 theft, fraud, offense, theft, poison, distressi… 1
#> 2 AA0103 4 dispute, hateful, scorpion, fraud 2
Do notice that with this approach, you don't have a zero entry for the person who spoke no anger words; they get dropped at the inner_join(). If you want them in the final data set, you will likely need to join back up with an earlier dataset and use replace_na().
Created on 2019-09-11 by the reprex package (v0.3.0)
Not the most elegant solution, but here's how you could count the words based on the line number:
library(stringr)
# write a new data.frame retaining the AgentID and Date from the original table
new.data <- data.frame(Agent = df$AgentID, Date = df$Date)
# using a for-loop to go through every row of text in the df provided.
for(i in seq(nrow(new.data))){ # i represent row number of the original df
# write a temporary object (e101) that:
## do a boolean check to see if the text from row i df[i, "Text"] the TAW$Word with stringr::str_detect function
## loop the str_detect with sapply so that the str_detect do a boolean check on each TAW$Word
## return the TAW$Word with TAW$Word[...]
e101 <- TAW$word[sapply(TAW$word, function(x) str_detect(df[i, "Text"], x))]
# write the number of returned words in e101 as a corresponding value in new data.frame
new.data[i, "number_of_TAW"] <- length(e101)
# concatenate the returned words in e101 as a corresponding value in new data.frame
new.data[i, "Words_Spoken"] <- ifelse(length(e101)==0, "", paste(e101, collapse=","))
}
new.data
# Agent Date number_of_TAW Words_Spoken
# 1 AA0101 4/1/2019 0
# 2 AA0101 4/1/2019 0
# 3 AA0101 4/1/2019 2 derogatory,remove
# 4 AA0101 4/1/2019 3 fee,money,remove
# 5 AA0101 4/1/2019 1 theft
I have a data frame in R with multiple columns with multi-word text responses, that looks something like this:
1a 1b 1c 2a 2b 2c
student job prospects money professors students campus
future career unsure my grades opportunities university
success reputation my job earnings courses unsure
I want to be able to count the frequency of words in columns 1a, 1b, and 1c combined, as well as 2a, 2b, and 2b combined.
Currently, I'm using this code to count word frequency in each column individually.
data.frame(table(unlist(strsplit(tolower(dat$1a), " "))))
Ideally, I want to be able to combine the two sets of columns into just two columns and then use this same code to count word frequency, but I'm open to other options.
The combined columns would look something like this:
1 2
student professors
future my grades
success earnings
job prospects students
career opportunities
reputation courses
money campus
unsure university
my job unsure
Here's a way using dplyr and tidyr packages. FYI, one should avoid having column names starting with a number. Naming them a1, a2... would make things easier in the long run.
df %>%
gather(variable, value) %>%
mutate(variable = substr(variable, 1, 1)) %>%
mutate(id = ave(variable, variable, FUN = seq_along)) %>%
spread(variable, value)
id 1 2
1 1 student professors
2 2 future my grades
3 3 success earnings
4 4 job prospects students
5 5 career opportunities
6 6 reputation courses
7 7 money campus
8 8 unsure university
9 9 my job unsure
Data -
df <- structure(list(`1a` = c("student", "future", "success"), `1b` = c("job prospects",
"career", "reputation"), `1c` = c("money", "unsure", "my job"
), `2a` = c("professors", "my grades", "earnings"), `2b` = c("students",
"opportunities", "courses"), `2c` = c("campus", "university",
"unsure")), .Names = c("1a", "1b", "1c", "2a", "2b", "2c"), class = "data.frame", row.names = c(NA,
-3L))
In general, you should avoid column names that start with numbers. That aside, I created a reproducible example of your problem and provided a solution using dplyr and tidyr. The substr() function inside the mutate_at assume your column names follow the [num][char] pattern in your example.
library(dplyr)
library(tidyr)
data <- tibble::tribble(
~`1a`, ~`1b`, ~`1c`, ~`2a`, ~`2b`, ~`2c`,
'student','job prospects', 'mone', 'professor', 'students', 'campus',
'future', 'career', 'unsure', 'my grades', 'opportunities', 'university',
'success', 'reputation', 'my job', 'earnings', 'courses', 'unsure'
)
data %>%
gather(key, value) %>%
mutate_at('key', substr, 0, 1) %>%
group_by(key) %>%
mutate(id = row_number()) %>%
spread(key, value) %>%
select(-id)
# A tibble: 9 x 2
`1` `2`
<chr> <chr>
1 student professor
2 future my grades
3 success earnings
4 job prospects students
5 career opportunities
6 reputation courses
7 mone campus
8 unsure university
9 my job unsure
If your end purpose is to count frequency (as opposed to switching from wide to long format), you could do
ave(unlist(df[,paste0("a",1:3)]), unlist(df[,paste0("a",1:3)]), FUN = length)
which will count the frequency of the elements of columns a1,a2,a3, where df denotes the data frame (and the columns are labeled a1,a2,a3,b1,b2,b3).
This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Split column at delimiter in data frame [duplicate]
(6 answers)
Closed 5 years ago.
I have a tibble.
library(tidyverse)
df <- tibble(
id = 1:4,
genres = c("Action|Adventure|Science Fiction|Thriller",
"Adventure|Science Fiction|Thriller",
"Action|Crime|Thriller",
"Family|Animation|Adventure|Comedy|Action")
)
df
I want to separate the genres by "|" and empty columns filled with NA.
This is what I did:
df %>%
separate(genres, into = c("genre1", "genre2", "genre3", "genre4", "genre5"), sep = "|")
However, it's being separated after each letter.
I think you haven't included into:
df <- tibble::tibble(
id = 1:4,
genres = c("Action|Adventure|Science Fiction|Thriller",
"Adventure|Science Fiction|Thriller",
"Action|Crime|Thriller",
"Family|Animation|Adventure|Comedy|Action")
)
df %>% tidyr::separate(genres, into = c("genre1", "genre2", "genre3",
"genre4", "genre5"))
Result:
# A tibble: 4 x 6
id genre1 genre2 genre3 genre4 genre5
* <int> <chr> <chr> <chr> <chr> <chr>
1 1 Action Adventure Science Fiction Thriller
2 2 Adventure Science Fiction Thriller <NA>
3 3 Action Crime Thriller <NA> <NA>
4 4 Family Animation Adventure Comedy Action
Edit: Or as RichScriven wrote in the comments, df %>% tidyr::separate(genres, into = paste0("genre", 1:5)). For separating on | exactly, use sep = "\\|".
Well, this is what helped, writing regex properly.
df %>%
separate(genres, into = paste0("genre", 1:5), sep = "\\|")