I have created a word cloud with the following frequency of terms:
interesting interesting 21
economics economics 12
learning learning 9
learn learn 6
taxes taxes 6
debating debating 6
everything everything 6
know know 6
tax tax 3
meaning meaning 3
I want to add the 6 counts for "learn" into the overall count for "learning" so that the frequency becomes 15, and I only have "learning" in my word cloud. I also want to do the same for "taxes" and "tax".
This is the code I used to generate the wordcloud.
dataset <- read.csv("~/filepath.csv")
> corpus <- Corpus(VectorSource(dataset$comment))
> clean_corpus <- tm_map(corpus, removeWords, stopwords('english'))
> wordcloud(clean_corpus, scale=c(5,0.5), max.words=100, random.order = FALSE, rot.per=0.35, colors=my_palette)
I have tried using the SnowballC package, but this was the outcome:
> library(SnowballC)
> clean_set <- tm_map(clean_corpus, stemDocument)
> dtm <- TermDocumentMatrix(clean_set)
> m <- as.matrix(dtm)
> v <- sort(rowSums(m), decreasing = TRUE)
> d <- data.frame(word = names(v), freq=v)
> head(d, 10)
This gives me the output below (economics has become econom, debating has become debat, everything, everyth) which is obviously unideal. I only have an issue with learn/learning and tax/taxes, so would it be possible to manually merge just those two sets of words?
interest interest 21
learn learn 18
econom econom 12
tax tax 9
debat debat 6
everyth everyth 6
know know 6
mean mean 3
understand understand 3
group group 3
I have also tried clean_corpus_2 <- tm_map(clean_corpus, content_transformer(gsub), pattern = "taxes", replacement = "tax", fixed = TRUE) which changed nothing in the output.
I'm using the tidyverse packages, particularly dplyr as that's why I'm comfortable with, but I'm sure this is doable with base R or any number of other approaches.
library(tidyverse)
First I mock up some data as I don't have yours to test on:
testdata <- tribble(
~ID, ~comment,
1, "learn",
2, "learning",
3, "learned",
4, "tax",
5, "taxes",
6, "panoply"
)
Next is the explicitly listing the options approach:
testdata1 <- testdata %>% mutate(
newcol = case_when(
comment %in% c("learn", "learning", "learned") ~ "learn",
comment %in% c("tax", "taxes") ~ "tax",
TRUE ~ as.character(comment)
)
)
In this code, %>% is a pipe, mutate() adds a new column based on what follows. newcol is the name of the new column, and its contents is decided by the case_when() construct, which tests each option in turn until it finds something returning "TRUE" - that's why the last option (the default "don't change" approach) is listed as TRUE ~ .
After that, the pattern-matching (grepl) approach:
testdata2 <- testdata %>% mutate(
newcol = case_when(
grepl(comment, pattern = "learn") ~ "learn",
grepl(comment, pattern = "tax") ~ "tax",
TRUE ~ as.character(comment)
)
)
Yielding:
> testdata1
# A tibble: 6 × 3
ID comment newcol
<dbl> <chr> <chr>
1 1 learn learn
2 2 learning learn
3 3 learned learn
4 4 tax tax
5 5 taxes tax
6 6 panoply panoply
> testdata2
# A tibble: 6 × 3
ID comment newcol
<dbl> <chr> <chr>
1 1 learn learn
2 2 learning learn
3 3 learned learn
4 4 tax tax
5 5 taxes tax
6 6 panoply panoply
Related
I've been working on something for a while now and still haven't figured out how to get it to work in my preferred way. Hoping someone can help me:
I have a dataframe containing lots of data (5000+ obs) about city budgets, therefore, one of the variable names is obviously 'city'. I have a seperate list of 40 cities that I want to attach to this dataframe and essentially conditionally check for each cityname in the df, if it's also on the seperate list (and so; code it 1; or else 0). I made an example below with smaller dataset:
city <- c(rep("city_a", 8), rep("city_b", 5), rep("city_c", 4), rep("city_d", 7),
rep("city_e", 3), rep("city_f", 9), rep("city_g", 4))
school <- c(1:8, 1:5, 1:4, 1:7,1:3, 1:9, 1:4)
df <- data.frame(city, school)
seperate_list <- tolower("City_A, City_B, City_E, City_G")
seperate_list <- gsub('[,]', '', seperate_list)
seperate_list <- strsplit(seperate_list, " ")[[1]]
Note: You may ask; why do the second part like that? My dataset is much larger and I wanted to find a way to make the process more automatic, so e.g. I wouldn't have to manually delete all the commas and seperate the citynames from one another. Now that I have df and seperate_list, I want to combine them in df, by adding a third column that specifies whether (1) or not (0) each city is in the seperate list. I've tried using a for loop and also lapply, but with no luck since I'm not very skilled in both of those yet.
I would appreciate a hint, so I can sort of find of myself!
library(tidyverse)
city <- c(rep("city_a", 8), rep("city_b", 5), rep("city_c", 4), rep("city_d", 7),
rep("city_e", 3), rep("city_f", 9), rep("city_g", 4))
school <- c(1:8, 1:5, 1:4, 1:7,1:3, 1:9, 1:4)
df <- data.frame(city, school)
seperate_list <- tolower("City_A, City_B, City_E, City_G")
seperate_list <- gsub('[,]', '', seperate_list)
seperate_list <- strsplit(seperate_list, " ")[[1]]
df %>%
mutate(
in_list = city %in% seperate_list
) %>%
as_tibble()
#> # A tibble: 40 x 3
#> city school in_list
#> <chr> <int> <lgl>
#> 1 city_a 1 TRUE
#> 2 city_a 2 TRUE
#> 3 city_a 3 TRUE
#> 4 city_a 4 TRUE
#> 5 city_a 5 TRUE
#> 6 city_a 6 TRUE
#> 7 city_a 7 TRUE
#> 8 city_a 8 TRUE
#> 9 city_b 1 TRUE
#> 10 city_b 2 TRUE
#> # … with 30 more rows
Created on 2021-09-09 by the reprex package (v2.0.1)
I think you might also look in joining tables and make the list of interest as a column of another table. This looks for what databases and relational algebra are made for.
I have a corpus of news articles with date and time of publication as 'docvars'.
readtext object consisting of 6 documents and 8 docvars.
# Description: df[,10] [6 × 10]
doc_id text year month day hour minute second title source
* <chr> <chr> <int> <int> <int> <int> <int> <int> <chr> <chr>
1 2014_01_01_10_51_00… "\"新华网伦敦1… 2014 1 1 10 51 0 docid报告称若不减… RMWenv
2 2014_01_01_11_06_00… "\"新华网北京1… 2014 1 1 11 6 0 docid盘点2013… RMWenv
3 2014_01_02_08_08_00… "\"原标题:报告… 2014 1 2 8 8 0 docid报告称若不减… RMWenv
4 2014_01_03_08_42_00… "\"地球可能毁灭… 2014 1 3 8 42 0 docid地球可能毁灭… RMWenv
5 2014_01_03_08_44_00… "\"北美鼠兔看起… 2014 1 3 8 44 0 docid北美鼠兔为应… RMWenv
6 2014_01_06_10_30_00… "\"欣克力C点核… 2014 1 6 10 30 0 docid英国欲建50… RMWenv
I would like to measure the changing relative frequency that a particular term - e.g 'development' - occurs in these articles (either as a proportion of the total terms in the article / or as a proportion of the total terms in all the articles published in a particular day / month). I know that I can count the number of times the term occurs in all the articles in a month, using:
dfm(corp, select = "term", groups = "month")
and that I can get the relative frequency of the word to the total words in the document using:
dfm_weight(dfm, scheme = "prop")
But how do I combine these together to get the frequency of a specific term relative to the total number of words on a particular day or in a particular month?
What I would like to be able to do is measure the change in the amount of times a term is used over time, but accounting for the fact that the total number of words used is also changing. Thanks for any help!
#DaveArmstrong gives a good answer here and I upvoted it, but can add a bit of efficiency using some of the newest quanteda syntax, which is a bit simpler.
The key here is preserving the date format created by zoo::yearmon(), since the dfm grouping coerce that to a character. So we pack it into a docvar, which is preserved by the grouping, and then retrieve it in the ggplot() call.
load(file("https://www.dropbox.com/s/kl2cnd63s32wsxs/music.rda?raw=1"))
library("quanteda")
## Package version: 2.1.1
## create corpus and dfm
corp <- corpus(m, text_field = "body_text")
corp$date <- m$first_publication_date %>%
zoo::as.yearmon()
D <- dfm(corp, remove = stopwords("english")) %>%
dfm_group(groups = "date") %>%
dfm_weight(scheme = "prop")
library("ggplot2")
convert(D[, "wonderfully"], to = "data.frame") %>%
ggplot(aes(x = D$date, y = wonderfully, group = 1)) +
geom_line() +
labs(x = "Date", y = "Wonderfully/Total # Words")
I suspect someone will come up with a better solution within quanteda, but in the event they don't, you could always extract the word from the dfm and put it in a dataset along with the date and then make the graph. In the code below, I'm using some music reviews I scraped from the Guardian's website. I've commented out the functions that read in the data from an .rda file from Dropbox. You're welcomed to use it if you like - it's clean, but I don't want to inadvertently have someone download a file from the web they're not aware of.
# f <- file("https://www.dropbox.com/s/kl2cnd63s32wsxs/music.rda?raw=1")
# load(f)
## create corpus and dfm
corp <- corpus(as.character(m$body_text))
docvars(corp, "date") <- m$first_publication_date
D <- dfm(corp, remove=stopwords("english"))
## take word frequencies "wonderfully" in the dfm
## along with the date
tmp <- tibble(
word = as.matrix(D)[,"wonderfully"],
date = docvars(corp)$date,
## calculate the total number of words in each document
total = rowSums(D)
)
tmp <- tmp %>%
## turn date into year-month
mutate(yearmon =zoo::as.yearmon(date)) %>%
## group by year-month
group_by(yearmon) %>%
## calculate the sum of the instances of "wonderfully"
## divided by the sum of the total words across all
## documents in the month
summarise(prop = sum(word)/sum(total))
## make a plot.
ggplot(tmp, aes(x=yearmon, y=prop)) +
geom_line() +
labs(x= "Date", y="Wonderfully/Total # Words")
I have data giving me the percentage of people in some groups who have various levels of educational attainment:
df <- data_frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10))
df
# A tibble: 2 x 5
group no.highschool high.school college graduate
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 20. 70. 10. 0.
2 B 10. 40. 40. 10.
E.g., in group A 70% of people have a high school education.
I want to generate 4 variables that give me the proportion of people in each group with less than each of the 4 levels of education (e.g., lessthan_no.highschool, lessthan_high.school, etc.).
desired df would be:
desired.df <- data.frame(group = c("A", "B"),
no.highschool = c(20, 10),
high.school = c(70,40),
college = c(10, 40),
graduate = c(0,10),
lessthan_no.highschool = c(0,0),
lessthan_high.school = c(20, 10),
lessthan_college = c(90, 50),
lessthan_graduate = c(100, 90))
In my actual data I have many groups and a lot more levels of education. Of course I could do this one variable at a time, but how could I do this programatically (and elegantly) using tidyverse tools?
I would start by doing something like a mutate_at() inside of a map(), but where I get tripped up is that the list of variables being summed is different for each of the new variables. You could pass in the list of new variables and their corresponding variables to be summed as two lists to a pmap(), but it's not obvious how to generate that second list concisely. Wondering if there's some kind of nesting solution...
Here is a base R solution. Though the question asks for a tidyverse one, considering the dialog in the comments to the question I have decided to post it.
It uses apply and cumsum to do the hard work. Then there are some cosmetic concerns before cbinding into the final result.
tmp <- apply(df[-1], 1, function(x){
s <- cumsum(x)
100*c(0, s[-length(s)])/sum(x)
})
rownames(tmp) <- paste("lessthan", names(df)[-1], sep = "_")
desired.df <- cbind(df, t(tmp))
desired.df
# group no.highschool high.school college graduate lessthan_no.highschool
#1 A 20 70 10 0 0
#2 B 10 40 40 10 0
# lessthan_high.school lessthan_college lessthan_graduate
#1 20 90 100
#2 10 50 90
how could I do this programatically (and elegantly) using tidyverse tools?
Definitely the first step is to tidy your data. Encoding information (like edu level) in column names is not tidy. When you convert education to a factor, make sure the levels are in the correct order - I used the order in which they appeared in the original data column names.
library(tidyr)
tidy_result = df %>% gather(key = "education", value = "n", -group) %>%
mutate(education = factor(education, levels = names(df)[-1])) %>%
group_by(group) %>%
mutate(lessthan_x = lag(cumsum(n), default = 0) / sum(n) * 100) %>%
arrange(group, education)
tidy_result
# # A tibble: 8 x 4
# # Groups: group [2]
# group education n lessthan_x
# <chr> <fct> <dbl> <dbl>
# 1 A no.highschool 20 0
# 2 A high.school 70 20
# 3 A college 10 90
# 4 A graduate 0 100
# 5 B no.highschool 10 0
# 6 B high.school 40 10
# 7 B college 40 50
# 8 B graduate 10 90
This gives us a nice, tidy result. If you want to spread/cast this data into your un-tidy desired.df format, I would recommend using data.table::dcast, as (to my knowledge) the tidyverse does not offer a nice way to spread multiple columns. See Spreading multiple columns with tidyr or How can I spread repeated measures of multiple variables into wide format? for the data.table solution or an inelegant tidyr/dplyr version. Before spreading, you could create a key less_than_x_key = paste("lessthan", education, sep = "_").
I'm still learning R and have been given the task of grouping a long list of students into groups of four based on another variable. I have loaded the data into R as a data frame. How do I sample entire rows without replacement, one from each of 4 levels of a variable and have R output the data into a spreadsheet?
So far I have been tinkering with a for loop and the sample function but I'm quickly getting over my head. Any suggestions? Here is sample of what I'm attempting to do. Given:
Last.Name <- c("Picard","Troi","Riker","La Forge", "Yar", "Crusher", "Crusher", "Data")
First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com")
Section <- c(1,1,2,2,3,3,4,4)
df <- data.frame(Last.Name,First.Name,Email,Section)
I want to randomly select a Star Trek character from each section and end up with 2 groups of 4. I would want the entire row's worth of information to make it over to a new data frame containing all groups with their corresponding group number.
I'd use the wonderful package 'dplyr'
require(dplyr)
random_4 <- df %>% group_by(Section) %>% slice(sample(c(1,2),1))
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Troi Deanna b#b.com 1
2 La Forge Geordi d#d.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
random_4
Source: local data frame [4 x 4]
Groups: Section
Last.Name First.Name Email Section
1 Picard Jean-Luc a#a.com 1
2 Riker William c#c.com 2
3 Crusher Beverly f#f.com 3
4 Data Data h#h.com 4
%>% means 'and then'
The code is read as:
Take DF AND THEN for all 'Section', select by position (slice) 1 or 2. Voila.
I suppose you have 8 students: First.Name <- c("Jean-Luc", "Deanna", "William", "Geordi", "Tasha", "Beverly", "Wesley", "Data").
If you wish to randomly assign a section number to the 8 students, and assuming you would like each section to have 2 students, then you can either permute Section <- c(1, 1, 2, 2, 3, 3, 4, 4) or permute the list of the students.
First approach, permute the sections:
> assigned_section <- print(sample(Section))
[1] 1 4 3 2 2 3 4 1
Then the following data frame gives the assignments:
assigned_students <- data.frame(First.Name, assigned_section)
Second approach, permute the students:
> assigned_students <- print(sample(First.Name))
[1] "Data" "Geordi" "Tasha" "William" "Deanna" "Beverly" "Jean-Luc" "Wesley"
Then, the following data frame gives the assignments:
assigned_students <- data.frame(assigned_students, Section)
Alex, Thank You. Your answer wasn't exactly what I was looking for, but it inspired the correct one for me. I had been thinking about the process from a far too complicated point of view. Instead of having R select rows and put them into a new data frame, I decided to have R assign a random number to each of the students and then sort the data frame by the number:
First, I broke up the data frame into sections:
df1<- subset(df, Section ==1)
df2<- subset(df, Section ==2)
df3<- subset(df, Section ==3)
df4<- subset(df, Section ==4)
Then I randomly generated a group number 1 through 4.
Groupnumber <-sample(1:4,4, replace=F)
Next, I told R to bind the columns:
Assigned1 <- cbind(df1,Groupnumber)
*Ran the group number generator and cbind in alternating order until I got through the whole set. (Wanted to make sure the order of the numbers was unique for each section).
Finally row binding the data set back together:
Final_List<-rbind(Assigned1,Assigned2,Assigned3,Assigned4)
Thank you everyone who looked this over. I am new to data science, R, and stackoverflow, but as I learn more I hope to return the favor.
I'd suggest the randomizr package to "block assign" according to section. The block_ra function lets you do this in a easy-to-read one-liner.
install.packages("randomizr")
library(randomizr)
df$group <- block_ra(block_var = df$Section,
condition_names = c("group_1", "group_2"))
You can inspect the resulting sets in a variety of ways. Here's with base r subsetting:
df[df$group == "group_1",]
Last.Name First.Name Email Section group
2 Troi Deanna b#b.com 1 group_1
3 Riker William c#c.com 2 group_1
6 Crusher Beverly f#f.com 3 group_1
7 Crusher Wesley g#g.com 4 group_1
df[df$group == "group_2",]
Last.Name First.Name Email Section group
1 Picard Jean-Luc a#a.com 1 group_2
4 La Forge Geordi d#d.com 2 group_2
5 Yar Tasha e#e.com 3 group_2
8 Data Data h#h.com 4 group_2
If you want to roll your own:
set <- tapply(1:nrow(df), df$Section, FUN = sample, size = 1)
df[set,] # show the sampled set
df[-set,] # show the complimentary set
Let's say I have:
Person Movie Rating
Sally Titanic 4
Bill Titanic 4
Rob Titanic 4
Sue Cars 8
Alex Cars **9**
Bob Cars 8
As you can see, there is a contradiction for Alex. All the same movies should have the same ranking, but there was a data error entry for Alex. How can I use R to solve this? I've been thinking about it for a while, but I can't figure it out. Do I have to just do it manually in excel or something? Is there a command on R that will return all the cases where there are data contradictions between two columns?
Perhaps I could have R do a boolean check if all the Movie cases match the first rating of its first iteration? For all that returns "no," I can go look at it manually? How would I write this function?
Thanks
Here's a data.table solution
Define the function
Myfunc <- function(x) {
temp <- table(x)
names(temp)[which.max(temp)]
}
library(data.table)
Create a column with the correct rating (by reference)
setDT(df)[, CorrectRating := Myfunc(Rating), Movie][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Alex Cars 9 8
# 6: Bob Cars 8 8
Or If you want to remove the "bad" ratings
df[Rating == CorrectRating][]
# Person Movie Rating CorrectRating
# 1: Sally Titanic 4 4
# 2: Bill Titanic 4 4
# 3: Rob Titanic 4 4
# 4: Sue Cars 8 8
# 5: Bob Cars 8 8
It looks like, within each group defined by "Movie", you're looking for any instances of Rating that are not the same as the most common value.
You can solve this using dplyr (which is good at "group by one column, then perform an operation within each group), along with the "Mode" function defined in this answer that finds the most common item in a vector:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
library(dplyr)
dat %>% group_by(Movie) %>% filter(Rating != Mode(Rating))
This finds all the cases where a row does not agree with the rest of the group. If you instead want to remove them, you can do:
newdat <- dat %>% group_by(Movie) %>% filter(Rating == Mode(Rating))
If you want to fix them, do
newdat <- dat %>% group_by(Movie) %>% mutate(Rating = Mode(Rating))
You can test the above with a reproducible version of your data:
dat <- data.frame(Person = c("Sally", "Bill", "Rob", "Sue", "Alex", "Bob"),
Movie = rep(c("Titanic", "Cars"), each = 3),
Rating = c(4, 4, 4, 8, 9, 8))
If the goal is to see if all the values within a group are the same (or if there are some differences) then this can be a simple application of tapply (or aggregate, etc.) used with a function like var (or compute the range). If all the values are the same then the variance and range will be 0. If it is any other value (outside of rounding error) then there must be a value that is different. The which function can help identify the group/individual.
tapply(dat$Rating, dat$Movie, FUN=var)
which(.Last.value > 0.00001)
tapply(dat$Rating, dat$Movie, FUN=function(x)diff(range(x)))
which(.Last.value != 0)
which( abs(dat$Rating - ave(dat$Rating, dat$Movie)) > 0)
which.max( abs(dat$Rating - ave(dat$Rating, dat$Movie)) )
dat[.Last.value,]
I would add a variable for mode so I can see if there is anything weird going on with the data, like missing data, text, many different answers instead of the rare anomaly,etc. I used "x" as your dataset
# one of many functions to find mode, could use any other
modefunc <- function(x){
names(table(x))[table(x)==max(table(x))]
}
# add variable for mode split by Movie
x$mode <- ave(x = x$Rating,x$Movie,FUN = modefunc)
# do whatever you want with the records that are different
x[x$Rating != x$mode, ]
If you want another function for mode, try other functions for mode