R create_matrix() - "ngramLength" option is not working - r

I want to create a document-term matrix by using the create_matrix-function.
This works so far with my example:
library(RTextTools)
library(tidyverse)
pos_tweets = rbind(
c("I love this car", "positive"),
c("This view is amazing", "positive"),
c("I feel great this morning", "positive"),
c("I am so excited about the concert", "positive"),
c("He is my best friend", "positive"))
neg_tweets = rbind(
c("I do not like this car", "negative"),
c("This view is horrible", "negative"),
c("I feel tired this morning", "negative"),
c("I am not looking forward to the concert", "negative"),
c("He is my enemy", "negative"))
tweets = rbind(pos_tweets, neg_tweets)
matrix = create_matrix(tweets[, 1], language = "english", removeStopwords = FALSE,
removeNumbers = TRUE, stemWords = FALSE, ngramLength = 1)
mat = as.matrix(matrix)
mat[, 1:5]
##Result:
Docs about amazing best car concert
I love this car 0 0 0 1 0
This view is amazing 0 1 0 0 0
I feel great this morning 0 0 0 0 0
I am so excited about the concert 1 0 0 0 1
He is my best friend 0 0 1 0 0
I do not like this car 0 0 0 1 0
This view is horrible 0 0 0 0 0
I feel tired this morning 0 0 0 0 0
I am not looking forward to the concert 0 0 0 0 1
He is my enemy 0 0 0 0 0
The function create_matrix has the option ngramLength=, with which one can determine the length of the n-grams. For example, 1 returns unigrams (single words, e.g.: "computer") and 2 returns bigrams (two adjacent words, e.g.: "my computer").
However, this option does not seem to work. No matter what number I enter, the function only gives me unigrams (ngramLength=1).
I would also like to have bigrams (ngramLength=2) as a result.
The result would look like this ( strongly shortened):
Docs this car this view feel great
I love this car 1 0 0
This view is amazing 0 1 0
I feel great this morning 0 0 1
Can anyone help me and solve my problem?
I am also very open to other functions from other packages.
Many Thanks in advance!

Related

Recoding multiple race variables into a single race variable

I'm trying to recode multiple race variables into a single race variable. The different variables are RaceVar1: Asian RaceVar2: AIAN RaceVar3: Black RaceVar4: Native Hawaiian RaceVar5: White. Variables are ticked off with a 1 if participants chose that race, and 0 if they didn't. I would like to create a new race variable that condenses these variables into one, as well as assess for if someone ticked off multiple races.
I am able to do this in SAS however I need to do this in R and am unsure how to perform the same task. SAS code below
data want;
set have;
length race $40;
if sum(of r_s_q61___1 - r_s_q61___5) > 1 then race='More than one race';
else if r_s_q61___2 then race='American Indian or Alaska Native';
else if r_s_q61___1 then race='Asian';
else if r_s_q61___5 then race='White';
else if r_s_q61___3 then race='Black or African American';
else if r_s_q61___4 then race='Native Hawaiian or Other Pacific Islander';
else race='Unknown';
run;
I'm not sure where to start other than I believe maybe using rowSums() and ifthen() within a mutate() statement.
Yep! mutate and ifelse are your friends here. With dplyr though we've also got a neat function called case_when that lets us nest a bunch of ifelse statements simultaneously.
library(dplyr)
data.frame(`RaceVar1`=c(1,0,0,0,0,1),
`RaceVar2`=c(0,1,0,0,0,0),
`RaceVar3`=c(0,0,1,0,0,0),
`RaceVar4`=c(0,0,0,1,0,0),
`RaceVar5`=c(0,0,0,0,1,0),
`RaceVar6`=c(0,0,0,0,0,1)) %>%
mutate(more_than_one=rowSums(.)) %>%
mutate(Race=ifelse(
more_than_one>1,
'More than one race',
case_when(
RaceVar1 == 1 ~ "RaceVar1: Asian",
RaceVar2 == 1 ~ "RaceVar2: AIAN",
RaceVar3 == 1 ~ "RaceVar3: Black",
RaceVar4 == 1 ~ "RaceVar4: Native Hawaiian",
RaceVar5 == 1 ~ "RaceVar5: White"
)
))
RaceVar1 RaceVar2 RaceVar3 RaceVar4 RaceVar5 RaceVar6 more_than_one Race
1 1 0 0 0 0 0 1 RaceVar1: Asian
2 0 1 0 0 0 0 1 RaceVar2: AIAN
3 0 0 1 0 0 0 1 RaceVar3: Black
4 0 0 0 1 0 0 1 RaceVar4: Native Hawaiian
5 0 0 0 0 1 0 1 RaceVar5: White
6 1 0 0 0 0 1 2 More than one race

Learning R: foreach value in col-A, list unique values in Col-B and then associated values in Col C

I'm learning R and have not coded in a long time (and no this is not a school assignment). My intent is to run NRC_Sentiment on notes, which I can do for only the 1st instance of the Cat. I've read doc on for loop and foreach loop (which I would've used in Perl).
I have created a data frame from a csv file
Cat | Sub-Cat | Notes
Cat1 | Apple | This is a fruit called apple and I love it
Cat1 | Orange | This is a fruit called orange and I don't like it
Cat2 | Tomato | This is a Veg called tomato and I like it
Cat2 | Pepper | This a Veg called pepper and I don't like it
Cat1 | Banana | This a fruit banana and I have no opinion about it
dataIn = read.csv(...)[,1:3] #Read col 1,3
df = data.frame(dataIn)
uCat = data.frame(uCatR=c(df$Cat))
uCat = unqiue(uCat)
I've tried for loop, but it stops at 1st instance of uCat, so any help is appreciated
In perl I could create a hash of hashes and using foreach loop through each
It's hard to tell exactly what you're trying to achieve, but it seems like you might be trying to run a sentiment analysis on the unique instances of Notes. You can use map from the purrr package (part of tidyverse). map will return a list, map_df will return a dataframe, and there are other map functions to return other object types.
library(tidyverse)
library(syuzhet)
map_df(unique(df$Notes), get_nrc_sentiment)
Which returns a dataframe:
anger anticipation disgust fear joy sadness surprise trust negative positive
1 0 0 0 0 1 0 0 0 0 1
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
dput:
structure(list(Cat = c("Cat1", "Cat1", "Cat2", "Cat2", "Cat1"
), `Sub-Cat` = c("Apple", "Orange", "Tomato", "Pepper", "Banana"
), Notes = c("This is a fruit called apple and I love it", "This is a fruit called orange and I don't like it",
"This is a Veg called tomato and I like it", "This a Veg called pepper and I don't like it",
"This a fruit banana and I have no opinion about it")), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))

Finding a way to tally soccer/football passes in R

I am trying to find a way to take a sequence of passes and show how many times each player passes to another player.
So for example, if the pass sequence was: Jordan to Emma to Molly to Emily bad, that means Jordan's and Emma's passes were successful but Molly's was not.
I have an example of a few lines of data I put in R (in a 2x2 dataset):
Passes
1 jordan to karlie karlie turnover unforced
2 jlin to gray bad
3 alia to kiersten to lilly to kiersten bad
4 mandy to karlie bad
5 kelsey to mccarter to jordan to emma emma fouled
6 mandy to karlie bad
7 mandy to kiersten cleared
I am trying to come up with a way that can convert those lines into a table like this:
Players Mandy-G Jlin-G Gray-G Kiersten-G Kelsey-G Karlie-G Jordan-G Lilly-G Mccarter-G Emma-G Alia-G Mandy-B Jlin-B Gray-B Kiersten-B Kelsey-B Karlie-B Jordan-B Lilly-B Mccarter-B Emma-B Alia-B
Mandy 1 2
Jlin 1
Gray
Kiersten 1
Kelsey 1
Karlie
Jordan 1 1
Lilly 1
McCarter 1
Emma
Alia 1
*I don't know how to insert a screenshot, so the copy and paste messed up the formatting but you can still get the idea of what I want it to look like.
If Mandy passed to Gray and it was good there should be a 1 in the Mandy and Gray-G intersection. If Mandy passed to Gray and it was bad there should be a 1 in the Mandy and Gray-B intersection.
There are only numbers in that table because I did it by hand and it was only for about 10 minutes of a game. Ultimately, doing it for the full 90 minutes and for about 25 games, I'm going to need to create a way to go through the first table and have R sort and add a mark for each successful and unsuccessful pass.
dat3 <- strsplit(dat[,1], "to")
numPass <- rep(0, length(dat3))
for (i in 1:length(dat2)) {
temp <- sum(dat2[[i]] == "to")
if ("bad" %in% dat2[[i]]) {
temp <- temp-1
}
numPass[i] <- temp
}
maxPass <- max(numPass)+1
#for (i in 1:length(dat2)){
for (i in 5){
keep<-dat2[[i]]%in%roster[,1]
pls<-dat2[[i]][keep]
#add statemets to remove last name if there is a "bad"
for (j in 1:length(pls)) {
cols<-which(substr(names(seqPass),1,nchar(pls[j]))==pls[j])
seqPass[i,cols[j]]<-j
}
}
seqPass[c(1,5),]
I have tried the above code to go through the first five lines and to count the number of passes in each sequence and it adds a mark under each player's name if they were involved in the pass, but if it was bad they need to be removed which is does not do.
Is there a way for R to automatically count if the first name and second name in the sequence have a good pass, add a mark in their intersection, and do the same for if the first and second name make a bad pass by having the word "bad" follows the second name?
Any help would be much appreciated!
Thanks
Sample data
structure(list(VT = c("jordan to karlie karlie turnover unforced",
"jlin to gray bad", "alia to kiersten to lilly to kiersten bad",
"mandy to karlie bad", "kelsey to mccarter to jordan to emma emma fouled",
"mandy to karlie bad", "mandy to kiersten cleared bad")), row.names = c(NA,
7L), class = "data.frame", na.action = structure(8:19, .Names = c("8",
"9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19"
), class = "omit"))
You could use regular expressions. And Also it will be fast if you only put the data of those who touched the ball. So something like:
pass = sub('_$','_good',sub("(.*\\w+ to (?:\\w+(?=.*(bad))|\\w+)).*",'\\1_\\2',dat$VT,perl = T))
pass1 = gsub('(to(\\s[^_ ]+(?=\\s)))','\\1_good\n\\2',pass,perl=T)
results = xtabs(V3~.,cbind(read.csv(text=gsub('to',',',pass1),h=F,strip.white = T),V3=1))
results
V2
V1 emma_good gray_bad jordan_good karlie_bad karlie_good kiersten_bad kiersten_good lilly_good mccarter_good
alia 0 0 0 0 0 0 1 0 0
jlin 0 1 0 0 0 0 0 0 0
jordan 1 0 0 0 1 0 0 0 0
kelsey 0 0 0 0 0 0 0 0 1
kiersten 0 0 0 0 0 0 0 1 0
lilly 0 0 0 0 0 1 0 0 0
mandy 0 0 0 2 0 1 0 0 0
mccarter 0 0 1 0 0 0 0 0 0
It seems that you have done a lot of the work already, so I will just add in my two cents. It would make your table generally smaller if you didn't separate out good and bad as two tables. You could generally have one table with combinations of players like you have created, but add a column with a 1 or 0 stating if the pass was good or bad, in which case you could just have your code above but with
dat$pass <- as.numeric(grepl(".*(bad)", dat$VT))
This adds a column with 1 if the row has 'bad' in it. Imagine the complexity of a good and bad table over multiple decades and different players!

Using DocumentTermMatrix on a Vector of First and Last Names

I have a column in my data frame (df) as follows:
> people = df$people
> people[1:3]
[1] "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"
[2] "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"
[3] "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"
The column has 4k+ unique first/last/nick names as a list of full names on each row as shown above. I would like to create a DocumentTermMatrix for this column where full name matches are found and only the names that occur the most are used as columns. I have tried the following code:
> people_list = strsplit(people, ", ")
> corp = Corpus(VectorSource(people_list))
> dtm = DocumentTermMatrix(corp, people_dict)
where people_dict is a list of the most commonly occurring people (~150 full names of people) from people_list as follows:
> people_dict[1:3]
[[1]]
[1] "Christian Slater"
[[2]]
[1] "Tara Reid"
[[3]]
[1] "Stephen Dorff"
However, the DocumentTermMatrix function seems to not be using the people_dict at all because I have way more columns than in my people_dict. Also, I think that the DocumentTermMatrix function is splitting each name string into multiple strings. For example, "Danny Devito" becomes a column for "Danny" and "Devito".
> inspect(actors_dtm[1:5,1:10])
<<DocumentTermMatrix (documents: 5, terms: 10)>>
Non-/sparse entries: 0/50
Sparsity : 100%
Maximal term length: 9
Weighting : term frequency (tf)
Terms
Docs 'g. 'jojo' 'ole' 'piolin' 'rampage' 'spank' 'stevvi' a.d. a.j. aaliyah
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
I have read through all the TM documentation that I can find, and I have spent hours searching on stackoverflow for a solution. Please help!
The default tokenizer splits text into individual words. You need to provide a custom function
commasplit_tokenizer <- function(x)
unlist(strsplit(as.character(x), ", "))
Note that you do not separate the actors before creating the corpus.
people <- character(3)
people[1] <- "Christian Slater, Tara Reid, Stephen Dorff, Frank C. Turner"
people[2] <- "Ice Cube, Nia Long, Aleisha Allen, Philip Bolden"
people[3] <- "John Travolta, Uma Thurman, Vince Vaughn, Cedric the Entertainer"
people_dict <- c("Stephen Dorff", "Nia Long", "Uma Thurman")
The control options didn't work with just Coprus, I used VCorpus
corp = VCorpus(VectorSource(people))
dtm = DocumentTermMatrix(corp, control = list(tokenize =
commasplit_tokenizer, dictionary = people_dict, tolower = FALSE))
All of the options are passed within control, including:
tokenize - function
dictionary
tolower = FALSE
Results:
as.matrix(dtm)
Terms
Docs Nia LOng Stephen Dorff Uma Thurman
1 0 1 0
2 0 0 0
3 0 0 1
I hope this helps

r string parsing challenge

I am dealing with a column that contains strings as follows
Col1
------------------------------------------------------------------
Department of Mechanical Engineering, Department of Computer Science
Division of Advanced Machining, Center for Mining and Metallurgy
Department of Aerospace, Center for Science and Delivery
What I am trying to do is separate strings containing the words starting with either, Department or Divison or Center until comma(,) the final output should look like this
Dept_Mechanical_Eng Dept_Computer_Science Div_Adv_Machining Cntr_Mining_Metallurgy Dept_Aerospace Cntr_Science_Delivery
1 1 0 0 0 0
0 0 1 1 0 0
0 0 1 1 1 1
I have butchered the actual names just for aesthetic purpose in the expected output. Any help on parsing this string is much appreciated.
This is very similar to a question I just did tabulating another text example. Are you in the same class as the questioner here? Count the number of times (frequency) a string occurs
inp <- "Department of Mechanical Engineering, Department of Computer Science
Division of Advanced Machining, Center for Mining and Metallurgy
Department of Aerospace, Center for Science and Delivery"
inp2 <- factor(scan(text=inp,what="",sep=","))
#Read 6 items
inp3 <- readLines(textConnection(inp))
as.data.frame( setNames( lapply(levels(inp2), function(ll) as.numeric(grepl(ll, inp3) ) ), trimws(levels(inp2) )) )
Department.of.Aerospace Division.of.Advanced.Machining
1 0 0
2 0 1
3 1 0
Center.for.Mining.and.Metallurgy Center.for.Science.and.Delivery
1 0 0
2 1 0
3 0 1
Department.of.Computer.Science Department.of.Mechanical.Engineering
1 1 1
2 0 0
3 0 0

Resources