I want to create a document-term matrix by using the create_matrix-function.
This works so far with my example:
library(RTextTools)
library(tidyverse)
pos_tweets = rbind(
c("I love this car", "positive"),
c("This view is amazing", "positive"),
c("I feel great this morning", "positive"),
c("I am so excited about the concert", "positive"),
c("He is my best friend", "positive"))
neg_tweets = rbind(
c("I do not like this car", "negative"),
c("This view is horrible", "negative"),
c("I feel tired this morning", "negative"),
c("I am not looking forward to the concert", "negative"),
c("He is my enemy", "negative"))
tweets = rbind(pos_tweets, neg_tweets)
matrix = create_matrix(tweets[, 1], language = "english", removeStopwords = FALSE,
removeNumbers = TRUE, stemWords = FALSE, ngramLength = 1)
mat = as.matrix(matrix)
mat[, 1:5]
##Result:
Docs about amazing best car concert
I love this car 0 0 0 1 0
This view is amazing 0 1 0 0 0
I feel great this morning 0 0 0 0 0
I am so excited about the concert 1 0 0 0 1
He is my best friend 0 0 1 0 0
I do not like this car 0 0 0 1 0
This view is horrible 0 0 0 0 0
I feel tired this morning 0 0 0 0 0
I am not looking forward to the concert 0 0 0 0 1
He is my enemy 0 0 0 0 0
The function create_matrix has the option ngramLength=, with which one can determine the length of the n-grams. For example, 1 returns unigrams (single words, e.g.: "computer") and 2 returns bigrams (two adjacent words, e.g.: "my computer").
However, this option does not seem to work. No matter what number I enter, the function only gives me unigrams (ngramLength=1).
I would also like to have bigrams (ngramLength=2) as a result.
The result would look like this ( strongly shortened):
Docs this car this view feel great
I love this car 1 0 0
This view is amazing 0 1 0
I feel great this morning 0 0 1
Can anyone help me and solve my problem?
I am also very open to other functions from other packages.
Many Thanks in advance!
I'm learning R and have not coded in a long time (and no this is not a school assignment). My intent is to run NRC_Sentiment on notes, which I can do for only the 1st instance of the Cat. I've read doc on for loop and foreach loop (which I would've used in Perl).
I have created a data frame from a csv file
Cat | Sub-Cat | Notes
Cat1 | Apple | This is a fruit called apple and I love it
Cat1 | Orange | This is a fruit called orange and I don't like it
Cat2 | Tomato | This is a Veg called tomato and I like it
Cat2 | Pepper | This a Veg called pepper and I don't like it
Cat1 | Banana | This a fruit banana and I have no opinion about it
dataIn = read.csv(...)[,1:3] #Read col 1,3
df = data.frame(dataIn)
uCat = data.frame(uCatR=c(df$Cat))
uCat = unqiue(uCat)
I've tried for loop, but it stops at 1st instance of uCat, so any help is appreciated
In perl I could create a hash of hashes and using foreach loop through each
It's hard to tell exactly what you're trying to achieve, but it seems like you might be trying to run a sentiment analysis on the unique instances of Notes. You can use map from the purrr package (part of tidyverse). map will return a list, map_df will return a dataframe, and there are other map functions to return other object types.
library(tidyverse)
library(syuzhet)
map_df(unique(df$Notes), get_nrc_sentiment)
Which returns a dataframe:
anger anticipation disgust fear joy sadness surprise trust negative positive
1 0 0 0 0 1 0 0 0 0 1
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
dput:
structure(list(Cat = c("Cat1", "Cat1", "Cat2", "Cat2", "Cat1"
), `Sub-Cat` = c("Apple", "Orange", "Tomato", "Pepper", "Banana"
), Notes = c("This is a fruit called apple and I love it", "This is a fruit called orange and I don't like it",
"This is a Veg called tomato and I like it", "This a Veg called pepper and I don't like it",
"This a fruit banana and I have no opinion about it")), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
I have a problem and I would ask if is a function or easy way to do below operation.
I have a data.frame like this
customer item
-------------------
smith a
smith b
smith c
johnson a
bush NA
regan d
How to create matrix like this
customer a b c d
--------------------------------------
smith 1 1 1 0
johnson 1 0 0 0
bush 0 0 0 0
regan 0 0 0 1
Is loop obligartory? Is easier way to create this?
Thank you in advance!
You should use the table function. The call would look something like this. IT goes x,y but depending on what the full data.frame list looks you may want to add some more parameters to handle NA values and such
table(df$customer, df$item)
I am dealing with a column that contains strings as follows
Col1
------------------------------------------------------------------
Department of Mechanical Engineering, Department of Computer Science
Division of Advanced Machining, Center for Mining and Metallurgy
Department of Aerospace, Center for Science and Delivery
What I am trying to do is separate strings containing the words starting with either, Department or Divison or Center until comma(,) the final output should look like this
Dept_Mechanical_Eng Dept_Computer_Science Div_Adv_Machining Cntr_Mining_Metallurgy Dept_Aerospace Cntr_Science_Delivery
1 1 0 0 0 0
0 0 1 1 0 0
0 0 1 1 1 1
I have butchered the actual names just for aesthetic purpose in the expected output. Any help on parsing this string is much appreciated.
This is very similar to a question I just did tabulating another text example. Are you in the same class as the questioner here? Count the number of times (frequency) a string occurs
inp <- "Department of Mechanical Engineering, Department of Computer Science
Division of Advanced Machining, Center for Mining and Metallurgy
Department of Aerospace, Center for Science and Delivery"
inp2 <- factor(scan(text=inp,what="",sep=","))
#Read 6 items
inp3 <- readLines(textConnection(inp))
as.data.frame( setNames( lapply(levels(inp2), function(ll) as.numeric(grepl(ll, inp3) ) ), trimws(levels(inp2) )) )
Department.of.Aerospace Division.of.Advanced.Machining
1 0 0
2 0 1
3 1 0
Center.for.Mining.and.Metallurgy Center.for.Science.and.Delivery
1 0 0
2 1 0
3 0 1
Department.of.Computer.Science Department.of.Mechanical.Engineering
1 1 1
2 0 0
3 0 0
I've got survey data with some multiple-response questions like this:
HS18 Why is it difficult to get medical care in South Africa? (Select all that apply)
1 Too expensive
2 No transportation to the hospital/clinic
3 Hospital/clinic is too far away
4 Hospital/clinic staff do not speak my language
5 Hospital/clinic staff do not like foreigners
6 Wait time too long
7 Cannot take time off of work
8 None of these. I have no problem accessing medical care
where multiple responses were entered with commas and are recorded as different levels i.e.:
unique(HS18)
[1] 888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3
[13] 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4
[25] 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5
30 Levels: 0 1 1,2,3 1,4 1,4,5 1,4,6,7 1,6 2 2,3 2,6 3 3,4 3,5 3,6 4 4,5 4,5,6 4,5,6,7 4,6 4,8 ... 999
This is as much a data-cleaning protocol question as an R question...I'm doing the cleaning, but not the analysis, so everything needs to be transparent and user-friendly when I pass it back...and the PI doesn't use R. Basically I'd like to split the multiples into levels and re-name them while keeping them together as a single observation...not sure how to do this, or even if it's the right approach.
How do you generally deal with this issue? Is there an elegant way to process this for analysis in STATA (simple descriptives, regressions, odds ratios)?
Thanks everyone!!!
My best thought for analyzing multi-select questions like this is to convert the possible answers into indicator variables: take all of your possible answers (1 to 8 in this example) and create data columns named HS18.1, HS18.2, etc. (You can optionally include something more in the column name, but that's completely between you and the PI.)
Your sample data here looks like it includes data that is not legal: 0, 888, and 999 are not listed in the options. It's possible/likely that these include DK/NR responses, but I can't be certain. As such:
Your data cleaning should be taking care of these anomalies before this step of converting 0+ length lists into indicator variables.
My code below arbitrarily ignores this fact and you will lose data. This is obviously not "A Good Thing™" in the long run. More robust checks are warranted (and not difficult). (I've added an other column to indicate something was lost.)
The code:
ss <- '888 1 6 4 5 8 2 3,5 4,6 3,6 3,4 3 4,5,6 7 999 4,5 2,6 4,8 7,8 1,6 1,2,3 5,7,8 4,5,6,7 1,4 0 5,6,7 5,6 2,3 1,4,6,7 1,4,5'
dat <- lapply(strsplit(ss, ' '), strsplit, ',')[[1]]
lvls <- as.character(1:8)
## lvls <- sort(unique(unlist(dat))) # alternative method
ret <- structure(lapply(lvls, function(lvl) sapply(dat, function(xx) lvl %in% xx)),
.Names = paste0('HS18.', lvls),
row.names = c(NA, -length(dat)), class = 'data.frame')
ret$HS18.other <- sapply(dat, function(xx) !all(xx %in% lvls))
ret <- 1 * ret ## convert from TRUE/FALSE to 1/0
head(1 * ret)
## HS18.1 HS18.2 HS18.3 HS18.4 HS18.5 HS18.6 HS18.7 HS18.8 HS18.other
## 1 0 0 0 0 0 0 0 0 1
## 2 1 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 1 0 0 0
## 4 0 0 0 1 0 0 0 0 0
## 5 0 0 0 0 1 0 0 0 0
## 6 0 0 0 0 0 0 0 1 0
The resulting data.frame can be cbinded (or even matrixized) to whatever other data you have.
(I use 1 and 0 instead of TRUE and FALSE because you said the PI will not be using R; this can easily be changed to a character string or something that makes more sense to them.)