Classification Task - Feature importance/selection in R (n-gram)

Classification Task - Feature importance/selection in R (n-gram) - r

I am still working on a dataset with customer satisfaction and their online paths/journeys (here as mini-sequences/bi-grams). The customers were classified according to their usability satisfaction (4 classes).
Please find some exemplary lines of my current data frame in the following.
|bi-gram| satisfaction_class |
|:---- |:----: |----:|
|"openPage1", "writeText“|1|
|“writeText“ , “writeText“|2|
|"openPage3", "writeText“|4|
|"writeText“,"openPage1"|3|
...
Now I would like to know which bi-gram is a significant/robust predictor for certain classes. Is it possible with only bigrams or do I need the whole customer path?
I have read that one can use TF-IDF or chi-square but I could not find the perfect code. :(
Thank you so much!
Marius

Related

Grouping and transposing data in R

It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks

Using R, randomly pairing rows in different categories without repeats

Thank you to everyone who commented already! I've edited my post with better code and hopefully some clarity on what I'm trying to do. (I appreciate all of the feedback - this is my first time asking a question here!)
I have a very similar question to this one here (Random Pairings that don't Repeat) but am trying to come up with a function or piece of code that I can run in R to create the pairings. Essentially, I have a pool of employees and want to come up with a way to randomly generate pairs of employees to meet every month, with no pairs repeating in future months/running of the function. (I will need to maintain the history of previous pairings.) The catch is that each employee is assigned to a working location, and I only want matches from different location.
I've gone through a number of previous queries on randomly sampling data sets in R and am comfortable with generating a random pair from my data, or pulling out an existing working group, but it's the "generating a pair that ALWAYS comes from a different group" that's tripping me up, especially since the different groups/locations have different numbers of employees so it's hard to sort the groups evenly.
Here's my dummy data, which currently has 10 "employees". The actual data set currently has over 100 employees with more being added to the pool each month:
ID <- (1:10)
Name <- c("Sansa", "Arya", "Hodor", "Jamie", "Cersei", "Tyrion", "Jon", "Sam", "Dany", "Drogo")
Email <- c("a#a.com","b#b.com", "c#c.com", "d#d.com", "e#e.com", "f#f.com", "g#g.com", "h#h.com", "i#i.com", "j#j.com")
Location <- c("winterfell", "Winterfell", "Winterfell", "Kings Landing", "Kings Landing", "Kings Landing",
"The Wall", "The Wall", "Essos", "Essos")
df <- data.frame(ID, Name, Email, Location)
Basically, I want to write something that would say that Sansa could be randomly paired with anyone who is not Arya or Hodor, because they're in the same location, and that Jon could be paired with anyone but Sam (i.e., anyone whose location is NOT Winterfell.) If Jon and Arya were paired up once, I would like them to not be paired up again going forward, so at that point Jon could be paired with anyone but Sam or Arya. I hope I'm making sense.
I was able to run the combn function on the ID column to generate groups of two, but it doesn't account for the historical pairings that we're trying to avoid.
Is there a way to do this in R? I've tried this one (Using R, Randomly Assigning Students Into Groups Of 4) but it wasn't quite right for my needs, again because of the historical pairings.
Thank you for any help you can provide!

Predictive modelling for a 3 dimensional dataframe

I have a dataset which contains all the quotes made by a company over the past 3 years. I want to create a predictive model using the library caret in R to predict whether a quote will be accepted or rejected.
The structure of the dataset is causing me some problems. It contains 45 variables, however, I have only included two bellow as they are the only variables that are important to this problem. An extract of the dataset is shown below.
contract.number item.id
0030586792 32X10AVC
0030586792 ZFBBDINING
0030587065 ZSTAIRCL
0030587065 EMS164
0030591125 YCLEANOFF
0030591125 ZSTEPSWC
contract.number <- c("0030586792","0030586792","0030587065","0030587065","0030591125","0030591125")
item.id <- c("32X10AVC","ZFBBDINING","ZSTAIRCL","EMS164","YCLEANOFF","ZSTEPSWC")
dataframe <- data.frame(contract.number,item.id)
Each unique contract.number corresponds to a single quote made. The item.id corresponds to the item that is being quoted for. Therefore, quote 0030586792 includes both items 32X10AVC and ZFBBDINING.
If I randomise the order of the dataset and model it in its current form I am worried that a model would just learn which contract.numbers won and lost during training and this would invalidate my testing as in the real world this is not known prior to the prediction being made. I also have the additional issue of what to do if the model predicts that the same contract.number will win with some item.id's and loose with others.
My ideal solution would be to condense each contract.number into a single line with multiple item.ids per line to form a 3 dimensional dataframe. But i am not aware if caret would then be able to model this? It is not realistic to split the item.ids into multiple columns as some quotes have 100s of item.id's. Any help would be much appreciated!
(Sorry if I haven't explained well!)

R Twitter Mining: Finding Frequency of Mentions/Hastags

I'm very new with R. I'm using it mostly for marketing purposes so the TwitteR package is very useful.
What I'm trying to do is find the frequency of #mentions and #hashtags within my data after I've got all the data through the searchTwitter command.
I'm not sure what type of vector it turns it into right away and if I need to convert it into a data.frame or another type of vector or a corpus or what
How do I break the data down into total number of mentions/hashtags and what the frequency of each #mention or #hashtag is?
This will give me a good idea of who the key influencers and key hashtags in a specific market are and how valueable those influencers/hashtags are.
Please help.
Thanks

How can I structure and recode messy categorical data in R?

I'm struggling with how to best structure categorical data that's messy, and comes from a dataset I'll need to clean.
The Coding Scheme
I'm analyzing data from a university science course exam. We're looking at patterns in
student responses, and we developed a coding scheme to represent the kinds of things
students are doing in their answers. A subset of the coding scheme is shown below.
Note that within each major code (1, 2, 3) are nested non-unique sub-codes (a, b, ...).
What the Raw Data Looks Like
I've created an anonymized, raw subset of my actual data which you can view here.
Part of my problem is that those who coded the data noticed that some students displayed
multiple patterns. The coders' solution was to create enough columns (reason1, reason2,
...) to hold students with multiple patterns. That becomes important because the order
(reason1, reason2) is arbitrary--two students (like student 41 and student 42 in my
dataset) who correctly applied "dependency" should both register in an analysis, regardless of
whether 3a appears in the reason column or the reason2 column.
How Can I Best Structure Student Data?
Part of my problem is that in the raw data, not all students display the same
patterns, or the same number of them, in the same order. Some students may do just one
thing, others may do several. So, an abstracted representation of example students might
look like this:
Note in the example above that student002 and student003 both are coded as "1b", although I've deliberately shown the order as different to reflect the reality of my data.
My (Practical) Questions
Should I concatenate reason1, reason2, ... into one column?
How can I (re)code the reasons in R to reflect the multiplicity for some students?
Thanks
I realize this question is as much about good data conceptualization as it is about specific features of R, but I thought it would be appropriate to ask it here. If you feel it's inappropriate for me to ask the question, please let me know in the comments, and stackoverflow will automatically flood my inbox with sadface emoticons. If I haven't been specific enough, please let me know and I'll do my best to be clearer.

Make it "long":
library(reshape)
dnow <- read.csv("~/Downloads/catsample20100504.csv")
dnow <- melt(dnow, id.vars=c("Student", "instructor"))
dnow$variable <- NULL ## since ordering does not matter
subset(dnow, Student%in%c(41,42)) ## see the results
What to do next will depend on the kind of analysis you would like to do. But the long format is the useful for irregular data such as yours.

you should use ddply from plyr and split on all of the columns if you want to take into account the different reasons, if you want to ignore them don't use those columns in the split. You'll need to clean up some of the question marks and extra stuff first though.
x <- ddply(data, c("split_column1", "split_column3" etc),
summarize(result_df, stats you want from result_df))

What's the (bigger picture) question you're attempting to answer? Why is this information interesting to you?
Are you just trying to find patterns such as 'if the student does this, then they also likely do this'?
Something I'd consider if that's the case - split the data set into smaller random samples for your analysis to reduce the risk of false positives.
Interesting problem though!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex