group by class multiple category variables in R - r

I am extremely new to R and thus not familiar with the various packages.
I am simply using Soybean data from library(mlbench) data(Soybean) and I want visualize in a table the CLASS factor (19 levels) by various categories (date, plant.stand, precip, etc) (there are 35 such vars). I want to show frequency, NAs and mode. In essence then each Class would then be broken out by the various category (date, plant.stand, precip) etc with the frequency data.
I' sure there must be a simple way but I'm very new to R.
Thanks for the help.
Update
As per the table below:
table data
I want to basically count all the categorical data ie (date, plant.stand, precip, etc) and sort by CLASS variable. The only way I can think of is by creating a key for each factor level per categorical variable, counting the occurrence of each key and then sorting. Is there perhaps an easier way?

Related

r: How to manipulate the GGparcoord input column inside the function

I want to compare between the week() of the year of two parallel date columns from two different years. I`m using the GGparcoord function and looking for a way to manipulate the dates in the two columns to be the week count of the specific date. I wish not to manipulate the table itself.
my code is:
ggparcoord(data, columns = 38:39)
and I'm looking for something like ggparcoord(data, columns = week(38):week(39)), that actually works.
In addition, if anyone knows how, I would be happy to learn how to use the ggparcoord with column name instead of column number.
Tnx!

Get total (count) per month in Power BI

I have an 'issue' data set in CSV format that looks like this.
Date,IssueId,Type,Location
2019/11/02,I001,A,Canada
2019/11/02,I002,A,USA
2019/11/11,I003,A,Mexico
2019/11/11,I004,A,Japan
2019/11/17,I005,B,USA
2019/11/20,I006,C,USA
2019/11/26,I007,B,Japan
2019/11/26,I008,A,Japan
2019/12/01,I009,C,USA
2019/12/05,I010,C,USA
2019/12/05,I011,C,Mexico
2019/12/13,I012,B,Mexico
2019/12/13,I013,B,USA
2019/12/21,I014,C,USA
2019/12/25,I015,B,Japan
2019/12/25,I016,A,USA
2019/12/26,I017,A,Mexico
2019/12/28,I018,A,Canada
2019/12/29,I019,B,USA
2019/12/29,I020,A,USA
2020/01/03,I021,C,Japan
2020/01/03,I022,C,Mexico
2020/01/14,I023,A,Japan
2020/01/15,I024,B,USA
2020/01/16,I025,B,Mexico
2020/01/16,I026,C,Japan
2020/01/16,I027,B,Japan
2020/01/21,I028,C,Canada
2020/01/23,I029,A,USA
2020/01/31,I030,B,Mexico
2020/02/02,I031,B,USA
2020/02/02,I032,C,Japan
2020/02/06,I033,C,USA
2020/02/08,I034,C,Japan
2020/02/15,I035,C,USA
2020/02/19,I036,A,USA
2020/02/20,I037,A,Mexico
2020/02/22,I038,A,Mexico
2020/02/22,I039,A,Canada
2020/02/28,I040,B,USA
2020/02/29,I041,B,USA
2020/03/02,I042,A,Mexico
2020/03/03,I043,B,Mexico
2020/03/08,I044,C,USA
2020/03/08,I045,C,Canada
2020/03/11,I046,A,USA
2020/03/12,I047,B,USA
2020/03/12,I048,B,Japan
2020/03/12,I049,C,Japan
2020/03/13,I050,A,USA
2020/03/13,I051,B,Japan
2020/03/13,I052,A,USA
I'm interested in analyzing the count of issues, particularly across months and years. Now if I wanted to simply plot a chart of issues by date, that's pretty easy. But what if I want to calculate total issues per month and plot it, and perhaps do some analysis of trends etc? How would I go about calculating these sums per (say) month to analyze.
The best approach I could take so far is the following.
I create a new column, called YearMonth which looks like this:
YearMonth = FORMAT(Issues[Date],"YYYY/MM")
Then if I plot Axis = YearMonth vs Values = Count of IssueId, I get what I want.
But the biggest drawback here is that my X-axis is the newly created column, not the original Date column. Since my project has other data that I would like to analyze using the date as well, I would like for this to be using the actual Date instead of my custom column.
Is there a way for me to get this same result but without having to create a new column?
What you usually do is create a calendar table, which will contain all the time-related columns (year, month, year-month, etc) and then link it to your data by date.
In your visuals, you will then use the "Calendar" table columns, without having to alter your original table. The calendar table will be sued also by any other table that needs date related data.

How to convert the levels of an integer variable of a dataset to string characters

Hi I have a problem with one of my assignments. I am using the following dataset http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv
One of the questions asks to "reduce the levels of rating for quality to three levels as high, medium, and low".
I would like to output the summary of the quality variable to these strings.
They are originally as integers
If it is a homework you should really try to get part of the solution yourself nevertheless here are helping ideas:
You want to cut or bin the variables. E.g. you have a scale of 1-6, you could cut it into three groups of 1-2,3-4 and 5-6.
Once you have cut or binned your variables you can transform the binned variable (which is now a factor) to the desired levels by using transformations like "5-6" -> "high".
Can you provide as least some code you have already worked on and where your problems are? Then I could provide better feedback instead of just providing a solution.

R: Share factor levels between different list items?

I am working with a dataframe, where one of the columns is a multivalued variable, which I've implemented as a list as column.
Here's a reproducible example:
df <- data.frame(title=c('one','two','three'), subjects=I(list(c('A'), c('A','B','C','D'), c('B','D','E'))))
The general idea being that I can attach as many subjects as I'd like without using too much space.
Now the set of possible subjects isn't that large, so if it were a simple column, I'd turn it into a factor. But if I do that here, R stores the levels attribute separately for each list item, (i.e. each row), once again needing a huge amount of storage.
Does anyone know of a way to store a list of factors, with the levels of these factors as a shared attribute?
The only thing I could think of was to do it myself, store the values as integers and create a separate lookup-table, but that doesn't look very efficient.

How to deal with categorical features having large number of levels in it

I am working on a data set in R having dimensions
dim(adData)
[1] 15844717 11
Out of 11 features,
one feature is having 273596(random integers used as id) unique values out of 15844717.
second feature is having 884353(random integers used as id) unique values out of 15844717.
My confusion is whether to convert them into factors or not because categorical variables with large number of levels will create a problem at the time of modelling or please suggest how to treat them.
I am new to Data Science and never worked on large data sets before.
~300k categories for one variable is sure to cause computational issues. I would first take a step back and examine the nature of this variable and its relevance to the prediction at hand. Without knowing the source of the data, it is hard to give specific advice.
If it is truly a categorical variable, it would be silly to leave the ids as numeric variables since the scale and order of the ids are likely meaningless.
Is it possible to group the levels into fewer but still meaningful categories?
Example 1: If the ids were zipcodes in the United States, there are potentially 40,000 unique values. These can be grouped into states or regions, reducing the number of levels to 50 or fewer.
Example 2: If the ids were product ids from an e-commerce site, they could be grouped by product category or sub-category. There would be much fewer distinct values to work with.
Another option is to examine the relative frequency of each category. If there are a few very common categories, with thousands of rare categories, you leave the common levels in tact and group the rare levels into an 'other' category.

Resources