I have an 'issue' data set in CSV format that looks like this.
Date,IssueId,Type,Location
2019/11/02,I001,A,Canada
2019/11/02,I002,A,USA
2019/11/11,I003,A,Mexico
2019/11/11,I004,A,Japan
2019/11/17,I005,B,USA
2019/11/20,I006,C,USA
2019/11/26,I007,B,Japan
2019/11/26,I008,A,Japan
2019/12/01,I009,C,USA
2019/12/05,I010,C,USA
2019/12/05,I011,C,Mexico
2019/12/13,I012,B,Mexico
2019/12/13,I013,B,USA
2019/12/21,I014,C,USA
2019/12/25,I015,B,Japan
2019/12/25,I016,A,USA
2019/12/26,I017,A,Mexico
2019/12/28,I018,A,Canada
2019/12/29,I019,B,USA
2019/12/29,I020,A,USA
2020/01/03,I021,C,Japan
2020/01/03,I022,C,Mexico
2020/01/14,I023,A,Japan
2020/01/15,I024,B,USA
2020/01/16,I025,B,Mexico
2020/01/16,I026,C,Japan
2020/01/16,I027,B,Japan
2020/01/21,I028,C,Canada
2020/01/23,I029,A,USA
2020/01/31,I030,B,Mexico
2020/02/02,I031,B,USA
2020/02/02,I032,C,Japan
2020/02/06,I033,C,USA
2020/02/08,I034,C,Japan
2020/02/15,I035,C,USA
2020/02/19,I036,A,USA
2020/02/20,I037,A,Mexico
2020/02/22,I038,A,Mexico
2020/02/22,I039,A,Canada
2020/02/28,I040,B,USA
2020/02/29,I041,B,USA
2020/03/02,I042,A,Mexico
2020/03/03,I043,B,Mexico
2020/03/08,I044,C,USA
2020/03/08,I045,C,Canada
2020/03/11,I046,A,USA
2020/03/12,I047,B,USA
2020/03/12,I048,B,Japan
2020/03/12,I049,C,Japan
2020/03/13,I050,A,USA
2020/03/13,I051,B,Japan
2020/03/13,I052,A,USA
I'm interested in analyzing the count of issues, particularly across months and years. Now if I wanted to simply plot a chart of issues by date, that's pretty easy. But what if I want to calculate total issues per month and plot it, and perhaps do some analysis of trends etc? How would I go about calculating these sums per (say) month to analyze.
The best approach I could take so far is the following.
I create a new column, called YearMonth which looks like this:
YearMonth = FORMAT(Issues[Date],"YYYY/MM")
Then if I plot Axis = YearMonth vs Values = Count of IssueId, I get what I want.
But the biggest drawback here is that my X-axis is the newly created column, not the original Date column. Since my project has other data that I would like to analyze using the date as well, I would like for this to be using the actual Date instead of my custom column.
Is there a way for me to get this same result but without having to create a new column?
What you usually do is create a calendar table, which will contain all the time-related columns (year, month, year-month, etc) and then link it to your data by date.
In your visuals, you will then use the "Calendar" table columns, without having to alter your original table. The calendar table will be sued also by any other table that needs date related data.
I am working with a dataframe, where one of the columns is a multivalued variable, which I've implemented as a list as column.
Here's a reproducible example:
df <- data.frame(title=c('one','two','three'), subjects=I(list(c('A'), c('A','B','C','D'), c('B','D','E'))))
The general idea being that I can attach as many subjects as I'd like without using too much space.
Now the set of possible subjects isn't that large, so if it were a simple column, I'd turn it into a factor. But if I do that here, R stores the levels attribute separately for each list item, (i.e. each row), once again needing a huge amount of storage.
Does anyone know of a way to store a list of factors, with the levels of these factors as a shared attribute?
The only thing I could think of was to do it myself, store the values as integers and create a separate lookup-table, but that doesn't look very efficient.
I am working on a data set in R having dimensions
dim(adData)
[1] 15844717 11
Out of 11 features,
one feature is having 273596(random integers used as id) unique values out of 15844717.
second feature is having 884353(random integers used as id) unique values out of 15844717.
My confusion is whether to convert them into factors or not because categorical variables with large number of levels will create a problem at the time of modelling or please suggest how to treat them.
I am new to Data Science and never worked on large data sets before.
~300k categories for one variable is sure to cause computational issues. I would first take a step back and examine the nature of this variable and its relevance to the prediction at hand. Without knowing the source of the data, it is hard to give specific advice.
If it is truly a categorical variable, it would be silly to leave the ids as numeric variables since the scale and order of the ids are likely meaningless.
Is it possible to group the levels into fewer but still meaningful categories?
Example 1: If the ids were zipcodes in the United States, there are potentially 40,000 unique values. These can be grouped into states or regions, reducing the number of levels to 50 or fewer.
Example 2: If the ids were product ids from an e-commerce site, they could be grouped by product category or sub-category. There would be much fewer distinct values to work with.
Another option is to examine the relative frequency of each category. If there are a few very common categories, with thousands of rare categories, you leave the common levels in tact and group the rare levels into an 'other' category.