Plot top X categories in R/ggplot2 - r

This is very similar to the question here:
How to use ggplot to group and show top X categories?
Except in my case I don't have a discrete value to go on. I've got data about users posting messages to a user forum. Similar to:
Year, Month, Day, User, Message
I've got an entry for every single message a person posted and I want to plot the top 5 users per year in terms of total Messages posted. In the previous question there was a distinct list of values that could be keyed off of.
In my case, I'm curious if I can do it easily in ggplot2, or if I need to do something like:
Load the data into a dataframe
Construct a new dataframe which is the same data collapsed & summarized by year
Plot from the new frame using the same approach as the previous question
If this is the best way to do it, what's the "correct" way to do #2? That new dataframe should probably be of the form:
Year, User, Total number of Messages
any help is appreciated.

Based on Joran's comment, I found this plyr approach:
ddply(posts, .(year, poster), summarise, freq=length(year))
Which gives me the posts per year per user. From there I can trim it down as suggested in other posts to get the top X posters per year.

Related

R: Creating Tidy df based on URL

I want to analyse data from a website regarding visitors. Unfortunately, I'm not sure if I can post the df publicly, so I'll describe it the best I can.
I basically have three columns:
date: containing the date (YYYY-MM-DD),
url: Containing the full url of the page
views: The number of visits for that url for that day
What I want to is to categorize the data based on the url, by making new columns. To take stackoverflow as an example, I have urls like:
stackoverflow.com/questions
stackoverflow.com/job
stackoverflow.com/users
For these I want to create a new categorical variable 'Main_cat' with 'Questions', 'Jobs' and 'Users' as its levels. For that I'm currently using this, which I found in another topic here.
df <- df %>%
mutate(Main_cat= case_when(
grepl(".*flow.com/questions.*", url) ~ "Questions",
grepl(".*flow.com/jobs.*", url) ~ "Jobs",
grepl(".*flow.com/users.*", url) ~ "Users")) %>% mutate(Main_cat = as.factor(Main_cat))
This works decently, though not great. The number of main categories I'm working with is about twelve. My full dataset is about 220.000 observations. So processing in a set-up like this takes a while. Feels like I'm working very inefficient.
In addition I'm working with sub-categories based on countries:
stackoverflow.com/job/belgium
stackoverflow.com/job/brazil
stackoverflow.com/job/china
stackoverflow.com/job/germany
stackoverflow.com/job/france
These I want to divide by new variables like Continent, and Country, since also the countries have subdivisions (...job/belgium/retail, ...job/belgium/it). In the end I would like to sort my data by country, or by sector or both using filter() and then perform an analysis.
I can use the mutate/case_when/grepl for all of the above, but judging from how long it takes R to finish, something doesn't seem right. I'm hoping there's a better way that takes less time to process.
Hope this is clear enough, thanks in advance!

Get total (count) per month in Power BI

I have an 'issue' data set in CSV format that looks like this.
Date,IssueId,Type,Location
2019/11/02,I001,A,Canada
2019/11/02,I002,A,USA
2019/11/11,I003,A,Mexico
2019/11/11,I004,A,Japan
2019/11/17,I005,B,USA
2019/11/20,I006,C,USA
2019/11/26,I007,B,Japan
2019/11/26,I008,A,Japan
2019/12/01,I009,C,USA
2019/12/05,I010,C,USA
2019/12/05,I011,C,Mexico
2019/12/13,I012,B,Mexico
2019/12/13,I013,B,USA
2019/12/21,I014,C,USA
2019/12/25,I015,B,Japan
2019/12/25,I016,A,USA
2019/12/26,I017,A,Mexico
2019/12/28,I018,A,Canada
2019/12/29,I019,B,USA
2019/12/29,I020,A,USA
2020/01/03,I021,C,Japan
2020/01/03,I022,C,Mexico
2020/01/14,I023,A,Japan
2020/01/15,I024,B,USA
2020/01/16,I025,B,Mexico
2020/01/16,I026,C,Japan
2020/01/16,I027,B,Japan
2020/01/21,I028,C,Canada
2020/01/23,I029,A,USA
2020/01/31,I030,B,Mexico
2020/02/02,I031,B,USA
2020/02/02,I032,C,Japan
2020/02/06,I033,C,USA
2020/02/08,I034,C,Japan
2020/02/15,I035,C,USA
2020/02/19,I036,A,USA
2020/02/20,I037,A,Mexico
2020/02/22,I038,A,Mexico
2020/02/22,I039,A,Canada
2020/02/28,I040,B,USA
2020/02/29,I041,B,USA
2020/03/02,I042,A,Mexico
2020/03/03,I043,B,Mexico
2020/03/08,I044,C,USA
2020/03/08,I045,C,Canada
2020/03/11,I046,A,USA
2020/03/12,I047,B,USA
2020/03/12,I048,B,Japan
2020/03/12,I049,C,Japan
2020/03/13,I050,A,USA
2020/03/13,I051,B,Japan
2020/03/13,I052,A,USA
I'm interested in analyzing the count of issues, particularly across months and years. Now if I wanted to simply plot a chart of issues by date, that's pretty easy. But what if I want to calculate total issues per month and plot it, and perhaps do some analysis of trends etc? How would I go about calculating these sums per (say) month to analyze.
The best approach I could take so far is the following.
I create a new column, called YearMonth which looks like this:
YearMonth = FORMAT(Issues[Date],"YYYY/MM")
Then if I plot Axis = YearMonth vs Values = Count of IssueId, I get what I want.
But the biggest drawback here is that my X-axis is the newly created column, not the original Date column. Since my project has other data that I would like to analyze using the date as well, I would like for this to be using the actual Date instead of my custom column.
Is there a way for me to get this same result but without having to create a new column?
What you usually do is create a calendar table, which will contain all the time-related columns (year, month, year-month, etc) and then link it to your data by date.
In your visuals, you will then use the "Calendar" table columns, without having to alter your original table. The calendar table will be sued also by any other table that needs date related data.

In R, creating table and plot when some column data is zero

I'm learning to use R, and I want to do an analysis of my customers and the distribution of them into categories. This is an example of my data:
I want to know how many customers (cards) have tickets in A, B or C, and the amount spend on each of these categories, and then do the same thing but for each type of segment.
I want to do a bar chart with the percentage of tickets in each category of all my customers and then 3 charts of each segment.
Heres an example of what I want to get:
Example
The thing is that there are cards that have 0 tickets on that category.
I've been thinking that it might work to add a column beside each category, adding for example an "if the row has tickets in category A", then put A in that row, so maybe its easier to create a column?
If you think that might work, how could I run that?
Sorry but I'm very new on R and I have a huge data base :(
Thanks for your help!!

ISSP data: calculating percentage of respondent answers on a particular item

Probably a pretty basic question, and hopefully one not repeated elsewhere. I’m looking at some ISSP survey data in R, and I made a separate data frame for respondents who answered “Government agencies” on one of the questions:
gov.child<-data[data$"V33"=="Government agencies",]
Then I used the table function to see how many total respondents answered that way in each country (C_ALPHAN is the variable name for country):
table(gov.child$C_ALPHAN)
Then I made a matrix of this table:
gov.child.matrix<-as.matrix(table(gov.child$C_ALPHAN))
So I now have a two-column matrix with just the two-letter country code (the C_ALPHAN code) and the number of people who answered “Government agencies.” But I want to know what percentage of respondents in those countries answered that way, so I need to divide this number by the total number of respondents for that country.
Is there some way (a function maybe?) to, after adding a new column, tell R that for each row, it has to divide the number in column two by the total number of rows in the original data set that correspond to the country code in column one (i.e., the n for that country)? Or should I just manually make a vector with the n for each country, which is available on the ISSP website, and add it to the matrix? I'm loathe to to that because of the possibility of making a data entry error, but maybe that's the best way.

Line Graph for dates upon which variable a exists with variable b?

I'm new to stats, R, and programming in general, having only had a short course before being thrown in at the deep end. I am keen to work things out for myself, however.
My first task is to check the data I have been given for anomalies. I have been given a spreadsheet with columns Date, PersonID and PlaceID. I assumed that if I plotted each factor of PersonID against Date, a straight line would show that there were no anomalies, as PersonID should only be able to exist in one place at one time. However, I am concerned that if there are 2 of the same PersonID on one Date, my plot has no way of showing this.
I used the simple code:
require(ggplot2)
qplot(Date,PersonID)
My issue is that I am unsure of how to factor the Date into this problem. Essentially, I am trying to check that no PersonID appears in more than one PlaceID on the same Date, and having been trying for 2 days, cannot figure out how to put all 3 of these variables on the same plot.
I am not asking for someone to write the code for me. I just want to know if I am on the right train of thought, and if so, how I should think about asking R to plot this. Can anybody help me? Apologies if this question is rather long winded, or posted in the wrong place.
If all you want to know is whether this occurs in the dataset try duplicated(). For example, assuming your dataframe is called df:
sum(duplicated(df[,c("Date","PersonID")]))
will return the number duplicates based on columns Date and PersonID in the dataframe. If it's greater than zero, you have duplicates in the data.

Resources