I want to analyse data from a website regarding visitors. Unfortunately, I'm not sure if I can post the df publicly, so I'll describe it the best I can.
I basically have three columns:
date: containing the date (YYYY-MM-DD),
url: Containing the full url of the page
views: The number of visits for that url for that day
What I want to is to categorize the data based on the url, by making new columns. To take stackoverflow as an example, I have urls like:
stackoverflow.com/questions
stackoverflow.com/job
stackoverflow.com/users
For these I want to create a new categorical variable 'Main_cat' with 'Questions', 'Jobs' and 'Users' as its levels. For that I'm currently using this, which I found in another topic here.
df <- df %>%
mutate(Main_cat= case_when(
grepl(".*flow.com/questions.*", url) ~ "Questions",
grepl(".*flow.com/jobs.*", url) ~ "Jobs",
grepl(".*flow.com/users.*", url) ~ "Users")) %>% mutate(Main_cat = as.factor(Main_cat))
This works decently, though not great. The number of main categories I'm working with is about twelve. My full dataset is about 220.000 observations. So processing in a set-up like this takes a while. Feels like I'm working very inefficient.
In addition I'm working with sub-categories based on countries:
stackoverflow.com/job/belgium
stackoverflow.com/job/brazil
stackoverflow.com/job/china
stackoverflow.com/job/germany
stackoverflow.com/job/france
These I want to divide by new variables like Continent, and Country, since also the countries have subdivisions (...job/belgium/retail, ...job/belgium/it). In the end I would like to sort my data by country, or by sector or both using filter() and then perform an analysis.
I can use the mutate/case_when/grepl for all of the above, but judging from how long it takes R to finish, something doesn't seem right. I'm hoping there's a better way that takes less time to process.
Hope this is clear enough, thanks in advance!
I have an 'issue' data set in CSV format that looks like this.
Date,IssueId,Type,Location
2019/11/02,I001,A,Canada
2019/11/02,I002,A,USA
2019/11/11,I003,A,Mexico
2019/11/11,I004,A,Japan
2019/11/17,I005,B,USA
2019/11/20,I006,C,USA
2019/11/26,I007,B,Japan
2019/11/26,I008,A,Japan
2019/12/01,I009,C,USA
2019/12/05,I010,C,USA
2019/12/05,I011,C,Mexico
2019/12/13,I012,B,Mexico
2019/12/13,I013,B,USA
2019/12/21,I014,C,USA
2019/12/25,I015,B,Japan
2019/12/25,I016,A,USA
2019/12/26,I017,A,Mexico
2019/12/28,I018,A,Canada
2019/12/29,I019,B,USA
2019/12/29,I020,A,USA
2020/01/03,I021,C,Japan
2020/01/03,I022,C,Mexico
2020/01/14,I023,A,Japan
2020/01/15,I024,B,USA
2020/01/16,I025,B,Mexico
2020/01/16,I026,C,Japan
2020/01/16,I027,B,Japan
2020/01/21,I028,C,Canada
2020/01/23,I029,A,USA
2020/01/31,I030,B,Mexico
2020/02/02,I031,B,USA
2020/02/02,I032,C,Japan
2020/02/06,I033,C,USA
2020/02/08,I034,C,Japan
2020/02/15,I035,C,USA
2020/02/19,I036,A,USA
2020/02/20,I037,A,Mexico
2020/02/22,I038,A,Mexico
2020/02/22,I039,A,Canada
2020/02/28,I040,B,USA
2020/02/29,I041,B,USA
2020/03/02,I042,A,Mexico
2020/03/03,I043,B,Mexico
2020/03/08,I044,C,USA
2020/03/08,I045,C,Canada
2020/03/11,I046,A,USA
2020/03/12,I047,B,USA
2020/03/12,I048,B,Japan
2020/03/12,I049,C,Japan
2020/03/13,I050,A,USA
2020/03/13,I051,B,Japan
2020/03/13,I052,A,USA
I'm interested in analyzing the count of issues, particularly across months and years. Now if I wanted to simply plot a chart of issues by date, that's pretty easy. But what if I want to calculate total issues per month and plot it, and perhaps do some analysis of trends etc? How would I go about calculating these sums per (say) month to analyze.
The best approach I could take so far is the following.
I create a new column, called YearMonth which looks like this:
YearMonth = FORMAT(Issues[Date],"YYYY/MM")
Then if I plot Axis = YearMonth vs Values = Count of IssueId, I get what I want.
But the biggest drawback here is that my X-axis is the newly created column, not the original Date column. Since my project has other data that I would like to analyze using the date as well, I would like for this to be using the actual Date instead of my custom column.
Is there a way for me to get this same result but without having to create a new column?
What you usually do is create a calendar table, which will contain all the time-related columns (year, month, year-month, etc) and then link it to your data by date.
In your visuals, you will then use the "Calendar" table columns, without having to alter your original table. The calendar table will be sued also by any other table that needs date related data.
Probably a pretty basic question, and hopefully one not repeated elsewhere. I’m looking at some ISSP survey data in R, and I made a separate data frame for respondents who answered “Government agencies” on one of the questions:
gov.child<-data[data$"V33"=="Government agencies",]
Then I used the table function to see how many total respondents answered that way in each country (C_ALPHAN is the variable name for country):
table(gov.child$C_ALPHAN)
Then I made a matrix of this table:
gov.child.matrix<-as.matrix(table(gov.child$C_ALPHAN))
So I now have a two-column matrix with just the two-letter country code (the C_ALPHAN code) and the number of people who answered “Government agencies.” But I want to know what percentage of respondents in those countries answered that way, so I need to divide this number by the total number of respondents for that country.
Is there some way (a function maybe?) to, after adding a new column, tell R that for each row, it has to divide the number in column two by the total number of rows in the original data set that correspond to the country code in column one (i.e., the n for that country)? Or should I just manually make a vector with the n for each country, which is available on the ISSP website, and add it to the matrix? I'm loathe to to that because of the possibility of making a data entry error, but maybe that's the best way.
I'm new to stats, R, and programming in general, having only had a short course before being thrown in at the deep end. I am keen to work things out for myself, however.
My first task is to check the data I have been given for anomalies. I have been given a spreadsheet with columns Date, PersonID and PlaceID. I assumed that if I plotted each factor of PersonID against Date, a straight line would show that there were no anomalies, as PersonID should only be able to exist in one place at one time. However, I am concerned that if there are 2 of the same PersonID on one Date, my plot has no way of showing this.
I used the simple code:
require(ggplot2)
qplot(Date,PersonID)
My issue is that I am unsure of how to factor the Date into this problem. Essentially, I am trying to check that no PersonID appears in more than one PlaceID on the same Date, and having been trying for 2 days, cannot figure out how to put all 3 of these variables on the same plot.
I am not asking for someone to write the code for me. I just want to know if I am on the right train of thought, and if so, how I should think about asking R to plot this. Can anybody help me? Apologies if this question is rather long winded, or posted in the wrong place.
If all you want to know is whether this occurs in the dataset try duplicated(). For example, assuming your dataframe is called df:
sum(duplicated(df[,c("Date","PersonID")]))
will return the number duplicates based on columns Date and PersonID in the dataframe. If it's greater than zero, you have duplicates in the data.