R: Creating Tidy df based on URL

R: Creating Tidy df based on URL - r

I want to analyse data from a website regarding visitors. Unfortunately, I'm not sure if I can post the df publicly, so I'll describe it the best I can.
I basically have three columns:
date: containing the date (YYYY-MM-DD),
url: Containing the full url of the page
views: The number of visits for that url for that day
What I want to is to categorize the data based on the url, by making new columns. To take stackoverflow as an example, I have urls like:
stackoverflow.com/questions
stackoverflow.com/job
stackoverflow.com/users
For these I want to create a new categorical variable 'Main_cat' with 'Questions', 'Jobs' and 'Users' as its levels. For that I'm currently using this, which I found in another topic here.
df <- df %>%
mutate(Main_cat= case_when(
grepl(".*flow.com/questions.*", url) ~ "Questions",
grepl(".*flow.com/jobs.*", url) ~ "Jobs",
grepl(".*flow.com/users.*", url) ~ "Users")) %>% mutate(Main_cat = as.factor(Main_cat))
This works decently, though not great. The number of main categories I'm working with is about twelve. My full dataset is about 220.000 observations. So processing in a set-up like this takes a while. Feels like I'm working very inefficient.
In addition I'm working with sub-categories based on countries:
stackoverflow.com/job/belgium
stackoverflow.com/job/brazil
stackoverflow.com/job/china
stackoverflow.com/job/germany
stackoverflow.com/job/france
These I want to divide by new variables like Continent, and Country, since also the countries have subdivisions (...job/belgium/retail, ...job/belgium/it). In the end I would like to sort my data by country, or by sector or both using filter() and then perform an analysis.
I can use the mutate/case_when/grepl for all of the above, but judging from how long it takes R to finish, something doesn't seem right. I'm hoping there's a better way that takes less time to process.
Hope this is clear enough, thanks in advance!

Related

Grouping and transposing data in R

It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks

How to get the top most number of distinct values from a dataset

I am playing around with the Los Angeles Police Data that I got via the Office of the Mayor's website. From 2017-2018, I am attempting to see what charges and the amount of each specific charge were given out in Council District 5. CHARGE and CITY_COUNCIL_DIST are the two variables/columns I am looking at.
I used table(ArrestData$CHARGE) to count the number of distinct values.
I realized that there are over 2400 unique entries, therefore most of the entries are being omitted. I am wondering if there is code to see which 5 "CHARGES" are being mostly given out by the LAPD.
Additionally, I am attempting to find the top 5 charges in one specific Council District (again, another variable/column), is there code for this?
Aside:
How can I add sample data to my post? What are the steps to do so on RStudio?
Someone asked me to do this in a previous post, but I am not sure how to do so. They told me to use dput(head(df,n)) but my data is too large, even with using 10 rows. They told me to do it through RScript, but I am not sure what they mean

I think that using an aggregate function may help here. If your data is just CHARGE and CITY_COUNCIL_DIST, then the code might look something like this:
aggregate(.~CITY_COUNCIL_DIST + CHARGE, ArrestData, count)
I'm not terribly advance at R yet, so that code might need some tweaks with your actual data. Once you have the aggregate, you can order your data:
agg.data[order(agg.data, descending=TRUE),]
I'm really no help with dput, sorry!

Posting a reference to the actual dataset/sample data will be helpful to creating a solution. This will help the post adhere to the reproducibility standards that others have mentioned. For the sake of this example we will explicitly create a dataset.
ArrestData <- data.frame(
CHARGE=c("CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA",
"CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA","CHARGEA",
"CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB",
"CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB","CHARGEB",
"CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC",
"CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC","CHARGEC",
"CHARGED","CHARGED","CHARGED","CHARGED","CHARGED","CHARGED",
"CHARGED","CHARGED","CHARGED","CHARGED","CHARGED","CHARGED",
"CHARGEE","CHARGEE","CHARGEE","CHARGEE","CHARGEE",
"CHARGEE","CHARGEE","CHARGEE","CHARGEE","CHARGEE",
"CHARGEF","CHARGEF","CHARGEF","CHARGEF",
"CHARGEF","CHARGEF","CHARGEF","CHARGEF",
"CHARGEG","CHARGEG","CHARGEG",
"CHARGEG","CHARGEG","CHARGEG",
"CHARGEH","CHARGEH",
"CHARGEH","CHARGEH",
"CHARGEI",
"CHARGEI"
),
CITY_COUNCIL_DIST=c(0,5)
)
This code should work, assuming that your dataset is named ArrestData and your CHARGE/CITY_COUNCIL_DIST are also named as stated. The below code will include the top 5 CHARGE's by CITY_COUNCIL_DIST for all CITY_COUNCIL_DIST.
#install these packages if you do not have them
install.packages("magrittr")
install.packages("dplyr")
#make sure these libraries are present
library(magrittr)
library(dplyr)
ArrestData %>%
group_by(CHARGE, CITY_COUNCIL_DIST) %>%
summarize(count=n()) %>%
arrange(CITY_COUNCIL_DIST, desc(count)) %>%
group_by(CITY_COUNCIL_DIST) %>%
mutate(rank = rank(desc(count), ties.method="min")) %>%
filter(rank<=5)
In order to filter out only the results for CITY_COUNCIL_DIST 5, you will need to change the filter statement to something like the following:(depending on what your CITY_COUNCIL_DIST values actually are)
filter(rank<=5, CITY_COUNCIL_DIST==5)

Assigning observation name to a value when retrieving a variable

I want to create a dataframe that contains > 100 observations on ~20 variables. Now, this will be based on a list of html files which are saved to my local folder. I would like to make sure that are matches the correct values per variable to each observation. Assuming that R would use the same order of going through the files for constructing each variable AND not skipping variables in case of errors or there like, this should happen automatically.
But, is there a "save way" to this, meaning assigning observation names to each variable value when retrieving the info?
Take my sample code for extracting a variable to make it more clear:
#Specifying the url for desired website to be scrapped
url <- 'http://www.imdb.com/search/title?
count=100&release_date=2016,2016&title_type=feature'
#Reading the HTML code from the website
webpage <- read_html(url)
title_data_html <- html_text(html_nodes(webpage,'.lister-item-header a'))
rank_data_html <- html_text(html_nodes(webpage,'.text-primary'))
description_data_html <- html_text(html_nodes(webpage,'.ratings-bar+ .text-
muted'))
df <- data.frame(title_data_html, rank_data_html,description_data_html)
This would come up with a list of rank and description data, but no reference to the observation name for rank or description (before binding it in the df). Now, in my actual code one variable suddenly comes up with 1 value too much, so 201 descriptions but there are only 200 movies. Without having a reference to which movie the description belongs, it is very though to see why that happens.
A colleague suggested to extract all variables for 1 observation at a time and extend the dataframe row-wise (1 observation at a time), instead of extending column-wise (1 variable at a time), but spotting errors and clean up needs per variable seems way more time consuming this way.
Does anyone have a suggestion of what is the "best practice" in such a case?
Thank you!

I know it's not a satisfying answer, but there is not a single strategy for solving this type of problem. This is the work of web scraping. There is no guarantee that the html is going to be structured in the way you'd expect it to be structured.
You haven't shown us a reproducible example (something we can run on our own machine that reproduces the problem you're having), so we can't help you troubleshoot why you ended up extracting 201 nodes during one call to html_nodes when you expected 200. Best practice here is the boring old advice to LOOK at the website you're scraping, LOOK at your data, and see where the extra or duplicate description is (or where the missing movie is). Perhaps there's an odd element that has an attribute that is also matching your xpath selector text. Look at both the website as it appears in a browser, as well as the source. Right click, CTL + U (PC), or OPT + CTL + U (Mac) are some ways to pull up the source code. Use the search function to see what matches the selector text.
If the html document you're working with is like the example you used, you won't be able to use the strategy you're looking for help with (extract the name of the movie together with the description). You're already extracting the names. The names are not in the same elements as the descriptions.

Having trouble figuring out how to approach this exercise #R scraping #extracting web data

So, sometimes I need to get some data from the web organizing it into a dataframe and waste a lot of time doing it manually. I've been trying to figure out how to optimize this proccess, and I've tried with some R scraping approaches, but couldn't get to do it right and I thought there could be an easier way to do this, can anyone help me out with this?
Fictional exercise:
Here's a webpage with countries listed by continents: https://simple.wikipedia.org/wiki/List_of_countries_by_continents
Each country name is also a link that leads to another webpage (specific of each country, e.g. https://simple.wikipedia.org/wiki/Angola).
I would like as a final result to get a data frame with number of observations (rows) = number of countries listed and 4 variables (colums) as ID=Country Name, Continent=Continent it belongs to, Language=Official language (from the specific webpage of the Countries) and Population = most recent population count (from the specific webpage of the Countries).
Which steps should I follow in R in order to be able to reach to the final data frame?

This will probably get you most of the way. You'll want to play around with the different nodes and probably do some string manipulation (clean up) after you download what you need.

Plot top X categories in R/ggplot2

This is very similar to the question here:
How to use ggplot to group and show top X categories?
Except in my case I don't have a discrete value to go on. I've got data about users posting messages to a user forum. Similar to:
Year, Month, Day, User, Message
I've got an entry for every single message a person posted and I want to plot the top 5 users per year in terms of total Messages posted. In the previous question there was a distinct list of values that could be keyed off of.
In my case, I'm curious if I can do it easily in ggplot2, or if I need to do something like:
Load the data into a dataframe
Construct a new dataframe which is the same data collapsed & summarized by year
Plot from the new frame using the same approach as the previous question
If this is the best way to do it, what's the "correct" way to do #2? That new dataframe should probably be of the form:
Year, User, Total number of Messages
any help is appreciated.

Based on Joran's comment, I found this plyr approach:
ddply(posts, .(year, poster), summarise, freq=length(year))
Which gives me the posts per year per user. From there I can trim it down as suggested in other posts to get the top X posters per year.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex