Examples of big datasets

Examples of big datasets - bigdata

I am trying to learn some big data technologies and all examples/tutorials I come across talk about calculating the word count. Is anyone familiar with other examples or datasets that I can calculate the count of. Something a little more exciting than calculating the count of words in a book.

Here are some examples:
You can visit https://www.kaggle.com/datasets and download and find City of LA Citations and then figure out when tickets are given the most, how much the fine was, etc.
Get a bunch of storm events from NCDC and rank the state by highest tornadoes in the last 10 years and see where those same states ranked in the prior 10 years
Visit City of Chicago website and download crime database. Compare stats with those published in news papers
Find Mueller's report and determine % of report that was redacted
Get UFO database and find the city that has had most sighting of over 3 minutes

Related

Interesting Python data structure problem involving disjoint sets, hashing, and graphs

Problem: You are planning an around-the-world trip with your two best friends for the
summer. There is a total of n cities that the three of you want to visit. As you are traveling around the world, you are
worried about time zones and airport access. Therefore, some cities can only be visited after visiting another city first,
which is in a nearby timezone or has an airport, which are expressed as a list of pairs (cityX,cityY) (cityX can only be
visited after visiting cityY).
Given the total number of cities and a list of dependency pairs, is it possible for you all to visit all cities?
Your task is to write the function can_visit_all_cities, which determines whether visiting the n cities is possible or
not given the dependencies.
Requirements
• Must run in O(m+n), and cannot use built in Python set/dictionary

This sounds like a dependency-graph. I don't know if python has a built in datastructure for this.
If you were to implement one on your own you'd have to use lists/sets though.

Predict whether the customer gross sales for the next month category for a product and Territory based on customer’s historical monthly sales data

I am new to R programming and I have to do some kinda POC(Proof of Concept) work at my workplace in which I have to predict the gross sales for the next months for customers based on product and Territory(Sales channel). Here's how my data looks here
the data is till March'2017 and both Product and Sales Channel are categorical and has levels 2. Also Sold_Customer is categorical with 66 levels.
I have searched on internet a lot and came across "Forecasting in R" and other related stuff but all of that includes data in time series form but my data is not in this format neither I know how to use that because most of my columns are character type. If anyone can help me out and guide me how to approach for this particular problem, I would be really thankful to him/her.

Find the equation to calculate Daily, Weekly, and Monthly rental costs

I have recently taken over a web development project for a local car rental company and need help finding out how to calculate the Daily, Weekly, and Monthly cost of a vehicle.
The previous developer used a plugin that allowed you to create "Pricing Schemes" where you define a day range and its price:
19.99/day, 99.99/week, 299.99/month:
Day 1-5 = $19.99
Day 5-6 = $16.665
Day 6-7 = 14.284
Day 7-8 = $14.9975
and so on...
Sadly the developer left no notes on how he got these numbers and each pricing scheme he made only extends to the 31st day. Which causes an issue when a user wants to rent a car for longer than a month (Which is common).
What I need help finding out is the equation he used to get these numbers so I can add on to the pricing schemes and create others if the need arises. I will add a screenshot of a full pricing scheme for reference below.
Any help with this would be greatly appreciated and I will be available to answer any questions if my question is not clear enough. Thank you!

Looping through website links with R

I am looking to incorporate a loop in R which goes through every game's boxscore data on the NFL statistics website here: http://www.pro-football-reference.com/years/2012/games.htm
At the moment I am having to manually click on the "boxscore" link for every game every week; is there any way to automate this in R? My code works with the Full play-by-play dataset within each link; it's taking me ages at the moment!

Web scraping may be against the terms of use of some websites. The enforceability of these terms is unclear. While outright duplication of original expression will in many cases be illegal, in the United States the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable.
require(RCurl)
require(XML)
bdata<-getURL('http://www.pro-football-reference.com/years/2012/games.htm')
bdata<-htmlParse(bdata)
boxdata<-xpathSApply(bdata,'//a[contains(#href,"boxscore")]',xmlAttrs)[-1]
The above will get the boxscore stem for the various games.

Get Annual Financial Data for a Stock for many years in R

Suppose I want to regress in R Gross Profit on Total Revenue. I need data for this, and the more, the better.
There is a library on CRAN that I find very useful: quantmod , that does what I need.
library(quantmod)
getFinancials(Symbol="AMD", src="google")
#to get the names of the matrix: rownames(AMD.f$IS$A)
Total.Revenue<-AMD.f$IS$A["Revenue",]
Gross.Profit<-AMD.f$IS$A["Gross Profit",]
#finally:
reg1<-lm(Gross.Profit~Total.Revenue)
The biggest issue that I have is that this library gets me data only for 4 years (4 observations, and who runs a regression with only 4 observations???). Is there any other way (maybe other libraries) that would get data for MORE than 4 years?

I agree that this is not an R programming question, but I'm going to make a few comments anyway before this question is (likely) closed.
It boils down to this: getting reliable fundamental data across sectors and markets is difficult enough even if you have money to spend. If you are looking at the US then there are a number of options, but all the major (read 'relatively reliable') providers require thousands of dollars per month - FactSet, Bloomberg, Datastream and so on. For what it's worth, for working with fundamental data I prefer and use FactSet.
Generally speaking, because the Excel tools offered by each provider are more mature, I have found it easier to populate spreadsheets with the data and then read the data into R. Then again, I typically deal with the fundamentals of a few dozen companies at most, because once you move out of the domain of your "known" companies the time it takes to check anomalies increases exponentially.
There are numerous potential "gotchas". The most obvious is that definitions vary from sector to sector. "Sales" for an industrial company is very different from "sales" for a bank, for example. Another problem is changes in definitions. Pretty much every year some accounting regulation or other changes and breaks your data series. Last year minorities were reported here, but this year this item is moved to another position in the P&L and so on.
Another problem is companies themselves changing. How does one deal with mergers, acquisitions and spin-offs, for example? This sort of thing can make measuring organic sales growth next to impossible. Yet another point to bear in mind is that if you're dealing with operating or net profit, you have to consider exceptionals and whether to adjust for them.
Dealing with companies outside the US adds a whole bunch of further problems. Of course, the major data providers try to standardise globally (FactSet Fundamentals for example). This just adds another layer of abstraction and typically it is hard to check to see how the data has been manipulated.
In short, getting the data is onerous and I know of no reliable free sources. Unless you're dealing with the simplest items for a very homogenous group of companies, this is a can of worms even if you do have the data.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex