I have a huge dataset containing information about 1774 counties in the US. The variables there are things like income quartile, voter preferences, median household income etc.
I would like to know if there exists a package which would allow me to quickly see for example the number of counties which have income over a certain number and voted Republican, or the number of counties where more than 50 % work in services, while the average education attainment is HS or lower.
I know that I can do so with dplyr functions, however, that is extremely time-consuming when I want to do it with large amounts of variables.
Thank you for any recommendations!
I recommend you try the explore package.
While you can use it manually to explore specific parts of your dataset, it has additional features to explore data interactively via shiny (explore_shiny) and to generate a report of your entire dataset via rmarkdown (report).
Exploring pairs of variables (e.g. income by party voted for) is possible by specifying one variable as the target and selecting the second variable. But it won't always give you the comparison you need. Hence I would recommend the explore package as an initial starting point for understanding your data, but for specific analysis you will probably need to write your own dplyr, ggplot, and/or plotly code (or whichever other packages you favour).
Further worked examples are found in its vignette.
Related
I have a dataset where I have a simple male/female breakdown, a category (say A, B or C), some kind of location to give me more data points and then a count for each one. E.g.
Basic sample
Obviously performing any kind of analysis on this is a bit meaningless at the moment as the number of males is far higher than females. 7 males is significantly lower than 7 females as it currently stands. The examples I can find online for standardising these are a bit too simple and blanket affect the whole dataset, rather than breaking it down into a particular category. I am looking to do this in R to give me more options when it comes to analysing larger things and I am frustratingly still waiting for my R training!
I have tried this manually and using tutorials online, but they are too basic for my data.
I wish to share a dataset (largely time-series data) with a group of data scientists to explore the statistical relationships within the data (e.g. between variables). However, for confidentiality reasons, I am unable to share the original dataset and so I was wondering if I may be able to transform the data with some random transformation that I know but that the recipients won't. Is this a common practice? Is there an associated R package?
I have been exploring the use of synthetic datasets, and have looked at 'synthpop' but I have a challenge that seems slightly different. For example, I don't necessarily want the data to include fictional individuals that resemble the original file. Rather I'd prefer the value associated with a specific variable to be unclear (e.g. still numerical but also nonsensical) to the human viewer but still enable statistical analysis (e.g. despite the actual values being unclear, the relationships between variable 'x' and 'y' remain the same).
I have a feeling that this is probably quite a simple process (e.g. change names of variables, apply the same transformation across all variables), but I'm not a mathematician/statistician and so I don't want to violate underlying relationships through an inappropriate transformation.
Thanks!
I am relatively new to using predictive modeling and would like some brainstorming help/assessment of feasibility.
I currently have the following variables in the data set for 2018-present with one row per order
date
day of week
item category
order id
lat / long for shipping address.
I would like to predict weekly sales for the remaining weeks of this year BY item category. I am most comfortable using R at the moment.
What algorithm/package would you recommend I look into given that I would like to predict weekly sales volume by category?
The shortest answer is you start with a set of tidyverse packages. group_by() from dplyr is very powerful for computing values by some factor. To me, it sounds like you have your data in a tidy form already which works best with tidyverse framework as it allows one to easily vectorize operations over data.frame. Check out the main packages they have to offer and their overviews here. Start with simpler models like lm() and then if the need arrives continue with more advanced ones. Which one of the variables are you going to use as predictors?
No matter the model you choose, after you build the appropriate one, you can use built-in predict() together with group_by() function. More details on basic prediction here.
By the way, I can't see the data set you talk about, only the description of it. Could you provide a link to a representative sample? It would allow me to provide deeper insight.
At work when I want to understand a dataset (I work with portfolio data in life insurance), I would normally use pivot tables in Excel to look at e.g. the development of variables over time or dependencies between variables.
I remembered from university the nice R-function where you can plot every column of a dataframe against every other column like in:
For the dependency between issue.age and duration this plot is actually interesting because you can clearly see that high issue ages come with shorter policy durations (because there is a maximum age for each policy). However the plots involving the issue year iss.year are much less "visual". In fact you cant see anything from them. I would like to see with once glance if the distribution of issue ages has changed over the different issue.years, something like
where you could see immediately that the average age of newly issue policies has been increasing from 2014 to 2016.
I don't want to write code that needs to be customized for every dataset that I put in because then I can also do it faster manually in Excel.
So my question is, is there an easy way to plot each column of a matrix against every other column with more flexible chart types than with the standard plot(data.frame)?
The ggpairs() function from the GGally library. It has a lot of capability for visualizing columns of all different types, and provides a lot of control over what to visualize.
For example, here is a snippet from the vignette linked to above:
data(tips, package = "reshape")
ggpairs(tips)
We are building an r package to assess income inequality in survey data https://github.com/DjalmaPessoa/convey
I have about ten different public use microdata sets here - https://github.com/ajdamico/asdfree - that I would like to provide usage examples with dataset-specific vignettes. For the most part, the vignettes will be exactly the same, I just want a distinct vignette available for the different surveys. I only want a handful of things to vary across the vignette, like:
title/vignette name, survey load script, survey design object name, income variable within dataset
The package we are writing has about a dozen different functions that I would like to demonstrate in a (seemingly for the user) dataset-specific manner across ten different public use files-- but I don't want to write ten nearly-identical vignettes
Is there a reasonable way to write a single document that will then auto-generate specific vignettes? I am thinking about something like a Microsoft mail merge, but am not sure what the equivalent would be here. Thanks!