Merging a larger dataset with a smaller dataset [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed yesterday.
Improve this question
I have a larger dataset, where for each US county (cross-sectional unit) there are multiple observations for each year, and then I have a smaller (core) dataset, where there is one observation for each year for each county.
The larger dataset has details (names) of specific events which have occurred in each county for each year and the smaller dataset is a county-year-wise crime data. Ideally, I would like to create a function that runs through each county for each year and takes the total count of the number of events occurring for that particular county-year observation.
Once this is done, I will have a dataset that has one row that shows one particular county-year observation and the total number of events that occurred in that county for that particular year. I can then merge this dataset with my crime data using the FIPS code.
merge (data`*`1`* `data_2 by = "fips")

Related

How to calculate column sums in R and then plot it using data.table library [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
So I have my first job as a data analyst however my boss wants me to use the data.table package and I'm having some issues with it.
My data set is about e-commerce shops with total purchases and returns(client returns). I want to visualize in a barplot how many items were returned per product denoted as Product name (I know having spaces in column names is a bit odd, I will change it later) so my code is as follows:
library(shiny)
library(ggplot2)
library(data.table)
library(tidyverse)
mainTable <- fread('returnStats.csv')
essentialReturnData <- mainTable[,c(7,9)]
returnsByProductName <- essentialReturnData[,
.(totalReturns = sum(essentialReturnData$`Return quantity`)),
by = 'Product name']
barplot(table(returnsByProductName$`Product name`))
However, I'm only getting a data.table with with the same sum value for all the Product names showed in the image below:
Then of course I'm having a barplot that looks like complete garbage:
There are two things wrong here:
Since you're asking for sum(essentialReturnData$`Return quantity`), which is a call to a different instance of the table, the sum is ignoring the by grouping. Use sum(`Return grouping`) instead, since this refers to the column within the same instance of the table.
table(returnsByProductName$`Product name`) is a frequency table for the product names, but returnsByProductName only has one row per name. You're not using the totalReturns at all! Use barplot(returnsByProductName$totalReturns, names.arg = returnsByProductName$`Product name`) instead.
Given how many products you have, you'll have problems fitting all the names on the axis in a nice way. You can do things like adding a las = 2 argument, which is passed to par() and makes the x-axis labels vertical. It's still going to look messy with that many products, however, and if the names are long then it doesn't leave much space for the plot itself, unless you make the plot size enormous.

join dataframes of unequal length, repeating values where appropriate [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I am trying to join two tables of unequal lengths in R. Both share a column (LogsheetID) on which to join. The longer table has more than 1 value in the other columns for each value of shared column. The shorter table has one value in columns (e.g. Date, VesselID) for each LogsheetID. In the joined table I want the values in the columns from short table be repeated according to the way LogsheetID is repeated in the long table. Tried left_join but values in joined columns from short table are NA
this should work
merge(tableX, tableY, by="colName", all=T)

Interpolation and forecasting out of 2 values in R [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I have a vector of yearly population data from 1980 to 2020 with only two values (years 2000 and 2010) and I need to predict the missing data.
My first thought was to use na.approx to fill in the missing data between 2000 and 2010 and then use the ARIMA model. However, as the population is declining, in the remote future its values would become negative, which is illogical.
My second thought was to use differences of logarithms between the sample data dividing it by 10(since there is a 10 year gap between the actual values) and using it as a percentage change to predict the missing data.
However, I am new to R and statistics so I am not sure if this is the best way to get the predictions. Any ideas would be really appreciated.
Since the line that the two data points provides does not make intuitive sense, I would recommend just using the average of the two unless you can get additional data. If you are able to get either more yearly data, or even expected variation values, then you can do some additional analysis. But for now, you're kinda stuck.

How do I analyze movement between points in R? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
So I have a lot of points, kind of like this:
animalid1;A;time
animalid1;B;time
animalid1;C;time
animalid2;A;time
animalid2;B;time
animalid2;A;time
animalid2;B;time
animalid2;C;time
animalid3;A;time
animalid3;B;time
animalid3;C;time
animalid3;B;time
animalid3;A;time
What I want to do is to first of all make R understand that the points A,B,C are connected. Then I want to get comparisons of movement from A to C and how long time it takes, how many steps were used, etc. So maybe I have a movement sequence like ABC on 20 animals and then ABABC on 10 animals and then ABCBA on 5 animals. I want to get some sort of statistical test done to see if the total time is different between these groups, and so on.
I bet this has been done before. But my Google skills are not good enough to find it.
Look at the msm package (msm stands for Multi State Model). Given observations of states at different times it will estimate probabilities of transitions and average time in the different states.

Test-driven Data Analysis? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Several R packages have been developed that allow you to make assertions about your data, results of analysis, etc. However, I have never seen anyone compile a list of useful checks.
Are there any resources that have checklists or other lists of common checks?
For example, if you were analyzing survey data, you might want to sanity check the data as follows:
Impossible values: Someone who lists a profession of doctor is 6 years old
Unlikely correlations: Education level has a negative correlation with Earnings
After doing a lot of joins, you want to verify the final data structure:
Lost observations: A data set begins with N = 100,000... after appending variables, does N still equal 100,000?
Unreasonable values within columns: Summaries of nulls, detection of outliers, distribution of most common values
Unreasonable cross-column relationships: A table with sales references salesperson, but the salesperson ID doesn't exist in the salesperson table
After developing predictions, you want to check if they make sense:
Unlikely predictions across groups: You average predicted probabilities of making a purchase by group and find that non-pet owners are more likely than pet owners to buy pet food
etc. etc.
Below are some R packages that would help incorporate such tests into R... if only we had a checklist of what those tests should be!
testthat
http://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf
https://github.com/hadley/testthat
RUnit
http://cran.r-project.org/web/packages/RUnit/vignettes/RUnit.pdf
Svunit
http://cran.r-project.org/web/packages/svUnit/vignettes/svUnit.pdf

Resources