How to find outliers in R comparing text to a numerical value? - r

I am trying to learn R, and finding it difficult to find precisely what I am looking for. There are tons of libraries.
I have a sample data set of data, of 150k first and last names and their salaries.
For fun, I would like to see if any first or last names are associated with significantly higher or lower pay.
,"FirstName","LastName","BasePay"
1,"NATHANIEL","FORD","167411.18"
2,"GARY","JIMENEZ","155966.02"
3,"ALBERT","PARDINI","212739.13"
I have tried using : library("arulesViz") and rules <- apriori(data)
But it seems to try to find correlation to precise salary numbers, not that the salary is relatively high or low.
Any help on this problem to get me started would be really appreciated!
Regards, Steven

I think it's a perfectly legitimate question.
I would use the package dplyr. You can then use the 'group_by' and 'summarise' functions. In your case group_by(FirstName) and then choose any kind of statistic, i.e. mean or median of salary as a measure of bias.

Related

Using permanova in r to analyse the effect of 3 independent variables on reef systems

I am trying to understand how to run PERMANOVA using Adonis2 in R to analyse some data that I have collected. I have been looking online, but as it often happens, explanations are a bit convoluted, so I am asking for your help, if you can help me. I have got some fish and coral groups as columns, as well as 3 independent variables (reef age, depth, and material). Snapshot of my dataset structure I think I have understood that p-values are not the only important bit of the output, and that the R2 values indicate how much each variable contributes to the model. Is there something wrong or that I am missing here? Also, I think I understood that I should check for homogeneity of variance, but I have not understood, again, if I should check for it on each variable independently, or if I should include them all in the same bit of code (which does not seem to work). Here are the bit of code that I am using to run the PERMANOVA (1), and the one that I am trying to use to assess homogeneity of variance - which does not work (2).
(1) adonis2(species ~ Age+Material+Depth,data=data.var,by="margin)
'Species' is the subset of the dataset including all the species'count, while 'data.var'is the subset including the 3 independent variables. Also what is the difference in using '+' or '' in the code? When I use '' it gives me - 'Error in qr.X(object$CCA$QR) :
need larger value of 'ncol' as pivoting occurred'. What does this mean?
(2) variance.check<-betadisper(species.distance,data.var, type=c("centroid"), bias.adjust= FALSE)
'species.distance' is a matrix calculated through 'vegdist' using Bray-Curtis method. I used 'data.var'to check variance on all the 3 independent variables, but it does not work, while it works if I check them independently (3). Why is that?
(3) variance.check<-betadisper(species.distance, data$Depth, type=c("centroid"), bias.adjust= FALSE)
Thank you in advance for your responses, and for your help. It will really help me to get my head around it (and sorry for the many questions).

custom code for compact letter display from pairwise table output

I would like to create a custom code that creates a compact letter display from a pairwise test I have performed.
I have done this with pairwise t-tests with success (packages for this exist), and I am also familiar with the package library(multcomp) when I run linear models and the function cld() to get the compact letter displays, but they will not work for my specific case here.
I work with kaplan meier survival data often, and after I run the pairwise_survdiff() function to see if any statistical differences exist between groups (found in the packages library(survival) and library(survminer), I am easily able to extract a table to display all pairwise comparisons and their corresponding p-values. I have included an example for you here today. (see df below)
When their are many comparisons to do by hand, this becomes a mess to found out which groups are different / similar, and it's prone to human error when many levels exist, and up to now, I've always done it by hand. I would like to change this.
Could someone help me with a code that helps do this automatically?
Here is a mock dataframe df with 10 treatments (named treatment-1....treatment-10), and the rows are filled with p-values. Let's assume anything below p<0.05 as significant. However, it would be very cool to have a code that would allow a more conservative approach, and say set the desired cut off for statistical significance (say anything below p<0.01 as significant for example).
Thanks for your help, and again, here is a play datatframe
df <- read.table("https://pastebin.com/raw/ZAKDBjVs", header = T)
While reflecting on this, I believe I found an answer on my own, with the library(mulcompView) and library(rcompanion)
Nonetheless, I think it's important, since I have seen / heard this question multiple times. Here is how I solved my problem
library(rcompanion)
library(multcompView)
df <- read.table("https://pastebin.com/raw/ZAKDBjVs", header=T)
PT1 = fullPTable(df)
multcompLetters(PT1,
compare="<",
threshold=0.05,
Letters=letters,
reversed = FALSE)
This gives me the desired output with the compact letter displays between groups. Additionally, one could edit the statistical threshold to be either more/less conservative by changing the threshold=
Very happy with the result. This has bothered me for a while. I hope it is useful to other members

How to take a Probability Proportional to Size (PPS) Unequal Probability sample using R?

I have very little programming experience, but I'm working on a statistics project and would like to generate an unequal probability sample where the inclusion probability of a unit is based on its size (PPS).
Basically, I have two datasets:
ds1 lists US states and the parameter I'm trying to estimate
ds2 has the population size of each state.
My questions:
I want to use R to select a random sample from the first dataset using inclusion probabilities based on the population of each state (second dataset).
Also is there any way to use R to calculate these Generalized Unequal Probability Estimator formulas?
Also just a note on the formulas: pi_i is inclusion probability and pi_ij is joint inclusion probability.
There is a package for the same in R - pps and the documentation is here.
Also, there is another package called survey with a bit of documentation here.
I'm not sure of the difference between the two and haven't used them myself. Hope this is what you're looking for.
Yes, that's called weighted sampling. Simply set the weight to the size of the state, strictly you don't even need to normalize them by 1/sum(sizes) although it's always good practice to. There are tons of duplicate posts on SO showing how to do weighted sampling.
The only tiny complication is that you need to do a join() of the datasets ds1, ds2. Show us what code you've tried if it's causing problems. Recommend you use either dplyr or data.table.
Your second question should be asked as a separate question, and is offtopic on SO, or at least won't get a great response - best to ask statistical questions at sister site CrossValidated

How to aggregate / roll up percentile measures

There is a dataset which contains aggregated data - aggregated to various dimensions, and down to the hourly level. The main measure is speed which is simply the file size divided by the duration.
The requirement is to see Percentile, Median and Average/Mean summaries.
Mean is simple because we simply create a calculated measure in the MDX and then it works at all aggregation levels i.e. daily/monthly etc.
However Percentile and median are hard. Is there any way in which it is possible to have a calculation for these functions which will roll up correctly? We could add the percentile speed as a column in the ETL when we're reading the raw data, but we'd still need to find a way to then roll it up further?
What is the proper way to roll up these types of measures? It's not uncommon to ask for percentile numbers, so I'm surprised to not see much information on this when I look around.
Maybe the only approach is to have various aggregated tables at the right level, with the right calculation, and then make mondrian use them as agg tables? Or worse case have multiple cubes (!)
OK, so it turns out you cannot roll up percentiles ( and therefore medians which is just a 50th Percentile ) I understand others have had this problem, see this tweet from Kasper here: https://twitter.com/kaspersor/status/308189242788560896
So our solution was a couple of different agg tables to store the relevant stats, and on the main (already aggregated) fact table to store the pre-computed percentile and median stats.

Finding patterns through better visualization in R

I have the following time series data. It has 60 data points shown below. Please see a simple plot of this data below. I am using R for plotting this. I think that if I draw a moving average curve on the points in the graph, then we can better understand the patterns in the data. I don't know how to do it in R. Could some one help me to do that. Additionally, I am not sure whether this is a good way to identify patterns or not. Please also suggest me if there is any better way. Thank you.
x <- c(18,21,18,14,8,14,10,14,14,12,12,14,10,10,12,6,10,8,
14,10,10,6,6,4,6,2,8,6,2,6,4,4,2,8,6,6,8,12,8,8,6,6,2,2,4,
4,4,8,14,8,6,6,2,6,6,4,4,8,6,6)
To answer your question about moving averages, you could accomplish it with the help of rollmean which is in package zoo.
From Joshua's comment: You could also look into TTR package that depends on xts that depends on zoo. Also, there are other moving averages in the package TTR: check ?MA.
require(TTR)
# assuming your vector is loaded in dat
# sliding window / moving average of size 5
dat.k5 <- rollmean(dat, k=5)
One reasonable possibility:
d <- data.frame(x=scan("tmp.dat"))
qplot(x=seq(nrow(d)),x,data=d)+geom_smooth(method="loess")
edit: moved from comment to answer, based on https://meta.stackexchange.com/questions/164783/why-was-a-seemingly-relevant-non-offensive-comment-removed
With regard to "is this a good way to identify patterns" (which is a little off-topic for StackOverflow, but whatever); I think rolling means are perfectly respectable, although more sophisticated methods (such as the locally-weighted regression [loess/lowess] shown here) do exist. However, it doesn't look to me as though there is much of a complicated pattern to detect here: the data seem to initially decline with time, then level off. Rolling means and more sophisticated approaches may look prettier, but I don't think they will identify any deeper patterns in this data set ...
If you want to do this sort of thing for multiple data sets at once (as indicated in your comment), you may like ggplot's capabilities for automatically producing multi-line or faceted versions of the same plot.

Resources