How might you determine how well distributed a set of data is? - r

I have two datasets which contains a distrbution of 90 data points into 2 and 4 groups/rows and I would like to determine which one out of the two has better distributed the data and plot the result to visually see which one has done this. Better distribution means which one has made it so each group has a similar/same number of data. For example, we can see that the result of Grouped 2 the second group contains larger values for each column compared to the first column so 1 of the 2 groups contains larger values which means its not well distributed among the 2 groups.
I quite new to R so I am unsure how I could go about doing this. Would appreciate any insight into what approach could be used.
R
Grouped into 4
Values <- matrix(c(1, 6, 3, 6, 6, 8,
3, 3, 5, 3, 3, 3,
6, 7, 6, 7, 5, 4,
9, 4, 4, 5, 5, 3), nrow = 4, ncol = 6, byrow = TRUE)
Grouped into 2
Values <- matrix(c(3, 6, 4, 3, 4, 6,
12, 9, 12, 12, 11, 9), nrow = 2, ncol = 6, byrow = TRUE)

You can do this with some basic statistics, using hypothesis testing i.e. testing whether the two groups are statistically different or not. The stats package in R has a lot of tests that you can try and use, each with its own assumptions. Here is one:
Making the matrix
values <- matrix(c(3, 6, 4, 3, 4, 6,
12, 9, 12, 12, 11, 9), nrow = 2, ncol = 6, byrow = TRUE)
Conducting t-test
t.test(values[1, ], values[2, ], paired = FALSE)
Will give you this:
Welch Two Sample t-test
data: values[1, ] and values[2, ]
t = -7.9279, df = 9.945, p-value = 1.318e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.328203 -4.671797
sample estimates:
mean of x mean of y
4.333333 10.833333
The means of values[1, ] is smaller than values[2, ], with a p-value of 1.3e-05.

Related

How to make seaborn plot display and take multiplicity of points into account

I have these two lists of data extracted from a dataframe. 
[5, 5, 5, 5, 5, 4, 4, 5, 4, 4, 2, 4, 5, 5, 5] (Col 1)
[5, 5, 5, 4, 4, 3, 2, 2, 3, 2, 2, 4, 2, 2, 5] (Col 2)
Calling stats.preasonr from the scipy library gives (-0.5062175977346661, 0.20052806464412476), indicating a negative correlation. However, the line of best fit calling
graph = sns.jointplot(x = 'col1name', y = 'col2name', data = df_name, kind = 'reg')
is positive. I realized that this is because I don't think that the calculation of the line of best fit is taking the multiplicity of the points into account. In particular, (5,2) is only considered once even if it happens 3 times. So what do I do that (a) someone can look at this plot and tell how many students are represented with a single data point and (b) the line of best fit takes into account the multiplicity of points?
Here is a picture of the plot: 
The coinciding points aren't ignored. Here is a visualization adding some random noise to show all points, and marking the mean of 'col2' for each value in 'col1'. Also the r-value is calculated before applying the random jitter.
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
df = pd.DataFrame({'col1': [5, 5, 5, 5, 5, 4, 4, 5, 4, 4, 2, 4, 5, 5, 5],
'col2': [5, 5, 5, 4, 4, 3, 2, 2, 3, 2, 2, 4, 2, 2, 5]})
r, p = pearsonr(df['col1'], df['col2'])
xs = np.unique(df['col1'])
ys = [df[df['col1'] == x]['col2'].mean() for x in xs]
df['col1'] += np.random.uniform(-0.1, 0.1, len(df))
df['col2'] += np.random.uniform(-0.1, 0.1, len(df))
g = sns.jointplot(x='col1', y='col2', data=df, kind='reg')
g.ax_joint.scatter(x=xs, y=ys, marker='X', color='crimson') # show the means
g.ax_joint.text(2.5, 4.5, f'$r={r:.2f}$', color='navy') # display the r-value
plt.show()
The regression line seems to go quite close to the means, as expected. For col1==5 there are 4 values at 5, 2 at 4 and 3 at 2, their mean is 3.78.

Kolmogorv-Smirnov Test in R package(dgof) -Discrete Case

I have the following data,
Data <- c(8, 15, 8, 10, 7, 5, 2, 11, 8, 7, 6, 6, 4, 6, 10,
3, 9, 7, 15, 6, 5, 9, 8, 3, 3, 8, 5, 14, 8, 11,
8, 10, 7, 4, 6, 4, 6, 7, 11, 7, 8, 7, 8, 6, 5,
12, 7, 8, 13, 10, 6, 9, 7)
and I want to perform a KS test in R using the dgof package but have no idea how to use it. I also fit the above data with binomial and Poisson distribution.
Now, I want to use KS test to identify which model (binomial or Poisson) represents the data.
Thank you.
Well first you have two problems.
The test of kolmogorov does not apply to continuous distributions. Kolmogorov-Smirnov Goodness-of-Fit Test
It has ties in the data.
even if I wanted to apply the test it would be in the following way, eliminate the duplicate values and use the maximum likelihood estimators for the poisson and binomial distributions.
x <- unique(Data)
ks.test(x,"ppois",lambda <- mean(x))
One-sample Kolmogorov-Smirnov test
data: x
D = 0.2058, p-value = 0.5273
alternative hypothesis: two-sided
ks.test(x,"pbinom",n <- length(x),p <- mean(x)/n)
One-sample Kolmogorov-Smirnov test
data: x
D = 0.3126, p-value = 0.103
alternative hypothesis: two-sided
We could conclude that the Poisson model best represents the data, without relying on the p-value.

How to calculate Euclidian distance between two points stored in rows of two separate matrixes?

I have two matrixes:
I would like to count the distance between point X and point Y without using a loop and in the way that when the matrix is expanded by additional columns the expression/function works.
For validation one could use:
sqrt((m1[,1] - m2[,1])^2 + (m1[,2] - m2[,2])^2 + (m1[,3] - m2[,3])^2 + (m1[,4] - m2[,4])^2 + (m1[,5] - m2[,5])^2)
The expression above gives the correct result for the distance between X and Y however once the matrix is expanded by additional columns the expression has also to be expanded and that is an unacceptable solution...
Would you be so kind and tell how to achieve this? Any help would be more than welcome. I'm stuck with this one for a while...
- between matrix is element-wise in R and the rowSums is useful for calculating the sum of along the row:
m1 <- matrix(
c(4, 3, 1, 6,
2, 4, 5, 7,
9, 0, 1, 2,
6, 7, 8, 9,
1, 6, 4, 3),
nrow = 4
)
m2 <- matrix(
c(2, 6, 3, 2,
9, 4, 1, 4,
1, 3, 0, 1,
4, 5, 0, 2,
7, 2, 1, 3),
nrow = 4
)
sqrt((m1[,1] - m2[,1])^2 + (m1[,2] - m2[,2])^2 + (m1[,3] - m2[,3])^2 + (m1[,4] - m2[,4])^2 + (m1[,5] - m2[,5])^2)
# [1] 12.529964 6.164414 9.695360 8.660254
sqrt(rowSums((m1 - m2) ^ 2))
# [1] 12.529964 6.164414 9.695360 8.660254

Subsample a matrix by selection locations with specific values within a matrix in R

I'm have to use R instead of Matlab and I'm new to it.
I have a large array of data repeating like 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10...
I need to find the locations where values equal to 1, 4, 7, 10 are found to create a sample using those locations.
In this case it will be position(=corresponding value) 1(=1) 4(=4) 7(=7) 10(=10) 11(=1) 14(=4) 17(=7) 20(=10) and so on.
in MatLab it would be y=find(ismember(x,[1, 4, 7, 10 ])),
Please, help! Thanks, Pavel
something like this?
foo <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
bar <- c(1, 4, 7, 10)
which(foo %in% bar)
#> [1] 1 4 7 10 11 14 17 20
#nicola, feel free to copy my answer and get the recognition for your answer, simply trying to close answered questions.
The %in% operator is what you want. For example,
# data in x
targets <- c(1, 4, 7, 10)
locations <- x %in% targets
# locations is a logical vector you can then use:
y <- x[locations]
There'll be an extra step or two if you wanted the row and column indices of the locations, but it's not clear if you do. (Note, the logicals will be in column order).

R hist right/left clump binning

I have a data set of length 15,000 with real values from 0 to 100. My data set is HEAVILY skewed to the left. I'm trying to accomplish the following bins: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, >10. What i have done so far is created the following:
breakvector = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 100)
and have run:
hist(datavector, breaks=breakvector, xlim=(0, 13))
However, it seems like this results in a histogram where data greater than 13 aren't included. Does anyone have any idea on how to get R to bin all the rest of the data in the last bin. Thanks in advance.
How about this
datavector<-c(sample(1:9, 40, replace=T), sample(10:100, 20, replace=T))
breakvector <- c(0:11)
hist(ifelse(datavector>10,11,datavector), breaks=breakvector, xlim=c(0, 13), xaxt="n")
axis(1, at=1:11-.5, labels=c(1:10, ">10"))
Rather than adjusting the breaks, i just throw all the values >10 into a bin for 11. Then i update the axis accordingly.

Resources