Generation of binary data using Beta distribution in R - r

I am new user in R. In my work, I have to generate binary data(0 or 1) using beta distribution (rbeta command). I have to create a matrix of such data. In some of the columns I want to have more zeros than ones or ones more than zeros.
And this should be done taking shaping parameter 1 = shaping parameter 2 =0.5. I tried all combinations. But I am not able to do this. Please let me know the way to do the same. The hint I was provided was: Take probability(0) =some number and probability(1) = 1- probability(0). Then give these parameters to rbeta command. But I did not find such facility with rbeta command. Please let me knoe if there is any way.
Thanking you,
Kalyani

Related

Mann Whitney U test in R

I want to perform a Mann Whitney U test on a small data set of non-parametric data in R, can anyone help me find the right code?
I have been trying to use the following code, to do the test straight from my data set, but I keep getting the error message:
Error: Column.Heading.1 not found.
wilcox.test(Column.Heading.1, Column.Heading.2, data=DATA)
I'm new to stats so not sure if I'm missing something here.
Thanks in advance!
You might be assuming that the wilcox.test() will be able to find Column.Heading.1 and Column.Heading.2 inside DATA.
Unfortunately, this does not happen. The argument data only plays a role if you are giving a formula to the first element, i.e. Column.Heading.1~Column.Heading.2
If you want to use the x,y configuration that you are using you have to write the column in full, as if you trying to see the values on the console. For example.
wilcox.test(DATA$Column.Heading.1, DATA$Column.Heading.2)
Note that the formula and x,y configuration have different meanings. the formula assumes that Column.Heading.2 is a factor grouping the numbers (like "group1" or "group2"). the x, y configuration expects that both columns are filled with numbers.
If you want to specify the data argument to wilcox.test then you need to use the formula interface, which means you will need a dataset in long format with a single column for the response and other for the group.
Otherwise to compare two columns of the same data frame you will need to refer to their locations explicitly. That is if we make a data frame:
dat <- data.frame(x=c(1.1,2.2,3.3),y=c(0.9,2.1,2.8))
Then we can run a Wilcoxon test either using:
wilcox.test(dat$x, dat$y)
or
with(dat , wilcox.test(x,y))

random data generation for Uniform Distribution for Set of Parameter Vector

I am interested in generating a data from uniform distribution using a vector of parameter (say parameter vector of size 10). I tried in R software but error is there. Please see the below code, it gives only one observation but I am interested to get all the 10 values.
parameter=c(1,2,4,5,3,45,10,14,7,12)
runif(1,0,parameter)
runif(10,0,parameter)
Or if you want it to automatically detect how many values to generate based on the length...
runif(length(parameter), 0, parameter)

R cluster analysis and dendrogram with correlation matrix

I have to perform a cluster analysis on a big amount of data. Since I have a lot of missing values I made a correlation matrix.
corloads = cor(df1[,2:185], use = "pairwise.complete.obs")
Now I have problems how to go on. I read a lot of articles and examples, but nothing really works for me. How can I find out how many clusters are good for me?
I already tried this:
dissimilarity = 1 - corloads
distance = as.dist(dissimilarity)
plot(hclust(distance), main="Dissimilarity = 1 - Correlation", xlab="")
I got a plot, but its very messy and I dont know how to read it and how to go on. It looks like this:
Any idea how to improve it? And what can I actually get out of it?
I also wanted to create a Screeplot. I read that there will be a curve where you can see how many clusters are correct.
I also performed a cluster analysis and choose 2-20 Clusters, but the results are so long, I have no idea how to handle it and what things are important to look on.
To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme.
The kgs is helpful to get the optimal number of clusters.
Following your code one would do:
clus <- hclust(distance)
op_k <- kgs(clus, distance, maxclus = 20)
plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")
So the optimal number of clusters according to the kgs function is the minimum value of op_k, as you can see in the plot.
You can get it with
min(op_k)
Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL.
Check this page for more methods.
Hope it helps you.
Edit
To find which is the optimal number of clusters, you can do
op_k[which(op_k == min(op_k))]
Plus
Also see this post to find the perfect graphy answer from #Ben
Edit
op_k[which(op_k == min(op_k))]
still gives penalty. To find the optimal number of clusters, use
as.integer(names(op_k[which(op_k == min(op_k))]))
I'm happy to learn about the kgs function. Another option is using the find_k function from the dendextend package (it uses the average silhouette width). But given the kgs function, I might just add it as another option to the package.
Also note the dendextend::color_branches function, to color your dendrogram with the number of clusters you end up choosing (you can see more about this here: https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#setting-a-dendrograms-branches )

Comparison between two Data Sets using R Scripting / TERR in Spotfire

I want to compare the two data table columns ID using R-Script / TERR in spotfire. Due to some limitations in am not able to install the functions called "compare","SQLDf". I can use the functions called "duplicated". Can some one help me in creating the sample script with out using the above functions.
Please find the below images for the detailed requirements.
Two Data Table
Result Table
Thanks,
-Vidya
Let's say you have two vectors setA and setB. You can get the result by
# in A but not in B
setdiff(setA,setB)
# in B but not in A
setdiff(setB,setA)
# both in A and B
intersect(setA,setB)
If you just want to know the count use the length function. This may not be the exact answer you were looking for but using the above functions you can create any set you want. If you need help with a specific logic please update your question.

Determining distribution so I can generate test data

I've got about 100M value/count pairs in a text file on my Linux machine. I'd like to figure out what sort of formula I would use to generate more pairs that follow the same distribution.
From a casual inspection, it looks power law-ish, but I need to be a bit more rigorous than that. Can R do this easily? If so, how? Is there something else that works better?
While a bit costly, you can mimic your sample's distribution exactly (without needing any hypothesis on underlying population distribution) as follows.
You need a file structure that's rapidly searchable for "highest entry with key <= X" -- Sleepycat's Berkeley database has a btree structure for that, for example; SQLite is even easier though maybe not quite as fast (but with an index on the key it should be OK).
Put your data in the form of pairs where the key is the cumulative count up to that point (sorted by increasing value). Call K the highest key.
To generate a random pair that follows exactly the same distribution as the sample, generate a random integer X between 0 and K and look it up in that file structure with the mentioned "highest that's <=" and use the corresponding value.
Not sure how to do all this in R -- in your shoes I'd try a Python/R bridge, do the logic and control in Python and only the statistics in R itself, but, that's a personal choice!
To see whether you have a real power law distribution, make a log-log plot of frequencies and see whether they line up roughly on a straight line. If you do have a straight line, you might want to read this article on the Pareto distribution for more on how to describe your data.
I'm assuming that you're interested in understanding the distribution over your categorical values.
The best way to generate "new" data is to sample from your existing data using R's sample() function. This will give you values which follow the probability distribution indicated by your existing counts.
To give a trivial example, let's assume you had a file of voter data for a small town, where the values are voters' political affiliations, and counts are number of voters:
affils <- as.factor(c('democrat','republican','independent'))
counts <- c(552,431,27)
## Simulate 20 new voters, sampling from affiliation distribution
new.voters <- sample(affils,20, replace=TRUE,prob=counts)
new.counts <- table(new.voters)
In practice, you will probably bring in your 100m rows of values and counts using R's read.csv() function. Assuming you've got a header line labeled "values\t counts", that code might look something like this:
dat <- read.csv('values-counts.txt',sep="\t",colClasses=c('factor','numeric'))
new.dat <- sample(dat$values,100,replace=TRUE,prob=dat$counts)
One caveat: as you may know, R keeps all of its objects in memory, so be sure you've got enough freed up for 100m rows of data (storing character strings as factors will help reduce the footprint).

Resources