Give different color distribution for different columns in a data.frame - r

I tried to build a heat-map for the cluster result of my data.frame. My data.frame has 5 columns with corresponding row names. I want to know if I could give the color distribution based on different colors, since the range of my 5 variables are so different, and if I don't scale them the result from "pheatmap" function in R would be a heat-map with only one or two color. And I really don't want to scale the data since I do need the positive or negative sign of my data point to remain what is should be. And here's the head of my data.frame, which I omit the rownames.
r.Square_gamma_logLink cof_glm.gamma_logLink int_glm.gamma_logLink estimated_shape_logLink
0.2524970 0.002357581 8.685446 3.558583
0.5932941 0.002651972 9.486916 8.085618
0.3615135 -0.001646538 10.071672 6.195176
0.4131553 -0.002218262 10.563557 8.671028
0.3529775 -0.002336544 10.984005 4.569396
0.4169932 0.002213259 9.602592 5.216084
estimated_dispersion_logLink
0.2810107
0.1236764
0.1614159
0.1153266
0.2188473
0.1917147
I did try to use the pheatmap, and the heatmap function, which are not quite useful, and the result is looks pretty much like this.

Related

How do you randomly generate a dataset with categorical variables in R?

There has been a question asked previously in the forum on something similar, however it doesn't quite go into enough detail for my issue.
I'm trying to generate a random dataset with 8 columns and 50,000 rows. Each column must be a categorical variable (my example, breed of dog) with 3 levels (example, colour), resulting in approx. equal proportions.
The line of code below does exactly what I'm looking for, but how do you alter this to fit the following values?
breeds of dog, instead of "8" - (Poodle, Labrador, Pug, Chihuahua, Collie, Shitzu, Bulldog, Lurcher)
colours, instead of "LETTERS" - (brown, black, white)
df <- data.frame(replicate(8,sample(LETTERS[1:3], 50000, replace = TRUE)))
Any help is much appreicated, thank you.

Averaging different length vectors with same domain range in R

I have a dataset that looks like the one shown in the code.
What I am guaranteed is that the "(var)x" (domain) of the variable is always between 0 and 1. The "(var)y" (co-domain) can vary but is also bounded, but within a larger range.
I am trying to get an average over the "(var)x" but over the different variables.
I would like some kind of selective averaging, not sure how to do this in R.
ax=c(0.11,0.22,0.33,0.44,0.55,0.68,0.89)
ay=c(0.2,0.4,0.5,0.42,0.5,0.43,0.6)
bx=c(0.14,0.23,0.46,0.51,0.78,0.91)
by=c(0.1,0.2,0.52,0.46,0.4,0.41)
qx=c(0.12,0.27,0.36,0.48,0.51,0.76,0.79,0.97)
qy=c(0.03,0.2,0.52,0.4,0.45,0.48,0.61,0.9)
a<-list(ax,ay)
b<-list(bx,by)
q<-list(qx,qy)
What I would like to have something like
avgd_x = c(0.12,0.27,0.36,0.48,0.51,0.76,0.79,0.97)
and
avgd_y would have contents that would
find the value of ay and by at 0.12 and find the mean with ay, by and qy.
Similarly and so forth for all the values in the vector with the largest number of elements.
How can I do this in R ?
P.S: This is a toy dataset, my dataset is spread over files and I am reading them with a custom function, but the raw data is available as shown in the code below.
Edit:
Some clarification:
avgd_y would have the length of the largest vector, for example, in the case above, avgd_y would be (ay'+by'+qy)/3 where ay' and by' would be vectors which have c(ay(qx(i))) and c(by(qx(i))) for i from 1 to length of qx, ay' and by' would have values interpolated at data points of qx

Filtering grouped data in R

I was wondering if anyone can help with grouping the data below as I'm trying to use the subset function to filter out volumes below a certain threshold but given that the data represents groups of objects, this creates the problem of removing certain items that should be kept.
In Column F ( and I) you can see Blue, Red, and Yellow Objects. Each represent three separate colored probes on one DNA strand. Odd numbered or non-numbered Blue ,Red, and Yellow are paired with a homologous strand represented by an even numbered Blue, Red, and Yellow. Ie data in rows 2,3,and 4 are one "group" and pair with the "group" shown in rows 5,6,and 7. This then repeats, so 8,9,10 are a new group and that group pairs with the one in 11,12,13.
What I would like to do is subset the groups so that only those below a certain Distance to Midpoint (column M) are kept. The Midpoint here is the midpoint of the line that connects the blue of one group with the blue of its partner, so the subset should only apply to the Blue distance to midpoint, and that is where I'm having a problem. For instance if I ask to keep blue distances to midpoint that are less than 3, then the objects in row 3 and 4 should be kept because they are part of the group with the blue distance below 3. Right now though when I filter with the subset function I lose Red Selection and Yellow Selection. I'm confident there is a straighforward solution to this in R, but I'd also be open to some type of filtering in excel if anyone has suggestions via that route instead.
EDIT
I managed to work something out in Excel last night after posting the question. Solution isn't pretty but it works well enough. I just added a new column next to "distance to midpoint" that gives all the objects in one group the same distance so that when I filter the data I won't lose any objects that I shouldn't. If it helps anyone in the future, the formula I used in excel was:
=SQRT ( ((INDEX($B$2:$B$945,1+QUOTIENT(ROWS(B$2:B2)-1,3)*3))- (INDEX($O$2:$O$945,1+QUOTIENT(ROWS(O$2:O2)-1,3)*3)) ) ^2 +( (INDEX($C$2:$C$945,1+QUOTIENT(ROWS(C$2:C2)-1,3)*3))-(INDEX($P$2:$P$945,1+QUOTIENT(ROWS(P$2:P2)-1,3)*3)) ) ^2 +( (INDEX($D$2:$D$945,1+QUOTIENT(ROWS(D$2:D2)-1,3)*3))-(INDEX($Q$2:$Q$945,1+QUOTIENT(ROWS(Q$2:Q2)-1,3)*3)) ) ^2)
Would be easier with a reproducible example, but here's a (hacky) plyr solution:
filterframe<-function(df,threshold){
df$grouper<-rep(seq(from=1,to=6),nrow(df)/6)
dataout<-df%>%group_by(grouper)%>%summarise(keep=.[[1]]$distance_to_midpoint<threshold)
dataout[dataout$keep,]
}
filterframe(mydata)
A base R solution provided below. The idea is that once your data are in R, you (edit) keep! rows iff they meet 2 criteria. First, the Surpass column has to contain the word "blue" in it, which is done with the grepl function. Second, the distance must below a certain threshold (set arbitrarily by thresh.
fakeData=data.frame(Surpass=c('blue', 'red', 'green', 'blue'),
distance=c(1,2,5,3), num=c(90,10,9,4))
#thresh is your distance threshold
thresh = 2
fakeDataNoBlue = fakeData[which(grepl('blue', fakeData$Surpass)
& fakeData$distance < thresh),]
There's probably also a quick dplyr solution using filter, but I haven't fully explored the functionality there. Also, I may be a bit confused on if you also want to keep the other colors. If so, that's the same as saying you want to remove the blue ones exceeding a certain distance threshold, which you would just do a -which command, and turn the < operator into a > operator.

R: how to divide a vector of values into fixed number of groups, based on smallest distance?

I think I have a rather simple problem but I can't figure out the best approach. I have a vector with 30 different values. Now I need to divide the vector into 10 groups in such a way that the mean within group variance is as small as possible. the size of the groups is not important, it can anything between one and 21.
Example. Let's say I have vector of six values, that I have to split into three groups:
Myvector <- c(0.88,0.79,0.78,0.62,0.60,0.58)
Obviously the solution would be:
Group1 <-c(0.88)
Group2 <-c(0.79,0.78)
Group3 <-c(0.62,0.60,0.58)
Is there a function that gives the same outcome as the example and that I can use for my vector withe 30 values?
Many thanks in advance.
It sounds like you want to do k-means clustering. Something like this would work
kmeans(Myvector,3, algo="Lloyd")
Note that I changed the default algorithm to match your desired output. If you read the ?kmeans help page you will see that there are different algorithms to calculate the different clusters because it's not a trivial computational problem. They might necessarily guarantee optimality.

dealing with data table with redundant rows

The title is not precisely stated but I could not come up with other words which summarizes what I exactly going to ask.
I have a table of the following form:
value (0<v<1) # of events
0.5677 100000
0.5688 5000
0.1111 6000
... ...
0.5688 200000
0.1111 35000
Here are some of the things I like to do with this table: drawing the histogram, computing mean value, fitting the distribution, etc. So far, I could only figure out how to do this with vectors like
v=(0.5677,...,0.5688,...,0.1111,...)
but not with tables.
Since the number of possible values are huge by being almost continuous, I guess making a new table would not be that effective, so doing this without modifying the original table and making another table would be desirable very much. But if it has to be done so, it's okay. Thanks in advance.
Appendix: What I want to figure out is how to treat this table as a usual data vector:
If I had the following vector representing the exact same data as above:
v= (0.5677, ...,0.5677 , 0.5688, ... 0.5688, 0.1111,....,0.1111,....)
------------------ ------------------ ------------------
(100000 times) (5000+200000 times) (6000+35000) times
then we just need to apply the basic functions like plot, mean, or etc to get what I wanted. I hope this makes my question more clear.
Your data consist of a value and a count for that value so you are looking for functions that will use the count to weight the value. Type ?weighted.mean to get information on a function that will compute the mean for weighted (grouped) data. For density plots, you want to use the weights= argument in the density() function. For the histogram, you just need to use cut() to combine values into a small number of groups and then use aggregate() to sum the counts for all the values in the group. You will find a variety of weighted statistical measures in package Hmisc (wtd.mean, wtd.var, wtd.quantile, etc).

Resources