extracting unique combination rows from a data frame in R - r

I have a data frame that gives pairwise correlations of scores people in the same state had provided.
I am giving a small example of what I wish to do with this data, but right now my actual data set has 15 million rows for pairwise correlations and many more additional columns.
Below is the example data:
>sample_data
Pair_1ID Pair_2ID CORR
1 2 0.12
1 3 0.23
2 1 0.12
2 3 0.75
3 1 0.23
3 2 0.75
I want to generate a new data frame without duplicates, for example in row 1, the correlation between persons 1 and 2 is 0.12. Row 1 is the same as Row 3, which shows the correlation between 2 and 1. Since they have the same information I would like a final file without duplicates, I would like a file like the one below:
>output
Pair_1ID Pair_2ID CORR
1 2 0.12
1 3 0.23
2 3 0.75
Can someone help? The unique command wont work with this and I don't know how to do it.

Assuming every combination shows up twice:
subset(sample_data , Pair_1ID <= Pair_2ID)
If not:
unique(transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
Pair_2ID = pmax(Pair_1ID, Pair_2ID)))
Edit: regarding that last one, including CORR in the unique is not a great idea because of possible floating point issues. I also see you mention you have a lot more columns. So it would be better to limit the comparison to the two ids:
relabeled <- transform(sample_data, Pair_1ID = pmin(Pair_1ID, Pair_2ID),
Pair_2ID = pmax(Pair_1ID, Pair_2ID))
subset(relabeled, !duplicated(cbind(Pair_1ID, Pair_2ID)))

The answer of flodel is really excellent. I just want to add another solution based on indexing without looking at the actual values. It only works if all combinations are present and the data frame is ordered by column 1 in the first place and column 2 in the second place (like in the example).
maxVal <- max(sample_data$Pair_1ID)
shrtIdx <- logical(maxVal)
idx <- sapply(seq(maxVal - 1, 1), function(x) replace(shrtIdx, seq(x), TRUE))
sample_data[idx,]
# Pair_1ID Pair_2ID CORR
# 1 1 2 0.12
# 2 1 3 0.23
# 4 2 3 0.75

Related

How do I describe() a column while meeting criteria for a unique value in another column?

I'm very new to R so excuse any incorrect language. I'm not sure if I even asked this question correctly, but here is the problem I'm dealing with.
Suppose I have a data frame that contains data for lengths and weights for 10 different species of fish. Suppose I have 100 samples for each species a fish (1000 rows of data). Is it possible to return the describe() function of a column for each unique species of fish without having to create an object for each species?
For example if I write:
Catfish <- filter(dataframe, dataframe$lengths == "Catfish")
describe(Catfish$lengths)
Do I have to manually create an object (Catfish for example) for each species and then describe? Or is there a simpler way to return describe() for the lengths of each unique species directly from my original dataframe? Hopefully I asked the clearly enough. Thanks for any help!
I think what you might want to look into is a split-apply-combine technique (example below)
df
value ID
1 1 ID
2 2 ID
3 3 PD
4 4 PD
5 5 ID
#split by grouping variable (in your case a fishspecies)
df_split <- split(df, df$ID)
#apply a function (in your case describe)
df_split <- lapply(df_split, function(x) { x["ID"] <- NULL; x }) #removed ID for easier merging
df_split <- lapply(df_split, describe)
#combine
Result <- Reduce(rbind, df_split)
Result
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 3 2.67 2.08 2.0 2.67 1.48 1 5 4 0.29 -2.33 1.2
X11 1 2 3.50 0.71 3.5 3.50 0.74 3 4 1 0.00 -2.75 0.5
What would improve this script is to add the specific grouping variable to each row (so "ID" in this example). But I think this provides a starting point for you.

Removing/collapsing duplicate rows in R

I am using the following R code, which I copied from elsewhere (https://support.bioconductor.org/p/70133/). Seems to work great for what I hope to do (which is remove/collapse duplicates from a dataset), but I do not understand the last line. I would like to know on what basis the duplicates are removed/collapsed. It was commented it was based on the median absolute deviation (MAD), but I am not following that. Could anyone help me understand this, please?
Probesets=paste("a",1:200,sep="")
Genes=sample(letters,200,replace=T)
Value=rnorm(200)
X=data.frame(Probesets,Genes,Value)
X=X[order(X$Value,decreasing=T),]
Y=X[which(!duplicated(X$Genes)),]
Are you sure you want to remove those rows where the Genesvalues are duplicated? That's at least what this code does:
Y=X[which(!duplicated(X$Genes)),]
Thus, Ycontains only unique Genesvalues. If you compare nrow(Y)and length(unique(X$Genes))you will see that the result is the same:
nrow(Y); length(unique(X$Genes))
[1] 26
[1] 26
If you want to remove rows that contain duplicate values across all columns, which is arguably the definition of a duplicate row, then you can do this:
Y=X[!duplicated(X),]
To see how it works consider this example:
df <- data.frame(
a = c(1,1,2,3),
b = c(1,1,3,4)
)
df
a b
1 1 1
2 1 1
3 2 3
4 3 4
df[!duplicated(df),]
a b
1 1 1
3 2 3
4 3 4
Your code is keeping the records containing maximum value per gene.

In R, how to take the mean of a varying number of elements for each row in a data frame?

So I have a dataframe, PVALUES, like this:
PVALS <- read.csv(textConnection("PVAL1 PVAL2 PVAL3
0.1 0.04 0.02
0.9 0.001 0.98
0.03 0.02 0.01"),sep = " ")
That corresponds to another dataframe, DATA, like this:
DATA <- read.csv(textConnection("COL1 COL2 CO3
10 2 9
11 20 200
2 3 5"),sep=" ")
For every row in DATA, I'd like to take the mean of the numbers whose indices correspond to entries in PVALUES that are <= 0.05.
So, for example, the first row in PVALUES only has two entries <= 0.05, the entries in [1,2] and [1,3]. Therefore, for the first row of DATA, I want to take the mean of 2 and 9.
In the second row of PVALUES, only the entry [2,2] is <=0.05, so instead of taking the mean for the second row of DATA, I would just use DATA[20,20].
So, my output would look like:
MEANS
6.5
20
3.33
I thought I might be able to generate indices for every entry in PVALUES <=0.05, and then use that to select entries in DATA to use for the mean. I tried to use this command to generate indices:
exp <- which(PVALUES[,]<=0.05, arr.ind=TRUE)
...but it only picks up on indices for entries the first column that are <=0.05. In my example above, it would only output [3,1].
Can anyone see what I'm doing wrong, or have ideas on how to tackle this problem?
Thank you!
It's a bit funny looking, but this should work
rowMeans(`is.na<-`(DATA,PVALUES>=.05), na.rm=T)
The "ugly" part is calling is.na<- without doing the automatic replacement, but here we just set all data with p-values larger than .05 to missing and then take the row means.
It's unclear to me exactly what you were doing with exp, but that type of method could work as well. Maybe with
expx <- which(PVALUES[,]<=0.05, arr.ind=TRUE)
aggregate(val~row, cbind(expx,val=DATA[exp]), mean)
(renamed so as not to interfere with the built in exp() function)
Tested with
PVALUES<-read.table(text="PVAL1 PVAL2 PVAL3
0.1 0.04 0.02
0.9 0.001 0.98
0.03 0.02 0.01", header=T)
DATA<-read.table(text="COL1 COL2 CO3
10 2 9
11 20 200
2 3 5", header=T)
I usually enjoy MrFlick's responses but the use of is.na<- in that manner seems to violate my expectations of R code because it is destructively modifies the data. I admit that I probably should have been expecting that possibility because of assignment arrow but it surprised me nonetheless. (I don't object to data.table code because it is hones t and forthright about modifying its contents with the := function.) I also admit that my efforts to improve one it lead me down a rabbit hole where I found this equally "baroke" effort. (You have incorrectly averaged 2 and 9)
sapply( split( DATA[which( PVALS <= 0.05, arr.ind=TRUE)],
which( PVALS <= 0.05, arr.ind=TRUE)[,'row']),
mean)
1 2 3
5.500000 20.000000 3.333333

compare row values over multiple rows (R)

I don't think this question has asked yet (most similar questions are about extracting data or returning a count). I am new to R, so any help would be appreciated!
I have a dataset of multiple runs of an experiment in one file and the data looks like this, where i have all the time steps for each run in rows
time [info] id (unique per run)
I am attempting to calculate when the system reaches equilibrium, which I am defining as stable values in 3 interdependent parameters. I would like to have the contents of rows compared and if they are within 5% of each other over 20 timesteps, to return the timestep at which the stability begins and the id.
So far, I'm thinking it will be something like the following (or maybe have a while loop)(sorry for the bad formatting):
y=1;
z=0; #variables to control the loop
x=0;
for (ID) {
if (CC at time=x == 0.05+-CC at time=y ) {
if(z<=20){ #catalogs the number of periods that match
y++
z++}
else [save value in column]
}
else{ #no match for sustained period so start over again
x++
y=x+1
z=0
}
}
eta: CC is one of my parameters of interest and ranges between 0 and 1 although the endpoints are unlikely.
Here's a simple example that might help: this is something like how my data looks:
zz <- textConnection("time CC ID
1 0.99 1
2 0.80 1
3 0.90 1
4 0.91 1
5 0.92 1
6 0.91 1
1 0.99 2
2 0.90 2
3 0.90 2
4 0.91 2
5 0.92 2
6 0.91 2")
Data <- read.table(zz, header = TRUE)
close(zz)
my question is, how can i run through the lines to find out when the value of CC becomes 'stable' (meaning it doesn't change by more than 0.05 over X (here, 3) time steps) so that it would create the following results:
ID timeToEQ
1 1 3
2 2 2
does this help? The only way I can think to do this is with a for-loop and I think there must be an easier way!
Here is my code. I will post the explanation in some time.
require(plyr)
ddply(Data, .(ID), summarize, timeToEQ = Position(isTRUE, abs(diff(CC)) < 0.05 ))
ID timeToEQ
1 1 3
2 2 2
EDIT. Here is how it works.
ddply breaks Data into subsets based on ID.
diff(CC) computes the difference between CC of successive rows.
abs(diff(CC)) < 0.05) returns TRUE if the difference has stabilized.
Position locates the first instance of an element which satisfies isTRUE.

How do I perform a function on each row of a data frame and have just one element of the output inserted as a new column in that row

It is easy to do an Exact Binomial Test on two values but what happens if one wants to do the test on a whole bunch of number of successes and number of trials. I created a dataframe of test sensitivities, potential number of enrollees in a study and then for each row I calculate how may successes that would be. Here is the code.
sens <-seq(from=.1, to=.5, by=0.05)
enroll <-seq(from=20, to=200, by=20)
df <-expand.grid(sens=sens,enroll=enroll)
df <-transform(df,succes=sens*enroll)
But now how do I use each row's combination of successes and number of trials to do the binomial test.
I am only interested in the upper limit of the 95% confidence interval of the binomial test. I want that single number to be added to the data frame as a column called "upper.limit"
I thought of something along the lines of
binom.test(succes,enroll)$conf.int
alas, conf.int gives something such as
[1] 0.1266556 0.2918427
attr(,"conf.level")
[1] 0.95
All I want is just 0.2918427
Furthermore I have a feeling that there has to be do.call in there somewhere and maybe even an lapply but I do not know how that will go through the whole data frame. Or should I perhaps be using plyr?
Clearly my head is spinning. Please make it stop.
If this gives you (almost) what you want, then try this:
binom.test(succes,enroll)$conf.int[2]
And apply across the board or across the rows as it were:
> df$UCL <- apply(df, 1, function(x) binom.test(x[3],x[2])$conf.int[2] )
> head(df)
sens enroll succes UCL
1 0.10 20 2 0.3169827
2 0.15 20 3 0.3789268
3 0.20 20 4 0.4366140
4 0.25 20 5 0.4910459
5 0.30 20 6 0.5427892
6 0.35 20 7 0.5921885
Here you go:
R> newres <- do.call(rbind, apply(df, 1, function(x) {
+ bt <- binom.test(x[3], x[2])$conf.int;
+ newdf <- data.frame(t(x), UCL=bt[2]) }))
R>
R> head(newres)
sens enroll succes UCL
1 0.10 20 2 0.31698
2 0.15 20 3 0.37893
3 0.20 20 4 0.43661
4 0.25 20 5 0.49105
5 0.30 20 6 0.54279
6 0.35 20 7 0.59219
R>
This uses apply to loop over your existing data, compute test, return the value you want by sticking it into a new (one-row) data.frame. And we then glue all those 90 data.frame objects into a new single one with do.call(rbind, ...) over the list we got from apply.
Ah yes, if you just want to directly insert a single column the other answer rocks as it is simple. My longer answer shows how to grow or construct a data.frame during the sweep of apply.

Resources