Computing normalized Euclidean distance in R - r

The data frame I have is as follows:
Binning_data[1:4,]
person_id V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1 312 74 80 NA 87 90.0 85 88 98 96.5 99 94 95 90 90 93 106
2 316 NA NA 116 106 105.0 110 102 105 105.0 102 98 101 98 92 89 91
3 318 71 61 61 61 60.5 68 62 67 64.0 60 59 60 62 59 63 63
4 319 64 NA 80 80 83.0 84 87 83 85.0 88 87 95 74 70 63 83
I would like to compute the Euclidean distance of a given 'index_person_id' (say 312) with all the other person_id while omitting all NAs.
For example: Normalized Euclidean distance between "312" and "316" should omit the first 3 bins (V1,V2,V3) because atleast one of the two rows has NAs. It should just compute the Euclidean distance from 4th bin to 16th bin and divide by 13 (number of non empty bins)
Dimension of Binning_Data is 10000*17.
The output file should be of size 10000*2 with the first column being the person_id and the second column being the 'normalized Euclidean distance'.
I am currently using sapply for this purpose:
index_person<-binning_data[which(binning_data$person_id==index_person_id),]
non_empty_index_person=which(is.na(index_person[2:ncol(index_person)])==FALSE)
distance[,2]<-sapply(seq_along(binning_data$person_id),function(j) {
compare_person<-binning_data[j,]
non_empty_compare_person=which(is.na(compare_person[2:ncol(compare_person)])==FALSE)
non_empty=intersect(non_empty_index_person,non_empty_compare_person)
distance_temp=(index_person[non_empty+1]-compare_person[non_empty+1])^2
as.numeric(mean(distance_temp))
})
This seems to take a considerable amount of time. Is there a better way to do this?

If I run your code I get:
0.0000 146.0192 890.9000 200.8750
If you convert your data frame into a matrix, transpose, then you can subtract columns and then use na.rm=TRUE on mean to get the distances you want. This can be done over columns using colMeans. Here for row II of your sample data:
> II = 1
> m = t(as.matrix(binning_data[,-1]))
> colMeans((m - m[,II])^2, na.rm=TRUE)
1 2 3 4
0.0000 146.0192 890.9000 200.8750
Your 10000x2 matrix is then (where here 10000==4):
> cbind(II,colMeans((m - m[,II])^2, na.rm=TRUE))
II
1 1 0.0000
2 1 146.0192
3 1 890.9000
4 1 200.8750
If you want to compute this for a given list of indexes, loop it, perhaps like this with an lapply and an rbind putting it all back together again as a data frame for a change:
II = c(1,2,1,4,4)
do.call(rbind,lapply(II, function(i){data.frame(i,d=colMeans((m-m[,i])^2,na.rm=TRUE))}))
i d
1 1 0.0000
2 1 146.0192
3 1 890.9000
4 1 200.8750
11 2 146.0192
21 2 0.0000
31 2 1595.0179
41 2 456.7143
12 1 0.0000
22 1 146.0192
32 1 890.9000
42 1 200.8750
13 4 200.8750
23 4 456.7143
33 4 420.8833
43 4 0.0000
14 4 200.8750
24 4 456.7143
34 4 420.8833
44 4 0.0000
That's a 4 x length(II)-row matrix

Related

Select rows based on a specified value (Difference between max score of 19+ for second highest value)

I have a list of 50 meditation techniques that I am classifying into one of 3 categories based on ratings by 92 people. I have calculated the difference between the 'max' value in each row from the other 2 categories with ratings.
I now want to select the specific rows where the difference between the 2nd highest rating and the max value is greater than 19 (so 20+).
Looking at the table below for MATKO_NEWBERG_01 the highest rating is for the CDM category with 64 and the second highest rating is the NDM cateory with 12. This gives a difference of 52 (Value2_NDM) which is clearly above my 20 threshold I desire. I would like to therefore keep MATKO_NEWBERG_01 row in the dataframe as it satisfies this criteria. For MATKO_NEWBERG_07 you can see that the second highest rating (NDM = 20) only exhibits a difference value from the max (CDM = 23) of 3 well below by desired threshold of 20, therefore I would like to remove this. And the same is true for MATKO_NEWBERG_03 and _05.
Med_Technique
NDM
CDM
ADM
Value2_NDM
Value2_CDM
Value2_ADM
MATKO_NEWBERG_01
12
64
8
52
NA
56
MATKO_NEWBERG_02
5
76
9
71
NA
67
MATKO_NEWBERG_03
20
45
27
25
NA
18
MATKO_NEWBERG_04
6
73
12
67
NA
61
MATKO_NEWBERG_05
6
37
47
41
10
NA
MATKO_NEWBERG_06
6
6
78
72
72
NA
MATKO_NEWBERG_07
20
23
18
3
NA
5
Desired output:
Med_Technique
NDM
CDM
ADM
Value2_NDM
Value2_CDM
Value2_ADM
MATKO_NEWBERG_01
12
64
8
52
NA
56
MATKO_NEWBERG_02
5
76
9
71
NA
67
MATKO_NEWBERG_04
6
73
12
67
NA
61
MATKO_NEWBERG_06
6
6
78
72
72
NA
Thanks for reading
Using your Value2 columns, you could do:
dat[apply(dat[5:7], 1, min, na.rm = T) >= 20,]
#or
dat[do.call(pmin, c(dat[5:7], list(na.rm = TRUE))) >= 20,]
Med_Technique NDM CDM ADM Value2_NDM Value2_CDM Value2_ADM
1 MATKO_NEWBERG_01 12 64 8 52 NA 56
2 MATKO_NEWBERG_02 5 76 9 71 NA 67
4 MATKO_NEWBERG_04 6 73 12 67 NA 61
6 MATKO_NEWBERG_06 6 6 78 72 72 NA
Here's one other way that does not use the last columns. For each row (apply with MARGIN = 1) compute the absolute difference (dist) between the first and second highest value (sort(x, decreasing = T)[1:2]). Look whether it is higher >= 20.
idx = apply(dat[2:4], 1, \(x) dist(sort(x, decreasing = T)[1:2])) >= 20
# [1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE
dat[idx, ]
Med_Technique NDM CDM ADM Value2_NDM Value2_CDM Value2_ADM
1 MATKO_NEWBERG_01 12 64 8 52 NA 56
2 MATKO_NEWBERG_02 5 76 9 71 NA 67
4 MATKO_NEWBERG_04 6 73 12 67 NA 61
6 MATKO_NEWBERG_06 6 6 78 72 72 NA
#Also works (maybe less intuitive, but shorter)
idx = apply(dat[2:4], 1, \(x) diff(sort(x))[2]) >= 20

Script out of bounds in R

I am using a code based on Deseq2. One of my goals is to plot a heatmap of data.
heatmap.data <- counts(dds)[topGenes,]
The error I am getting is
Error in counts(dds)[topGenes, ]: subscript out of bounds
the first few line sof my counts(dds) function looks like this.
99h1 99h2 99h3 99h4 wth1 wth2
ENSDARG00000000002 243 196 187 117 91 96
ENSDARG00000000018 42 55 53 32 48 48
ENSDARG00000000019 91 91 108 64 95 94
ENSDARG00000000068 3 10 10 10 30 21
ENSDARG00000000069 55 47 43 53 51 30
ENSDARG00000000086 46 26 36 18 37 29
ENSDARG00000000103 301 289 289 199 347 386
ENSDARG00000000151 18 19 17 14 22 19
ENSDARG00000000161 16 17 9 19 10 20
ENSDARG00000000175 10 9 10 6 16 12
ENSDARG00000000183 12 8 15 11 8 9
ENSDARG00000000189 16 17 13 10 13 21
ENSDARG00000000212 227 208 259 234 78 69
ENSDARG00000000229 68 72 95 44 71 64
ENSDARG00000000241 71 92 67 76 88 74
ENSDARG00000000324 11 9 6 2 8 9
ENSDARG00000000370 12 5 7 8 0 5
ENSDARG00000000394 390 356 339 283 313 286
ENSDARG00000000423 0 0 2 2 7 1
ENSDARG00000000442 1 1 0 0 1 1
ENSDARG00000000472 16 8 3 5 7 8
ENSDARG00000000476 2 1 2 4 6 3
ENSDARG00000000489 221 203 169 144 84 114
ENSDARG00000000503 133 118 139 89 91 112
ENSDARG00000000529 31 25 17 26 15 24
ENSDARG00000000540 25 17 17 10 28 19
ENSDARG00000000542 15 9 9 6 15 12
How do I ensure all the elements of the top genes are present in it?
When I try to see 20 top genes in the dataset. it looks like a list of genes
6339" "12416" "1241" "3025" "12791" "846" "15090"
[8] "6529" "14564" "4863" "12777" "1122" "7454" "13716"
[15] "5790" "3328" "1231" "13734" "2797" "9072" with the column head V1.
I have used both
topGenes <- read.table("E://mir99h50 Cheng data//topGenesresordered.txt",header = TRUE)
and
topGenes <- read.table("E://mir99h50 Cheng data//topGenesresordered.txt",header = FALSE)
to see if the out of bounds error is removed. However it was of no use. I guess the V1 head is causing the issue.
The top genes function has been generated using the above code snippet.
resordered <- res[order(res$padj),]
#Reorder gene list by increasing pAdj
resordered <- as.data.frame(res[order(res$padj),])
#Filter for genes that are differentially expressed with an FDR < 0.01
ii <- which(res$padj < 0.01)
length(ii)
# Use the rownames() function to get the top 20 differentially expressed genes from our results table
topGenes <- rownames(resordered[1:20,])
topGenes
# Get the counts from the DESeqDataSet using the counts() function
heatmap.data <- counts(dds)[topGenes,]
Perhaps this will do what you want?
counts_dds <- counts(dds)
topgenes <- c("ENSDARG00000000002", "ENSDARG00000000489", "ENSDARG00000000503",
"ENSDARG00000000540", "ENSDARG00000000529", "ENSDARG00000000542")
heatmap.data <- counts_dds[rownames(counts_dds) %in% topgenes,]
If you provide more information it will be easier to advise you on how to fix your problem.

How can I use a loop in R to create new, labeled variables and write them into a .csv?

I have a table with eighty columns and I want to create columns by multiplying var1*var41 var1*var42....var1*var80. var2*var41 var2*var42...var2*var80. How could I write a loop to multiply the columns and write the labeled product into a .csv? The result should have 1600 additional columns.
I took a stab at this with some fake data:
# Fake data (arbitraty 5 rows)
mtx <- sample(1:100, 5 * 80, replace = T)
dim(mtx) <- c(5,80)
colnames(mtx) <- paste0("V", 1:ncol(mtx)) # Name the original columns
mtx[1:5,1:5]
# V1 V2 V3 V4 V5
#[1,] 8 10 69 84 92
#[2,] 59 34 36 96 86
#[3,] 51 26 78 63 8
#[4,] 74 93 73 70 49
#[5,] 62 30 20 43 9
Using a for loop, one might try something like this:
v <- expand.grid(1:40,41:80) # all combos
v[c(1:3,1598:1600),]
# Var1 Var2
#1 1 41
#2 2 41
#3 3 41
#1598 38 80
#1599 39 80
#1600 40 80
# Initialize matrix for multiplication results
newcols <- matrix(NA, nrow = nrow(mtx), ncol = nrow(v))
# Run the for loop
for(i in 1:nrow(v)) newcols[,i] <- mtx[,v[i,1]] * mtx[,v[i,2]]
# save the names as "V1xV41" format with apply over rows (Margin = 1)
# meaning, for each row in v, paste "V" in front and "x" between
colnames(newcols) <- apply(v, MARGIN = 1, function(eachv) paste0("V", eachv, collapse="x"))
# combine the additional 1600 columns
tocsv <- cbind(mtx, newcols)
tocsv[,78:83] # just to view old and new columns
# V78 V79 V80 V1xV41 V2xV41 V3xV41
#[1,] 17 92 13 429 741 1079
#[2,] 70 94 1 4836 4464 5115
#[3,] 6 77 93 3740 1020 3468
#[4,] 88 34 26 486 258 66
#[5,] 48 77 61 873 4365 970
# Write it
write.csv(tocsv, "C:/Users/Evan Friedland/Documents/NEWFILENAME.csv")

How to process multi columns data in data.frame with plyr

I am trying to solve the DSC(Differential scanning calorimetry) data with R but it seems that I ran into some troubles. All this used to be done in Origin or Qtiplot tediously in my lab.But I wonder if there is another way to do it in batch.But the result did not goes well. For example, maybe I have used the wrong colnames of my data.frame,the code
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
can not reach my data.
So below is the full description of my purpose, thank you in advance!
the DSC data is like this(I store the CSV file in my GoogleDrive Link ) :
T1 0.5min T2 1min
40.59 -0.2904 40.59 -0.2545
40.81 -0.281 40.81 -0.2455
41.04 -0.2747 41.04 -0.2389
41.29 -0.2728 41.29 -0.2361
41.54 -0.2553 41.54 -0.2239
41.8 -0.07 41.8 -0.0732
42.06 0.1687 42.06 0.1414
42.32 0.3194 42.32 0.2817
42.58 0.3814 42.58 0.3421
42.84 0.3863 42.84 0.3493
43.1 0.3665 43.11 0.3322
43.37 0.3438 43.37 0.3109
43.64 0.3265 43.64 0.2937
43.9 0.3151 43.9 0.2819
44.17 0.3072 44.17 0.2735
44.43 0.2995 44.43 0.2656
44.7 0.2899 44.7 0.2563
44.96 0.2779 44.96 0.245
in fact I have merge the data into a data.frame and hope I can adjust it and do something further.
the command is:
dat<-read.csv("Book1.csv",header=F)
colnames(dat)<-c('T1','0.5min','T2','1min','T3','2min','T4','4min','T5','8min','T6','10min',
'T7','20min','T8','ascast1','T9','ascast2','T10','ascast3','T11','ascast4',
'T12','ascast5'
)
so actually dat is a data.frame with 1163 obs. of 24 variables.
T1,T2,T3.....T12 means temperature that the samples were tested of DSC although in the same interval they do differ a little due to the unstability of the machine.
And the colname along T1~T12 is Heat Flow of different heat treatment durations that records by the machine and ascast1~ascast5 means nothing done to the sample to check the accuracy of the machine.
Now I need to do something like the following:
for T1~T2 is in Celsius Degrees,I need to change them into Kelvin Degrees whichi means every data plus 273.16.
Two temperature is chosen to compare the result that is Ts=180.25,Te=240.45(all is discussed in Celsius Degrees and I have seen it Qtiplot to make sure). To be clear I list the two temperature and the first 6 columns data.
T1 0.5min T2 1min T3 2min T4 4min
180.25 -0.01710000 180.25 -0.01780000 180.25 -0.02120000 180.25 -0.02020000
. . . .
. . . .
240.45 0.05700000 240.45 0.04500000 240.45 0.05780000 240.45 0.05580000
That all Heat Flow in Ts should be the same that can be made 0 for convenience. So based on the different values Heat Flow of different times like 0.5min,1min,2min,4min,8min,10min,20min and ascas1~ascast5 all Heat Flow value should be minus the Heat Flow value in Ts.
And for Heat Flow in Te, the value should be adjust to make sure that all the Heat Flow data are the same in Te. The purpose is like the following, (1) calculate mean of the 12 heat flow data in Te. Let's use Hmean for the mean heat flow.So Hmean is the value that all Heat Flow should be. (2) for data in column 0.5min,I use col("0.5min") to denote, and the lineal transform formula is like the following:
col("0.5min")-[([0.05700000-(-0.01710000)]-Hmean)/(Te-Ts)]*(col(T1)-Ts)
Actually, [0.05700000-(-0.01710000)] is done in step 2,but I write it for your reference. And this formula is used for different pair of T1~T12 and columns,like (T1,0.5min),(T2, 1min),(T3,1min).....all is 12 pairs.
Now we can plot the 12 pairs of data on the same plot with intervals from 180~240(also in Celsius Degrees) to magnify the details of differences between the different scans of DSC.
I have been stuck on this problems for 2 days , so I return to stackoverflow for help.
Thanks!
I am assuming that your question was right in the beginning where you got the following error,
dat$0.5min
Error: unexpected numeric constant in "dat$0.5"
As I could not find a question in the rest of the steps. They just seemed like a step by step procedure of an experiment.
To fix that error, the problem is the column name has a number in it so to use the column name in the way you want (to reference a column), you should use "`", accent mark, symbol.
>dataF <- data.frame("0.5min"=1:10,"T2"=11:20,check.names = F)
> dataF$`0.5min`
[1] 1 2 3 4 5 6 7 8 9 10
Based on comments adding more information,
You can add a constant to add to alternate columns in the following manner,
dataF <- data.frame(matrix(1:100,10,10))
const <- 237
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 11 21 31 41 51 61 71 81 91
2 2 12 22 32 42 52 62 72 82 92
3 3 13 23 33 43 53 63 73 83 93
4 4 14 24 34 44 54 64 74 84 94
5 5 15 25 35 45 55 65 75 85 95
6 6 16 26 36 46 56 66 76 86 96
7 7 17 27 37 47 57 67 77 87 97
8 8 18 28 38 48 58 68 78 88 98
9 9 19 29 39 49 59 69 79 89 99
10 10 20 30 40 50 60 70 80 90 100
dataF[,seq(1,ncol(dataF),by = 2)] <- dataF[,seq(1,ncol(dataF),by = 2)] + const
> print(dataF)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 238 11 258 31 278 51 298 71 318 91
2 239 12 259 32 279 52 299 72 319 92
3 240 13 260 33 280 53 300 73 320 93
4 241 14 261 34 281 54 301 74 321 94
5 242 15 262 35 282 55 302 75 322 95
6 243 16 263 36 283 56 303 76 323 96
7 244 17 264 37 284 57 304 77 324 97
8 245 18 265 38 285 58 305 78 325 98
9 246 19 266 39 286 59 306 79 326 99
10 247 20 267 40 287 60 307 80 327 100
To generalize, we know that the columns of a dataframe can be referenced with a vector of numbers/column names. Most operations in R are vectorized. You can use column names or numbers based on the pattern you are looking for.
For example, I change the name of my first two columns and want to access just those I do this,
colnames(dataF)[c(1,2)] <- c("Y1","Y2")
#Reference all column names with "Y" in it. You can do any operation you want on this.
dataF[,grep("Y",colnames(dataF))]
Y1 Y2
1 238 11
2 239 12
3 240 13
4 241 14
5 242 15
6 243 16
7 244 17
8 245 18
9 246 19
10 247 20

findInterval() with varying intervals in data.table R

I have asked this question a long time ago, but haven't found the answer yet. I do not know if this is legit in stackoverflow, but I repost it.
I have a data.table in R and I want to create a new column that finds the interval for every price of the respective year/month.
Reproducible example:
set.seed(100)
DT <- data.table(year=2000:2009, month=1:10, price=runif(5*26^2)*100)
intervals <- list(year=2000:2009, month=1:10, interval = sort(round(runif(9)*100)))
intervals <- replicate(10, (sample(10:100,100, replace=T)))
intervals <- t(apply(intervals, 1, sort))
intervals.dt <- data.table(intervals)
intervals.dt[, c("year", "month") := list(rep(2000:2009, each=10), 1:10)]
setkey(intervals.dt, year, month)
setkey(DT, year, month)
I have just tried:
merging the DT and intervals.dt data.tables by month/year,
creating a new intervalsstring column consisting of all the V* columns to
one column string, (not very elegant, I admit), and finally
substringing it to a vector, so as I can use it in findInterval() but the solution does not work for every row (!)
So, after:
DT <- merge(DT, intervals.dt)
DT <- DT[, intervalsstring := paste(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10)]
DT <- DT[, c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10") := NULL]
DT[, interval := findInterval(price, strsplit(intervalsstring, " ")[[1]])]
I get
> DT
year month price intervalsstring interval
1: 2000 1 30.776611 12 21 36 46 48 51 63 72 91 95 2
2: 2000 1 62.499648 12 21 36 46 48 51 63 72 91 95 6
3: 2000 1 53.581115 12 21 36 46 48 51 63 72 91 95 6
4: 2000 1 48.830599 12 21 36 46 48 51 63 72 91 95 5
5: 2000 1 33.066053 12 21 36 46 48 51 63 72 91 95 2
---
3376: 2009 10 33.635924 12 40 45 48 50 65 75 90 96 97 2
3377: 2009 10 38.993769 12 40 45 48 50 65 75 90 96 97 3
3378: 2009 10 75.065820 12 40 45 48 50 65 75 90 96 97 8
3379: 2009 10 6.277403 12 40 45 48 50 65 75 90 96 97 0
3380: 2009 10 64.189162 12 40 45 48 50 65 75 90 96 97 7
which is correct for the first rows, but not for the last (or other) rows.
For example, for the row 3380, the price ~64.19 should be in the 5th interval and not the 7th. I guess my mistake is that by my last command, finding Intervals relies only on the first row of intervalsstring.
Thank you!
Your main problem is that you just didn't do findInterval for each group. But I also don't see the point of making that large merged data.table, or the paste/strsplit business. This is what I would do:
DT[, interval := findInterval(price,
intervals.dt[.BY][, V1:V10]),
by = .(year, month)][]
# year month price interval
# 1: 2000 1 30.776611 2
# 2: 2000 1 62.499648 6
# 3: 2000 1 53.581115 6
# 4: 2000 1 48.830599 5
# 5: 2000 1 33.066053 2
# ---
#3376: 2009 10 33.635924 1
#3377: 2009 10 38.993769 1
#3378: 2009 10 75.065820 7
#3379: 2009 10 6.277403 0
#3380: 2009 10 64.189162 5
Note that intervals.dt[.BY] is a keyed subset.

Resources