how to ge the distances of one observation to all the others from a dist object? - r

Suppose I have a data.frame a (each observation is in a row) and calculated the distance matrix. the question is, is there any function that will give the distances of observation 5 to all the other observations.
> a=data.frame(A=rnorm(10), B=rnorm(10), C=rnorm(10))
> b=dist(a)
> b
1 2 3 4 5 6 7 8 9
2 1.6118634
3 0.4891468 1.3382692
4 1.2002947 1.7516061 0.9160975
5 1.8128570 0.3197837 1.5192406 1.7709168
6 0.7280433 1.2628696 0.4063128 1.2411639 1.4971098
7 1.7616767 0.7400666 1.4512844 1.4355922 0.5168996 1.5524980
8 3.1033274 3.3739578 2.7297046 2.2281075 3.3693333 2.7738859 3.1216145
9 2.0916857 1.6749526 1.6717408 2.0293415 1.8196557 1.3704288 1.9824870 2.4013682
10 1.5949320 1.7309838 1.1680365 0.6331770 1.7255615 1.3234977 1.4333926 1.7798153 1.6126823

Just convert it to a matrix:
as.matrix(b)[5,]

Check out the pdist package. It basically returns a single row of what as.matrix(dist(x)) would return... so you don't have to calculate everything.
http://cran.r-project.org/web/packages/pdist/index.html

Related

creating a dataframe of means of 5 randomly sampled observations

I'm currently reading "Practical Statistics for Data Scientists" and following along in R as they demonstrate some code. There is one chunk of code I'm particularly struggling to follow the logic of and was hoping someone could help. The code in question is creating a dataframe with 1000 rows where each observation is the mean of 5 randomly drawn income values from the dataframe loans_income. However, I'm getting confused about the logic of the code as it is fairly complicated with a tapply() function and nested rep() statements.
The code to create the dataframe in question is as follows:
samp_mean_5 <- data.frame(income = tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean),
type='mean_of_5')
In particular, I'm confused about the nested rep() statements and the 1000*5 portion of the sample() function. Any help understanding the logic of the code would be greatly appreciated!
For reference, the original dataset loans_income simply has a single column of 50,000 income values.
You have 50,000 loans_income in a single vector. Let's break your code down:
tapply(sample(loans_income$income,1000*5),
rep(1:1000,rep(5,1000)),
FUN = mean)
I will replace 1000 with 10 and income with random numbers, so it's easier to explain. I also set set.seed(1) so the result can be reproduced.
sample(loans_income$income,1000*5)
We 50 random incomes from your vector without replacement. They are (temporarily) put into a vector of length 50, so the output looks like this:
> sample(runif(50000),10*5)
[1] 0.73283101 0.60329970 0.29871173 0.12637654 0.48434952 0.01058067 0.32337850
[8] 0.46873561 0.72334215 0.88515494 0.44036341 0.81386225 0.38118213 0.80978822
[15] 0.38291273 0.79795343 0.23622492 0.21318431 0.59325586 0.78340477 0.25623138
[22] 0.64621658 0.80041393 0.68511759 0.21880083 0.77455662 0.05307712 0.60320912
[29] 0.13191926 0.20816298 0.71600799 0.70328349 0.44408218 0.32696205 0.67845445
[36] 0.64438336 0.13241312 0.86589561 0.01109727 0.52627095 0.39207860 0.54643661
[43] 0.57137320 0.52743012 0.96631114 0.47151170 0.84099503 0.16511902 0.07546454
[50] 0.85970500
rep(1:1000,rep(5,1000))
Now we are creating an indexing vector of length 50:
> rep(1:10,rep(5,10))
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4 5 5 5 5 5 6 6 6
[29] 6 6 7 7 7 7 7 8 8 8 8 8 9 9 9 9 9 10 10 10 10 10
Those indices "group" the samples from step 1. So basically this vector tells R that the first 5 entries of your "sample vector" belong together (index 1), the next 5 entries belong together (index 2) and so on.
FUN = mean
Just apply the mean-function on the data.
tapply
So tapply takes the sampled data (sample-part) and groups them by the second argument (the rep()-part) and applies the mean-function on each group.
If you are familiar with data.frames and the dplyr package, take a look at this (only the first 10 rows are displayed):
set.seed(1)
df <- data.frame(income=sample(runif(5000),10*5), index=rep(1:10,rep(5,10)))
income index
1 0.42585569 1
2 0.16931091 1
3 0.48127444 1
4 0.68357403 1
5 0.99374923 1
6 0.53227877 2
7 0.07109499 2
8 0.20754511 2
9 0.35839481 2
10 0.95615917 2
I attached the an index to the random numbers (your income). Now we calculate the mean per group:
df %>%
group_by(index) %>%
summarise(mean=mean(income))
which gives us
# A tibble: 10 x 2
index mean
<int> <dbl>
1 1 0.551
2 2 0.425
3 3 0.827
4 4 0.391
5 5 0.590
6 6 0.373
7 7 0.514
8 8 0.451
9 9 0.566
10 10 0.435
Compare it to
set.seed(1)
tapply(sample(runif(5000),10*5),
rep(1:10,rep(5,10)),
mean)
which yields basically the same result:
1 2 3 4 5 6 7 8 9
0.5507529 0.4250946 0.8273149 0.3905850 0.5902823 0.3730092 0.5143829 0.4512932 0.5658460
10
0.4352546

Frequency distribution using binCounts

I have a dataset of Ages for the customer and I wanted to make a frequency distribution by 9 years of a gap of age.
Ages=c(83,51,66,61,82,65,54,56,92,60,65,87,68,64,51,
70,75,66,74,68,44,55,78,69,98,67,82,77,79,62,38,88,76,99,
84,47,60,42,66,74,91,71,83,80,68,65,51,56,73,55)
My desired outcome would be similar to below-shared table, variable names can be differed(as you wish)
Could I use binCounts code into it ? if yes could you help me out using the code as not sure of bx and idxs in this code?
binCounts(x, idxs = NULL, bx, right = FALSE) ??
Age Count
38-46 3
47-55 7
56-64 7
65-73 14
74-82 10
83-91 6
92-100 3
Much Appreciated!
I don't know about the binCounts or even the package it is in but i have a bare r function:
data.frame(table(cut(Ages,0:7*9+37)))
Var1 Freq
1 (37,46] 3
2 (46,55] 7
3 (55,64] 7
4 (64,73] 14
5 (73,82] 10
6 (82,91] 6
7 (91,100] 3
To exactly duplicate your results:
lowerlimit=c(37,46,55,64,73,82,91,101)
Labels=paste(head(lowerlimit,-1)+1,lowerlimit[-1],sep="-")#I add one to have 38 47 etc
group=cut(Ages,lowerlimit,Labels)#Determine which group the ages belong to
tab=table(group)#Form a frequency table
as.data.frame(tab)# transform the table into a dataframe
group Freq
1 38-46 3
2 47-55 7
3 56-64 7
4 65-73 14
5 74-82 10
6 83-91 6
7 92-100 3
All this can be combined as:
data.frame(table(cut(Ages,s<-0:7*9+37,paste(head(s+1,-1),s[-1],sep="-"))))

An easier way to get average of a table with some conditions in R

I am trying to get the average of all 6 quizzes for each male student.
Here is part of the code that I've tried:
a<-subset(mydf,Sex=="M")
b<-a[4:9]
b
sum(b[1:6])
My logic is to get a table only contains male students with each of their 6 quizzes, then sum the table and divide by the number of male student. But I think there should be an easier way to do this.
Sample data:
df <- data.frame(Section=c(rep('A',9)),
Degree=c(rep('MBA',4),'MS','MBA','MBA','MS','MBA'),
Sex=c(rep('M',5),'F','M','M','F'),
Quiz1=c(0,10,2,2,8,6,6,2,3),
Quiz2=c(0,1,4,4,1,5,0,3,9),
Quiz3=c(6,5,6,6,4,2,7,9,3),
Quiz4=c(5,4,5,5,10,5,7,7,3),
Quiz5=c(7,3,6,3,10,7,6,10,5),
Quiz6=c(3,8,6,6,5,8,10,10,5))
How about this:
data.frame(df[which(df$Sex=='M'),],QuizMeans=rowMeans(df[which(df$Sex=='M'),c(4:9)]))
Note: "c(4:9)" in the code above is takes the row average for quiz columns 4-9.
So we're calculating quiz scores for each individual this way.
Output:
Section Degree Sex Quiz1 Quiz2 Quiz3 Quiz4 Quiz5 Quiz6 QuizMeans
1 A MBA M 0 0 6 5 7 3 3.500000
2 A MBA M 10 1 5 4 3 8 5.166667
3 A MBA M 2 4 6 5 6 6 4.833333
4 A MBA M 2 4 6 5 3 6 4.333333
5 A MS M 8 1 4 10 10 5 6.333333
7 A MBA M 6 0 7 7 6 10 6.000000
8 A MS M 2 3 9 7 10 10 6.833333
Then if you wanted to take the mean of their means (i.e. the grand mean), you could store the above as something like "df", then use mean() to calculate the mean of the column QuizMeans, like this:
df <- data.frame(df[which(df$Sex=='M'),],QuizMeans=rowMeans(df[which(df$Sex=='M'),c(4:9)]))
mean(df$QuizMeans)
[1] 5.285714
If there are missing values in your data, you'll need to add na.rm=TRUE to either the mean() or rowMeans() function, like this:
mean(df$QuizMeans, na.rm=TRUE)
[1] 5.285714
You could use the following without specifying column positions
ans <- sum(df[df$Sex=="M", grepl("Quiz",names(df))])/sum(df$Sex=="M")
# 31.71429
If you know the column positions
ans <- sum(df[df$Sex=="M", 4:9])/sum(df$Sex=="M")
# 31.71429
Data
df <- data.frame(Section=c(rep('A',9)),
Degree=c(rep('MBA',4),'MS','MBA','MBA','MS','MBA'),
Sex=c(rep('M',5),'F','M','M','F'),
Quiz1=c(0,10,2,2,8,6,6,2,3),
Quiz2=c(0,1,4,4,1,5,0,3,9),
Quiz3=c(6,5,6,6,4,2,7,9,3),
Quiz4=c(5,4,5,5,10,5,7,7,3),
Quiz5=c(7,3,6,3,10,7,6,10,5),
Quiz6=c(3,8,6,6,5,8,10,10,5))
Use dplyr.
library(dplyr)
mydf %>% filter(Sex == "Male") %>%
summarise(avg_q6 = mean(Quiz6))

Subset columns based on row value

This may be a simple question, but I haven't been able to find any answer. Consider you have a dataframe with n columns with molecular features. In the last row of each column, a coefficient of variance is expressed.
Example data set:
a <- data.frame(matrix(runif(30),ncol=3))
b <- c(50.23,45.23,21)
a<-rbind(a,b)
X1 X2 X3
1 0.1097075 0.78584027 0.20925033
2 0.6081752 0.39669748 0.65559913
3 0.9912855 0.68462073 0.54741795
4 0.8543848 0.53776889 0.43789447
5 0.2579654 0.92188090 0.61292895
6 0.6203840 0.73152279 0.82866311
7 0.6643195 0.84953926 0.62192976
8 0.5760624 0.30949900 0.11032929
9 0.8888167 0.04530598 0.08089825
10 0.8926815 0.61736284 0.19834310
11 50.2300000 45.23000000 21.00000000
How do I subset so I only get the columns with CV>50 in the last row? So my new data.frame would be:
X1
1 0.1097075
2 0.6081752
3 0.9912855
4 0.8543848
5 0.2579654
6 0.6203840
7 0.6643195
8 0.5760624
9 0.8888167
10 0.8926815
11 50.230000
We can do
a[,a[nrow(a),]>50,drop=FALSE]

Efficient method of obtaining successive high values of data.frame column

Lets say I have the following data.frame in R
df <- data.frame(order=(1:10),value=c(1,7,3,5,9,2,9,10,2,3))
Other than looping through data an testing whether value exceeds previous high value how can I get successive high values so that I can end up with a table like this
order value
1 1
2 7
5 9
8 10
TIA
Here's one option, if I understood the question correct:
df[df$value > cummax(c(-Inf, head(df$value, -1))),]
# order value
#1 1 1
#2 2 7
#5 5 9
#8 8 10
I use cummax to keep track of the maximum of column "value" and compare it (the previous row's cummax) to each "value" entry. To make sure the first entry is also selected, I start by "-Inf".
"get successive high values (of value?)" is unclear.
It seems you want to filter only rows whose value is higher than previous max.
First, we reorder your df in increasing order of value... (not clear but I think that's what you wanted)
Then we use logical indexing with diff()>0 to only include strictly-increasing rows:
rdf <- df[order(df$value),]
rdf[ diff(rdf$value)>0, ]
order value
1 1 1
9 9 2
10 10 3
4 4 5
2 2 7
7 7 9
8 8 10

Resources