Is there a way i can calculate the area of ecdf? - r

I have this kind of data set
B_dong A_dong count mean count2
14 1118053 1118053 15 129.7333 15
16 1118053 1118055 18 171.8333 33
19 1118053 1118060 4 270.7500 37
13 1118053 1118052 13 374.8462 50
17 1118053 1118057 21 389.9524 71
20 1118053 1118061 9 418.0000 80
15 1118053 1118054 10 435.1000 90
1 1118053 1102054 2 3066.0000 237
44 1118053 1122060 2 3080.0000 239
2 1118053 1102059 1 4867.0000 240
test2 <- structure(list(B_dong = c(1118053L, 1118053L, 1118053L, 1118053L,1118053L, 1118053L, 1118053L, 1118053L, 1118053L, 1118053L),A_dong = c(1118053L, 1118055L, 1118060L, 1118052L, 1118057L,1118061L, 1118054L, 1102054L, 1122060L, 1102059L), count = c(15L,18L, 4L, 13L, 21L, 9L, 10L, 2L, 2L, 1L), mean = c(129.733333333333,171.833333333333, 270.75, 374.846153846154, 389.952380952381,418, 435.1, 3066, 3080, 4867), count2 = c(15L, 33L, 37L,50L, 71L, 80L, 90L, 237L, 239L, 240L)), row.names = c(14L,16L, 19L, 13L, 17L, 20L, 15L, 1L, 44L, 2L), class = "data.frame")
And I plotted a ecdf with this data set using
plot(ecdf(test2$mean), ylab="Fn(x)", verticals = FALSE,
col.01line = "gray70", pch = 19)
Is there a way i can calculate the area(integral) of the ecdf graph in R?

Sure. If you look at the plot, it's effectively a series of rectangles (red lines added):
segments(sort(unique(test2$mean)), 0,
sort(unique(test2$mean)), ecdf(test2$mean)(sort(unique(test2$mean))),
col = "red")
The area under each "step" is merely the width and height of each of those rectangles.
The widths are just the differences between each of the unique values (sorted):
diff(sort(unique(test2$mean)))
# [1] 42.10000 98.91667 104.09615 15.10623 28.04762 17.10000 2630.90000 14.00000 1787.00000
The heights are found empirically from the ecdf call's return-value function on these values:
ecdf(test2$mean)(sort(unique(test2$mean)))
# [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Since there are one fewer widths (the right-most is always a single point ... and we all know that the area under a single point is zero), we can discard the last height.
So we just multiple the heights by the widths and sum them up:
sum(
head(ecdf(test2$mean)(sort(unique(test2$mean))), n = -1) *
diff(sort(unique(test2$mean)))
)
# [1] 3546.678

Related

How to do a barplot from data frame

I have this data frame named "dt" (which contains 100 individuals but cut it down to 8) and have to do a barplot of the salary counts in 5 intervals (so let's say by=20,000)
ID Salary Bonus
--------------------------
1: 1 100000 57
2: 2 86000 66
3: 3 17000 12
4: 6 50500 37
5: 9 35000 26
6: 10 45000 21
7: 11 95000 54
8: 12 100000 58
How do I do it with barplot()? This is what I tried to do:
barplot(cut(dt$Salary,c(seq(1,max(dt$Salary),by=20000))))
but it says that "height" needs to be a vector or a matrix.
We can cut the 'Salary' with breaks mentioned as 5, get the frequency count with table and plot with barplot
barplot(table(cut(df1$Salary, breaks = 5)))
Or specify the min and max in seq to get a sequence by 20000
barplot(table(cut(df1$Salary,
breaks = seq(min(df1$Salary)- 100, max(df1$Salary) + 100, by = 20000))))
data
df1 <- structure(list(ID = c(1L, 2L, 3L, 6L, 9L, 10L, 11L, 12L), Salary = c(100000L,
86000L, 17000L, 50500L, 35000L, 45000L, 95000L, 100000L), Bonus = c(57L,
66L, 12L, 37L, 26L, 21L, 54L, 58L)), class = "data.frame", row.names = c("1:",
"2:", "3:", "4:", "5:", "6:", "7:", "8:"))

Is there a function to turn the variables of this table into the percentage of each column?

I have a table like this:
JUV SUB AD
A 137 19 46
B 0 0 46
C 16 10 1
F 23 2 9
IN 27 9 29
IS 3 3 25
M 8 0 27
R 1 0 66
S 145 47 189
T 17 6 9
I would like to take a chi square test in order to find a correlation among the groups of the three columns, but in order to do that I need the values divided in the percentage for the three columns.
Prob.table doesn't work because it makes the percentage of the total, not divided for the three groups
There are several issues here. Chi Square measures association, not correlation and can be computed only on frequency (count) data, not percentages or proportions. The chisq.test() function will handle all of the computations for you. Also the prop.table function does exactly what you want (see the manual page ?prop.table especially the description of the margin= argument). When posting, you should use dput() to provide your data. That preserves important details of your data - in this case, do you have a data frame or a table. Assuming a data frame:
tbl <- structure(list(JUV = c(137L, 0L, 16L, 23L, 27L, 3L, 8L, 1L, 145L,
17L), SUB = c(19L, 0L, 10L, 2L, 9L, 3L, 0L, 0L, 47L, 6L), AD = c(46L,
46L, 1L, 9L, 29L, 25L, 27L, 66L, 189L, 9L)), class = "data.frame", row.names = c("A",
"B", "C", "F", "IN", "IS", "M", "R", "S", "T"))
To get percentages by column, use
pct <- prop.table(as.matrix(tbl), 2) * 100
print(addmargins(pct), digits=3)
# JUV SUB AD Sum
# A 36.340 19.79 10.291 66.42
# B 0.000 0.00 10.291 10.29
# C 4.244 10.42 0.224 14.88
# F 6.101 2.08 2.013 10.20
# IN 7.162 9.38 6.488 23.02
# IS 0.796 3.12 5.593 9.51
# M 2.122 0.00 6.040 8.16
# R 0.265 0.00 14.765 15.03
# S 38.462 48.96 42.282 129.70
# T 4.509 6.25 2.013 12.77
# Sum 100.000 100.00 100.000 300.00
For a chi square test:
chisq.test(tbl)
#
# Pearson's Chi-squared test
#
# data: tbl
# X-squared = 256.14, df = 18, p-value < 2.2e-16
#
# Warning message:
# In chisq.test(tbl) : Chi-squared approximation may be incorrect
The warning indicates that some of the expected values are too small.

Loops with random sampling from a matrix and distance calculation

I got a list of nodes, and I need to randomly assign 'p' hubs to 'n' clients.
I got the following data, where the first row shows:
The total number of nodes.
The requested number of hubs.
The total supply capacity for each hub.
The following lines show:
The first column the node number.
The second column the "x" coordinate.
The third the "y" coordinate.
Below I will show the raw data, adding colnames() it would look something like this:
total_nodes hubs_required total_capacity
50 5 120
node number x_coordinate y_coordinate node_demand
1 2 62 3
2 80 25 14
3 36 88 1
4 57 23 14
. . . .
. . . .
. . . .
50 1 58 2
The x and y values are provided so we can calculate the Euclidean distance.
nodes:
50 5 120
1 2 62 3
2 80 25 14
3 36 88 1
4 57 23 14
5 33 17 19
6 76 43 2
7 77 85 14
8 94 6 6
9 89 11 7
10 59 72 6
11 39 82 10
12 87 24 18
13 44 76 3
14 2 83 6
15 19 43 20
16 5 27 4
17 58 72 14
18 14 50 11
19 43 18 19
20 87 7 15
21 11 56 15
22 31 16 4
23 51 94 13
24 55 13 13
25 84 57 5
26 12 2 16
27 53 33 3
28 53 10 7
29 33 32 14
30 69 67 17
31 43 5 3
32 10 75 3
33 8 26 12
34 3 1 14
35 96 22 20
36 6 48 13
37 59 22 10
38 66 69 9
39 22 50 6
40 75 21 18
41 4 81 7
42 41 97 20
43 92 34 9
44 12 64 1
45 60 84 8
46 35 100 5
47 38 2 1
48 9 9 7
49 54 59 9
50 1 58 2
I extracted the information from the first line.
nodes <- as.matrix(read.table(data))
header<-colnames(nodes)
clean_header <-gsub('X','',header)
requested_hubs <- as.numeric(clean_header[2])
max_supply_capacity <- as.numeric(clean_header[3])
I need to randomly select 5 nodes, that will act as hubs
set.seed(37)
node_to_hub <-nodes[sample(nrow(nodes),requested_hubs,replace = FALSE),]
Then randomly I need to assign nodes to each hub calculate the distances between the hub and each one of the nodes and when the max_supply_capacity(120) is exceeded select the following hub and repeat the process.
After the final iteration I need to return the cumulative sum of distances for all the hubs.
I need to repeat this process 100 times and return the min() value of the cumulative sum of distances.
This is where I'm completely stuck since I'm not sure how to loop through a matrix let alone when I have to select elements randomly.
I got the following elements:
capacity <- c(numeric()) # needs to be <= to 120
distance_sum <- c(numeric())
global_hub_distance <- c(numeric())
The formula for the euclidean distance (rounded) would be as below but I'm not sure how I can reflect the random selection when assigning nodes.
distance <-round(sqrt(((node_to_hub[i,2]-nodes[i,2]))^2+(node_to_hub[random,3]-nodes[random,3])^2))
The idea for the loop I think I need is below, but as I mentioned before I don't know how to deal with the sample client selection, and the distance calculation of the random clients.
for(i in 1:100){
node_to_hub
for(i in 1:nrow(node_to_hub){
#Should I randomly sample the clients here???
while(capacity < 120){
node_demand <- nodes[**random**,3]
distance <-round(sqrt(((node_to_hub[i,2]-nodes[i,2]))^2+(node_to_hub[**random**,3]-nodes[**random**,3])^2))
capacity <-c(capacity, node_demand)
distance_sum <- c(distance_sum,distance)
}
global_hub_distance <- c(global_hub_distance,distance_sum)
capacity <- 0
distance_sum <- 0
}
min(global_hub_distance)
}
Not EXACTLY sure what you are looking for but this code may be able to help you. It's not extremely fast, as instead of using a while to stop after hitting your total_capacity it just does a cumsum on the full node list and find the place where you exceed 120.
nodes <- structure(list(node_number = 1:50,
x = c(2L, 80L, 36L, 57L, 33L, 76L, 77L, 94L,
89L, 59L, 39L, 87L, 44L, 2L, 19L, 5L,
58L, 14L, 43L, 87L, 11L, 31L, 51L, 55L,
84L, 12L, 53L, 53L, 33L, 69L, 43L, 10L,
8L, 3L, 96L, 6L, 59L, 66L, 22L, 75L, 4L,
41L, 92L, 12L, 60L, 35L, 38L, 9L, 54L, 1L),
y = c(62L, 25L, 88L, 23L, 17L, 43L, 85L, 6L, 11L,
72L, 82L, 24L, 76L, 83L, 43L, 27L, 72L, 50L,
18L, 7L, 56L, 16L, 94L, 13L, 57L, 2L, 33L, 10L,
32L, 67L, 5L, 75L, 26L, 1L, 22L, 48L, 22L, 69L,
50L, 21L, 81L, 97L, 34L, 64L, 84L, 100L, 2L, 9L, 59L, 58L),
node_demand = c(3L, 14L, 1L, 14L, 19L, 2L, 14L, 6L,
7L, 6L, 10L, 18L, 3L, 6L, 20L, 4L,
14L, 11L, 19L, 15L, 15L, 4L, 13L,
13L, 5L, 16L, 3L, 7L, 14L, 17L,
3L, 3L, 12L, 14L, 20L, 13L, 10L,
9L, 6L, 18L, 7L, 20L, 9L, 1L, 8L,
5L, 1L, 7L, 9L, 2L)),
.Names = c("node_number", "x", "y", "node_demand"),
class = "data.frame", row.names = c(NA, -50L))
total_nodes = nrow(nodes)
hubs_required = 5
total_capacity = 120
iterations <- 100
track_sums <- matrix(NA, nrow = iterations, ncol = hubs_required)
colnames(track_sums) <- paste0("demand_at_hub",1:hubs_required)
And then I prefer using a function for distance, in this case A and B are 2 separate vectors with c(x,y) and c(x,y).
euc.dist <- function(A, B) round(sqrt(sum((A - B) ^ 2))) # distances
The Loop:
for(i in 1:iterations){
# random hub selection
hubs <- nodes[sample(1:total_nodes, hubs_required, replace = FALSE),]
for(h in 1:hubs_required){
# sample the nodes into a random order
random_nodes <- nodes[sample(1:nrow(nodes), size = nrow(nodes), replace = FALSE),]
# cumulative sum their demand, and get which number passes 120,
# and subtract 1 to get the node before that
last <- which(cumsum(random_nodes$node_demand) > total_capacity) [1] - 1
# get sum of all distances to those nodes (1 though the last)
all_distances <- apply(random_nodes[1:last,], 1, function(rn) {
euc.dist(A = hubs[h,c("x","y")],
B = rn[c("x","y")])
})
track_sums[i,h] <- sum(all_distances)
}
}
min(rowSums(track_sums))
EDIT
as a function:
hubnode <- function(nodes, hubs_required = 5, total_capacity = 120, iterations = 10){
# initialize results matrices
track_sums <- node_count <- matrix(NA, nrow = iterations, ncol = hubs_required)
colnames(track_sums) <- paste0("demand_at_hub",1:hubs_required)
colnames(node_count) <- paste0("nodes_at_hub",1:hubs_required)
# user defined distance function (only exists wihtin hubnode() function)
euc.dist <- function(A, B) round(sqrt(sum((A - B) ^ 2)))
for(i in 1:iterations){
# random hub selection
assigned_hubs <- sample(1:nrow(nodes), hubs_required, replace = FALSE)
hubs <- nodes[assigned_hubs,]
assigned_nodes <- NULL
for(h in 1:hubs_required){
# sample the nodes into a random order
assigned_nodes <- sample((1:nrow(nodes))[-assigned_hubs], replace = FALSE)
random_nodes <- nodes[assigned_nodes,]
# cumulative sum their demand, and get which number passes 120,
# and subtract 1 to get the node before that
last <- which(cumsum(random_nodes$node_demand) > total_capacity) [1] - 1
# if there are none
if(is.na(last)) last = nrow(random_nodes)
node_count[i,h] <- last
# get sum of all distances to those nodes (1 though the last)
all_distances <- apply(random_nodes[1:last,], 1, function(rn) {
euc.dist(A = hubs[h,c("x","y")],
B = rn[c("x","y")])
})
track_sums[i,h] <- sum(all_distances)
}
}
return(list(track_sums = track_sums, node_count = node_count))
}
output <- hubnode(nodes, iterations = 100)
node_count <- output$node_count
track_sums <- output$track_sums
plot(rowSums(node_count),
rowSums(track_sums), xlab = "Node Count", ylab = "Total Demand", main = paste("Result of", 100, "iterations"))
min(rowSums(track_sums))

Converting a list of vectors and numbers (from replicate) into a data frame

After running the replicate() function [a close relative of lapply()] on some data I ended up with an output that looks like this
myList <- structure(list(c(55L, 13L, 61L, 38L, 24L), 6.6435972422341, c(37L, 1L, 57L, 8L, 40L), 5.68336098665417, c(19L, 10L, 23L, 52L, 60L ),
5.80430476680636, c(39L, 47L, 60L, 14L, 3L), 6.67554407822367,
c(57L, 8L, 53L, 6L, 2L), 5.67149520387856, c(40L, 8L, 21L,
17L, 13L), 5.88446015238962, c(52L, 21L, 22L, 55L, 54L),
6.01685181395007, c(12L, 7L, 1L, 2L, 14L), 6.66299948053721,
c(41L, 46L, 21L, 30L, 6L), 6.67239635545512, c(46L, 31L,
11L, 44L, 32L), 6.44174324641076), .Dim = c(2L, 10L), .Dimnames = list(
c("reps", "score"), NULL))
In this case the vectors of integers are indexes that went into a function that I won't get into and the scalar-floats are scores.
I'd like a data frame that looks like
Index 1 Index 2 Index 3 Index 4 Index 5 Score
55 13 61 38 24 6.64
37 1 57 8 40 5.68
19 10 23 52 60 5.80
and so on.
Alternatively, a matrix of the indexes and an array of the values would be fine too.
Things that haven't worked for me.
data.frame(t(random.out)) # just gives a data frame with a column of vectors and another of scalars
cbind(t(random.out)) # same as above
do.call(rbind, random.out) # intersperses vectors and scalars
I realize other people have similar problems,
eg. Convert list of vectors to data frame
but I can't quite find an example with this particular kind of vectors and scalars together.
myList[1,] is a list of vectors, so you can combine them into a matrix with do.call and rbind. myList[2,] is a list of single scores, so you can combine them into a vector with unlist:
cbind(as.data.frame(do.call(rbind, myList[1,])), Score=unlist(myList[2,]))
# V1 V2 V3 V4 V5 Score
# 1 55 13 61 38 24 6.643597
# 2 37 1 57 8 40 5.683361
# 3 19 10 23 52 60 5.804305
# 4 39 47 60 14 3 6.675544
# 5 57 8 53 6 2 5.671495
# 6 40 8 21 17 13 5.884460
# 7 52 21 22 55 54 6.016852
# 8 12 7 1 2 14 6.662999
# 9 41 46 21 30 6 6.672396
# 10 46 31 11 44 32 6.441743

Ordering the x-axis in an R graph

I have a data.frame that looks like:
gvs order labels
1 -2.3321916 1 Adygei
2 -1.4996229 5 Basque
3 1.7958170 15 French
4 2.5543214 19 Italian
5 -2.7758460 33 Orcadian
6 -1.9659984 39 Russian
7 2.1239768 41 Sardinian
8 -1.8515908 47 Tuscan
9 -1.5597359 6 Bedouin
10 -1.2534511 14 Druze
11 -0.1625003 31 Mozabite
12 -1.0265275 35 Palestinian
13 -0.8519079 2 Balochi
14 -2.4279528 8 Brahui
15 -3.1717421 9 Burusho
16 -0.9258497 17 Hazara
17 -1.2207974 21 Kalash
18 -1.0325107 24 Makrani
19 -3.2102686 37 Pathan
20 -0.9377928 43 Sindhi
21 -1.7657017 48 Uygurf
22 -0.5058627 10 Cambodian
23 -0.7819299 12 Dai
24 -1.4095947 13 Daur
25 2.2810477 16 Han
26 -0.9007551 18 Hezhen
27 2.6614486 20 Japanese
28 -0.9441980 23 Lahu
29 -0.7237586 29 Miao
30 -0.9452944 30 Mongola
31 -1.2035258 32 Naxi
32 -0.7703779 34 Oroqen
33 -3.0895998 42 She
34 -0.7037952 45 Tu
35 -1.9311354 46 Tujia
36 -0.5423822 49 Xibo
37 -1.6244801 50 Yakut
38 -0.9049735 51 Yi
39 -2.6491331 11 Colombian
40 2.3706977 22 Karitiana
41 -2.7590587 26 Maya
42 -0.9614190 38 Pima
43 -1.6961014 44 Surui
44 -0.8449225 28 Melanesian
45 -1.1163019 36 Papuan
46 -0.9298674 3 BantuKenya
47 -2.8859587 4 BantuSouthAfrica
48 -1.4494841 7 BiakaPygmy
49 -0.7381369 25 Mandenka
50 -0.5644325 27 MbutiPygmy
51 -0.9195156 40 San
52 2.0949378 52 Yoruba
I would like to graph the column gvs along the x-axis in the order of the column order, and then have the label for each point along the x-axis to be from the column labels. Does anyone know how this is done? I want the graph to look like a less colorful version of the graphs in figure-5 in this paper http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004412
Based on your comments, it looks like (1) labels doesn't correspond to gvs and order, and (2) if I sort the first two columns by order, the data frame will be ordered properly. Please let me know if this is not correct.
Sort first two columns by order, leaving third column alone:
df[,c("gvs","order")] = df[order(df$order), c("gvs","order")]
Set the ordering of labels based on the current ordering of labels in the sample data frame:
df$labels = factor(df$labels, levels=df$labels)
Add a grouping variable for region. I did this by creating a new group each time the alphabetic ordering of labels went "backwards". The regions are just numbers here, but you can give them descriptive names if you want to use them:
df$group = c(0, cumsum(diff(match(substr(df$labels,1,1), LETTERS)) < 0))
Add fake p-values (since point size was based on p-value in the graph you linked to):
set.seed(595)
df$p.value = runif(nrow(df), 0, 0.5)
Plot the data, including a different color for each regional group, point-size based on p-value, and black borders around points with p < 0.05. geom_line add the regional means:
library(dplyr)
ggplot(df, aes(labels, gvs, size=p.value, fill=factor(group))) +
geom_line(data=df %>% group_by(group) %>% mutate(gvs=mean(gvs)),
aes(group=group, colour=factor(group)), size=0.8,alpha=0.5) +
geom_point(pch=21, stroke=1, aes(color=p.value<0.05)) +
theme_bw() +
theme(axis.text.x=element_text(angle=-90, hjust=0, vjust=0.5),
panel.grid.major=element_blank(),
panel.grid.minor=element_blank()) +
scale_size_continuous(name="p values", limits=c(0, 0.5), breaks=seq(0,1,0.1), range=c(4,1)) +
scale_color_manual(values=c(hcl(seq(15,375,length.out=8),100,65)[1:7],NA,"black")) +
labs(x="Language", fill="Region") +
guides(colour=FALSE,
size=guide_legend(reverse=TRUE, override.aes=list(color=NA,fill="grey50")),
fill=guide_legend(reverse=TRUE, override.aes=list(color=NA, size=3)))
Read data frame:
df <- data.frame(gvs = c(-2.3321916, -1.4996229, 1.795817, 2.5543214, -2.775846, -1.9659984,
2.1239768, -1.8515908, -1.5597359, -1.2534511, -0.1625003, -1.0265275,
-0.8519079, -2.4279528, -3.1717421, -0.9258497, -1.2207974, -1.0325107,
-3.2102686, -0.9377928, -1.7657017, -0.5058627, -0.7819299, -1.4095947,
2.2810477, -0.9007551, 2.6614486, -0.944198, -0.7237586, -0.9452944,
-1.2035258, -0.7703779, -3.0895998, -0.7037952, -1.9311354, -0.5423822,
-1.6244801, -0.9049735, -2.6491331, 2.3706977, -2.7590587, -0.961419,
-1.6961014, -0.8449225, -1.1163019, -0.9298674, -2.8859587, -1.4494841,
-0.7381369, -0.5644325, -0.9195156, 2.0949378),
order = c(1L, 5L, 15L, 19L, 33L, 39L, 41L, 47L, 6L, 14L, 31L, 35L, 2L,
8L, 9L, 17L, 21L, 24L, 37L, 43L, 48L, 10L, 12L, 13L, 16L, 18L,
20L, 23L, 29L, 30L, 32L, 34L, 42L, 45L, 46L, 49L, 50L, 51L, 11L,
22L, 26L, 38L, 44L, 28L, 36L, 3L, 4L, 7L, 25L, 27L, 40L, 52L),
labels = c("Adygei", "Basque", "French", "Italian", "Orcadian", "Russian",
"Sardinian", "Tuscan", "Bedouin", "Druze", "Mozabite", "Palestinian",
"Balochi", "Brahui", "Burusho", "Hazara", "Kalash", "Makrani",
"Pathan", "Sindhi", "Uygurf", "Cambodian", "Dai", "Daur", "Han",
"Hezhen", "Japanese", "Lahu", "Miao", "Mongola", "Naxi", "Oroqen",
"She", "Tu", "Tujia", "Xibo", "Yakut", "Yi", "Colombian", "Karitiana",
"Maya", "Pima", "Surui", "Melanesian", "Papuan", "BantuKenya",
"BantuSouthAfrica", "BiakaPygmy", "Mandenka", "MbutiPygmy", "San",
"Yoruba"))
Order data
df.ordered <- df[ order(df$order) , ]
And some simple (ugly) sample plotting which you can surely improve upon (maybe with ggplot)
plot(df.ordered$gvs, pch = 19)
axis(1, at=1:52, labels=df.ordered$labels, las=2)
Another option that doesn't rely on the sorting of the dataframe is to use the limits parameter of a discrete scale (which as a side benefit can allow you do do more arbitrary ordering when plotting.)
df <-read.csv(/path/to/file/df.csv')
xorder <-df[order(df$order),'labels']
ggplot(df, aes(x=labels, y=gvs, size=gvs)) +
geom_point() +
scale_x_discrete(limits=xorder)+
theme(axis.text.x=element_text(angle=90))

Resources