how to create for function in R for small dataset? - r

my data set is like this
blood pressure : 127 127 127-146 127-146
serum :200-219 2 11 3 12
220-259 3 88 23 100
>259 7 56 11 11
how to create for loop in my test
prop.test(2,11,p=p0,correct = FALSE)
prop.test(3,88,p=p0,correct = FALSE)
prop.test(7,56,p=p0,correct = FALSE)
prop.test(3,12,p=p0,correct = FALSE)

Related

Display result of predict in a 3 x 4 table

I have the following code in R
v <- c("featureA", "featureB")
newdata <- unique(data[v])
print(unique(data[v])
print(predict(model, newdata, type='response', allow.new.level = TRUE)
And I got the following result
featureA featureB
1 bucket_in_10_to_30 bucket_in_90_to_100
2 bucket_in_10_to_30 bucket_in_50_to_90
3 bucket_in_0_to_10 bucket_in_50_to_90
4 bucket_in_0_to_10 bucket_in_90_to_100
7 bucket_in_10_to_30 bucket_in_10_to_50
10 bucket_in_30_to_100 bucket_in_90_to_100
19 bucket_in_0_to_10 bucket_in_0_to_10
33 bucket_in_0_to_10 bucket_in_10_to_50
36 bucket_in_30_to_100 bucket_in_10_to_50
38 bucket_in_10_to_30 bucket_in_0_to_10
52 bucket_in_30_to_100 bucket_in_0_to_10
150 bucket_in_30_to_100 bucket_in_50_to_90
1 2 3 4 7 10 19 33 36 38 52 150
0.001920662 0.005480186 0.000961198 0.000335883 0.006311521 0.004005570 0.000620979 0.001107773 0.013100210 0.003546136 0.007382468 0.011384935
And I'm wondering if it's possible in R that I can reshape and directly get a 3 x 4 tables similar to this
feature A / features B
bucket_in_90_to_100
bucket_in_50_to_90
bucket_in_0_to_30
...
...
bucket_in_0_to_30
...
...
Thanks for the help!

writing out .dat file in r

I have a dataset looks like this:
ids <- c(111,12,134,14,155,16,17,18,19,20)
scores.1 <- c(0,1,0,1,1,2,0,1,1,1)
scores.2 <- c(0,0,0,1,1,1,1,1,1,0)
data <- data.frame(ids, scores.1, scores.1)
> data
ids scores.1 scores.1.1
1 111 0 0
2 12 1 1
3 134 0 0
4 14 1 1
5 155 1 1
6 16 2 2
7 17 0 0
8 18 1 1
9 19 1 1
10 20 1 1
ids stands for student ids, scores.1 is the response/score for the first question, and scores.2 is the response/score for the second question. Student ids vary in terms of the number of digits but scores always have 1 digit. I am trying to write out as .dat file by generating some object and use those in write.fwf function in gdata library.
item.count <- dim(data)[2] - 1 # counts the number of questions in the dataset
write.fwf(data, file = "data.dat", width = c(5,rep(1, item.count)),
colnames = FALSE, sep = "")
I would like to separate the student ids and question response with some spaces,so I would like to use 5 spaces for students ids and to specify that I used width = c(5, rep(1, item.count)) in write.fwf() function. However, the output file looks like this having the spaces at the left side of the student ids
11100
1211
13400
1411
15511
1622
1700
1811
1911
2011
rather than at the right side of the ids.
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11
Any recommendations?
Thanks!
We can use unite to unite the 'score' columns into a single one and then use write.csv
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
with #akrun's help, this gives what I wanted:
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
write.fwf(data, file = "data.dat",
width = c(5,item.count),
colnames = FALSE, sep = " ")
in the .dat file, the dataset looks like this below:
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11

Cut dendrogram / cluster: Error in function 'cutree': tree incorrect (composante 'merge')

I have a dendrogram which I want to cut into less clusters because right know there are too many for interpretation.
My dataframe looks like this:
> head(alpha)
locs 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
1 A1 12 14 15 15 14 21 10 18 18 20
2 A2 11 11 12 13 9 16 20 18 18 11
3 B1 12 13 20 17 21 20 27 14 22 25
4 B2 15 18 18 25 21 17 27 23 28 23
5 B3 22 22 26 24 28 23 31 25 32 25
6 B4 18 21 25 20 20 14 23 22 20 26
library("ggplot2") #for the heatmap
library("ggdendro") #for the dendrogram
library("reshape2") #for data wrangling
library("grid") #to combine the two plots heatmap and dendrogram
# Read in data
setwd("C:/Users/data")
alpha <- read.csv2("alpha.csv", header=TRUE, check.names=FALSE)
str(alpha) #structure of the dataset: locations (locs) = factor, values = integer
#scale the data variables (columns 4-9)
alpha.scaled <- alpha
alpha.scaled[, c(2:11)] <- scale(alpha.scaled[, 2:11])
alpha.scaled[, c(2:11)] <- scale(alpha.scaled[, 2:11])
# run clustering
alpha.matrix <- as.matrix(alpha.scaled[, -c(1)])
rownames(alpha.matrix) <- alpha.scaled$locs
alpha.dendro <- as.dendrogram(hclust(d = dist(x = alpha.matrix), method="complete" ))
# Create dendrogram (=cluster)
dendro.plot <- ggdendrogram(data = alpha.dendro, rotate = TRUE)
alphacut <- cutree(alpha.dendro, h=3)
alphacut <- cutree(alpha.dendro, h=3)
Error in cutree(alpha.dendro, h = 3) :
'tree' incorrect (composante 'merge')
alphacut <- cutree(as.dendrogram(hclust(d = dist(x = alpha.matrix), method="complete" )),k=5)
alphacut <- cutree(as.dendrogram(hclust(d = dist(x = alpha.matrix), method="complete" )), k=5)
Error in cutree(as.dendrogram(hclust(d = dist(x = alpha.matrix), method = "complete")), :
'tree' incorrect (composante 'merge')
I haven't found a solution to this. When I look at 'alpha.dendro' there is a list of 2 but no merge component, so this seems to be the problem. Does somebody know what to do?

Clustering biological sequences based on numeric values

I am trying to cluster several amino acid sequences of a fixed length (13) into K clusters based on the Atchley factors (5 numbers which represent each amino acid.
For example, I have an input vector of strings like the following:
key <- HDMD::AAMetric.Atchley
sequences <- sapply(1:10000, function(x) paste(sapply(1:13, function (X) sample(rownames(key), 1)), collapse = ""))
However, my actual list of sequences is over 10^5 (specifying for need for computational efficiency).
I then convert these sequences into numeric vectors by the following:
key <- HDMD::AAMetric.Atchley
m1 <- key[strsplit(paste(sequences, collapse = ""), "")[[1]], ]
p = 13
output <-
do.call(cbind, lapply(1:p, function(i)
m1[seq(i, nrow(m1), by = p), ]))
I want to output (which is now 65 dimensional vectors) in an efficient way.
I was originally using Mini-batch kmeans, but I noticed the results were very inconsistent when I repeated. I need a consistent clustering approach.
I also was concerned about the curse of dimensionality, considering at 65 dimensions, Euclidean distance doesn't work.
Many high dimensional clustering algorithms I saw assume that outliers and noise exists in the data, but as these are biological sequences converted to numeric values, there is no noise or outlier.
In addition to this, feature selection will not work, as each of the properties of each amino acid and each amino acid are relevant in the biological context.
How would you recommend clustering these vectors?
I think self organizing maps can be of help here - at least the implementation is quite fast so you will know soon enough if it is helpful or not:
using the data from the op along with:
rownames(output) <- 1:nrow(output)
colnames(output) <- make.names(colnames(output), unique = TRUE)
library(SOMbrero)
you define the number of cluster in advance
fit <- trainSOM(x.data=output , dimension = c(5, 5), nb.save = 10, maxit = 2000,
scaling="none", radius.type = "gaussian")
the nb.save is used as intermediate steps for further exploration how the training developed during the iterations:
plot(fit, what ="energy")
seems like more iterations is in order
check the frequency of clusters:
table(my.som$clustering)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
428 417 439 393 505 458 382 406 271 299 390 303 336 358 365 372 332 268 437 464 541 381 569 419 467
predict clusters based on new data:
predict(my.som, output[1:20,])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
19 12 11 8 9 1 11 13 14 5 18 2 22 21 23 22 4 14 24 12
check which variables were important for clustering:
summary(fit)
#part of output
Summary
Class : somRes
Self-Organizing Map object...
online learning, type: numeric
5 x 5 grid with square topology
neighbourhood type: gaussian
distance type: euclidean
Final energy : 44.93509
Topographic error: 0.0053
ANOVA :
Degrees of freedom : 24
F pvalue significativity
pah 1.343 0.12156074
pss 1.300 0.14868987
ms 16.401 0.00000000 ***
cc 1.695 0.01827619 *
ec 17.853 0.00000000 ***
find optimal number of clusters:
plot(superClass(fit))
fit1 <- superClass(fit, k = 4)
summary(fit1)
#part of output
SOM Super Classes
Initial number of clusters : 25
Number of super clusters : 4
Frequency table
1 2 3 4
6 9 4 6
Clustering
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 1 2 2 2 1 1 2 2 2 1 1 2 2 2 3 3 4 4 4 3 3 4 4 4
ANOVA
Degrees of freedom : 3
F pvalue significativity
pah 1.393 0.24277933
pss 3.071 0.02664661 *
ms 19.007 0.00000000 ***
cc 2.906 0.03332672 *
ec 23.103 0.00000000 ***
Much more in this vignette

How to change the size of plots according to a scale on ggmap?

I have been trying to plot some data on the map.
With the plot circle of the observed data changing according to a scale.
But the plot produced doesn't reflect the scale.
See the following.
This is the code which I have tried.
newmap <- get_map(location = c(lon = 82.5,lat = 24),zoom=4, color="bw")
ggmap(newmap, extent = "normal") +
geom_point(aes(x = lon, y = lat, colour = scale , size = scale), data = final_data)
I got the code from the following post.
My data looks like this.
> final_data
lon lat disab scale
1 74.79728 34.083671 27832 1
2 74.87226 31.633979 28119 1
3 75.85728 30.900965 34830 1
4 77.31779 28.408912 33579 1
5 77.10249 28.704059 228427 6
6 75.78727 26.912434 74541 2
7 73.02431 26.238947 24898 1
8 75.86475 25.213816 20843 1
9 77.70641 28.984462 27864 1
10 77.45376 28.669156 84458 2
11 78.00807 27.176670 54382 2
12 80.94617 26.846694 77684 2
13 80.33187 26.449923 81988 2
14 81.84631 25.435801 37750 1
15 82.97391 25.317645 39408 2
16 85.13756 25.594095 68869 2
17 86.95240 23.673945 24627 1
18 88.36390 22.572646 342319 8
19 86.43039 23.795653 28865 1
20 86.20288 22.804566 20766 1
21 85.30956 23.344100 22957 1
22 81.28492 21.190449 22061 1
23 81.62964 21.251384 25868 1
24 78.18283 26.218287 18434 1
25 75.85773 22.719569 56279 2
26 77.41262 23.259933 73219 2
27 79.98641 23.181467 32597 1
28 72.57136 23.022505 188917 5
29 70.80216 22.303894 20219 1
30 73.18122 22.307159 47587 2
31 72.83106 21.170240 55055 2
32 75.34331 19.876165 36205 1
33 79.08815 21.145800 63969 2
34 73.78980 19.997453 26572 1
35 72.83973 19.391928 37382 1
36 72.81771 19.003050 484688 11
37 73.85674 18.520430 127858 3
38 78.48667 17.385044 294072 7
39 80.64802 16.506174 40592 2
40 NA NA 53865 2
41 77.61586 12.941483 251561 6
42 75.37037 11.874477 33907 1
43 75.78041 11.258753 51981 2
44 76.07400 11.073182 31863 1
45 76.21443 10.527642 38573 2
46 76.26730 9.931233 41432 2
47 76.61414 8.893212 23403 1
48 76.93664 8.524139 39024 2
49 80.27072 13.082680 163428 4
50 78.70467 10.790483 14489 1
51 78.11978 9.925201 19890 1
52 76.95583 11.016844 32794 1
It will be ton of help, if someone can help me figure out the problem..:)
Thanks in advance.
It does not work because there is no myscale column in your data:final_data.
Change myscale to scale:
newmap <- get_map(location = c(lon = 82.5,lat = 24),zoom=4, color="bw")
ggmap(newmap, extent = "normal") +
geom_point(aes(x = lon, y = lat, colour = scale , size = scale), data = final_data)

Resources