I'm trying to make a program that determines if a number is in a cluster based on a threshold. I need it to go through this while loop enough times to check all numbers and exit when only numbers not in the threshold are left. I've been using Boolean values to determine this. However, it seems to be that as soon as it's false the while loop stops.
Here's the while loop:
while(allInCluster){
centroid <- mean(cluster)
distance <- max(data)
data <- c(newData)
newData <- c()
smallest <- data[1]
allInCluster2 <- FALSE
for(i in 2:length(data)){
if((abs(centroid - data[i]) > distance) & (abs(centroid - data[i]) <= centroidThreshold)){
print("smallest")
print(smallest)
newData <- c(newData, smallest)
smallest <- data[i]
print("smallest changed")
print(smallest)
allInCluster2 <- TRUE
}else{
newData <- c(newData, data[i])
allInCluster2 <- FALSE
}
}
if(allInCluster2 == FALSE){
allInCluster <- FALSE
}
cluster <- c(cluster, smallest)
print("centroid")
print(centroid)
print("cluster")
print(cluster)
print("The other numbers")
print(newData)
}
Here's the sample data set I've been using:
data <- c(103,103,103,104,102, 102,102, 102,105,103, 110, 111)
with the threshold set to 5.
This is what it outputs:
[1] "cluster"
[1] 102 102 102 102
[1] "The other data"
[1] 103 103 103 104 105 103 110 111
[1] "centroid"
[1] 102
[1] "cluster"
[1] 102 102 102 102 103
[1] "The other numbers"
[1] 103 103 104 105 103 110 111
>
Related
I am working with the R programming language.
I have this loop:
for (i in 1:100)
{
num_i = as.integer(rnorm(1,100,100))
print(num_i)
}
[1] 44
[1] -3
[1] -55
[1] 127
[1] 149
[1] 83
[1] 151
[1] 52
[1] 120
[1] 102
[1] 132
[1] 352
[1] 96
[1] 208
[1] 268
[1] 156
[1] 51
[1] 23
[1] 27
I only want to print every 5th output of this loop (i.e. 5th output, 10th output, 15th output, etc.):
[1] 83
[1] 132
[1] 156
I had an idea - I could use the concept of "modulo" in such a way, such that only every 5th output is printed. For example:
for (i in 1:100)
{
num_i = as.integer(rnorm(1,100,100))
ifelse(i %% 5 == 0, print(num_i), "" )
}
Have I done this correctly?
Thanks!
There are non-loop ways to do this to get the same output since rnorm can generate more than 1 number.
However, this seems to be a simplified example of what you are doing so in this case, you can continue the for loop using if/else -
for (i in 1:100) {
num_i = as.integer(rnorm(1,100,100))
if(i %% 5 == 0) {
print(num_i)
}
}
This will print nothing when the condition i %% 5 is FALSE. If you want it to print "" you may include the else condition.
Or since we are not using num_i when the condition is not satisfied so in this case we can generate the number only when i %% 5 == 0
for (i in 1:100) {
if(i %% 5 == 0) {
num_i = as.integer(rnorm(1,100,100))
print(num_i)
}
}
I was wondering if there is a way to convert decimal numbers to ternary, given that there is a function intToBits for converting to binary.
I actually need to convert a character string like
> S0 <- c("Hello Stac")
to base 3. I thought to first convert it to decimal with
> S01 <- utf8ToInt(S0)
> S01
## [1] 72 101 108 108 111 32 83 116 97 99
then convert the result to base 3. I want to obtain something like this:
> S1
## [1] 2200 10202 11000 11010 11022 1012 10002 11022 10121 10200
For practice, I guess you can try to write your own converter function like below
f <- function(x, base = 3) {
q <- c()
while (x) {
q <- c(x %% base, q)
x <- x %/% base
}
# as.numeric(paste0(q, collapse = ""))
sum(q * 10^(rev(seq_along(q) - 1)))
}
or with recursion
f <- function(x, base = 3) {
ifelse(x < base, x, f(x %/% base) * 10 + x %% base)
}
then you can run
> sapply(utf8ToInt(S0),f)
[1] 2200 10202 11000 11000 11010 1012 10002 11022 10121 10200
Nice programming exercise. I have vectorized #ThomasIsCoding's answer to avoid expensive loops over strings and characters within strings. The idea is to loop over digits instead, since Unicode code points do not exceed 21 digits in any base, whereas the total number of characters in a character vector can be orders of magnitude greater.
The function below takes as arguments a character vector x, a base b (from 2 to 10), and a logical flag double. It returns a list res such that res[[i]] is an nchar(x[i])-length vector giving the base-b representation of x[i]. The list elements are double vectors or character vectors depending on double.
utf8ToBase <- function(x, b = 10, double = TRUE) {
## Do some basic checks
stopifnot(is.character(x), !anyNA(x),
is.numeric(b), length(b) == 1L,
b %% 1 == 0, b >= 2, b <= 10)
## Require UTF-8 encoding
x <- enc2utf8(x)
## Operate on concatenation to avoid loop over strings
xx <- paste(x, collapse = "")
ixx <- utf8ToInt(xx)
## Handle trivial case early
if (length(ixx) == 0L) {
el <- if (double) base::double(0L) else character(0L)
res <- rep.int(list(el), length(x))
names(res) <- names(x)
return(res)
}
## Use common field width determined from greatest integer
width <- as.integer(floor(1 + log(max(ixx, 1), base = b)))
res <- rep.int(strrep("0", width), length(ixx))
## Loop over digits
pos <- 1L
pow <- b^(width - 1L)
while (pos <= width) {
quo <- ixx %/% pow
substr(res, pos, pos) <- as.character(quo)
ixx <- ixx - pow * quo
pos <- pos + 1L
pow <- pow %/% b
}
## Discard leading zeros
if (double) {
res <- as.double(res)
if (b == 2 && any(res > 0x1p+53)) {
warning("binary result not guaranteed due to loss of precision")
}
} else {
res <- sub("^0+", "", res)
}
## Return list
res <- split(res, rep.int(gl(length(x), 1L), nchar(x)))
names(res) <- names(x)
res
}
x <- c(foo = "Hello Stack Overflow!", bar = "Hello world!")
utf8ToBase(x, 2)
$foo
[1] 1001000 1100101 1101100 1101100 1101111 100000
[7] 1010011 1110100 1100001 1100011 1101011 100000
[13] 1001111 1110110 1100101 1110010 1100110 1101100
[19] 1101111 1110111 100001
$bar
[1] 1001000 1100101 1101100 1101100 1101111 100000
[7] 1110111 1101111 1110010 1101100 1100100 100001
utf8ToBase(x, 3)
$foo
[1] 2200 10202 11000 11000 11010 1012 10002 11022 10121 10200
[11] 10222 1012 2221 11101 10202 11020 10210 11000 11010 11102
[21] 1020
$bar
[1] 2200 10202 11000 11000 11010 1012 11102 11010 11020 11000
[11] 10201 1020
utf8ToBase(x, 10)
$foo
[1] 72 101 108 108 111 32 83 116 97 99 107 32 79 118 101
[16] 114 102 108 111 119 33
$bar
[1] 72 101 108 108 111 32 119 111 114 108 100 33
Some caveats:
For efficiency, the function concatenates the strings in x rather than looping over them. It throws an error if the concatenation would exceed 2^31-1 bytes, which is the maximum string size allowed by R.
x <- strrep(letters[1:2], 0x1p+30)
log2(sum(nchar(x))) # 31
utf8ToBase(x, 3)
Error in paste(x, collapse = "") : result would exceed 2^31-1 bytes
The largest Unicode code point is 0x10FFFF. The binary representation of this number exceeds 2^53 when interpreted as decimal, so it cannot be stored in a double vector without loss of precision:
x <- sub("^0+", "", paste(rev(as.integer(intToBits(0x10FFFF))), collapse = ""))
x
## [1] "100001111111111111111"
sprintf("%.0f", as.double(x))
## [1] "100001111111111114752"
As a defensive measure, the function warns if 2^53 is exceeded when b = 2 and double = TRUE.
utf8ToBase("\U10FFFF", b = 2, double = TRUE)
[[1]]
[1] 1.000011e+20
Warning message:
In utf8ToBase("\U{10ffff}", b = 2, double = TRUE) :
binary result not guaranteed due to loss of precision
utf8ToBase("\U10FFFF", b = 2, double = FALSE)
[[1]]
[1] "100001111111111111111"
You can use cwhmisc::int2B:
library(cwhmisc)
int2B(utf8ToInt(S0), 3)[[1]] |> as.numeric()
# [1] 2200 10202 11000 11000 11010 1012 10002 11022 10121 10200
I'm reading a .nc data in R with ncdf4 and RNetCDF. The NetCDF metadata says that there are 144 lons and 73 lats, which leads to 144 columns and 73 rows, right?
However, the data I get in R seems to be transposed with 144 rows and 73 columns.
Please could you tell me what is wrong?
thanks
library(ncdf4)
a <- tempfile()
download.file(url = "ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis2.derived/pressure/uwnd.mon.mean.nc", destfile = a)
nc <- nc_open(a)
uwnd <- ncvar_get(nc = ncu, varid = "uwnd")
dim(uwnd)
## [1] 144 73 17 494
umed <- (uwnd[ , , 10, 421] + uwnd[ , , 10, 422] + uwnd[ , , 10, 423])/3
nrow(umed)
## [1] 144
ncol(umed)
## [1] 73
It looks you are having two problems.
The first one is related with expecting the same structure that the netCDF file has in R which is a problem in itself because when you are translating the multi-dimensional array structure of the netCDF into 2 dimensional dataframe. NetCDF format needs some reshaping in R in order to be manipulated as it does in python(see: http://geog.uoregon.edu/bartlein/courses/geog490/week04-netCDF.html).
The second one is that you are using values instead of indices when subsetting the data.
umed <- (uwnd[ , , 10, 421] + uwnd[ , , 10, 422] + uwnd[ , , 10, 423])/3
The solution that I see for this is starting by creating the indices of the dimensions that you want to subset. In this example I am subsetting preassure level 10 millibar and all that goes between longitude 230 and 300 and latitude 25 and 40.
nc <- nc_open("uwnd.mon.mean.nc")
LonIdx <- which( nc$dim$lon$vals > 230 & nc$dim$lon$vals <300 )
## [1] 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113
## 114 115 116 117 118 119 120
LatIdx <- which( nc$dim$lat$vals >25 & nc$dim$lat$vals < 40)
## [1] 22 23 24 25 26
LevIdx <- which( nc$dim$level$vals==10)
## [1] 17
Then you would need to apply the indices over each dimension except time which i would assume you don't want to subset. Sub setting lon and latitude is important due to R saves all in memory therefore leaving the whole range of them would consume a significant amount of RAM.
lat <- ncvar_get(nc,"lat")[LatIdx]
lon <- ncvar_get(nc,"lon")[LonIdx]
lev <- ncvar_get(nc,"level")[LevIdx]
time <- ncvar_get(nc,"time")
After that you can get the variable that you were looking for uwnd Monthly U-wind on Pressure Levels and finish reading the netCDF file with a nc_close(nc).
uwnd <- ncvar_get(nc,"uwnd")[LonIdx,LatIdx,LevIdx,]
nc_close(nc)
At the end you can expand the grid with all the four dimensions: longitude,latitude,preassure level and time.
uwndf <- data.frame(as.matrix(cbind(expand.grid(lon,lat,lev,time))),c(uwnd))
names(uwndf) <- c("lon","lat","level","time","U-wind")
Bind it to a dataframe with the U-wind variable and convert the netcdf time variable into an R time object.
uwndf$time_final<-convertDateNcdf2R(uwndf$time, units = "hours", origin =
as.POSIXct("1800-01-01", tz = "UTC"),time.format="%Y-%m-%d %Z %H:%M:%S")
At the end you will have the dataframe you are looking for between Jan 1979 and March 2020.
max(uwndf$time_final)
## [1] "2020-03-01 UTC"
min(uwndf$time_final)
## [1] "1979-01-01 UTC"
head(uwndf)
## lon lat level time U-wind time_final
## 1 232.5 37.5 10 1569072 3.289998 1979-01-01
## 2 235.0 37.5 10 1569072 5.209998 1979-01-01
## 3 237.5 37.5 10 1569072 7.409998 1979-01-01
## 4 240.0 37.5 10 1569072 9.749998 1979-01-01
## 5 242.5 37.5 10 1569072 12.009998 1979-01-01
## 6 245.0 37.5 10 1569072 14.089998 1979-01-01
I hope this is useful! Cheers!
Note: For converting the netcdf time variable into an R time object make sure you have the ncdf.tools library installed.
What I would like to do is for the red points find the nearest equivalent blue dot on the other side of the abline (i.e. 1,5 find 5,1).
Data:
https://1drv.ms/f/s!Asb7WztvacfOuesIq4evh0jjvejZ4Q
Edit: to open data do readRDS("path/to/data")
So what I have tried is to find the difference between the x and y coordinates, rank them and then find the min value going down the ranks for both x and y. The results and pretty bad. The thing I'm struggling with is finding a way to find nearest match of tuples.
My attempt:
find_nearest <- function(query, subject){
weight_df <- data.frame(ID=query$ID)
#find difference of first, then second, rank and find match in both going from top to bottom
tmp_df <- query
for(i in 1:nrow(subject)){
first_order <- order(abs(query$mean_score_n-subject$mean_score_n[i]))
second_order <- order(abs(query$mean_score_p-subject$mean_score_p[i]))
tmp_df$order_1[first_order] <- seq(1, nrow(tmp_df))
tmp_df$order_2[second_order] <- seq(1, nrow(tmp_df))
weight_df[,i+1] <- tmp_df$order_1 + tmp_df$order_2
}
rownames(weight_df) <- weight_df$ID
weight_df$ID <- NULL
print(dim(weight_df))
nearest_match <- list()
count <- 1
subject_ids <- NA
query_ids <- NA
while(ncol(weight_df) > 0 & count <= ncol(weight_df)){
pos <- which(weight_df == min(weight_df, na.rm = TRUE), arr.ind = TRUE)
if(length(unique(rownames(pos))) > 1){
for(i in nrow(pos)){
#if subject/query already used then mask and find another
if(subject$ID[pos[i,2]] %in% subject_ids){
weight_df[pos[i,1],pos[i,2]] <- NA
}else if(query$ID[pos[i,1]] %in% query_ids){
weight_df[pos[i,1],pos[i,2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[i,2]])
query_ids <- c(query_ids, query$ID[pos[i,1]])
nearest_match[[count]] <- data.frame(query=query[pos[i,1],]$ID, subject=subject[pos[i,2],]$ID)
#mask
weight_df[pos[i,1],pos[i,2]] <- NA
count <- count + 1
}
}
}else if(nrow(pos) > 1){
#if subject/query already used then mask and find another
if(subject$ID[pos[1,2]] %in% subject_ids){
weight_df[pos[1,1],pos[1,2]] <- NA
}else if(query$ID[pos[1,1]] %in% query_ids){
weight_df[pos[1,1],pos[1,2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[1,1]])
query_ids <- c(query_ids, query$ID[pos[1,1]])
nearest_match[[count]] <- data.frame(query=query[pos[1,1],]$ID, subject=subject[pos[1,2],]$ID)
#mask
weight_df[pos[1,1],pos[1,2]] <- NA
count <- count + 1
}
}else{
#if subject/query already used then mask and find another
if(subject$ID[pos[2]] %in% subject_ids){
weight_df[pos[1],pos[2]] <- NA
}else if(query$ID[pos[1]] %in% query_ids){
weight_df[pos[1],pos[2]] <- NA
}else{
subject_ids <- c(subject_ids, subject$ID[pos[2]])
query_ids <- c(query_ids, query$ID[pos[1]])
nearest_match[[count]] <- data.frame(query=query[pos[1],]$ID, subject=subject[pos[2],]$ID)
#mask
weight_df[pos[1],pos[2]] <- NA
count <- count + 1
}
}
}
out <- plyr::ldply(nearest_match, rbind)
out <- merge(out, data.frame(subject=subject$ID,
mean_score_p_n=subject$mean_score_p,
mean_score_n_n= subject$mean_score_n), by="subject", all.x=TRUE)
out <- merge(out, data.frame(query=query$ID,
mean_score_p_p=query$mean_score_p,
mean_score_n_p= query$mean_score_n), by="query", all.x=TRUE)
return(out)
}
Edit: is this what the solution looks like for you?
ggplot() +
geom_point(data=B[out,], aes(x=mean_score_p, y= mean_score_n, color="red")) +
geom_point(data=A, aes(x=mean_score_p, y=mean_score_n, color="blue")) +
geom_abline(intercept = 0, slope = 1)
Let
query <- readRDS("query.dms")
subject <- readRDS("subject.dms")
kA <- nrow(subject)
kB <- nrow(query)
A <- as.matrix(subject[, 2:3])
B <- as.matrix(query[, 2:3])
where we want to find the closest "reverse" point (row) in B to each point in A.
Solution permitting non-unique results
Then, assuming that you are using the Euclidean distance,
D <- as.matrix(dist(rbind(A, B[, 2:1])))[(1 + kA):(kA + kB), 1:kA]
unname(apply(D, 2, which.min))
# [1] 268 183 350 284 21 360 132 287 100 298 58 56 170 70 47 305 353
# [18] 43 266 198 58 215 198 389 412 321 255 181 79 340 292 268 198 54
# [35] 390 38 376 47 19 94 244 18 168 201 160 194 114 247 287 273 182
# [52] 87 94 87 192 63 160 244 101 298 62
are the corresponding row numbers in B. The trick was to switch the coordinates of the points in B by using B[, 2:1].
Solution with unique results
out <- vector("numeric", length = kA)
colnames(D) <- 1:ncol(D)
rownames(D) <- 1:nrow(D)
while(any(out == 0))
for(i in 1:nrow(D)) {
aux <- apply(D, 2, which.min)
if(i %in% aux) {
win <- which(aux == i)[which.min(D[i, aux == i])]
out[as.numeric(names(win))] <- as.numeric(rownames(D)[i])
D <- D[-i, -win, drop = FALSE]
}
}
out
# [1] 268 183 350 284 21 360 132 213 100 298 22 56 170 70 128 305 353
# [18] 43 266 198 58 215 294 389 412 321 255 181 79 340 292 20 347 54
# [35] 390 38 376 47 19 94 73 18 168 201 160 194 114 247 287 273 182
# [52] 87 365 158 192 63 211 244 101 68 62
whereas
all(table(res) == 1)
# [1] TRUE
confirms uniqueness. The solution is not the most efficient, but on your dataset it takes only a couple of seconds. It takes some time because it keeps going over all the available points in B checking if it is the closest one to any of the points in A. If so, the corresponding point in B is assigned to the closest one in A. Then both the point in A and the point in B are eliminated from the distance matrix. The loop goes until every point in A has some match in B.
I have a question regarding loop in R.
For example, currently at t=0, there are 100 people alive. Basically, each person will be alive with a probability of exponential (-mu) in which i put the mu=0.1.
I want to generate 10 samples to get the number of people alive at t=1. So i have done and get the following.
command:
set.seed(123)
alive <- 100
mu <- 0.1
sample <- 10
alive1 <- rbinom(sample,alive,exp(-mu))
alive1
# [1] 92 88 91 87 86 95 90 87 90 91
and now, i want to keep continuing doing it until time t=20.
command :
alive2 <- rbinom(10,alive1,exp(-mu))
alive2
alive3 <- rbinom(10,alive2,exp(-mu))
alive3
....
alive20 <-rbinom (10,alive19,exp(-mu))
alive20
output :
alive2 <- rbinom(10,alive1,exp(-mu))
alive2
# [1] 78 80 81 78 81 82 83 83 83 77
alive3 <- rbinom(10,alive2,exp(-mu))
alive3
# [1] 67 71 72 63 72 73 75 75 77 72
...
however, i do not want to keep on repeating the command especially if i want to extend my time to a longer period. how do i do the looping in r for my problem?
thanks!
set.seed(123)
alive <- vector("list", 20)
mu <- 0.1
n <- 10
alive[[1]] <- rbinom(n, 100, exp(-mu))
for(i in 2:20)
alive[[i]] <- rbinom(n, alive[[i-1]], exp(-mu))
I renamed the variable sample to n to avoid confusion with the commonly used function sample().
set.seed(123)
alive <- 100
mu <- 0.1
sample <- 10
alive1 <- rbinom(sample,alive,exp(-mu))
for ( i in 2:20)
{
assign(
paste0("alive",i),
rbinom(10,get(paste0("alive",(i-1))),exp(-mu))
)
}
Or #Backlin's suggestion of putting it in a list -
set.seed(123)
alive <- 100
mu <- 0.1
sample <- 10
Aliveset <- vector(mode = "list", length = 20)
Aliveset[[1]] <- rbinom(sample,alive,exp(-mu))
for ( i in 2:20)
{
Aliveset[[i]] <- rbinom(10,Aliveset[[i-1]],exp(-mu))
}