R dividing dataset into ranged bins? - r

I am having some problems sorting my dataset into bins, that based on the numeric value of the data value. I tried doing it with the function shingle from the lattice which seem to split it accurately.
I can't seem to extract the desired output which is the knowledge how the data is divided into the predefined bins. I seem only able to print it.
bin_interval = matrix(c(0.38,0.42,0.46,0.50,0.54,0.58,0.62,0.66,0.70,0.74,0.78,0.82,0.86,0.90,0.94,0.98,
0.40,0.44,0.48,0.52,0.56,0.60,0.64,0.68,0.72,0.76,0.80,0.84,0.88,0.92,0.96,1.0),
ncol = 2, nrow = 16)
bin_1 = shingle(data_1,intervals = bin_interval)
How do i extract the intervals which is outputted by the shingle function, and not only print it...
the intervals being the output:
Intervals:
min max count
1 0.38 0.40 0
2 0.42 0.44 6
3 0.46 0.48 46
4 0.50 0.52 251
5 0.54 0.56 697
6 0.58 0.60 1062
7 0.62 0.64 1215
8 0.66 0.68 1227
9 0.70 0.72 1231
10 0.74 0.76 1293
11 0.78 0.80 1330
12 0.82 0.84 1739
13 0.86 0.88 2454
14 0.90 0.92 3048
15 0.94 0.96 8936
16 0.98 1.00 71446
As an variable, that can be fed to another function.

The shingle() function returns the values using attributes().
The levels are specifically given by attr(bin_1,"levels").
So:
set.seed(1337)
data_1 = runif(100)
bin_interval = matrix(c(0.38,0.42,0.46,0.50,0.54,0.58,0.62,0.66,0.70,0.74,0.78,0.82,0.86,0.90,0.94,0.98,
0.40,0.44,0.48,0.52,0.56,0.60,0.64,0.68,0.72,0.76,0.80,0.84,0.88,0.92,0.96,1.0),
ncol = 2, nrow = 16)
bin_1 = shingle(data_1,intervals = bin_interval)
attr(bin_1,"levels")
This gives:
[,1] [,2]
[1,] 0.38 0.40
[2,] 0.42 0.44
[3,] 0.46 0.48
[4,] 0.50 0.52
[5,] 0.54 0.56
[6,] 0.58 0.60
[7,] 0.62 0.64
[8,] 0.66 0.68
[9,] 0.70 0.72
[10,] 0.74 0.76
[11,] 0.78 0.80
[12,] 0.82 0.84
[13,] 0.86 0.88
[14,] 0.90 0.92
[15,] 0.94 0.96
[16,] 0.98 1.00
Edit
The count information for each interval is only computed within the print.shingle method. Thus, you would need to run the following code:
count.shingle = function(x){
l <- levels(x)
n <- nlevels(x)
int <- data.frame(min = numeric(n), max = numeric(n),
count = numeric(n))
for (i in 1:n) {
int$min[i] <- l[[i]][1]
int$max[i] <- l[[i]][2]
int$count[i] <- length(x[x >= l[[i]][1] & x <= l[[i]][2]])
}
int
}
a = count.shingle(bin_1)
This gives:
> a
min max count
1 0.38 0.40 0
2 0.42 0.44 1
3 0.46 0.48 3
4 0.50 0.52 1
5 0.54 0.56 2
6 0.58 0.60 2
7 0.62 0.64 2
8 0.66 0.68 4
9 0.70 0.72 1
10 0.74 0.76 3
11 0.78 0.80 2
12 0.82 0.84 2
13 0.86 0.88 5
14 0.90 0.92 1
15 0.94 0.96 1
16 0.98 1.00 2
where a$min is lower range, a$max is upper range, and a$count is the number within the bins.

Related

Extracting dataframe values using indices in R

I have 100+ files and have starting and ending coordinates for each file. So based on starting and ending coordinates, I want to extract the regions from all data sets and want to store in file. I have used following approach but its not giving me the expected out put.
startco have the starting indices of 1st 2nd 3rd file respectively and endco have ending indices of 1st 2nd 3rd file respectively. And if the indices is going beyond the files indices want to put NA
Example:
startco<-c(3,4,1)
endco<-c(5,6,2)
ctc<-c(1,2,3)
for (hm0 in 1:length(ctc)) {
for (hm1 in 1:length(startco)) {
for (hm2 in 1:length(endco)) {
methd1<-read.table( paste0("path/to folder/","file_",ctc[hm0],".txt"))
methd2<- methd1[,startco[hm1]:endco[hm2]]
}
}
}
File_1.txt
V1 V2 V3 V4 V5
41 42 43 45 46
0.31 0.21 0.87 0.65 0.54
0.32 0.28 0.74 0.87 0.65
0.19 0.12 0.99 0.99 0.89
File_2.txt
V1 V2 V3 V4 V5
12 24 13 14 16
0.89 0.78 0.50 0.22 0.34
0.54 0.78 0.50 0.34 0.41
0.78 0.54 0.66 0.26 0.14
File_3.txt
V1 V2 V3 V4 V5
1 2 3 5 6
0.20 0.40 0.50 0.49 0.52
Expected output :
43 45 46
0.87 0.65 0.54
0.74 0.87 0.65
0.99 0.99 0.89
0.22 0.34 NA
0.34 0.41 NA
0.99 0.89 NA
1 2
0.20 0.40
in Base R you could do:
fun <- function(path, start, end){
id <- basename(path)
dat <- read.table(path, header = TRUE)
p <- ncol(dat)
n <- nrow(dat)
neg <- if(start<0) -start else 0
add <- matrix(nrow = n, ncol = neg)
if (start < 1) start <- 1
if (end > p) end <- p
d <- cbind(add, dat[, start:end])
names(d) <- paste0('X', seq(ncol(d)))
cbind(id,r = seq(nrow(d)), d)
}
startco<-c(3,4,-2) # TAKES NEGATIVE INDICES
endco<-c(5,6,2)
ctc<-c(1,2,3)
files <- file.path('path/to/folder', ctc)
A <- Map(fun, files, startco, endco)
Reduce(function(x, y)merge(x,y, all =TRUE), A)[, -(1:2)]
X1 X2 X3 X4
1 43.00 45.00 46.00 NA
2 0.87 0.65 0.54 NA
3 0.74 0.87 0.65 NA
4 0.99 0.99 0.89 NA
5 14.00 16.00 NA NA
6 0.22 0.34 NA NA
7 0.34 0.41 NA NA
8 0.26 0.14 NA NA
9 NA NA 1.00 2.0
10 NA NA 0.20 0.4
The one with no negatives
startco<-c(3,4,1)
B <- Map(fun, files, startco, endco)
Reduce(function(x, y)merge(x,y, all =TRUE), B)[, -(1:2)]
X1 X2 X3
1 43.00 45.00 46.00
2 0.87 0.65 0.54
3 0.74 0.87 0.65
4 0.99 0.99 0.89
5 14.00 16.00 NA
6 0.22 0.34 NA
7 0.34 0.41 NA
8 0.26 0.14 NA
9 1.00 2.00 NA
10 0.20 0.40 NA
I would use a readfun,
readfun <- \(i, s, e) {
stopifnot(s != 0)
r <- read.table(paste0("foo1/", "file_", i, ".txt"), header=TRUE)
if (e > ncol(r)) { ## inserts cols to the right if e > ncol
e1 <- e - ncol(r)
nm <- paste0('V', as.numeric(substring(colnames(r), 2)[ncol(r)]) + seq_len(e1))
m <- matrix(NA_real_, nrow(r), e1, dimnames=list(NULL, nm))
r <- cbind(r, m)
}
if (s < 0) { ## inserts cols to the left if s < 0
m <- matrix(NA_real_, nrow(r), -s)
r <- cbind(m, r)
e <- e + -s
s <- 1
}
out <- r[, s:e]
unname(as.matrix(out))
}
in Map.
ctc <- c(1, 2, 3); startco <- c(3, 4, -2); endco <- c(5, 6, 2)
Map(readfun, ctc, startco, endco)
# [[1]]
# [,1] [,2] [,3]
# [1,] 43.00 45.00 46.00
# [2,] 0.87 0.65 0.54
# [3,] 0.74 0.87 0.65
# [4,] 0.99 0.99 0.89
#
# [[2]]
# [,1] [,2] [,3]
# [1,] 14.00 16.00 NA
# [2,] 0.22 0.34 NA
# [3,] 0.34 0.41 NA
# [4,] 0.26 0.14 NA
#
# [[3]]
# [,1] [,2] [,3] [,4]
# [1,] NA NA 1.0 2.0
# [2,] NA NA 0.2 0.4
Data:
dir.create('foo1')
write.table(read.table(header=TRUE, text='
V1 V2 V3 V4 V5
41 42 43 45 46
0.31 0.21 0.87 0.65 0.54
0.32 0.28 0.74 0.87 0.65
0.19 0.12 0.99 0.99 0.89'), './foo1/file_1.txt', row.names=F)
write.table(read.table(header=TRUE, text='
V1 V2 V3 V4 V5
12 24 13 14 16
0.89 0.78 0.50 0.22 0.34
0.54 0.78 0.50 0.34 0.41
0.78 0.54 0.66 0.26 0.14'), './foo1/file_2.txt', row.names=F)
write.table(read.table(header=TRUE, text='
V1 V2 V3 V4 V5
1 2 3 5 6
0.20 0.40 0.50 0.49 0.52 '), './foo1/file_3.txt', row.names=F)

How do i create a 3d surface plot in R If I have a dataframe of 3 columns?

Here are the first 20 rows of my dataframe:
x y z
1 0.50 0.50 48530.98
2 0.50 0.51 49029.34
3 0.50 0.52 49576.12
4 0.50 0.53 50161.22
5 0.50 0.54 50752.05
6 0.50 0.55 51354.43
7 0.50 0.56 51965.09
8 0.50 0.57 38756.51
9 0.50 0.58 39262.34
10 0.50 0.59 39783.68
11 0.51 0.60 41052.09
12 0.51 0.61 41447.51
13 0.51 0.62 26972.85
14 0.51 0.63 27134.74
15 0.51 0.64 27297.85
16 0.51 0.65 27462.82
17 0.51 0.66 27632.45
18 0.51 0.67 27806.77
19 0.51 0.68 27988.12
20 0.51 0.69 25514.42
I need to create a 3d surface plot to view it.
The best would be one where I can rotate it around angles to view it from all perspectives.
Thanks.
You can use plotly to create a 3d surface plot. Use xtabs to turn your data into a suitable matrix
library(plotly)
plot_ly(z = ~xtabs(z ~ x + y, data = df)) %>% add_surface()
Sample data
df <- read.table(text =
" x y z
1 0.50 0.50 48530.98
2 0.50 0.51 49029.34
3 0.50 0.52 49576.12
4 0.50 0.53 50161.22
5 0.50 0.54 50752.05
6 0.50 0.55 51354.43
7 0.50 0.56 51965.09
8 0.50 0.57 38756.51
9 0.50 0.58 39262.34
10 0.50 0.59 39783.68
11 0.51 0.60 41052.09
12 0.51 0.61 41447.51
13 0.51 0.62 26972.85
14 0.51 0.63 27134.74
15 0.51 0.64 27297.85
16 0.51 0.65 27462.82
17 0.51 0.66 27632.45
18 0.51 0.67 27806.77
19 0.51 0.68 27988.12
20 0.51 0.69 25514.42", header = T)

Applying a custom function repeatedly to same dataframe using purrr

Suppose I have a dataframe as follows:
df <- data.frame(
alpha = 0:20,
beta = 30:50,
gamma = 100:120
)
I have a custom function that makes new columns. (Note, my actual function is a lot more complex and can't be vectorized without a custom function, so please ignore the substance of the transformation here.) For example:
newfun <- function(var = NULL) {
newname <- paste0(var, "NEW")
df[[newname]] <- df[[var]]/100
return(df)
}
I want to apply this over many columns of the dataset repeatedly and have the dataset "build up." This happens just fine when I do the following:
df <- newfun("alpha")
df <- newfun("beta")
df <- newfun("gamma")
Obviously this is redundant and a case for map. But when I do the following I get back a list of dataframes, which is not what I want:
df <- data.frame(
alpha = 0:20,
beta = 30:50,
gamma = 100:120
)
out <- c("alpha", "beta", "gamma") %>%
map(function(x) newfun(x))
How can I iterate over a vector of column names AND see the changes repeatedly applied to the same dataframe?
Writing the function to reach outside of its scope to find some df is both risky and will bite you, especially when you see something like:
df[['a']] <- 2
# Error in df[["a"]] <- 2 : object of type 'closure' is not subsettable
You will get this error when it doesn't find your variable named df, and instead finds the base function named df. Two morals from this discovery:
While I admit to using df myself, it's generally bad practice to name variables the same as R functions (especially from base); and
Scope-breach is sloppy and renders a workflow unreproducible and often difficult to troubleshoot problems or changes.
To remedy this, and since your function relies on knowing what the old/new variable names are or should be, I think pmap or base R Map may work better. Further, I suggest that you name the new variables outside of the function, making it "data-only".
myfunc <- function(x) x/100
setNames(lapply(dat[,cols], myfunc), paste0("new", cols))
# $newalpha
# [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17
# [19] 0.18 0.19 0.20
# $newbeta
# [1] 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47
# [19] 0.48 0.49 0.50
# $newgamma
# [1] 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17
# [19] 1.18 1.19 1.20
From here, we just need to column-bind (cbind) it:
cbind(dat, setNames(lapply(dat[,cols], myfunc), paste0("new", cols)))
# alpha beta gamma newalpha newbeta newgamma
# 1 0 30 100 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04
# ...
Special note: if you plan on doing this iteratively (repeatedly), it is generally bad to iteratively add rows to frames; while I know this is a bad idea for adding rows, I suspect (without proof at the moment) that doing the same with columns is also bad. For that reason, if you do this a lot, consider using do.call(cbind, c(list(dat), ...)) where ... is the list of things to add. This results in a single call to cbind and therefore only a single memory-copy of the original dat. (Contrast that with iteratively calling the *bind functions which make a complete copy with each pass, scaling poorly.)
additions <- lapply(1:3, function(i) setNames(lapply(dat[,cols], myfunc), paste0("new", i, cols)))
str(additions)
# List of 3
# $ :List of 3
# ..$ new1alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new1beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new1gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
# $ :List of 3
# ..$ new2alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new2beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new2gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
# $ :List of 3
# ..$ new3alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new3beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new3gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
do.call(cbind, c(list(dat), additions))
# alpha beta gamma new1alpha new1beta new1gamma new2alpha new2beta new2gamma new3alpha new3beta new3gamma
# 1 0 30 100 0.00 0.30 1.00 0.00 0.30 1.00 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01 0.01 0.31 1.01 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02 0.02 0.32 1.02 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03 0.03 0.33 1.03 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04 0.04 0.34 1.04 0.04 0.34 1.04
# 6 5 35 105 0.05 0.35 1.05 0.05 0.35 1.05 0.05 0.35 1.05
# ...
An alternative approach is to change your function to only return a vector:
newfun2 <- function(var = NULL) {
df[[var]] / 100
}
newfun2('alpha')
# [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13
#[15] 0.14 0.15 0.16 0.17 0.18 0.19 0.20
Then, using base, you can use lapply() to loop through your list of functions to do:
cols <- c("alpha", "beta", "gamma")
df[, paste0(cols, 'NEW')] <- lapply(cols, newfun2)
#or
#df[, paste0(cols, 'NEW')] <- purrr::map(cols, newfun2)
df
alpha beta gamma alphaNEW betaNEW gammaNEW
1 0 30 100 0.00 0.30 1.00
2 1 31 101 0.01 0.31 1.01
3 2 32 102 0.02 0.32 1.02
4 3 33 103 0.03 0.33 1.03
5 4 34 104 0.04 0.34 1.04
6 5 35 105 0.05 0.35 1.05
7 6 36 106 0.06 0.36 1.06
8 7 37 107 0.07 0.37 1.07
9 8 38 108 0.08 0.38 1.08
10 9 39 109 0.09 0.39 1.09
11 10 40 110 0.10 0.40 1.10
12 11 41 111 0.11 0.41 1.11
13 12 42 112 0.12 0.42 1.12
14 13 43 113 0.13 0.43 1.13
15 14 44 114 0.14 0.44 1.14
16 15 45 115 0.15 0.45 1.15
17 16 46 116 0.16 0.46 1.16
18 17 47 117 0.17 0.47 1.17
19 18 48 118 0.18 0.48 1.18
20 19 49 119 0.19 0.49 1.19
21 20 50 120 0.20 0.50 1.20
Based on the way you wrote your function, a for loop that assign the result of newfun to df repeatedly works pretty well.
vars <- names(df)
for (i in vars){
df <- newfun(i)
}
df
# alpha beta gamma alphaNEW betaNEW gammaNEW
# 1 0 30 100 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04
# 6 5 35 105 0.05 0.35 1.05
# 7 6 36 106 0.06 0.36 1.06
# 8 7 37 107 0.07 0.37 1.07
# 9 8 38 108 0.08 0.38 1.08
# 10 9 39 109 0.09 0.39 1.09
# 11 10 40 110 0.10 0.40 1.10
# 12 11 41 111 0.11 0.41 1.11
# 13 12 42 112 0.12 0.42 1.12
# 14 13 43 113 0.13 0.43 1.13
# 15 14 44 114 0.14 0.44 1.14
# 16 15 45 115 0.15 0.45 1.15
# 17 16 46 116 0.16 0.46 1.16
# 18 17 47 117 0.17 0.47 1.17
# 19 18 48 118 0.18 0.48 1.18
# 20 19 49 119 0.19 0.49 1.19
# 21 20 50 120 0.20 0.50 1.20

Remove values based on specific relation to previous value in same column

I have daily stock return data for several companies and need to remove those values, which have a specific relation to the previous (= the day before) return value.
In a mathematical formula it looks something like that:(1+r)*(1+e)-1<= 50%, where r is the return on the current day and e the return on the previous day, and at least either r or e being greater than 100%.
The data frame DF looks like that.
Date A B C D
01.01.2015 0.15 0.17 0.70 0.65
02.01.2015 1.01 0.75 0.01 -0.18
01.02.2015 -0.50 0.64 1.20 0.1
06.02.2015 0.12 0.54 0.13 1.50
01.03.2016 0.45 0.54 1.89 0.56
Afte apllying this filter DF should look like this.
Date A B C D
01.01.2015 0.15 0.17 0.70 0.65
02.01.2015 1.01 0.75 0.01 -0.18
01.02.2015 NA 0.64 1.20 0.1
06.02.2015 0.12 0.54 0.13 1.50
01.03.2016 0.45 0.54 1.89 0.56
Thanks for your help!
I would try this:
library(tidyverse)
check_fn <- function(Z){
ifelse(((lag(Z, n=1) > 1 | Z >1) & ((1+lag(Z, n=1))*(1+Z) <= 1.5)), NA, Z)
}
Y <- X %>%
mutate_at(vars(2:5), check_fn)
Y[1, 2:5] <- X[1, 2:5]
Y
... which generates:
Date A B C D
1 01.01.2015 0.15 0.17 0.70 0.65
2 02.01.2015 1.01 0.75 0.01 -0.18
3 01.02.2015 NA 0.64 1.20 0.10
4 06.02.2015 0.12 0.54 0.13 1.50
5 01.03.2016 0.45 0.54 1.89 0.56
I hope it helps you.
There might for shure be a more elegang solution.
> M=matrix(c(0.15,0.17,0.70,0.65,1.01,0.75,0.01,-0.18,-0.50,0.64,1.20,0.1,0.12,0.54,0.13,1.50,0.45,0.54,1.89,0.56),nrow = 5, byrow = TRUE)
> M
[,1] [,2] [,3] [,4]
[1,] 0.15 0.17 0.70 0.65
[2,] 1.01 0.75 0.01 -0.18
[3,] -0.50 0.64 1.20 0.10
[4,] 0.12 0.54 0.13 1.50
[5,] 0.45 0.54 1.89 0.56
> ifelse(rbind(c(T,T,T,T), !(((M[2:5,]>1)|(M[1:4,]>1))&(((1+M[2:5,])*(1+M[1:4,])-1)<.5))), M, NA)
[,1] [,2] [,3] [,4]
[1,] 0.15 0.17 0.70 0.65
[2,] 1.01 0.75 0.01 -0.18
[3,] NA 0.64 1.20 0.10
[4,] 0.12 0.54 0.13 1.50
[5,] 0.45 0.54 1.89 0.56
Sorry, I misread some of your post. I have corrected it so it matches your expected output.
library(data.table)
setDT(dat)
dat = dat[ , lapply(.SD, relationship), .SDcols = c("A", "B", "C", "D")]
relationship = function(x){
return(ifelse(((1 + x)*(1 + shift(x)) - 1) < .5 & !is.na(shift(x)) & (x > 1 | shift(x) > 1), NA, x))
}
> dat[ , lapply(.SD, relationship), .SDcols = c("A", "B", "C", "D")]
A B C D
1: 0.15 0.17 0.70 0.65
2: 1.01 0.75 0.01 -0.18
3: NA 0.64 1.20 0.10
4: 0.12 0.54 0.13 1.50
5: 0.45 0.54 1.89 0.56
You can cbind the dates back onto the data.table
I should add incase there are many more columns that this needs to be done on and you don't want to write them all out you could do something like this.
Dates = dat$Date
dat[ , "Date" := NULL]
dat = dat[ , lapply(.SD, relationship)]
That will apply the function to every column in the data.table.

R :How to execute FOR loop for Kmeans

I have an input file with Format as below :
RN KEY MET1 MET2 MET3 MET4
1 1 0.11 0.41 0.91 0.17
2 1 0.94 0.02 0.17 0.84
3 1 0.56 0.64 0.46 0.7
4 1 0.57 0.23 0.81 0.09
5 2 0.82 0.67 0.39 0.63
6 2 0.99 0.90 0.34 0.84
7 2 0.83 0.01 0.70 0.29
I have to execute Kmeans in R -separately for DF with Key=1 and Key=2 and so on...
Afterwards the final output CSV should look like
RN KEY MET1 MET2 MET3 MET4 CLST
1 1 0.11 0.41 0.91 0.17 1
2 1 0.94 0.02 0.17 0.84 1
3 1 0.56 0.64 0.46 0.77 2
4 1 0.57 0.23 0.81 0.09 2
5 2 0.82 0.67 0.39 0.63 1
6 2 0.99 0.90 0.34 0.84 2
7 2 0.83 0.01 0.70 0.29 2
Ie Key=1 is to be treated as separate DF and Key=2 is be treated as separate DF and so on...
Finally the output of clustering (of each DF)is to be combined with Key column first (since Key cannot participate in clustering) and then combined with each different DF for final output
In the above example :
DF1 is
KEY MET1 MET2 MET3 MET4
1 0.11 0.41 0.91 0.17
1 0.94 0.02 0.17 0.84
1 0.56 0.64 0.46 0.77
1 0.57 0.23 0.81 0.09
DF2 is
KEY MET1 MET2 MET3 MET4
2 0.82 0.67 0.39 0.63
2 0.99 0.90 0.34 0.84
2 0.83 0.01 0.70 0.29
Please suggest how to achieve in R
Psuedo code :
n<-Length(unique(Mydf$key))
for i=1 to n
{
#fetch partial df for each value of Key and run K means
dummydf<-subset(mydf,mydf$key=i
KmeansIns<-Kmeans(dummydf,2)
# combine with cluster result
dummydf<-data.frame(dummydf,KmeansIns$cluster)
# combine each smalldf into final Global DF
finaldf<-data.frame(finaldf,dummydf)
}Next i
#Now we have finaldf then it can be written to file
I think the easiest way would be to use by. Something along the lines of
by(data = DF, INDICES = DF$KEY, FUN = function(x) {
# your clustering code here
})
where x is a subset of your DF for each KEY.
A solution using data.tables.
library(data.table)
setDT(DF)[,CLST:=kmeans(.SD, centers=2)$clust, by=KEY, .SDcols=3:6]
DF
# RN KEY MET1 MET2 MET3 MET4 CLST
# 1: 1 1 0.11 0.41 0.91 0.17 2
# 2: 2 1 0.94 0.02 0.17 0.84 1
# 3: 3 1 0.56 0.64 0.46 0.70 1
# 4: 4 1 0.57 0.23 0.81 0.09 2
# 5: 5 2 0.82 0.67 0.39 0.63 2
# 6: 6 2 0.99 0.90 0.34 0.84 2
# 7: 7 2 0.83 0.01 0.70 0.29 1
#Read data
mdf <- read.table("mydat.txt", header=T)
#Convert to list based on KEY column
mls <- split(mdf, f=mdf$KEY)
#Define columns to use in clustering
myv <- paste0("MET", 1:4)
#Cluster each df item in list : modify kmeans() args as appropriate
kls <- lapply(X=mls, FUN=function(x){x$clust <- kmeans(x[, myv],
centers=2)$cluster ; return(x)})
#Make final "global" df
finaldf <- do.call(rbind, kls)

Resources