R within group sum of squares kmeans - r

I have the following code, which is giving me the an error:
# Read input dataset from CSV file
input_dataset <-
read.csv("C:\\Users\\sw029693\\Desktop\\Overtime_work_hrs_analytics\\input_dataset.csv", header = TRUE)
wss <- (nrow(input_dataset)-1)*sum(apply(input_dataset,2,var))
which gives the following error:
Warning messages:
1: In FUN(newX[, i], ...) : NAs introduced by coercion
2: In FUN(newX[, i], ...) : NAs introduced by coercion
3: In FUN(newX[, i], ...) : NAs introduced by coercion
4: In FUN(newX[, i], ...) : NAs introduced by coercion
5: In FUN(newX[, i], ...) : NAs introduced by coercion
> wss
[1] NA
> colnames(input_dataset)
[1] "client" "domain" "user_name"
"cdf_display" "position" "shift_start"
[7] "shift_end" "shift_length_avg" "patients_seen_cnt"
It looks like the wss is NA, I am not sure why. Any ideas?

K-means only supports numerical data.
You columns user_name etc. probably are not numerical.
Bring your data into the appropriate format first.

Related

How do I convert date to plot it in R?

I am trying to plot simple weather data in R. I don't know how do I convert the following datetime. Please help.
> datetime
[1] "2022-01-18" "2022-01-19" "2022-01-20" "2022-01-21" "2022-01-22"
[6] "2022-01-23" "2022-01-24" "2022-01-25" "2022-01-26" "2022-01-27"
[11] "2022-01-28" "2022-01-29" "2022-01-30" "2022-01-31" "2022-02-01"
> plot(datetime, temp)
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
2: In min(x) : no non-missing arguments to min; returning Inf
3: In max(x) : no non-missing arguments to max; returning -Inf

Only receive unique warning messages

Warning messages are a good information i want to know. But i just want to know it one time!
So this function throws 2 different warnings and repeats it 20 times.
How can i tell R to only print unique warnings. Im looking for a gerenal solution.
Warning messages:
1: NAs introduced by coercion
2: In sqrt(-1) : NaNs produced
Here is my example:
foobar <- function(n=20) {
for (i in 1:n) {
as.numeric("b")
sqrt(-1)
}
}
foobar()
To return only unique warning strings, use
unique(warnings())
Now, a problem you may have is that your function has more than 50 warnings, in which case warnings() will not catch them all. To workaround this, you can increase nwarnings in options to e.g. 10000 as suggested in the help page of warnings.
options(nwarnings = 10000)
Example:
foobar <- function(n=20) {
warning("First warning")
for (i in 1:n) {
as.numeric("b")
sqrt(-1)
}
warning("Last warning")
}
foobar(60)
unique(warnings())
## Warning messages:
## 1: In foobar(60) : First warning
## 2: NAs introduced by coercion
## 3: In sqrt(-1) : NaNs produced
op <- options(nwarnings = 10000)
foobar(60)
unique(warnings())
## Warning messages:
## 1: In foobar(60) : First warning
## 2: NAs introduced by coercion
## 3: In sqrt(-1) : NaNs produced
## 4: In foobar(60) : Last warning
options(op)

nError in importing signals with createAffyIntensityFile (GWASTools)

I am using the GWASTools package and I am facing an error to import my signal file. I tried to mimetize my real data set in the follow example:
library(GWASTools)
snp.anno <- 'snpID chromosome position snpName
AX-100676796 1 501997 AX-100676796
AX-100120875 1 503822 AX-100120875
AX-100067350 1 504790 AX-100067350'
snp.anno <- read.table(text=snp.anno, header=T)
signals <- 'probeset_id sample1.CEL sample1.CEL sample1.CEL
AX-100676796-A 2126.7557 1184.8638 1134.2687
AX-100676796-B 427.1864 2013.8512 1495.0654
AX-100120875-A 1775.5816 2013.8512 651.1691
AX-100120875-B 335.9226 2013.8512 1094.7429
AX-100067350-A 2365.7755 2695.0053 2758.1739
AX-100067350-B 2515.4818 2518.2818 28181.289 '
p1summ <- read.table(text=signals, header=T)
write.table(p1summ, "del.txt", sep="\t", col.names=T, row.names=F, quote=F)
p1summ <- createAffyIntensityFile("del.txt", snp.annotation=snp.anno)
Error: all(snp.annotation$snpID == sort(snp.annotation$snpID)) is not TRUE
In addition: Warning messages:
1: In .checkSnpAnnotation(snp.annotation) : coerced snpID to type integer
2: In .checkSnpAnnotation(snp.annotation) :
coerced chromosome to type integer
I used the probe Names with 'A' and 'B' pattern also, the error was the same:
snp.annoab <- 'snpID chromosome position snpName
AX-100676796-A 1 501997 AX-100676796-A
AX-100676796-B 1 501997 AX-100676796-B
AX-100120875-A 1 503822 AX-100120875-A
AX-100120875-B 1 503822 AX-100120875-B
AX-100067350-A 1 504790 AX-100067350-A
AX-100067350-B 1 504790 AX-100067350-B'
snp.annoab <- read.table(text=snp.annoab, header=T)
p1summ <- createAffyIntensityFile("del.txt", snp.annotation=snp.annoab)
Error: all(snp.annotation$snpID == sort(snp.annotation$snpID)) is not TRUE
In addition: Warning messages:
1: In .checkSnpAnnotation(snp.annotation) : coerced snpID to type integer
2: In .checkSnpAnnotation(snp.annotation) :
coerced chromosome to type integer
In my real dataset the error is slight different, but do not work anyway:
Error: length(snp.annotation$snpID) == length(unique(snp.annotation$snpID)) is not TRUE
In addition: Warning messages:
1: In .checkSnpAnnotation(snp.annotation) : NAs introduced by coercion
2: In .checkSnpAnnotation(snp.annotation) : coerced snpID to type integer
3: In .checkSnpAnnotation(snp.annotation) : NAs introduced by coercion
4: In .checkSnpAnnotation(snp.annotation) :
coerced chromosome to type integer
And the strange thing is that:
> length(snp.annotation$snpID) == length(unique(snp.annotation$snpID))
[1] TRUE
Thus, seems that the error is not in agreement with the command (to check if the length is the same). I am missing some important detail in the format of my inputs? I would be grateful for any help. Thank you!

Error while creating a Timeseries plot in R: Error in plot.window(xlim, ylim, log, ...) : need finite 'ylim' values

Here's a sample of my single column data set:
Lines
141,523
146,785
143,667
65,560
88,524
148,422
I read this file as a .csv file, convert it into a ts object and then plot it:
##Read the actual number of lines CSV file
Aclines <- read.csv(file.choose(), header=T, stringsAsFactors = F)
Aclinests <- ts(Aclines[,1], start = c(2013), end = c(2015), frequency = 52)
plot(Aclinests, ylab = "Actual_Lines", xlab = "Time", col = "red")
I get the following error message:
Error in plot.window(xlim, ylim, log, ...) : need finite 'ylim' values
In addition: Warning messages:
1: In xy.coords(x, NULL, log = log) : NAs introduced by coercion
2: In min(x) : no non-missing arguments to min; returning Inf
3: In max(x) : no non-missing arguments to max; returning -Inf
I thought this might be because of the "," in the columns and tried to use sapply to take care of that as advised here:
need finite 'ylim' values-error
plot(sapply(Aclinests, function(x)gsub(",",".",x)))
But I got the following error:
Error in plot(sapply(Aclinests, function(x) gsub(",", ".", x))) :
error in evaluating the argument 'x' in selecting a method for function 'plot': Error in sapply(Aclinests, function(x) gsub(",", ".", x)) :
'names' attribute [105] must be the same length as the vector [1]
Here is the head of my original and ts data set if it might help:
> head(Aclines)
Lines
1 141,523
2 146,785
3 143,667
4 65,560
5 88,524
6 148,422
> head(Aclinests)
[1] "141,523" "146,785" "143,667" "65,560" "88,524" "148,422"
Also, if I read the .csv file as:
Aclines <- read.csv(file.choose(), header=T, **stringsAsFactors = T**)
Then, I am able to plot the ts object, but head(Aclinests)gives the below output which is not consistent with my original data:
> head(Aclinests)
[1] 14 27 17 84 88 36
Please advice on how I can plot this ts object.
The simplest way to avoid this, in my case, is to remove the commas in the excel file containing the data. This can be done using simple excel commands and it worked for me.

quantmod::chart_Series() bug?

I would like to chart SPX using quantmod::chart_Series() and below draw changes in GDP and 12 month SMA of changes of GDP. No matter how I try to do it (what combinations I use) eithe errors occur or quantmod::chart_Series() displays just partial plot.
require(quantmod)
FRED.symbols <- c("GDPC96")
getSymbols(FRED.symbols, src="FRED")
SPX <- getSymbols("^GSPC", auto.assign=FALSE, from="1900-01-01")
subset="2000/"
chart_Series(SPX, subset=subset)
add_TA(GDPC96)
add_TA(ROC(GDPC96, type="discrete"))
add_TA(SMA(ROC(GDPC96, type="discrete"), n=4), on=3, col="blue")
EDIT: Actually, it seems to me that this is a quantmod::chart_series() problem when using quarterly data:
subset <- "2000/"
chart_Series(to.quarterly(SPX, drop.time=TRUE), subset=subset)
add_TA(SMA(Cl(to.quarterly(SPX, drop.time=TRUE))))
> subset <- "2000/"
> chart_Series(to.quarterly(SPX, drop.time=TRUE), subset=subset)
> add_TA(SMA(Cl(to.quarterly(SPX, drop.time=TRUE))))
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
In addition: Warning messages:
1: In as_numeric(H) : NAs introduced by coercion
2: In as_numeric(H) : NAs introduced by coercion
3: In as_numeric(H) : NAs introduced by coercion
This does produce SPX plot on main panel, but leaves empty second and third panel.
Then I tried to play around with having same index on data, same lengths etc.
chart_Series(head(to.quarterly(SPX, drop.time="TRUE"), -1), subset=subset)
add_TA(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE))
add_TA(ROC(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE), type="discrete"))
add_TA(SMA(ROC(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE), type="discrete"), n=4), on=3, col="blue")
And result is errors all over:
> chart_Series(head(to.quarterly(SPX, drop.time="TRUE"), -1), subset=subset)
> add_TA(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE))
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
In addition: Warning messages:
1: In as_numeric(H) : NAs introduced by coercion
2: In as_numeric(H) : NAs introduced by coercion
3: In as_numeric(H) : NAs introduced by coercion
> add_TA(ROC(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE), type="discrete"))
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
In addition: Warning messages:
1: In as_numeric(H) : NAs introduced by coercion
2: In as_numeric(H) : NAs introduced by coercion
3: In as_numeric(H) : NAs introduced by coercion
> add_TA(SMA(ROC(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE), type="discrete"), n=4), on=3, col="blue")
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
In addition: Warning messages:
1: In as_numeric(H) : NAs introduced by coercion
2: In as_numeric(H) : NAs introduced by coercion
3: In as_numeric(H) : NAs introduced by coercion
Using
tail(to.quarterly(SPX, drop.time="TRUE"))
tail(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE))
tail(ROC(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE), type="discrete"))
tail(SMA(ROC(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE), type="discrete"), n=4))
dput(to.quarterly(SPX, drop.time="TRUE"))
dput(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE))
dput(ROC(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE), type="discrete"))
dput(SMA(ROC(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE), type="discrete"), n=4))
all looks good to me.
My sessionInfo():
> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
[9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] quantmod_0.3-18 TTR_0.21-0 xts_0.8-7 zoo_1.7-7
[5] Defaults_1.1-1 rj_1.1.0-4
loaded via a namespace (and not attached):
[1] grid_2.15.0 lattice_0.20-0 tools_2.15.0
Any ideas what might be the solution for these issues?
EDIT: This seems to be a quantmod::chart_Series() bug. If I do this:
subset <- "1990/"
test <- cbind(head(to.quarterly(SPX, drop.time="TRUE"), -1)[subset],
to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE)[subset],
ROC(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE), type="discrete")[subset],
SMA(ROC(to.quarterly(GDPC96, drop.time="TRUE", OHLC=FALSE), type="discrete"), n=4)[subset])
test$test <- 1
subset <- "2000/"
chart_Series(OHLC(test), subset=subset)
add_TA(test$test)
add_TA(test$GDPC96)
> test$test <- 1
> subset <- "2000/"
> chart_Series(OHLC(test), subset=subset)
> add_TA(test$test)
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
In addition: Warning messages:
1: In as_numeric(H) : NAs introduced by coercion
2: In as_numeric(H) : NAs introduced by coercion
3: In as_numeric(H) : NAs introduced by coercion
> add_TA(test$GDPC96)
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
In addition: Warning messages:
1: In as_numeric(H) : NAs introduced by coercion
2: In as_numeric(H) : NAs introduced by coercion
3: In as_numeric(H) : NAs introduced by coercion
> traceback()
14: stop("'x' and 'y' lengths differ") at chart_Series.R#510
13: xy.coords(x, y) at chart_Series.R#510
12: plot.xy(xy.coords(x, y), type = type, ...) at chart_Series.R#510
11: lines.default(ta.x, as.numeric(ta.y[, i]), col = col, ...) at chart_Series.R#510
10: lines(ta.x, as.numeric(ta.y[, i]), col = col, ...) at chart_Series.R#510
9: plot_ta(x = current.chob(), ta = get("x"), on = NA, taType = NULL,
col = 1) at replot.R#238
8: eval(expr, envir, enclos) at replot.R#238
7: eval(aob, env) at replot.R#238
6: FUN(X[[12L]], ...) at replot.R#230
5: lapply(x$Env$actions, function(aob) {
if (attr(aob, "frame") > 0) {
x$set_frame(attr(aob, "frame"), attr(aob, "clip"))
env <- attr(aob, "env")
if (is.list(env)) {
env <- unlist(lapply(env, function(x) eapply(x, eval)),
recursive = FALSE)
}
eval(aob, env)
}
}) at replot.R#230
4: plot.replot(x, ...)
3: plot(x, ...)
2: print.replot(<environment>)
1: print(<environment>)
Any ideas on how to get this fixed?
I had a similar error several days ago. I found that the problem was in add_TA with the line:
ta.x <- as.numeric(na.approx(ta.adj[, 1]))
na.approx uses approx with rule = 1 by default, which leaves trailing NAs in the list if the last timestamp in the original data is before the last timestamp in the TA data. Changing that line to set rule = 2 fixed the problem.
ta.x <- as.numeric(na.approx(ta.adj[, 1], rule=2))
I just wrote a long "answer" confirming your problems, even after some data massaging, and even using the older chartSeries function. Then I realized that add_TA() is perhaps the wrong function. This approach works:
par(mfrow=c(2,1))
chart_Series(SPX)
chart_Series(GDPC96)
(See R/quantmod: multiple charts all using the same y-axis for an alternative approach using the layout command.)
Or with the subset:
par(mfrow=c(2,1))
chart_Series(SPX,subset="2000/")
chart_Series(GDPC96,subset="2000/")
(NB. the two datasets end at different place, so don't quite line up.)
Incidentally, there is one definite bug in chart_Series with quarterly data: the x-axis labels look like "%n%b%n2010".
q.SPX=to.quarterly(SPX)
chart_Series(q.SPX)

Resources