R, error in kmodes() application to a text matrix - r

I'm trying to apply the kmodes clustering method (from klaR package) in R to a text matrix made of around 1000 strings and 6 columns.
Unfortunately, I get an error that I cannot comprehend:
kmodes(mat, 5, iter.max=10)
Error in cluster[j] <- which.min(dist) : replacement has length zero
Do you have any ideas about why this is happening?
EDIT: This is the head(mat):
1 aaa ccc iii <NA> 0
2 aaa ddd kkk <NA> 0
3 aaa eee -273 <NA> 0
4 aaa fff lll <NA> 0
5 bbb ggg 67 <NA> 0
6 bbb hhh mmm <NA> 0

You can very well use kmodes(na.omit(mat), 5, iter.max = 10), but that will lead to loss of data which is not appreciable.
You could instead check the summary(mat) to find for missing data or NA values and then replace them using random imputation.
dataframeName$variable <- with(dataframeName, impute(variable, 'random'))
and in case you have too many NA values in a column/variable then you can also neglect that column.
Check out the link:-
https://discuss.analyticsvidhya.com/t/how-to-handle-missing-values-of-categorical-variables/310

I had the same error. It seems to be due to missing data. Remove missing data from the data first, e.g. by using kmodes(na.omit(mat), 5, iter.max=10).

Related

How to do conditional replace value in data

I have below data table and want to replace other than Web,Mobile usage into another category say others. (OR)
Is there anyway group the usage as Web, Mobile and rest all as others without replace the value like Web is used 1 , Mobile 1 and Others - 4 (OR)
Do we need to write a function to do so.
Id Name Usage
1 AAA Web
2 BBB Mobile
3 CCC Manual
4 DDD M1
5 EEE M2
6 FFF M3
Assuming that the 'Usage' is character class, we can use %chin% to create a logical index, negate it (!) and assign (:=) values in 'Usage' to 'Others'. This would be more efficient as we are assigning in place without any copying.
library(data.table)
setDT(df1)[!Usage %chin% c("Web", "Mobile"), Usage := "Others"]
df1
# Id. Name Usage
#1: 1 AAA Web
#2: 2 BBB Mobile
#3: 3 CCC Others
#4: 4 DDD Others
#5: 5 EEE Others
#6: 6 FFF Others

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

R reading values of numeric field in file wrongly

R is reading the values from a file wrongly. One can check if this statement is true with the following example:
A sample picture/snapshot which explains the problem areas is here
(1) Copy paste the following 10 numbers into a test file (sample.csv)
1000522010609612
1000522010609613
1000522010609614
1000522010609615
1000522010609616
1000522010609617
971000522010609612
1501000522010819466
971000522010943717
1501000522010733490
(2) Read these contents into R using read.csv
X <- read.csv("./test.csv", header=FALSE)
(3) Print the output
print(head(X, n=10), digits=22)
The output I got was
V1
1 1000522010609612.000000
2 1000522010609613.000000
3 1000522010609614.000000
4 1000522010609615.000000
5 1000522010609616.000000
6 1000522010609617.000000
7 971000522010609664.000000
8 1501000522010819584.000000
9 971000522010943744.000000
10 1501000522010733568.000000
The problem is that rows 7,8,9,10 are not correct (check the sample 10 numbers that we considered before).
What could be the problem? Is there some setting that I am missing with my R - terminal?
You could try
library(bit64)
x <- read.csv('sample.csv', header=FALSE, colClasses='integer64')
x
# V1
#1 1000522010609612
#2 1000522010609613
#3 1000522010609614
#4 1000522010609615
#5 1000522010609616
#6 1000522010609617
#7 971000522010609612
#8 1501000522010819466
#9 971000522010943717
#10 1501000522010733490
If you load the bit64, then you can also try fread from data.table
library(data.table)
x1 <- fread('sample.csv')
x1
# V1
#1: 1000522010609612
#2: 1000522010609613
#3: 1000522010609614
#4: 1000522010609615
#5: 1000522010609616
#6: 1000522010609617
#7: 971000522010609612
#8: 1501000522010819466
#9: 971000522010943717
#10: 1501000522010733490

Replace values in one data frame from values in another data frame

I need to change individual identifiers that are currently alphabetical to numerical. I have created a data frame where each alphabetical identifier is associated with a number
individuals num.individuals (g4)
1 ZYO 64
2 KAO 24
3 MKU 32
4 SAG 42
What I need to replace ZYO with the number 64 in my main data frame (g3) and like wise for all the other codes.
My main data frame (g3) looks like this
SAG YOG GOG BES ATR ALI COC CEL DUN EVA END GAR HAR HUX ISH INO JUL
1 2
2 2 EVA
3 SAG 2 EVA
4 2
5 SAG 2
6 2
Now on a small scale I can write a code to change it like I did with ATR
g3$ATR <- as.character(g3$ATR)
g3[g3$target == "ATR" | g3$ATR == "ATR","ATR"] <- 2
But this is time consuming and increased chance of human error.
I know there are ways to do this on a broad scale with NAs
I think maybe we could do a for loop for this, but I am not good enough to write one myself.
I have also been trying to use this function which I feel like may work but I am not sure how to logically build this argument, it was posted on the questions board here
Fast replacing values in dataframe in R
df <- as.data.frame(lapply(df, function(x){replace(x, x <0,0)})
I have tried to work my data into this by
df <- as.data.frame(lapply(g4, function(g3){replace(x, x <0,0)})
Here is one approach using the data.table package:
First, create a reproducible example similar to your data:
require(data.table)
ref <- data.table(individuals=1:4,num.individuals=c("ZYO","KAO","MKU","SAG"),g4=c(64,24,32,42))
g3 <- data.table(SAG=c("","SAG","","SAG"),KAO=c("KAO","KAO","",""))
Here is the ref table:
individuals num.individuals g4
1: 1 ZYO 64
2: 2 KAO 24
3: 3 MKU 32
4: 4 SAG 42
And here is your g3 table:
SAG KAO
1: KAO
2: SAG KAO
3:
4: SAG
And now we do our find and replacing:
g3[ , lapply(.SD,function(x) ref$g4[chmatch(x,ref$num.individuals)])]
And the final result:
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA
And if you need more speed, the fastmatch package might help with their fmatch function:
require(fastmatch)
g3[ , lapply(.SD,function(x) ref$g4[fmatch(x,ref$num.individuals)])]
SAG KAO
1: NA 24
2: 42 24
3: NA NA
4: 42 NA

Removing time series with only zero values from a data frame

I have a data frame with multiple time series identified by uniquer id's. I would like to remove any time series that have only 0 values.
The data frame looks as follows,
id date value
AAA 2010/01/01 9
AAA 2010/01/02 10
AAA 2010/01/03 8
AAA 2010/01/04 4
AAA 2010/01/05 12
B 2010/01/01 0
B 2010/01/02 0
B 2010/01/03 0
B 2010/01/04 0
B 2010/01/05 0
CCC 2010/01/01 45
CCC 2010/01/02 46
CCC 2010/01/03 0
CCC 2010/01/04 0
CCC 2010/01/05 40
I want any time series with only 0 values to be removed so that the data frame look as follows,
id date value
AAA 2010/01/01 9
AAA 2010/01/02 10
AAA 2010/01/03 8
AAA 2010/01/04 4
AAA 2010/01/05 12
CCC 2010/01/01 45
CCC 2010/01/02 46
CCC 2010/01/03 0
CCC 2010/01/04 0
CCC 2010/01/05 40
This is a follow up to a previous question that was answered with a really great solution using the data.tables package.
R efficiently removing missing values from the start and end of multiple time series in 1 data frame
If dat is a data.table, then this is easy to write and read :
dat[,.SD[any(value!=0)],by=id]
.SD stands for Subset of Data. This answer explains .SD very well.
Picking up on Gabor's nice use of ave, but without repeating the same variable name (DF) three times, which can be a source of typo bugs if you have a lot of long or similar variable names, try :
dat[ ave(value!=0,id,FUN=any) ]
The difference in speed between those two may be dependent on several factors including: i) number of groups ii) size of each group and iii) the number of columns in the real dat.
Try this. No packages are used.
DF[ ave(DF$value != 0, DF$id, FUN = any), ]
An easy plyr solution would be
ddply(mydat,"id",function(x) if (all(x$value==0)) NULL else x)
(seems to work OK) but there may be a faster solution with data.table ...

Resources