Problem with Ruining c-mean clustering on my data in R program - r

For this data how to fix this problem
> x=data.frame(c(v1="a" ,"b" ,"c" ,"d" ,"e"),
+ v2=c(97 ,90 ,93 ,97 ,90),
+ v3=c( 85 ,91 ,87 ,91 ,93))
> library(e1071)
> f <- cmeans(x, 2)
Error in cmeans(x, 2) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In cmeans(x, 2) : NAs introduced by coercion
2: In cmeans(x, 2) : NAs introduced by coercion
> f
I want to apply c-maen to my data as is illustrated code in above, it contains three vectors: v1,v2 ,v2 I want to apply c-mean label by vector (v1)

If we look at the documentation of ?cmeans,
x - The data matrix where columns correspond to variables and rows to observations.
So, we can convert the data.frame to matrix after removing the character column (1st column)
x1 <- as.matrix(x[-1])
row.names(x1) <- x[,1]
cmeans(x1, 2)
#Fuzzy c-means clustering with 2 clusters
#Cluster centers:
# v2 v3
#1 90.30090 91.85191
#2 95.75436 87.22535
#Memberships:
# 1 2
#a 0.06614213 0.93385787
#b 0.98305641 0.01694359
#c 0.19855988 0.80144012
#d 0.25730888 0.74269112
#e 0.97924422 0.02075578
#Closest hard clustering:
#a b c d e
#2 1 2 2 1
#Available components:
#[1] "centers" "size" "cluster" "membership" "iter" "withinerror" "call"

The k-mean family of partitional clustering algorithm works on the principle of mean which by its nature will accept only numeric values. You are getting an error because, the dataframe consist of both numeric and categorical values, which c-mean() does not like. Also, there is no need to convert the dataframe to matrix because that is not the actual problem.
Therefore,
Alternative approach
Discretize the character variable to assign it numbers and then apply clustering. This way there is no need to drop any variable.
# create empty data frame
df<- setNames(data.frame(matrix(ncol = 5, nrow = 5)), c("a" ,"b" ,"c" ,"d" ,"e"))
# fill values
df$a<- c("aaaa" ,"bbbb" ,"cccc" ,"dddd" ,"eeee")
df$b<- c(97 ,90 ,93 ,97 ,90)
df$c<- c(97 ,90 ,93 ,97 ,90)
df$d<- c( 85 ,91 ,87 ,91 ,93)
df$e<- c( 85 ,91 ,87 ,91 ,93)
# show the dataframe
df
a b c d e
1 aaaa 97 97 85 85
2 bbbb 90 90 91 91
3 cccc 93 93 87 87
4 dddd 97 97 91 91
5 eeee 90 90 93 93
# Discretize the character variable
df$a <- as.numeric( factor(df$a) ) -1
df
a b c d e
1 0 97 97 85 85
2 1 90 90 91 91
3 2 93 93 87 87
4 3 97 97 91 91
5 4 90 90 93 93
# Apply clustering
library(e1071)
cmeans(df, 2)
Fuzzy c-means clustering with 2 clusters
Cluster centers:
a b c d e
1 1.406 95.72 95.72 87.18 87.18
2 2.510 90.36 90.36 91.85 91.85
Memberships:
1 2
[1,] 0.92728 0.07272
[2,] 0.04014 0.95986
[3,] 0.80061 0.19939
[4,] 0.72009 0.27991
[5,] 0.03544 0.96456
Closest hard clustering:
[1] 1 2 1 1 2
Available components:
[1] "centers" "size" "cluster" "membership" "iter"
[6] "withinerror" "call"

Related

Using a function and mapply in R to create new columns that sums other columns

Suppose, I have a dataframe, df, and I want to create a new column called "c" based on the addition of two existing columns, "a" and "b". I would simply run the following code:
df$c <- df$a + df$b
But I also want to do this for many other columns. So why won't my code below work?
# Reproducible data:
martial_arts <- data.frame(gym_branch=c("downtown_a", "downtown_b", "uptown", "island"),
day_boxing=c(5,30,25,10),day_muaythai=c(34,18,20,30),
day_bjj=c(0,0,0,0),day_judo=c(10,0,5,0),
evening_boxing=c(50,45,32,40), evening_muaythai=c(50,50,45,50),
evening_bjj=c(60,60,55,40), evening_judo=c(25,15,30,0))
# Creating a list of the new column names of the columns that need to be added to the martial_arts dataframe:
pattern<-c("_boxing","_muaythai","_bjj","_judo")
d<- expand.grid(paste0("martial_arts$total",pattern))
# Creating lists of the columns that will be added to each other:
e<- names(martial_arts %>% select(day_boxing:day_judo))
f<- names(martial_arts %>% select(evening_boxing:evening_judo))
# Writing a function and using mapply:
kick_him <- function(d,e,f){d <- rowSums(martial_arts[ , c(e, f)], na.rm=T)}
mapply(kick_him,d,e,f)
Now, mapply produces the correct results in terms of the addition:
> mapply(ff,d,e,f)
Var1 <NA> <NA> <NA>
[1,] 55 84 60 35
[2,] 75 68 60 15
[3,] 57 65 55 35
[4,] 50 80 40 0
But it doesn't add the new columns to the martial_arts dataframe. The function in theory should do the following
martial_arts$total_boxing <- martial_arts$day_boxing + martial_arts$evening_boxing
...
...
martial_arts$total_judo <- martial_arts$day_judo + martial_arts$evening_judo
and add four new total columns to martial_arts.
So what am I doing wrong?
The assignment is wrong here i.e. instead of having martial_arts$total_boxing as a string, it should be "total_boxing" alone and this should be on the lhs of the Map/mapply. As the OP already created the 'martial_arts$' in 'd' dataset as a column, we are removing the prefix part and do the assignment
kick_him <- function(e,f){rowSums(martial_arts[ , c(e, f)], na.rm=TRUE)}
martial_arts[sub(".*\\$", "", d$Var1)] <- Map(kick_him, e, f)
-check the dataset now
> martial_arts
gym_branch day_boxing day_muaythai day_bjj day_judo evening_boxing evening_muaythai evening_bjj evening_judo total_boxing total_muaythai total_bjj total_judo
1 downtown_a 5 34 0 10 50 50 60 25 55 84 60 35
2 downtown_b 30 18 0 0 45 50 60 15 75 68 60 15
3 uptown 25 20 0 5 32 45 55 30 57 65 55 35
4 island 10 30 0 0 40 50 40 0 50 80 40 0

Calculating mode with modeest package in R

I am using the below code for calculating the mode of a dataframe:
library(modeest)
apply(df[ ,2:length(df)], 1, mfv)
My data looks like this:
Item A B C
Book001 56 32 56
Book002 95 95 20
Book003 50 89 50
Book004 6 65 40
It gives me the following output:
[[1]]
[1] 56
[[2]]
[1] 95
[[3]]
[1] 50
[[4]]
[1] 6 40 65
This code is perfect only if the data contains a recurring term.
How can I display the mode as NA when there is no recurring term?
Let's try with a custom function:
foo <- function(x){
out <- mfv(x)
if(length(out) > 1) out <- NA
return(out)
}
apply(df[ ,2:length(df)], 1, foo)
# [1] 56 95 50 NA

Filter rows based on values of multiple columns in R

Here is the data set, say name is DS.
Abc Def Ghi
1 41 190 67
2 36 118 72
3 12 149 74
4 18 313 62
5 NA NA 56
6 28 NA 66
7 23 299 65
8 19 99 59
9 8 19 61
10 NA 194 69
How to get a new dataset DSS where value of column Abc is greater than 25, and value of column Def is greater than 100.It should also ignore any row if value of atleast one column in NA.
I have tried few options but wasn't successful. Your help is appreciated.
There are multiple ways of doing it. I have given 5 methods, and the first 4 methods are faster than the subset function.
R Code:
# Method 1:
DS_Filtered <- na.omit(DS[(DS$Abc > 20 & DS$Def > 100), ])
# Method 2: which function also ignores NA
DS_Filtered <- DS[ which( DS$Abc > 20 & DS$Def > 100) , ]
# Method 3:
DS_Filtered <- na.omit(DS[(DS$Abc > 20) & (DS$Def >100), ])
# Method 4: using dplyr package
DS_Filtered <- filter(DS, DS$Abc > 20, DS$Def >100)
DS_Filtered <- DS %>% filter(DS$Abc > 20 & DS$Def >100)
# Method 5: Subset function by default ignores NA
DS_Filtered <- subset(DS, DS$Abc >20 & DS$Def > 100)

Sorting data.frame in r [duplicate]

I am new to R, and want to sort a data frame called "weights". Here are the details:
>str(weights)
'data.frame': 57 obs. of 1 variable:
$ attr_importance: num 0.04963 0.09069 0.09819 0.00712 0.12543 ...
> names(weights)
[1] "attr_importance"
> dim(weights)
[1] 57 1
> head(weights)
attr_importance
make 0.049630556
address 0.090686474
all 0.098185517
num3d 0.007122618
our 0.125433292
over 0.075182467
I want to sort by decreasing order of attr_importance BUT I want to preserve the corresponding row names also.
I tried:
> weights[order(-weights$attr_importance),]
but it gives me a "numeric" back.
I want a data frame back - which is sorted by attr_importance and has CORRESPONDING row names intact. How can I do this?
Thanks in advance.
Since your data.frame only has one column, you need to set drop=FALSE to prevent the dimensions from being dropped:
weights[order(-weights$attr_importance),,drop=FALSE]
# attr_importance
# our 0.125433292
# all 0.098185517
# address 0.090686474
# over 0.075182467
# make 0.049630556
# num3d 0.007122618
Here is the big comparison on data.frame sorting:
How to sort a dataframe by column(s)?
Using my now-preferred solution arrange:
dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"),
levels = c("Low", "Med", "Hi"), ordered = TRUE),
x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
z = c(1, 1, 1, 2))
library(plyr)
arrange(dd,desc(z),b)
b x y z
1 Low C 9 2
2 Med D 3 1
3 Hi A 8 1
4 Hi A 9 1
rankdata.txt
regno name total maths science social cat
1 SUKUMARAN 400 78 89 73 S
2 SHYAMALA 432 65 79 87 S
3 MANOJ 500 90 129 78 C
4 MILYPAULOSE 383 59 88 65 G
5 ANSAL 278 39 77 60 O
6 HAZEENA 273 45 55 56 O
7 MANJUSHA 374 50 99 52 C
8 BILBU 408 81 97 72 S
9 JOSEPHROBIN 374 57 85 68 G
10 SHINY 381 70 79 70 S
z <- data.frame(rankdata)
z[with(z, order(-total+ maths)),] #order function maths group selection
z
z[with(z, order(name)),] # sort on name
z

Apply over all columns and rows of two diffrent dataframes in R

I try to apply a function over all rows and columns of two dataframes but I don't know how to solve it with apply.
I think the following script explains what I intend to do and the way i tried to solve it. Any advice would be warmly appreciated! Please note, that the simplefunction is only intended to be an example function to keep it simple.
# some data and a function
df1<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
df2<-data.frame(name=c("aa","bb","cc","dd","ee"),a=sample(1:50,5),b=sample(1:50,5),c=sample(1:50,5))
simplefunction<-function(a,b){a+b}
# apply on a single row
simplefunction(df1[1,2],df2[1,2])
# apply over all colums
apply(?)
## apply over all columns and rows
# create df to receive results
df3<-df2
# loop it
for (i in 2:5)df3[i]<-apply(?)
My first mapply answer!! For your simple example you have...
mapply( FUN = `+` , df1[,-1] , df2[,-1] )
# a b c
# [1,] 60 35 75
# [2,] 57 39 92
# [3,] 72 71 48
# [4,] 31 19 85
# [5,] 47 66 58
You can extend it like so...
mapply( FUN = function(x,y,z,etc){ simplefunctioncodehere} , df1[,-1] , df2[,-1] , ... other dataframes here )
The dataframes will be passed in order to the function, so in this example df1 would be x, df2 would be y and z and etc would be some other dataframes that you specify in that order. Hopefully that makes sense. mapply will take the first row, first column values of all dataframes and apply the function, then the first row, second column of all data frames and apply the function and so on.
You can also use Reduce:
set.seed(45) # for reproducibility
Reduce(function(x,y) { x + y}, list(df1[, -1], df2[,-1]))
# a b c
# 1 53 22 23
# 2 64 28 91
# 3 19 56 51
# 4 38 41 53
# 5 28 42 30
You can just do :
df1[,-1] + df2[,-1]
Which gives :
a b c
1 52 24 37
2 65 63 62
3 31 90 89
4 90 35 33
5 51 33 45

Resources