Clustering with binary variables

Clustering with binary variables - r

I have a dataset with some variables having a binary type.
The first column are names, so when applying cluster analysis it is showing error.
kc <- kmeans(j1,4) ## j1 is the stored data frame
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning message: In storage.mode(x) <- "double" : NAs
introduced by coercion –
The data head I am giving here using dput(j1[1:5,]:
structure(list(OUTPUT_NAME = c("nonsaturation_fba268_2ch_0_out.wav",
"nonsaturation_fba268_2ch_32_out.wav", "substreaminfo_fba268_2ch_96_out.wav",
"substreaminfo_fba268_2ch_201_out.wav", "substreaminfo_fba268_2ch_93_out.wav"
), PEAK_MIPS = c(82.47, 82.5, 82.63, 82.73, 82.73), PRESENTATION = c(0,
0, 0, 0, 0), DTHD_ATMOS_PRE = c(0, 0, 0, 0, 0), FBAFBBDETECTER = c(1,
1, 1, 1, 1), DIAL_NORM = c(31, 31, 31, 31, 31), NORMAL_DRC = c(0,
0, 0, 0, 0), ANALOG_DB_GAIN_REQ = c(0, 0, 0, 0, 0), DECODER_CH_ASSIGN = c(1,
1, 1, 1, 1), DECODER_6_CH_ASSIGN = c(1, 1, 13, 1, 1), DECODER_8_CH_ASSIGN = c(1,
1, 13, 1, 1), DECODER_16_CH_ASSIGN = c(0, 0, 0, 0, 0), CH_MODIFIER = c(0,
0, 0, 0, 0), CH_ASSIGNMENT_TYPE = c(0, 0, 0, 0, 0), FILTER_ORDER = c(0,
0, 0, 0, 0), COEFF_BITS = c(9, 9, 9, 9, 9), COEFF_SHIFT = c(7,
7, 7, 7, 7), STATE_BITS = c(4, 4, 6, 6, 6), STATE_SHIFT = c(0,
0, 0, 0, 0), `31EC_PRIMITIVE_MATRIX_CNT` = c(16, 16, 8, 8, 8),
LSB_BYPASS_COUNT = c(0, 0, 0, 0, 0), DITHER_SCALE = c(1,
1, 1, 1, 1), `31EC_FRAC_BITS` = c(14, 14, 12, 12, 12), INTERPOLATION_USED = c(1,
1, 0, 0, 0), `31EA_31EB_PRIMITIVE_MATIX_CNT` = c(0, 0, 0,
0, 0), `31EA_31EB_FRAC_BITS` = c(14, 14, 12, 12, 12), LSB_BYPASS_USED = c(0,
0, 0, 0, 0), AU_LENGTH = c(937, 937, 937, 937, 937), VARIABLE_RATE = c(1,
1, 1, 1, 1), PEAK_DATA_RATE = c(6000, 6000, 6000, 6000, 6000
), SUBSTREAM_CNT = c(1, 1, 2, 2, 2), EXTENDED_SUBSTREAM_CNT = c(0,
0, 0, 0, 0), SUBSTREAM_INFO = c(20, 20, 40, 24, 24), SPEAKER_LAYOUT = c(0,
0, 0, 0, 0), CONTROL_EN_2 = c(0, 0, 0, 0, 0), CONTROL_EN_6 = c(0,
0, 0, 0, 0), CONTROL_EN_8 = c(0, 0, 0, 0, 0), MIX_LEVEL_2 = c(35,
35, 35, 35, 35), MIX_LEVEL_6 = c(35, 35, 35, 35, 35), MIX_LEVEL_8 = c(35,
35, 35, 35, 35), DIALOGUE_NORM_2 = c(31, 31, 31, 31, 31),
DIALOGUE_NORM_6 = c(31, 31, 31, 31, 31), DIALOGUE_NORM_8 = c(31,
31, 31, 31, 31), SOURCE_FORMAT_6 = c(0, 0, 0, 0, 0), SOURCE_FORMAT_8 = c(0,
0, 0, 0, 0), DRC_STARTUP_GAIN = c(0, 0, 0, 0, 0), DIALOGUE_NORM_16 = c(28,
28, 31, 31, 31), MIX_LEVEL_16 = c(35, 35, 35, 35, 35), CHANNEL_CNT_16 = c(16,
16, 16, 16, 16), DYNAMIC_OBJ_ONLY = c(1, 1, 1, 1, 1), DYNAMIC_CHANNEL_CNT_16 = c(0,
0, 0, 0, 0), LFE_PRE = c(1, 1, 0, 0, 0), CHANNEL_CONTENT_DES_16 = c(0,
0, 0, 0, 0), MIN_CHAN = c(0, 0, 0, 0, 0), MAX_CHAN = c(1,
1, 1, 1, 1), RESTART_SYNC_WORD = c(12778, 12778, 12778, 12778,
12778), MAX_MATRIX_CHAN = c(1, 1, 1, 1, 1), DITHER_SHIFT = c(0,
0, 0, 0, 0), ERROR_PROTECT = c(1, 1, 1, 1, 1), LOSSLESS_PROTECT = c(0,
0, 1, 1, 1), BLOCK_SIZE = c(32, 32, 40, 40, 40), OUTPUT_SHIFT = c(0,
0, 0, 0, 0), QUANT_STEP_SIZE = c(0, 0, 0, 0, 0), HUFF_OFFSET = c(0,
0, 0, 0, 0), HUFF_TYPE = c(1, 1, 0, 2, 2), HUFF_LSBS = c(6,
6, 8, 5, 5), SAMPLE_RATE = c(0, 3, 0, 3, 0), OUTPUT_SAMPLE_COUNT = c(40,
40, 40, 40, 40), RESTART_HEADER_EXISTS = c(0, 0, 0, 0, 0)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))

You're using a variable that is not numeric, look at this:
class(j1[,1])
[1] "character"
You've to remove it, to make kmeans works:
set.seed(1234)
kmeans(j1[,-1],2)

Related

Creating column that takes mean of X for specific group id when date1 < date2

in (R)
for an event study I'm trying to create a column that calculates the mean of ccu_avg for a specific combination of appid and Eventdate1. One appid has multiple events so it has to be divided by both appid and Eventdate1.
The difficult thing here is that I want it to calculate the mean only up until the event date since after the event happened the estimation period stops
The new column should look like est_ccu_avg:
picture of the dataset below for explanation
https://i.stack.imgur.com/ZPquW.png
Could someone help me figure the code for this out? I've been trying for hours and can't seem to get it to work.
I've now been trying things like this but without success:
study <- study %>%
mutate(est_ccu_avg=
mean(study[unique(study$appid) | study$Eventdate1 >
study$datefinal, "ccu_avg"])
)
Result of dput head:
structure(list(appid = c("105600", "105600", "105600", "105600",
"105600", "105600"), name = c("Terraria", "Terraria", "Terraria",
"Terraria", "Terraria", "Terraria"), ccu_avg = c(26825, 29058,
37842, 37525, 26484, 24377), ccu_min = c(21176, 21620, 28954,
32880, 19648, 19118), ccu_max = c(35827, 41322, 50012, 44071,
33241, 32060), pos_max = c(356186, 356363, 356508, 356712, 356921,
357092), neg_max = c(6756, 6756, 6758, 6768, 6766, 6768), Maj_Upt =
c(0,
0, 0, 0, 0, 0), Min_Upt = c(0, 0, 0, 0, 0, 0), Hotfix = c(0,
0, 0, 0, 0, 0), Bugfix = c(0, 0, 0, 0, 0, 0), Balance = c(0,
0, 0, 0, 0, 0), ExpBranch = c(0, 0, 0, 0, 0, 0), Promo = c(0,
1, 0, 0, 0, 0), Ev_Out = c(0, 0, 0, 0, 0, 0), Ev_In = c(0, 0,
0, 0, 0, 0), isfree = c(0, 0, 0, 0, 0, 0), developers1 = c("Re-
Logic",
"Re-Logic", "Re-Logic", "Re-Logic", "Re-Logic", "Re-Logic"),
publishers1 = c("Re-Logic", "Re-Logic", "Re-Logic", "Re-Logic",
"Re-Logic", "Re-Logic"), metascore = c(83, 83, 83, 83, 83,
83), singleplayer = c(1, 1, 1, 1, 1, 1), multiplayer = c(1,
1, 1, 1, 1, 1), coop = c(1, 1, 1, 1, 1, 1), mmo = c(0, 0,
0, 0, 0, 0), indie = c(1, 1, 1, 1, 1, 1), single_player_gen = c(0,
0, 0, 0, 0, 0), adventure = c(1, 1, 1, 1, 1, 1), casual = c(0,
0, 0, 0, 0, 0), strategy = c(0, 0, 0, 0, 0, 0), rpg = c(1,
1, 1, 1, 1, 1), simulation = c(0, 0, 0, 0, 0, 0), multi_player_gen =
c(0,
0, 0, 0, 0, 0), shooter = c(0, 0, 0, 0, 0, 0), platformer = c(0,
0, 0, 0, 0, 0), ea_min = c(0, 0, 0, 0, 0, 0), ea_max = c(0,
0, 0, 0, 0, 0), scifi = c(0, 0, 0, 0, 0, 0), sports = c(0,
0, 0, 0, 0, 0), racing = c(0, 0, 0, 0, 0, 0), inappurchase = c(0,
0, 0, 0, 0, 0), workshop = c(0, 0, 0, 0, 0, 0), f_release_date =
c("May 16, 2011",
"May 16, 2011", "May 16, 2011", "May 16, 2011", "May 16, 2011",
"May 16, 2011"), l_release_date = c("May 16, 2011", "May 16, 2011",
"May 16, 2011", "May 16, 2011", "May 16, 2011", "May 16, 2011"
), datefinal = structure(c(18942, 18943, 18944, 18945, 18946,
18947), class = "Date"), Eventdate = c("", "", "", "", "",
""), Eventdate1 = structure(c(18949, 18949, 18949, 18949,
18949, 18949), class = "Date"), est_ccu_avg = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))

I figured it out, there probably is an easier way but this is how I did it:
# first make a list with only the rows where eventdate > datefinal to only
include estimation period.
estmeans <- study[study$Eventdate1 > study$datefinal,]
# calculate means per appid and eventdate
studymeans <- aggregate(estmeans$ccu_avg, list(estmeans$appid,
estmeans$Eventdate1), mean)
# change the names for merging
names(studymeans)[1] <- 'appid'
names(studymeans)[2] <- 'Eventdate1'
names(studymeans)[3] <- 'est_ccu_avg'
# merge the dataframes, it creates 2 new columns, delete the empty one.
studynew <- merge(study, studymeans, by=c("appid", "Eventdate1"))
studynew$est_ccu_avg.x <- NULL

You can leverage the special .BY, to refer to the grouping variable, when using data.table
library(data.table)
setDT(df)[, mean(ccu_avg[datefinal<=.BY$Eventdate1]), by=.(appid, Eventdate1)]
The equivalent in dplyr is cur_group().
df %>%
group_by(appid,Eventdate1) %>%
summarize(res = mean(ccu_avg[datefinal<=cur_group()$Eventdate1))

One-hot coding to numeric [duplicate]

This question already has answers here:
How do I dichotomise efficiently
(5 answers)
How to one hot encode several categorical variables in R
(5 answers)
Closed 9 months ago.
I am working on a project that requires me to one-hot code a single variable and I cannot seem to do it correctly.
I simply want to one-hot code the variable data$Ratings so that the values for 1,2,3 and separated in the dataframe and only equal either 0 or 1. E.g., if data$Ratings = 3 then the dummy would = 1. All the other columns are not to change.
structure(list(ID = c(284921427, 284926400, 284946595, 285755462,
285831220, 286210009, 286313771, 286363959, 286566987, 286682679
), AUR = c(4, 3.5, 3, 3.5, 3.5, 3, 2.5, 2.5, 2.5, 2.5), URC = c(3553,
284, 8376, 190394, 28, 47, 35, 125, 44, 184), Price = c(2.99,
1.99, 0, 0, 2.99, 0, 0, 0.99, 0, 0), AgeRating = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1), Size = c(15853568, 12328960, 674816, 21552128,
34689024, 48672768, 6328320, 64333824, 2657280, 1466515), HasSubtitle = c(0,
0, 0, 0, 0, 1, 0, 0, 0, 0), InAppSum = c(0, 0, 0, 0, 0, 1.99,
0, 0, 0, 0), InAppMin = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppMax = c(0,
0, 0, 0, 0, 1.99, 0, 0, 0, 0), InAppCount = c(0, 0, 0, 0, 0,
1, 0, 0, 0, 0), InAppAvg = c(0, 0, 0, 0, 0, 1.99, 0, 0, 0, 0),
descriptionTermCount = c(263, 204, 97, 272, 365, 368, 113,
129, 61, 87), LanguagesCount = c(17, 1, 1, 17, 15, 1, 0,
1, 1, 1), EngSupported = c(2, 2, 2, 2, 2, 2, 1, 2, 1, 2),
GenreCount = c(2, 2, 2, 2, 3, 3, 3, 2, 3, 2), months = c(7,
7, 7, 7, 7, 7, 7, 8, 8, 8), monthsSinceUpdate = c(29, 17,
25, 29, 15, 6, 71, 12, 23, 134), GameFree = c(0, 0, 0, 0,
0, 1, 0, 0, 0, 0), Ratings = c(3, 3, 3, 3, 2, 3, 2, 3, 2,
3)), row.names = c(NA, 10L), class = "data.frame")
install.packages("mlbench")
install.packages("neuralnet")
install.packages("mltools")
library(mlbench)
library(dplyr)
library(caret)
library(mltools)
library(tidyr)
data2 <- mutate_if(data, is.factor,as.numeric)
data3 <- lapply(data2, function(x) as.numeric(as.character(x)))
data <- data.frame(data3)
summary(data)
head(data)
str(data)
View(data)
#
dput(head(data, 10))
data %>% mutate(value = 1) %>% spread(data$Ratings, value, fill = 0 )

Is this what you want? I will assume your data is called data and continue with that for the data frame you supplied:
library(plm)
plm::make.dummies(data$Ratings) # returns a matrix
## 2 3
## 2 1 0
## 3 0 1
# returns the full data frame with dummies added:
plm::make.dummies(data, col = "Ratings")
## [not printed to save space]
There are some options for plm::make.dummies, e.g., you can select the base category via base and you can choose whether to include the base (add.base = TRUE) or not (add.base = FALSE).
The help page ?plm::make.dummies has more examples and explanation as well as a comparison for LSDV model estimation by a factor variable and by explicitly self-created dummies.

Removing characters/words in a column of a large dataframe in R

I am currently strugling to remove words from a large dataframe in R.
This is the df:
The first column (GeneID) contains a so called "ensembl gene ID". First one i.e. ENSG00000223972.5 followed by a "|". Afterwards, the real Gene name is listed. So i now want to remove the "ensembl gene ID" including the "|" to keep only the real gene name in this column. Is there a smart way to do this ? For example with the stringR package?
Cheers!
Edit:
> dput(head(data3))
structure(list(GeneID = c("ENSG00000223972.5|DDX11L1", "ENSG00000227232.5|WASH7P",
"ENSG00000278267.1|MIR6859-1", "ENSG00000243485.5|MIR1302-2HG",
"ENSG00000284332.1|MIR1302-2", "ENSG00000237613.2|FAM138A"),
`DC2-CD5pos-d1` = c(2, 47, 0, 0, 0, 0), `DC2-CD5pos-d2` = c(0,
41, 0, 0, 0, 0), `DC2-CD5pos-d3` = c(2, 31, 0, 0, 0, 0),
`DC2-CD5pos-d4` = c(0, 29, 0, 0, 0, 0), `DC3-d1` = c(1, 36,
0, 0, 0, 0), `DC3-d2` = c(0, 33, 0, 0, 0, 0), `DC3-d3` = c(0,
49, 0, 0, 0, 3), `DC3-d4` = c(0, 27, 0, 0, 0, 0), `DC2-BTLA-S-d1` = c(2,
4, 0, 1, 0, 0), `DC2-BTLA-S-d3` = c(6, 6, 1, 0, 0, 0), `DC2-BTLA-S-d4` = c(2,
1, 0, 0, 0, 0), `DC3-CD163-S-d1` = c(2, 8, 2, 0, 0, 0), `DC3-CD163-S-d3` = c(5,
9, 0, 0, 0, 0), `DC3-CD163-S-d4` = c(0, 5, 0, 0, 0, 0)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))

How to set NA values from a matrix to black-coloured tiles in a ggplot heatmap

I am working on the following structure and the following plotting code:
structure(c(NA, 11, 9, 9, 21, 7, 2, 5, 3, 0, 0, 1, 31, NA, 3,
2, 1, 0, 0, 10, 3, 0, 0, 0, 31, 16, NA, 2, 2, 10, 0, 5, 0, 0,
0, 0, 59, 65, 1, NA, 2, 4, 0, 4, 0, 0, 0, 0, 156, 23, 7, 17,
NA, 3, 2, 4, 7, 0, 0, 0, 31, 84, 0, 10, 16, NA, 0, 6, 0, 0, 2,
0, 129, 0, 2, 1, 0, 0, NA, 0, 0, 0, 0, 0, 41, 41, 0, 3, 4, 5,
0, NA, 0, 0, 0, 1, 16, 4, 1, 2, 0, 0, 0, 3, NA, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 1, 12, 2, 0, 0, 6, 0, 0, 0, 0,
NA, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), .Dim = c(12L,
12L), .Dimnames = list(c("WILL_", "WOULD_", "MAY_", "MIGHT_",
"CAN_", "COULD_", "SHALL_", "SHOULD_", "MUST_", "OUGHT TO_",
"USED TO_", "HAVE TO_"), c("_WILL", "_WOULD", "_MAY", "_MIGHT",
"_CAN", "_COULD", "_SHALL", "_SHOULD", "_MUST", "_OUGHT TO",
"_USED TO", "_HAVE TO")))
breaks <- c(0,1,5,10,50,100,500,100000)
reshape2::melt(structure, value.name = "Freq") %>%
mutate(label = ifelse(is.na(Freq) | Freq == 0, "", as.character(Freq))) %>%
ggplot(aes(Var2, fct_rev(Var1))) +
geom_tile(aes(fill = Freq), color = "black") +
geom_text(aes(label = label), color = "black") +
scale_fill_steps(low = "white", high = "purple", breaks = breaks, na.value = "grey",trans = "log")+
scale_x_discrete(NULL, expand = c(0, 0), position="top") +
scale_y_discrete(NULL, expand = c(0, 0)) +
theme(axis.text.x = element_text(angle=60,vjust = 0.5, hjust = 0))
I am trying to tweak the code so that original NA values (seen on the plot as the tiles forming a diagonal line from the co-occurrence of WILL WILL to HAVE TO HAVE TO, and the X HAVE TO column) are represented as black tiles separately from the other tiles which I would like to keep as they are.
Looking for tips on how to do this as I think I'm doing something wrong with the representation of values at the beginning of my code.
All the best
Cameron

Entire data frame is not displayed while using kable() in R

I am trying to display a data frame using kable function. The dataframe consists of 29 columns but only 9 columns are displayed and the remaining columns are repeated.
The dataframe used is
structure(list(Name = c("Grand Total", "B", "C", "D", "E", "F"
), GrandTotal = c(3416, 297, 410, 326, 125, 29), English = c(1096,
18, 64, 0, 55, 0), Science = c(211, 5, 39, 0, 55, 0), Language = c(149,
5, 0, 0, 10, 0), Maths = c(22, 0, 0, 0, 0, 0), Social = c(0,
0, 0, 0, 0, 0), English = c(211, 5, 39, 0, 55, 0), Science = c(149,
5, 0, 0, 10, 0), Maths = c(0, 0, 0, 0, 0, 0), Social = c(22,
0, 0, 0, 0, 0), English = c(1096, 18, 64, 0, 55, 0), Science = c(211,
5, 39, 0, 55, 0), Language = c(149, 5, 0, 0, 10, 0), Maths = c(22,
0, 0, 0, 0, 0), Social = c(0, 0, 0, 0, 0, 0), English = c(211,
5, 39, 0, 55, 0), Science = c(149, 5, 0, 0, 10, 0), ACIntern = c(0,
0, 0, 0, 0, 0), PAM = c(22, 0, 0, 0, 0, 0), Maths = c(1096, 18,
64, 0, 55, 0), Social = c(211, 5, 39, 0, 55, 0), English = c(149,
5, 0, 0, 10, 0), Science = c(22, 0, 0, 0, 0, 0), Language = c(0,
0, 0, 0, 0, 0), Maths = c(211, 5, 39, 0, 55, 0), Social = c(149,
5, 0, 0, 10, 0), English = c(0, 0, 0, 0, 0, 0), Science = c(22,
0, 0, 0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")
The code used for displaying the data frame as a table format is as follows
monthSelected <- c("April","May","June")
month1 <- paste0(monthSelected[1],' ',yearSelected)
month2 <- paste0(monthSelected[2],' ',yearSelected)
month3 <- paste0(monthSelected[3],' ',yearSelected)
myHeader <- c(" " = 2, month1 = 9, month2 = 9, month3 = 9)
names(myHeader) <- c(" ", month1, month2, month3)
kable(df[1:ncol(df)],"html") %>%
kable_styling(c("striped", "bordered")) %>%
add_header_above(c(" "=2, "IND" = 5, "US" = 4,"IND" = 5, "US" = 4,"IND" = 5, "US" = 4)) %>%
add_header_above(header = myHeader)
The output displayed is as follows
I can't figure out where I went wrong. Can anyone help me out with this issue?
In addition to it, is it possible to freeze first two columns when the table is scrolled horizontally?
Thanks in advance!!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Clustering with binary variables - r

You're using a variable that is not numeric, look at this: class(j1[,1]) [1] "character" You've to remove it, to make kmeans works: set.seed(1234) kmeans(j1[,-1],2)

Related

Creating column that takes mean of X for specific group id when date1 < date2

One-hot coding to numeric [duplicate]

Removing characters/words in a column of a large dataframe in R

How to set NA values from a matrix to black-coloured tiles in a ggplot heatmap

Entire data frame is not displayed while using kable() in R

Categories

Resources