I'm trying to build a clustering model following the kmeans method using both continous and categoric variables.
The goal is to create clusters based on the gender, age, occupation, billing plan, cell phone and the usage on some applications
I'm struggeling on how to process the categorical data, I know i should turn them onto dummies but not quite sure on how to do it in all the categoric variables all at once.
Thank you
The table looks like:
ID Gender Age Occupation Plan Cell phone Amazon Prime GB DL Apple Music GB DL Audible DB DL
C001 NR 56 Student Archaius SAMSUNG 0 0 0.498829165
C002 M 25 Management Malawi HUAWEI 0 0 1
C003 H 32 Professor Archaius Apple 0 0 0.632005841
One possible solution to create dummy variables from categorical variables is the "fastDummies" Package:
library(fastDummies)
df <- data.frame(NR1 = c(1,2,3),
NR2 = c(0.1, 0.5, 0.7),
FA1 = factor(c("A","B","C")),
FA2 = factor(c("5","6","7")))
str(df)
'data.frame': 3 obs. of 4 variables:
$ NR1: num 1 2 3
$ NR2: num 0.1 0.5 0.7
$ FA1: Factor w/ 3 levels "A","B","C": 1 2 3
$ FA2: Factor w/ 3 levels "5","6","7": 1 2 3
# one variable per factor level
fastDummies::dummy_cols(df)
NR1 NR2 FA1 FA2 FA1_A FA1_B FA1_C FA2_5 FA2_6 FA2_7
1 1 0.1 A 5 1 0 0 1 0 0
2 2 0.5 B 6 0 1 0 0 1 0
3 3 0.7 C 7 0 0 1 0 0 1
# encoding where where there are n-1 columns per factor (as in case of all being 0 it implies the last is 1 already)
fastDummies::dummy_cols(df, remove_first_dummy = TRUE)
NR1 NR2 FA1 FA2 FA1_B FA1_C FA2_6 FA2_7
1 1 0.1 A 5 0 0 0 0
2 2 0.5 B 6 1 0 1 0
3 3 0.7 C 7 0 1 0 1
Related
Say I collect two samples from two groups but the sample I collect from each subject within each group is a different size.
head(df1)
subject three bias number
<int> <dbl> <dbl> <int>
1 1 0 0.696 69
2 1 1 0.656 32
3 100 0 0.938 64
4 100 1 0.929 28
5 1002 0 0.7 40
6 1002 1 0.345 29
the df1 above is actually aggregated over a df of a bias Boolean by subject.
str(df)
'data.frame': 1256 obs. of 3 variables:
$ subject: int 1 1 1 1 1 1 1 1 1 1 ...
$ three : num 0 0 0 0 0 0 0 0 0 0 ...
$ bias : num 1 1 1 1 1 1 1 1 1 1 ...
I want to determine whether there are significant within-subject (column 1 df1) differences in the mean bias between three==0 and three==1 (column2 df1). The bias column (column 3 df1) is averaged over the number of samples (column 4 df1).
In theory, I would like to run an accelerated and bias corrected bootstrap test or a permutation test which can handle different sample sizes.
I'm confused how to implement a valid statistical test when it is a group of groups as described here.
Any insight would be greatly appreciated!
Consider the following toy data frame of my seed study:
site <- c(LETTERS[1:12])
site1 <- rep(site,each=80)
fate <- c('germinated', 'viable', 'dead')
fate1 <- rep(fate,each=320)
number <- c(41:1000)
df <- data.frame(site1,fate1,number)
> str(df)
'data.frame': 960 obs. of 3 variables:
$ site1 : Factor w/ 12 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
$ fate1 : Factor w/ 3 levels "dead","germinated",..: 2 2 2 2 2 2 2 2 2 2 ...
$ number: int 41 42 43 44 45 46 47 48 49 50 ...
I want R to go through all observations which are "dead" and assign "0" to every single one of them. Similarly, I want to assign "1" to all "viable" observations and "2" to all "germinated" observations.
My final data frame would be a single column, somewhat like this:
> year16
[1] 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0
[38] 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1
All suggestions are highly welcome
As zx8754 mentioned, you can have a look at the properties of a factor.
year16 <- as.numeric(factor(df$fate1, levels = c("dead", "viable", "germinated")))-1
Here first I reorder the levels of df$fate1, so dead is assigned to 1, viable to 2 and germinated to 3. You want to start the sequence at 0, so I have to substract 1 after turning the factor in a numeric variable.
Using case_when from the dplyr library:
df$year16 <-
case_when(
levels(df$fate1)[df$fate1] == "dead" ~ 0,
levels(df$fate1)[df$fate1] == "viable" ~ 1,
levels(df$fate1)[df$fate1] == "germinated" ~ 2,
TRUE ~ -1
)
Note: The solutions given by #David and #kath are much more graceful than this, but what I gave above would still work even if we had non numerical replacements.
Base R solution:
assignnum <- function(x) {
if (x == 'viable') {
z <- 1
} else if (x == 'dead') {
z <- 0
} else if (x == 'germinated') {
z <- 2
}
return(z)
}
df['result'] <- sapply(df$fate1, assignnum)
I have a categorical dataset that I am trying to summarize that has inherent differences in the nature of questions that were asked. The data below represent a questionnaire that had standard close-ended questions, but also questions where one could choose multiple answers from a list. "village" and "income" represent close-ended questions. "responsible.1"...etc... represent a list where the respondent either said yes or no to each.
VILLAGE INCOME responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
j both DLNR NA DEQ NA Public
k regular.income DLNR NA NA NA NA
k regular.income DLNR CRM DEQ Mayor NA
l both DLNR NA NA Mayor NA
j both DLNR CRM NA Mayor NA
m regular.income DLNR NA NA NA Public
What I want is a 3-way table output with "village" and the suite of of "responsible" responsible variables wrapped up into a ftable. This way, I could use the table with numerous R packages for graphs and analyses.
RESPONSIBLE
VILLAGE INCOME responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
j both 2 1 1 1 1
k regular income 2 1 1 1 0
l both 1 0 0 1 0
m regular income 1 0 0 0 1
as.data.frame(table(village, responsible.1) would get me the first, but I can't figure out how to get the entire thing wrapped up in a nice ftable.
> aggregate(dat[-(1:2)], dat[1:2], function(x) sum(!is.na(x)) )
VILLAGE INCOME responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
1 j both 2 1 1 1 1
2 l both 1 0 0 1 0
3 k regular.income 2 1 1 1 0
4 m regular.income 1 0 0 0 1
I'm guessing you actually had another grouping vector , perhaps the first "responsible" column?
I don't really understand the sorting rules but reversing the order of the grouping columns may be closer to what you posted:
> aggregate(dat[-(1:2)], dat[2:1], function(x) sum(!is.na(x)) )
INCOME VILLAGE responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
1 both j 2 1 1 1 1
2 regular.income k 2 1 1 1 0
3 both l 1 0 0 1 0
4 regular.income m 1 0 0 0 1
I have a data frame that I'm working with that contains experimental data. For the purposes of this post we can limit the discussion to 3 columns: ExperimentID, ROI, isContrast, isTreated, and, Value. ROI is a text-based factor that indicates where a region-of-interest is drawn, e.g. 'ROI_1', 'ROI_2',...etc. isTreated and isContrast are binary fields indicating whether or not some treatment was applied. I want to make a scatter plot comparing the values of, e.g., 'ROI_1' vs. 'ROI_2 ', which means I need the data paired in such a way that when I plot it the first X value is from Experiment_1 and ROI_1, the first Y value is from Experiment_1 and ROI_2, the next X value is from Experiment_2 and ROI_1, the next Y value is from Experiment_2 and ROI_2, etc. I only want to make this comparison for common values of isContrast and isTreated (i.e. 1 plot for each combination of these variables, so 4 plots altogether.
Subsetting doesn't solve my problem because data from different experiments/ROIs was sometimes entered out of numerical order.
The following code produces a mock data set to demonstrate the problem
expID = c('Bob','Bob','Bob','Bob','Lisa','Lisa','Lisa','Lisa','Alice','Alice','Alice','Alice','Joe','Joe','Joe','Joe','Bob','Bob','Alice','Alice','Lisa','Lisa')
treated = c(0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0,0,0)
contrast = c(0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1)
val = c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4,6,7,8,9,10,11)
roi = c(rep('A',16),'B','B','B','B','B','B')
myFrame = data.frame(ExperimentID=expID,isTreated = treated, isContrast= contrast,Value = val, ROI=roi)
ExperimentID isTreated isContrast Value ROI
1 Bob 0 0 1 A
2 Bob 0 1 2 A
3 Bob 1 0 3 A
4 Bob 1 1 4 A
5 Lisa 0 0 1 A
6 Lisa 0 1 2 A
7 Lisa 1 0 3 A
8 Lisa 1 1 4 A
9 Alice 0 0 1 A
10 Alice 0 1 2 A
11 Alice 1 0 3 A
12 Alice 1 1 4 A
13 Joe 0 0 1 A
14 Joe 0 1 2 A
15 Joe 1 0 3 A
16 Joe 1 1 4 A
17 Bob 0 0 6 B
18 Bob 0 1 7 B
19 Alice 0 0 8 B
20 Alice 0 1 9 B
21 Lisa 0 0 10 B
22 Lisa 0 1 11 B
Now let's say I want to scatter plot values for A vs. B. That is to say, I want to plot x vs. y where {(x,y)} = {(Bob's Value from ROI A, Bob's Value from ROI B), (Alice's Value from ROI A, Alices Value from ROI B)},...} etc. and these all must have the same values for isTreated and isContrast for the comparison to make sense. Now, if I just go an subset I'll get something like:
> x= myFrame$Value[(myFrame$ROI == 'A') & (myFrame$isTreated == 0) & (myFrame$isContrast == 0)]
> x
[1] 1 1 1 1
> y= myFrame$Value[(myFrame$ROI == 'B') & (myFrame$isTreated == 0) & (myFrame$isContrast == 0)]
> y
[1] 6 8 10
Now as you can see the values in y correspond to the first rows of Bob, Lisa, Alice and Joe, respectively but the values of y Bob, Alice and Lisa respectively, and there is no value for Joe.
So say I ignored the value for Joe because that data is missing for B and just decided to plot the first 3 values of x vs. the first 3 values of y. The data are still out of order because x = (Bob, Lisa, Alice) but y = (Bob, Alice, Lisa) in terms of where the values are coming from. So I would like to now how to make vectors such that the order is correct and the plot makes sense.
Similar to #Matthew, with ggplot:
The idea is to reshape your data so the the values from ROI=A and RIO=B are in different columns. This can be done (with your sample data) as follows:
library(reshape2)
zz <- dcast(myFrame,
value.var="Value",
formula=ExperimentID+isTreated+isContrast~ROI)
zz
ExperimentID isTreated isContrast A B
1 Alice 0 0 1 8
2 Alice 0 1 2 9
3 Alice 1 0 3 NA
4 Alice 1 1 4 NA
5 Bob 0 0 1 6
6 Bob 0 1 2 7
7 Bob 1 0 3 NA
8 Bob 1 1 4 NA
9 Joe 0 0 1 NA
10 Joe 0 1 2 NA
11 Joe 1 0 3 NA
12 Joe 1 1 4 NA
13 Lisa 0 0 1 10
14 Lisa 0 1 2 11
15 Lisa 1 0 3 NA
16 Lisa 1 1 4 NA
Notiice that your sample data is rather sparse (lots of NA's).
To plot:
library(ggplot2)
ggplot(zz,aes(x=A,y=B,color=factor(isTreated))) +
geom_point(size=4)+facet_wrap(~isContrast)
Produces this:
The reason there are no blue points is that, in your sample data, there are no occurrences of isTreated=1 and ROI=B.
Something like this, perhaps:
myFrameReshaped <- reshape(myFrame, timevar='ROI', direction='wide', idvar=c('ExperimentID','isTreated','isContrast'))
plot(Value.B ~ Value.A, data=myFrameReshaped)
To condition by the isTreated and isContrast variables, lattice comes in handy:
library(lattice)
xyplot(Value.B~Value.A | isTreated + isContrast, data=myFrameReshaped)
Values that are not present for one of the conditions give NA, and are not plotted.
head(myFrameReshaped)
## ExperimentID isTreated isContrast Value.A Value.B
## 1 Bob 0 0 1 6
## 2 Bob 0 1 2 7
## 3 Bob 1 0 3 NA
## 4 Bob 1 1 4 NA
## 5 Lisa 0 0 1 10
## 6 Lisa 0 1 2 11
I would like to have the frequencies of each levels of a categorical variable (row vector) denoting ecological type (3 levels: H,F,T) of a set of 93 herbaceous plants for the observed species present (=1) conditioning by sites (3 levels: A,B,C), habitats (3 levels: 1,2,3,4) and years (3 levels: 1,2,3).
I know the procedure is passed by tapply(), but the messy thing come from the logic operator for linking levels of the categorical variable (H,F,T) for the present species (=1) accross all of the species conditioning by combination of columns factors.
This could be summarized by a 12 x 3 contingency table indicating the numbers of each ecological types (3) of species per sites (3) and habitats (4).
Ex of my data (each habitat contain 20 lines): for each species (Sp1 to Sp93) 0 for absent and 1 for present. Vector "type" contain ecological type for each species.
Site,Habitat,Year,Sp1,Sp2,Sp3,Sp4,Sp5,Sp6,...,Sp93
type= c(H,H,F,T,F,T,H,....T) # vector of length 93
Thank you in advance.
I hope this would help describe my data objects better.
data = read.csv(file = "Veg_06.csv", header = TRUE)
data = data[1:240, -c(1,4:7)]
Ilot #
Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ... each level has 4 sublevels (from "Site") with 20 lines each, adding up to 80 lines by levels.
Site #
Factor w/ 4 levels "Am","Av","CP","CS": 2 2 2 2 2 2 2 2 2 2 ...
Sp #
int [1:240] 0 0 0 0 0 0 0 0 0 0 ... either "0" or "1" for absence or presence of species.
veg #
Factor w/ 3 levels "H","F","T": 3 3 2 2 3 1 2 1 2 1 ... categorical factor indicating type of species.
First off, I would recommend http://vita.had.co.nz/papers/tidy-data.pdf, Hadley Wickham's paper on Tidy Data, for some ideas on how to organize the data to be better suited to analysis. In essence, we think of each row as a single observation.
It sounds like fundamentally, your data is a collection of year, site, habitat, quadrant(? maybe line, not sure from the description), species with the observation point being that species was observed in that site, habitat, quadrant, and year. For simplicity, a row is present if the species is present.
In addition, there's the concept of type, which is associated with each species.
Analyzing and contingency table
Putting aside the question of how to get your data into this form, let's assume that we have the data in the form described above.
> raw <- expand.grid(species=1:93, quadrant=1:20, habitat=1:4, site=1:3, year=1:3)
> head(raw)
species quadrant habitat site year
1 1 1 1 1 1
2 2 1 1 1 1
3 3 1 1 1 1
4 4 1 1 1 1
5 5 1 1 1 1
6 6 1 1 1 1
And let's take a small sample and a large sample
> set.seed(100); d.small <- raw[sample(nrow(raw),20), ]
> set.seed(100); d.large <- raw[sample(nrow(raw),1000), ]
We can use the ftable function to get this into a state that we want, the 12x4 contingency table, as
> ftable(habitat ~ year + site, data=d.small)
habitat 1 2 3 4
year site
1 1 0 0 1 0
2 0 0 1 1
3 0 1 1 1
2 1 2 1 1 0
2 1 1 0 2
3 0 0 1 0
3 1 2 0 0 1
2 0 1 0 1
3 0 0 0 0
This will count the same species twice if it occurs in two different quadrants of the site/habitat mixture. We can discard the habitat and unique-ify to get the count across all of them
> ftable(habitat ~ year + site , data=unique(d.small[c('species', 'habitat','year','site')]))
Transforming (tidying the source data)
To transform the data as it stands into a form like this is tricky in vanilla R. With the tidyr package it gets easier (reshape does very similar things as well)
> onerow <- data.frame(year=1, site=1, habitat=2, quadrant=3, sp1=0, sp2=1,sp3=0,sp4=0,sp5=1)
> onerow
year site habitat quadrant sp1 sp2 sp3 sp4 sp5
1 1 1 2 3 0 1 0 0 1
Here I'm making assumptions about what your data look like that seem reasonable
> subset(gather(onerow, species, present, -(year:quadrant)), present==1)
year site habitat quadrant species present
2 1 1 2 3 sp2 1
5 1 1 2 3 sp5 1
> subset(gather(onerow, species, present, -(year:quadrant)), present==1, select=-present)
year site habitat quadrant species
2 1 1 2 3 sp2
5 1 1 2 3 sp5
And now you can proceed with the analysis above.
Merging in the species type data
Looking at your description a little closer, I think you also want to merge in a parallel vector of species type information.
> set.seed(100); sp.type <- data.frame(species=1:93, type=factor(sample(1:4, 93, replace=T)))
> merge(d.small, sp.type)
species quadrant habitat site year type
1 6 16 4 2 3 2
2 27 9 2 2 2 4
3 27 8 4 2 1 4
4 32 18 1 2 2 4
5 33 18 1 1 2 2
6 45 14 4 2 2 3
7 49 6 2 3 1 1
8 54 3 3 2 1 2
9 55 2 1 1 3 3
10 56 2 4 3 1 2
11 56 1 3 1 1 2
12 57 7 2 1 2 1
13 62 18 4 2 2 3
14 70 19 1 1 2 3
15 77 2 3 3 1 4
16 80 7 3 1 2 1
17 81 17 1 1 3 2
18 82 5 2 2 3 3
19 86 9 4 1 3 3
20 87 10 3 3 2 3
And now you can use the subset, unique, and ftable approach above to get the data you need.
Assuming you had a dataframe with (among other things) the columns named: "sites", "habitats", "years":
dfrm <- data.frame( sites = sample( LETTERS[1:3], 20, replace=TRUE),
habitats= sample( factor(1:4), 20, replace=TRUE),
years = sample( factor(paste("Y",1:4, sep="_")), 20, replace=TRUE) )
Then this will give you an additional factor-mode column that encodes the various levels of each row.
dfrm$three.way.inter <- with(dfrm, interaction(sites, habitats, years))
If you want non-populated levels then do nothing else. If you want possible levels that have no instances, then use drop=TRUE. Then you can analyze these within individual levels of the three classification variables.