Clustering in R - r

I used hclust to cluster my data and cutree to specify the numbers of cluster to be 3. Is there any way that I can examine each of the cluster? By examine I mean to list out the cases/observations that are in e.g. the first cluster. I tried all the basic function that I know such as summary(), list()...but seem not relevant. Any function can do this?
If not, the cutree function returns a list of groups/clusters that each of my observation belongs to, something like this:
1,3,1,2,3,3,1
which indicates my first observation belongs to group 1, second belong to group three...
I am thinking about how to extract the position from that list where e.g. group = 1, so it will return 1,3 and 7 since observations 1,3,7 are belong to group 1
Or I need to use a loop to count all the observations that belong to e.g. group 1 from that list?
Is my question clear?

Does this help to get started?
nclust <- 10
cutreeout <- cutree(hclustOutput, nclust)
Add them as a new column to your dataframe
mydata$cluster <- cutreeout
How many observations are in each cluster?
table(mydata$cluster)
Then you can do more stuff to interpret your clusters, and/or study subsets of your data.

This is a hint, not the answer. Here's the example of Hierarchical Clustering in R. You can try to use the functions table(), ggplot() in order to see observations per clusters.

Related

How to compute questionnaire total score and subscores by summing all and a selection of columns in R?

I'm new in R and I'm having a little issue. I hope some of you can help me!
I have a data.frame including answers at a single questionnaire.
The rows indicate the participants.
The first columns indicates the participant ID.
The following columns include the answers to each item of the questionnaire (item.1 up to item.20).
I need to create two new vectors:
total.score <- sum of all 20 values for each participant
subscore <- sum of some of the items
I would like to use a function, like a sum(A:T) in Excel.
Just to recap, I'm using R and not other software.
I already did it by summing each vector just with the symbol +
(data$item.1 + data$item.2 + data$item.3 etc...)
but it is a slow way to do it.
Answers range from 0 to 3 for each item, so I expect a total score ranging from 0 to 60.
Thank you in advance!!
Let's use as example this data from a national survey with a questionnaire
If you download the .csv file to your working directory
data <- read.csv("2016-SpanishSurveyBreastfeedingKnowledge-AELAMA.csv", sep = "\t")
Item names are p01, p02, p03...
Imagine you want a subtotal of the first five questions (from p01 to p05)
You can give a name to the group:
FirstFive <- c("p01", "p02", "p03", "p04", "p05")
I think this is worthy because of probably you will want to perform more tasks with this group (analysis, add or delete a question from the group...), and because it helps you to provide meaningful names (for instance "knowledge", "attitudes"...)
And then create the subtotal variable:
data$subtotal1 <- rowSums(data[ , FirstFive])
You can check that the new variable is the sum
head(data[ , c(FirstFive, "subtotal2")])
(notice that FirstFive is not quoted, because it is an object outside data, but subtotal2 is quoted, because it is the name of a variable in data)
You can compute more subtotals and use them to compute a global score
You could may be save some keystrokes if you know that these variables are the columns 20 to 24:
names(data)[20:24]
And then sum them as
rowSums(data[ , c(20:24)])
I think this is what you asked for, but I would avoid doing this way, as it is easier to make mistakes, whick can be hard to be detected

compute with the values in table() function

I am new to R and stuck in computing the proportions of two values.
I got to this point with using the table() function
table(data$subscriptions, data$pickup)
The subscriptions data is divided into casual and registered users per station. Basically, I want to compute the proportion of casual users per station.
Should I be using tapply() to solve this?
Thankful for any help!
There is a function prop.table() that is called on the table to turn counts into proportions. So in your case try something like this:
tab <- table(data$subscriptions, data$pickup)
prop.table(tab, 2)
Where 2 is a margin on which the proportions will be calculated. 2 means columns in your case.
Also see help(prop.table)

How to calculate column mean at intervals of row values in R?

I have dataframe which has 253 rows(locations on a chromosome in Mbps) and 1 column (Allele score at each location). I need to produce a dataframe which contains the mean of the allele score at every 0.5 Mbps on the chromosome. Please help with R code that can do this. thanks.
The picture in this case is adequate to construct an answer but not adequate to support testing. You should learn to post data in a form that doesn't require re-entry by hand. (That's why you are accumulating negative votes.)
The basic R strategy would be to use cut to create a grouping variable and then use a loop construct to accumulate and apply the mean function. Presumably this is in a dataframe which I will assume is named something specific like my_alleles:
tapply( my_alleles$Allele_score, # act on this vector
# in groups defined by this factor
cut(my_alleles$Location,
breaks=seq(0, max(my_alleles$Location), by=0.5)
),
# with this function
FUN=mean)

How to group data to minimize the variance while preserving the order of the data in R

I have a data frame (760 rows) with two columns, named Price and Size. I would like to put the data into 4/5 groups based on price that would minimize the variance in each group while preserving the order Size (which is in ascending order). The Jenks natural breaks optimization would be an ideal function however it does not take the order of Size into consideration.
Basically, I have data simlar to the following (with more data)
Price=c(90,100,125,100,130,182,125,250,300,95)
Size=c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata=data.frame(Size,Price)
I would like to group data, to minimize the variance of price in each group respecting 1) The Size value: For example, the first two prices 90 and 100 cannot be in a different groups since they are the same size & 2) The order of the Size: For example, If Group One includes observations (Obs) 1-2 and Group Two includes observations 3-9, observation 10 can only enter into group two or three.
Can someone please give me some advice? Maybe there is already some such function that I can’t find?
Is this what you are looking for? With the dplyr package, grouping is quite easy. The %>%can be read as "then do" so you can combine multiple actions if you like.
See http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html for further information.
library("dplyr")
Price <– c(90,100,125,100,130,182,125,250,300,95)
Size <- c(10,10,10.5,11,11,11,12,12,12,12.5)
mydata <- data.frame(Size,Price) %>% # "then"
group_by(Size) # group data by Size column
mydata_mean_sd <- mydata %>% # "then"
summarise(mean = mean(Price), sd = sd(Price)) # calculate grouped
#mean and sd for illustration
I had a similar problem with optimally splitting a day into 4 "load blocks". Adjacent time periods must stick together, of course.
Not an elegant solution, but I wrote my own function that first split up a sorted series at specified break points, then calculates the sum(SDCM) using those break points (using the algorithm underlying the jenks approach from Wiki).
Then just iterated through all valid combinations of break points, and selected the set of points that produced the minimum sum(SDCM).
Would quickly become unmanageable as number of possible breakpoints combinations increases, but it worked for my data set.

R: Function: generate and save multiple matrices based on multiple conditions

I am a new R user and an unexperienced coder and I have a data handling problem. Hopefully someone can help:
I have a data.frame with 3 columns (firm, year, class) and about 50.000 rows. I want to generate and store for every firm a (class x year) matrix with class counts as the elements in the matrix. Every matrix would be automatically named something like firm.name and stored so that I can use them afterwards for computations. Ideally, I'd be able to change the simple class counts into a function of values in columns 4 and 5 (backward and forward citations)
I am looking at 40 firms, 30 years, and about 1500 classes (so many firm-year-class counts are zero).
I realise I can get most of what I need (for counts) by simply using table(class,year,firm) as these columns have the same length. However, I don't know how to either store or access the matrices this function generates...
Any help would be greatly appreciated!
Simon
So, your question is how to deal with a table object?
Example:
#note the assigment operator
mytable <- with(ChickWeight, table(cut(weight, c(0,100,200,Inf)), Diet, Chick))
#access the data for the first chick
mytable[,,1]
#turn the table object into a data.frame
as.data.frame(mytable)

Resources