match and add the cluster number to the original data - r

I am using the regular method to do a Hierarchical Clustering project.
mydata.dtm <- TermDocumentMatrix(mydata.corpus)
mydata.dtm2 <- removeSparseTerms(mydata.dtm, sparse=0.98)
mydata.df <- as.data.frame(inspect(mydata.dtm2))
mydata.df.scale <- scale(mydata.df)
d <- dist(mydata.df.scale, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
groups <- cutree(fit, k=10)
groups
congestion cough ear eye fever flu fluzonenon medicare painpressure physical pink ppd pressure
1 2 3 4 5 6 5 5 5 7 4 8 5
rash screening shot sinus sore sports symptoms throat uti
5 5 6 1 9 7 5 9 10
And I what I want is to put the group number back to the new column in the original data.
I've looked at approximate string matching within single list - r
Because the df here is a document matrix so what I got after df <- t(data.frame(mydata.df.scale,cutree(hc,k=10))) is a matrix like
df[1:5,1:5]
congestion cough ear eye fever
[1,] 0 0 0 0 0
[2,] 0 0 0 0 0
[3,] 0 0 0 0 0
[4,] 0 0 0 1 0
[5,] 0 0 0 0 0
Since eye has the group number 3 then I want add the number 3 to the new column in 4th row.
note that in this case a single document can be mapped to two items in the same group.
df[23,17:21]
sinus sore sports symptoms throat
0 1 0 0 1

Instead of put back the number directly I use the 0-1 matrix:
label_back <-t(data.frame(mydata.df,cutree(fit,k=10)))
row.names(label_back) <- NULL
#label_back<-label_back[1:(nrow(label_back)-1),]# the last line is the sum
groups.df<-as.data.frame(groups)
groups.df$label<-rownames(groups.df)
for (i in 1:length((colnames(label_back)))){
ind<-which(colnames(label_back)[i]==groups.df$label) # match names and return index
label_back[,i]=groups.df$groups[ind]*label_back[,i] # time the 0-1 with the #group number
}
find the max value in each row because there are more than 1 value in some rows.
data_group<-rep(0,nrow(data)
for (i in 1:nrow(data)){
data_group[i]<-max(unique(label_back[i,]))
}
data$group<-data_group
I am looking for more elegant way.

Related

incidence matrix for bipartite groups with multiple joins

I was wondering if there's a fast way to get an incidence matrix for this such a problem. I've got two data frames with three columns (the join keys)
df1 <- data.frame(K1=c(1,1,0,1,3,2,2),K2=c(1,2,1,0,2,0,1),K3=c(0,0,3,2,1,3,0))
df2 <- data.frame(K1=c(1,2,0,3),K2=c(0,1,2,0),K3=c(2,0,3,1))
and I need to obtain the corresponding incidence matrix
# IM:
# 1 2 3 4
# 1 1 1 0 0
# 2 1 0 1 0
# 3 0 1 1 0
# 4 1 0 0 0
# 5 0 0 1 1
# 6 0 1 1 0
# 7 0 1 0 0
where it's set 1 if there's a match between the corresponding KEY (column value) of rows of the two data frames.
I would do by using multiple loops
for (j in seq_len(nrow(df2)))
for (k in seq_len(ncol(df2))) {
if (df2[j,k])
m[which(df1[,k] == df2[j,k]),j] <- 1
}
but it's a C approach and maybe there's something faster in R. Do you have any other ideas? Besides, when the data.frame are quite big (around 50k and 20k rows), I cannot allocate the matrix as it seems too big.

R network graphs

I have data where X-column is review and then columns of words that most reviews have. Is it possible to create a graph where nodes would be reviews and edges would be words?
X action age ago amazing american art author back bad beautiful beginning
1 1 1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 0 0
4 1 3 0 0 0 2 1 0 0 0 0
5 0 0 1 0 1 0 0 2 0 1 0
Another idea is to claster the reviews in the graph according to the used words and their frequency.
Thank you very much. Any help is appreciated.
Here are three approaches to explore the relationships in your data:
par(mfrow=c(1,3))
# two mode network (reviews+words)
library(igraph)
set.seed(1)
g <- graph_from_data_frame(subset(reshape2::melt(df, 1), !!value, -value)[2:1])
V(g)$type <- bipartite.mapping(g)$type
plot(g, layout = layout_as_bipartite(g)[, 2:1], vertex.color = V(g)$type+1L)
# just the reviews:
library(reshape2)
lst <- with(subset(melt(df, 1), !!value)[2:1], split(X, variable))
lst <- lst[lengths(lst)>1]
lst <- lapply(lst, function(x) t(combn(x, m=2)))
g <- graph_from_edgelist(do.call(rbind, lst), dir = F)
E(g)$label <- rep(names(lst), sapply(lst, nrow))
plot(g)
# review clustering
df[-1] %>% dist(meth="bin") %>% hclust %>% plot
Output:
Data:
df <- read.table(header=T, text="
X action age ago amazing american art author back bad beautiful beginning
1 1 1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 1 0 0
4 1 3 0 0 0 2 1 0 0 0 0
5 0 0 1 0 1 0 0 2 0 1 0")
PS: There may be a shortcut to no. 2 (reviews as nodes & words as edges) - feel free to add it.
With your example data one could create graph consiting of nodes (each review) and edges (two reviews are connected when they use the same word). Moreover, you could weight the edges according to how many words two reviews have in common and moreover you could use different shapes/colors of the edges to represent the different words.
There are several ways to create a graph with your data. First, to create a adjacency matrix, where each columns and rows would each represent a review. The adjecency matrix only counts whether there is a common word between two reviews or not. In case two reviews share a common word it takes the value 1, otherwise it is zero.
The adjency matrix would look similar to this, where the latters denote column and row labels:
Review A B C D
A 0 1 1 1
B 1 0 0 1
C 1 0 0 1
D 1 1 1 0
With the R command graph_from_adjency( ) in the igraph package you could then create a graph and use the plot functions.
Second you could also create a weight matrix, which counts how many words are shared between two review. Using the same command graph_from_adjency( , weighted=T) from the igraph package you could create from that matrix a graph .
You can find a good introduction to network analysis with the igraph package here: http://kateto.net/networks-r-igraph
Review A B C D
A 0 2 3 1
B 2 0 0 2
C 3 0 0 2
D 1 2 2 0
Third, you could specifiy the graph from an edge and nodes data frames.
The nodes data frame would contain a short id of each node and maybe the name and all other information you may want to include about the nodes :
id long_review_name
R1 A
R2 B
R3 C
R4 D
The edges data frame collects all the information about the connections between two reviews. First, and most important it would record all edges in the columns from and to . Further, it could contain the frequency as weight on the edges and type would denote, which word connection the two nodes share:
from to weight type
R1 R2 1 american
R1 R2 1 age
R1 R3 2 american
R1 R3 1 age
R1 R4 1 age
R2 R4 2 american
To turn the edges and the node data frame into a graph you would need to use the command graph_from_data_frame(d=links, vertices=nodes).

In R: Sample from a "totals" column, then subtract 1 from sampled column, store value, and resample

I am definitely not an R coder but am trying to stumble my way through this code. I have a dataframe that looks like this--with 200 rows (just 8 shown here).
Ind.ID V1 V2 V3 V4 V5 V6 V7 Captures
1 1 0 0 1 1 0 0 0 2
2 2 0 0 1 0 0 0 1 2
3 3 1 1 0 1 1 0 1 5
4 4 0 0 1 1 0 0 0 2
5 5 1 0 0 0 0 1 0 2
6 6 0 1 1 0 0 0 0 2
7 7 0 0 1 1 1 0 0 3
8 8 1 0 0 0 1 0 0 2
I am trying to sample from the Captures column (which is the sum of the row) and output the Ind.ID value. If there is a 0 in the Captures column, I want it to subtract 1 from i (i=i-1) and resample--to ensure that I get the correct number of samples. I also want to then subtract 1 from the sampled column (i.e., decrease the Captures value by 1 if it was sampled), and then resample. I am trying to get 400 samples (I think the current code will get me only 200, but I can't figure out how to get 400).
i want my output to be
23
45
197
64
.....
Here's my code:
sess1<-(numeric(200)) #create a place for output
for(i in 1:length(dep.pop$Captures)){
if(dep.pop[i,'Captures']!=0){ #if the value of Captures is not 0, sample and
sample(dep.pop$Captures, size=1, replace=TRUE) #want to resample the row if Captures >1
#code here to decrease the value of the sampled Captures column by 1. create new vector for resampling?
}
else {
if(dep.pop[i,'Captures']==0){ #if the value of Captures = 0
i<-i-1 #decrease the value of i by 1 to ensure 200 samples
sample(dep.pop$Captures, size=1, replace=TRUE) #and resample
}
#sess1<- #store the value from a different column (ID column) that represents the sampled row
}}
Thanks!
Assuming sum(dep.pop$Captures) is at least 400 then the following code may meet your needs to sample up to the number of captures for each individual id:
sample(rep(dep.pop$Ind.ID, times=dep.pop$Captures), size=400)
If you wish to sample with replacement (so you do not need to worry about the total number of captures) but still want to use the number of captures per individual id as sampling weights, then perhaps
sample(dep.pop$Ind.ID, size=400, replace=TRUE, prob=dep.pop$Captures)

R: clustering documents

I've got a documentTermMatrix that looks as follows:
artikel naam product personeel loon verlof
doc 1 1 1 2 1 0 0
doc 2 1 1 1 0 0 0
doc 3 0 0 1 1 2 1
doc 4 0 0 0 1 1 1
In the package tm, it's possible to calculate the hamming distance between 2 documents. But now I want to cluster all the documents that have a hamming distance smaller than 3.
So here I would like that cluster 1 is document 1 and 2, and that cluster 2 is document 3 and 4. Is there a possibility to do that?
I saved your table to myData:
myData
artikel naam product personeel loon verlof
doc1 1 1 2 1 0 0
doc2 1 1 1 0 0 0
doc3 0 0 1 1 2 1
doc4 0 0 0 1 1 1
Then used hamming.distance() function from e1071 library. You can use your own distances (as long as they are in the matrix form)
lilbrary(e1071)
distMat <- hamming.distance(myData)
Followed by hierarchical clustering using "complete" linkage method to make sure that the maximum distance within one cluster could be specified later.
dendrogram <- hclust(as.dist(distMat), method="complete")
Select groups according to the maximum distance between points in a group (maximum = 5)
groups <- cutree(dendrogram, h=5)
Finally plot the results:
plot(dendrogram) # main plot
points(c(-100, 100), c(5,5), col="red", type="l", lty=2) # add cutting line
rect.hclust(dendrogram, h=5, border=c(1:length(unique(groups)))+1) # draw rectangles
Another way to see the cluster membership for each document is with table:
table(groups, rownames(myData))
groups doc1 doc2 doc3 doc4
1 1 1 0 0
2 0 0 1 1
So documents 1st and 2nd fall into one group while 3rd and 4th - to another group.

How do I calculate (age-specific) mortality rates in R?

Using data in the following form, in which ways can I calculate the (age-specific) mortality rate in the R programming language?
head(data)
## age gender zone Class misc bonus duration death cost
## 1 0 M 1 4 12 1 0.1753 0 0
## 2 4 M 3 6 9 1 0.0000 1 0
## 3 5 F 3 3 18 1 0.4548 0 0
## 4 5 F 4 1 25 1 0.1726 0 0
## 5 6 F 2 1 26 1 0.1808 0 0
## 6 9 F 3 3 8 1 0.5425 0 0
That is, for each age I want to calculate the number of deaths and divide by the total number of exposed individuals in that particular age. I tried the following:
n <- length(data$age);
rate <- c(1:n);
for (i in 1:n){
rate[i] <- sum(subset(data, age == i)$death)/ length(subset(data, age == i))
}
But this was useless - obviously not all ages from 1 to n is present in the dataset - I am looking for a written program in the sense of the above which will do the job.
Because the variable death only takes on the value of zero or one, you can calculate the age-specific mortality in one line of code.
tapply(data$death, data$age, mean)
You can get most of the way there with table(). If we assume that all those not dying are present for 100% of the time (a year, say), and that those dying are present for 1/2 of the time, then we have enough info to calculate exposure from these data. I'm not sure what your duration column is, but you haven't really described the data.
# cheap version of your data:
DF <- data.frame(age = c(0,4,5,5,6,9), death = c(0,1,0,0,0,0))
(DAT <- table(DF$death,DF$age))
0 4 5 6 9
0 1 0 2 1 1
1 0 1 0 0 0
# weight these two rows for components of exposure:
Exposure <- colSums(DAT * c(1,.5))
# rates are the ratio of death counts in each age to exposure to risk in each age:
Rates <- DAT["1",] / Exposure
If you then go on to calculate a lifetable, this is the so-called Mx or mx column.

Resources