Manipulating cutree object in R to segment original dataframe - r

I'm using R's built-in correlation matrix and hierarchical clustering methods to segment daily sales data into 10 clusters. Then, I'd like to create agglomerated daily sales data by cluster. I've got as far as creating a cutree() object, but am stumped on extracting only the column names in the cutree object where the cluster number is 1, for example.
For simplicity's sake, I'll use the EuStockMarkets data set and cut the tree into 2 segments; bear in mind that I'm working with thousands of columns here so the needs to be scalable:
data=as.data.frame(EuStockMarkets)
corrMatrix<-cor(data)
dissimilarity<-round(((1-corrMatrix)/2), 3)
distSimilarity<-as.dist(dissimilarity)
hirearchicalCluster<-hclust(distSimilarity)
treecuts<-cutree(hirearchicalCluster, k=2)
now, I get stuck. I want to extract only the column names from treecuts where the cluster number is equal to 1, for example. But, the object that cutree() makes is not a DataFrame, making sub-setting difficult. I've tried to convert treecuts into a data frame, but R does not create a column for the row names, all it does is coerce the numbers into a row with the name treecuts.
I would want to do the following operations:
....Code that converts treecuts into a data frame called "treeIDs" with the
columns "Index" and "Cluster"......
cluster1Columns<-colnames(treeIDs[Cluster==1, ])
cluster1DF<-data[ , (colnames(data) %in% cluster1Columns)]
rowSums(cluster1DF)
...and voila, I'm done.
Thoughts/suggestions?

Here is the solution:
names(treecuts[which(treecuts[1:4]==1)])
[1] "DAX" "SMI" "FTSE"
If you want,say, also for the cluster 2 (or higher), you can then use %in%
names(treecuts[which(treecuts[1:4] %in% c(1,2))])
[1] "DAX" "SMI" "CAC" "FTSE"

Why not just
data$clusterID <- treecuts
then subset data as usual?

Related

How to dynamically create and name data frames in a for loop

I am trying to generate data frame subsets for each respondent in a data frame using a for loop.
I have a large data frame with columns titled "StandardCorrect", "NameProper", "StartTime", "EndTime", "AScore", and "StandardScore" and several thousand rows.
I want to make a subset data frame for each person's name so I can generate statistics for each respondent.
I tried using a for loop
for(name in 1:length(NamesList)){ name <- DigiNONA[DigiNONA$NameProper == NamesList[name], ] }
NamesList is just a list containing all the levels of NamesProper (which isa factor variable)
All I want the loop to do is each iteration, generate a new data frame with the name "NamesList[name]" and I want that data frame to contain a subset of the main data frame where NameProper corresponds to the name in the list for that iteration.
This seems like it should be simple I just can;t figure out how to get r to dynamically generate data frames with different names for each iteration.
Any advice would be appreciated, thank you.
The advice to use assign for this purpose is technically feasible, but incorrect in the sense that it is widely deprecated by experienced users of R. Instead what should be done is to create a single list with named elements each of which contains the data from a single individual. That way you don't need to keep a separate data object with the names of the resulting objects for later access.
named_Dlist <- setNames( split( DigiNONA, DigiNONA$NameProper),
NamesList)
This would allow you to access individual dataframes within the named_Dlist object:
named_Dlist[[ NamesList[1] ]] # The dataframe with the first person in that NamesList vector.
It's probably better to use the term list only for true R lists and not for atomic character vectors.

Applying a function to a dataframe to trim empty columns within a list environment R

I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac
Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])

how to make groups of variables from a data frame in R?

Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

In R, how do I join and subset SpatialPolygonsDataFrame?

I'm trying to figure out my way on how to perform (so easy in GIS) operations in R.
Let's take some example polygon data set from spdep package
library("spdep")
c <- readShapePoly(system.file("etc/shapes/columbus.shp", package="spdep")[1])
plot(c)
I've managed to figure out that I can choose polygons with logical statements using subset. For instance:
cc <- subset(c, c#data$POLYID<5) plot(cc)
Now, let's suppose I have another data frame that I'd like to join to my spatial data:
POLYID=1:9
TO.LINK =101:109
link.data <- data.frame(POLYID=POLYID, TO.LINK=TO.LINK)
Using these two datasets, how can I get two spatial data frames:
First, consisting of polygons that have their ID in the second data frame
Second, consisting of the opposite set - polygons that do not exist in the second data frame.
How could I get to this point?
This will probably work. First, you want your relevant IDs.
myIDs <- link.data$POLYID
Then, use subset as you've pointed out:
subset(c, POLYID %in% myIDs)
subset(c, !(POLYID %in% myIDs))
Note that this assumes that your first dataframe, c, also has a relevant column called POLYID.

Binning column and getting corresponding values from other column in R

I have two columns of paired values in a data frame, I want to bin the data in one column using the cut2 function from the Hmisc package so that there are at least say 25 data points in each bin. I however need the corresponding values from the other column. Is there a convenient way for that using R? I have to bin the column B.
A B
-10.834510 1.680173
11.012966 1.866603
-16.491415 1.868667
-14.485036 1.900002
2.629104 1.960929
-3.597291 2.005348
.........
It's not clear what you mean by wanting the "corresponding values of the other column". The first part is easy to accomplish using the g (# of groups) argument:
dfrm$Agrp <- cut2(dfrm$A, g=trunc(length(dfrm$A)/25) )
You can aggregate means or medians of B within Agrp's using tapply or ave or one of the Hmisc summary functions. There are several worked examples in one of today's questions: How to get Summary statistics by group as well as many other examples of using those functions or aggregate or the pkg:plyr functions.
Given that the number of B values will not necessarily be constant across groups the only way I can think to deliver the individual values by A-grouped-value would be with split. I added an extra row to illustrate that a non-even split might need to return a list rather than a more "rectangular" object :
dat <- read.table(text="A B
-10.834510 1.680173
11.012966 1.866603
-16.491415 1.868667
-14.485036 1.900002
2.629104 1.960929
-3.597291 2.005348\n 3.5943 3.796", header=TRUE)
dat$Agrp <- cut2(dat$A, g=trunc(length(dat$A)/3) )
split(dat$B, dat$Agrp)
#-----
$`[-16.49, 2.63)`
[1] 1.680173 1.868667 1.900002 2.005348
$`[ 2.63,11.01]`
[1] 1.866603 1.960929 3.796000
If you want the vector of values on which the splits were done then that can be accomplished by using regex on levels(dat$Agrp).

Resources