To reorder a data by hclust in R

To reorder a data by hclust in R - r

I saw these codes from here: http://learnr.wordpress.com/2009/08/10/ggplot2-version-of-figures-in-lattice-multivariate-data-visualization-with-r-part-9/
hc1 <- hclust(dist(USArrests, method = "canberra"))
hc1 <- as.dendrogram(hc1)
ord.hc1 <- order.dendrogram(hc1)
hc2 <- reorder(hc1, state.region[ord.hc1])
ord.hc2 <- order.dendrogram(hc2)
region.colors <- trellis.par.get("superpose.polygon")$col
USArrests2 <- melt(t(scale(USArrests)))
USArrests2$X2 <- factor(USArrests2$X2, levels = state.name[ord.hc2])
But I'm very confused in the forth line about the state.region variable.
The variable about the order is ord.hc1 was generated from USArrests, which seems to have nothing to do with state.region. Then why it uses state.region for reordering instead of a column within USArrests data frame?

Look at the help file for state.region -
?state.region
The first sentence under Details, is
R currently contains the following "state" data sets.
Note that all data are arranged according to alphabetical
order of the state names.
This means that we can jump between the USA data sets, since they are all in the same order, i.e. the state of the first entry of the USArrests is the same as the state in state.region.

Related

Is there an R function that can create an adjacency list / edge list / adjacency dataframe from a csv, to then use in iGraph?

I am trying to perform a Social Network Analysis of Congressional Roll Call data. The data I have comes as a csv, from voteview.com, and has the following format:
Format of the csv
There are a high number of unique bills (represented by roll number) that I need to loop through to see how often politicians (represented by icpsr) agree in their vote (represented by cast_code).
However, I am really unsure of how I would loop through this data frame, check if two politicians vote the same on a unique bill, and then add that to a new data frame which would have three columns [politician 1|politican 2|weight (how many times they voted the same on unique bills)].
I have produced the following code when there was just a single bill being considered, which was able to get me a network map:
#1. creating a dataframe with all the yayers and one with all the nayers
yay_list <- S117 %>% filter(cast_code == '1')
nay_list <- S117 %>% filter(cast_code == '6')
#2. a list of the icpsr numbers who agree for yay and nay
y_list <- list(yay_list$icpsr)
n_list <- list(nay_list$icpsr)
#3. trying to use this list to make an igraph graph - BUT it does not recognise it
# I am not sure where to go next
make_ring(yay_list)
a1 <- as_adj_list(y_list)
#4. Alternative method - using only columns for icpsr & cast_code
# this will make an edge/adjency style data frame
foo <- S117[, c("icpsr", "cast_code")]
library(plyr)
# define a function returning the edges for a single group
group.edges <- function(x) {
edges.matrix <- t(combn(x, 2))
colnames(edges.matrix) <- c("Sen_A", "Sen_B")
edges.df <- as.data.frame(edges.matrix)
return(edges.df)
}
# apply the function above to each group and bind altogether
all.edges <- do.call(rbind, lapply(unstack(foo), group.edges))
# add weights if needed
#all.edges$weight <- 1
#all.edges <- aggregate(weight ~ Sen_A + Sen_B, all.edges, sum)
all.edges
#convert to a dataframe for igraph
df <- data.frame(all.edges)
df
# use igraph function on new datafame and plot
g <- graph_from_data_frame(df)
print(g, e=TRUE, v=TRUE)
plot(g)
# a plot is produced, which is good, but I do not know how to do this for
# a situation where there are multiple bills - it seems very complicated
Does anyone have any advice on how I would create a similar style edge list data frame, ideally with weights (as there are many bills in the data frame not just 1)?
The weight should show how many times politicians vote the same way (either yay or nay) on unique bills.
Thanks!

error with dfidx: the two indexes don't define unique observations

I have collected data from a survey in order to perform a choice based conjoint analysis.
I have preprocessed and clean data with python in order to use them in R.
However, when I apply the function dfidx on the dataset I get the following error: the two indexes don't define unique observations.
I really do not understand why. Before creating the .csv file I checked if there were duplicates through the pandas function final_df.duplicated().sum() and its out put was 0 meaning that there were no duplicates.
Can please some one help me to understand what I am doing wrong ?
Here is the code:
df <- read.csv('.../survey_results.csv')
df <- df[,-c(1)]
df$Platform <- as.factor(df$Platform)
df$Deposit <- as.factor(df$Deposit)
df$Fees <- as.factor(df$Fees)
df$Financial_Instrument <- as.factor(df$Financial_Instrument)
df$Leverage <- as.factor(df$Leverage)
df$Social_Trading <- as.factor(df$Social_Trading)
df.mlogit <- dfidx(df, idx = list(c("resp.id","ques"), "position"), shape='long')
Here is the link to the dataset that I am using https://github.com/AlbertoDeBenedittis/conjoint-survey-shiny/blob/main/survey_results.csv
Thank you in advance for you time

The function dfidx() is build for data frames "for which observations are defined by two (potentialy nested) indexes" (ref).
I don't think this function is build for more than two idxs. Especially that, in your df, there aren't any duplicates ONLY when considering the combinations of the three columns you mention above (resp.id, ques and position).
One solution to this problem is to "combine" the two columns resp.id and ques into one (called for example resp.id.ques) with paste(...).
df$resp.id.ques <- paste(df$resp.id, df$ques, sep="_")
Then you can write the following line which should work just fine:
df.mlogit <- dfidx(df, idx = list("resp.id.ques", "position"))

Combine imputed data by group in r using mice

my question is a follow-up to this question on imputation by group using "mice":
multiple imputation and multigroup SEM in R
The code in the answer works fine as far as the imputation part goes. But afterwards I am left with a list of actually complete data but more than one set. The sample looks as follows:
'Set up data frame'
df.g1<-data.frame(ID=rep("A",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,10,20)),x3=floor(runif(5,100,150)))
df.g2<-data.frame(ID=rep("B",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,25,50)),x3=floor(runif(5,200,250)))
df.g3<-data.frame(ID=rep("C",5),x1=floor(runif(5,4,5)),x2=floor(runif(5,75,99)),x3=floor(runif(5,500,550)))
df<-rbind(df.g1,df.g2,df.g3)
'Introduce NAs'
df$x1[rbinom(15,1,0.1)==1]<-NA
df$x2[rbinom(15,1,0.1)==1]<-NA
df$x3[rbinom(15,1,0.1)==1]<-NA
df
'Impute values by group:'
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(df,m=5)))
df.clean
As you can see, df.clean is a list of 3. One element per group. But each element containing a complete data set I am looking for.
The original answer suggests to rbind() the obtained data in df.clean which leaves me with a new data set with 45 (3x the original size) observations.
Here is the original code for the last step:
imputed.both <- do.call(args = df.clean, what = rbind)
Which data is the "right" one? And why the last step?
Thanks a bunch!

There's a bug in the code, i have a edited version below that works:
#Set up data frame
set.seed(12345)
df.g1<-data.frame(ID=rep("A",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,10,20)),x3=floor(runif(5,100,150)))
df.g2<-data.frame(ID=rep("B",5),x1=floor(runif(5,0,2)),x2=floor(runif(5,25,50)),x3=floor(runif(5,200,250)))
df.g3<-data.frame(ID=rep("C",5),x1=floor(runif(5,4,5)),x2=floor(runif(5,75,99)),x3=floor(runif(5,500,550)))
df<-rbind(df.g1,df.g2,df.g3)
#Introduce NAs
df$x1[rbinom(15,1,0.1)==1]<-NA
df$x2[rbinom(15,1,0.1)==1]<-NA
df$x3[rbinom(15,1,0.1)==1]<-NA
# check NAs
colSums(is.na(df))
#Impute values by group:
# here's the bug
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(x,m=5)))
imputed.both <- do.call(args = df.clean, what = rbind)
dim(imputed.both)
# returns 15,4
In the code in the question, you have
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(df,m=5)))
dim(do.call(rbind,df.clean))
#this returns 45,4
The function is specified with "x" but you call df from the global environment. Hence you impute on the complete df.
So to answer your question, if you do this step:
split(df,df$ID)
You split your data frame into a list of data.frames with only A,B or Cs. Then if you lapply through this list, you get
df.clean<-lapply(split(df,df$ID), function(x) mice::complete(mice(x,m=5)))
names(df.clean)
lapply(df.clean,dim)
each item of the list df.clean contains a subset of the original df, with ID being A, B or C. Now you combine this list together into a data.frame using:
imputed.both <- do.call(rbind,df.clean)

How to create a table in R populated with 1s and 0s to show presence of values from another table?

I'm working with data regarding people and what class of medicine they were prescribed. It looks something like this (the actual data is read in via txt file):
test <- matrix(c(1,"a",1,"a",1,"b",2,"a",2,"c"),ncol=2,byrow=TRUE)
colnames(test) <- c("id","med")
test <- as.data.table(test)
test <- unique(test[, 1:2])
test
The table has about 5 million rows, 45k unique patients, and 49 unique medicines. Some patients have multiples of the same medicines, which I remove. Not all patients have every medicine. I want to make each of the 49 unique medicines into separate columns, and have each unique patient be a row, and populate the table with 1s and 0s to show if the patient has the medicine or not.
I was trying to use spread or dcast, but there's no value column. I tried to amend this by adding a row of 1s
test$true <- rep(1, nrow(test))
And then using tidyr
library(tidyr)
test_wide <- spread(test, med, true, fill = 0)
My original data produced this error but I'm not sure why the new data isn't reproducing it...
Error: `var` must evaluate to a single number or a column name, not a list
Please let me know what I can do to make this a better reproducible example sorry I'm really new to this.

It looks like you are trying to do onehot encoding here. For this please refer to the "onehot" package. Details are here.
Code for reference:
library(onehot)
test <- matrix(c(1,"a",1,"a",1,"b",2,"a",2,"c"),ncol=2,byrow=TRUE)
colnames(test) <- c("id","med")
test <- as.data.frame(test)
str(test)
test$id <- as.numeric(test$id)
str(test)
encoder <- onehot(test)
finaldata <- predict(encoder,test)
finaldata
Make sure that all the columns that you want to be encoded are of the type factor. Also, I have taken the liberty of changing data.table to data.frame.

How to get the previous name of data set in R?

Suppose I assign data=Abortion (Abortion data set given in the ltm package). I have some function where one of the inputs is data.
While using the function, I will write.
function.name(data=Abortion)
For writing the summary of the results I want the name of the data set I used; here in this case it is Abortion.
How can I get that name back?
In more general sense. suppose I have some object which has some name abc. I assign xyz=abc and now how can I get the name abc back?

I suggest to rethink your approach. I assume you are trying to loop through different datasets and get results. Try following example:
#dummy data
dat1 <- runif(10)
dat2 <- runif(10)
dat3 <- runif(10)
#my function
myfunc <- function(data) max(data)
#make a list - creating list of data manually, this is done automatically,e.g.:
# lapply(list.files(),read.table)
all_dat <- list(dat1,dat2,dat3)
#add names to list
names(all_dat) <- c("dat1","dat2","dat3")
#loop through dat1,2,3
sapply(all_dat,myfunc)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

To reorder a data by hclust in R - r

Related

Is there an R function that can create an adjacency list / edge list / adjacency dataframe from a csv, to then use in iGraph?

error with dfidx: the two indexes don't define unique observations

Combine imputed data by group in r using mice

How to create a table in R populated with 1s and 0s to show presence of values from another table?

How to get the previous name of data set in R?

Categories

Resources