Hello again stackoverflow-ers ! hope you are well
I am working on a project and am essentially trying to create a decision tree. The data is a for a bank's campaign concerning how well the campaign incentivized the customers to open up a term deposit.
Anyhow, i've worked through coding etc with some assistance from online resources and hit the wall on one part.
One of the columns is the term deposit amout figure for all customers and as I plotted the data to visualize it (please see attached the plot)
Since the data is so dispersed i wanted to discretize it. I used the following code:
BankTraining$TDepositAMTD<-cut(BankTraining$TermDepositAMT, right=F,
breaks= c(0,5000,10000,15000,20000,max(BankTraining$TermDepositAMT)))
here
The Y axis is the number of observations and X axis is the dollar amount of term deposits.
However, viewing the column after this step i see :
table(BankTraining$TDepositAMTD)
[0,5e+03) [5e+03,1e+04)
5213 8631
[1e+04,1.5e+04) [1.5e+04,2e+04)
8367 1698
[2e+04,3e+04)
3121
Now, clearly this is no good. Once the decision tree is created it shows these weird categories which I cannot interpret.
Could someone shed light on this issue please? Much gratitude for your help.
Since it seems you are not happy with the cuts you are producing, have a go at it with:
library(Hmisc)
Groups <- cut2(data, g = 5) # g is the number of groups or levels I want
The package Hmisc can be found here.
As for your weird categories, we would need to see what packages/ algorithms along with how you call it as these categories may be a product of your binning and some consequence of default behavior. Happy to edit when more information is available.
Related
I am (desperately) trying to generate a few graphs based on some data that I know has a correlation value of over 0.50 and I get these 2 graphs.
Needless to say, I am no statistician nor have I played with this subject before.
Here are the 2 graphs that I get:
What can be said about the 2 graphs individually? I am super confused by the outcome.
hard to say without knowing the full scope and context of your data. Few remarks;
there is some uper limit to the first graph above which all data points are considered to be 'out of range' (can't be 40 years with a company if I'm 30 years old.)
pay attention to simpsons paradox and make sure you have the right segmentation of your data (and check it).
2nd graph: if you only have values of 3 and 4 on the y-axis there is no use of plotting the grids and values for 3.2 etc. (it implies some significance/accuracy that's not there)
2nd graph: seems there is some 'rule' that says you need rating of 4 to get certain % of salary raise.
but again, more (business) context and info on the data is needed to be able to help you out more.
Good day.
I am 3 month old in R and R-Studio but am getting the hang of things. I am implementing a SOM solution with 38k records/observations using Kohonen SuperSOM following Self-Organising Maps for Customer Segmentation using R.
My data have no missing values but almost 60 columns many of them are dummyVars (I received this data in this format)
I have removed the ONE char Column (URL)
My Y column (as I understand it) is "shares" (How many times it was shared)
My data only consist of numerical data (dummyVars are of course 1 or 0)
I have Centered and Scaled my data (entire dataFrame)
As per the example I followed I dod convert the entire DF to a matrix
My problem is that my SOM takes ages to train even with multi core processing and my progress graph does not reach a nice flat"ish" plateau, it does come nicely down but still is very erratic, all my other graphs are extremely high in population and there are no nice clustering. I have even tried a 500 iteration with a 100x100 grid ;-(
I think /guess it is because of the huge amount of columns including mostly dummyVars e.g. dayOfWeek.Monday, dayOfWeek.Tuesday, category.LifeStile, category.Computers, etc.
What am I to do?
Should I convert the dummyVars back into another format, How and Why?
Please do not just give me a section of code as I would like to understand why I need to do What.
Thanx
I have to perform a cluster analysis on a big amount of data. Since I have a lot of missing values I made a correlation matrix.
corloads = cor(df1[,2:185], use = "pairwise.complete.obs")
Now I have problems how to go on. I read a lot of articles and examples, but nothing really works for me. How can I find out how many clusters are good for me?
I already tried this:
dissimilarity = 1 - corloads
distance = as.dist(dissimilarity)
plot(hclust(distance), main="Dissimilarity = 1 - Correlation", xlab="")
I got a plot, but its very messy and I dont know how to read it and how to go on. It looks like this:
Any idea how to improve it? And what can I actually get out of it?
I also wanted to create a Screeplot. I read that there will be a curve where you can see how many clusters are correct.
I also performed a cluster analysis and choose 2-20 Clusters, but the results are so long, I have no idea how to handle it and what things are important to look on.
To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme.
The kgs is helpful to get the optimal number of clusters.
Following your code one would do:
clus <- hclust(distance)
op_k <- kgs(clus, distance, maxclus = 20)
plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")
So the optimal number of clusters according to the kgs function is the minimum value of op_k, as you can see in the plot.
You can get it with
min(op_k)
Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL.
Check this page for more methods.
Hope it helps you.
Edit
To find which is the optimal number of clusters, you can do
op_k[which(op_k == min(op_k))]
Plus
Also see this post to find the perfect graphy answer from #Ben
Edit
op_k[which(op_k == min(op_k))]
still gives penalty. To find the optimal number of clusters, use
as.integer(names(op_k[which(op_k == min(op_k))]))
I'm happy to learn about the kgs function. Another option is using the find_k function from the dendextend package (it uses the average silhouette width). But given the kgs function, I might just add it as another option to the package.
Also note the dendextend::color_branches function, to color your dendrogram with the number of clusters you end up choosing (you can see more about this here: https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#setting-a-dendrograms-branches )
I'm fairly new to R and haven't been able to find an answer for this. Someone else asked a similar question, but no solution was ever reported. If I should have posted this Q on a different stackexchange, I apologize and will delete if it can't be migrated.
Using data I pulled from the FDIC on US based financial institutions and their total asset holdings, I would like to create a basic network graph where each node is proportionally sized to each other node in the graph. Each node would also be labeled with the name of the financial institution.
The edges of the graph actually don't matter for now, but I want each node connected to the network by at least one edge.
As of now, I've already successfully created a very basic network with 8 banks, connected by edges I randomly assigned, as shown here (I apparently can't embed pictures yet, sorry about that):
My .csv file will be formatted as:
id, bank, assets
1, JP Morgan Chase, 16928000
2, Bank of America, 19075000
... ... ...
For the graph I already created, it is the same as above except without the asset column. It was also only 8 banks, where the file I hope to use will have 25.
Like I already said, as for edges, I just randomly assigned some. If someone knows an easier way of creating random edges that connect the nodes I create, please let me know. Otherwise, this is how my file is formatted as of now:
to, from
1, 2
1, 3
...
And I created the graph I linked with the following commands:
> nodes <- read.csv("~/foo/foo/foo.csv")
> links <- read.csv("~/blah/taco/burrito/blah.csv")
> net <- graph_from_data_frame(d=links, vertices = nodes, directed = F)
> class(net)
> net
IGRAPH UN-- 8 10 --
+ attr: name (v/c), bank (v/c)
+ edges (vertex names):
[1] 1--2 1--3 1--4 1--5 2--3 2--4 2--7 4--5 5--8 7--8
> plot(net, main = "Financial Intermediaries", edge.arrow.size=.4, vertex.size=25, vertex.label.cex=1.5, vertex.label.color="black", vertex.label=V(net)$bank)
I hope I was clear with my problem and gave the necessary details/code. If not, please just let me know and I'll post it up here. Like I said, I'm really new to R (I literally picked it up today, lol), and much of the code I've used so far was less or more taken from Katya Ognyanova's examples/presentations on her blog.
For the sake of clarity, I'm currently using RStudio (most recent stable) and R v3.2.5.
I have been only using the igraph package, but if what I want can't be done with that, I am more than willing to switch over to a different package. That said, I would like to stay with R (unless there really is something so much easier for this it can't be ignored. I would like to stick with and learn R).
Thank you for any and all help, I really appreciate it.
as #Osssan linked to in the comments, there was a partial solution floating around.
That said, I think I created more of a 'hack' solution than a proper one with what I gleaned from the previous question. Here is what I did.
In my csv file, I had four columns. In the third column, I had the asset's for a given bank. NOTE Since I don't know how to do data manipulation inside of R, I had to do some work to adjust the size of the asset value so that it did not result in nodes that covered the entirety of the graph. With my solution, you will NOT get nodes that are relative in size automatically. You must do that first.
Since I wanted to create a network with nodes(banks) that were variable in size according to their respective asset holdings, what I did was create a separate vector like so
> df <- read.csv("~/blah/blah/blah.csv", colClasses = c("NULL","NULL", NA, "NULL"))
What this command does is read in the csv file, looks at the headings with colClasses and tell the interpreter to vacuum up all columns specified (non-NULL). With this vector, I then plugged it into my the plot function as such:
> plot(net, main = "Financial Intermediaries", edge.arrow.size=.4, vertex.size=as.matrix(df), vertex.label.color="black")
where I make a matrix using the as.matrix(df) and set it to vertex.size=. Given a vector of only one dimension, R is able to quickly make the appropriate matrix (I guess).
I still have to do some relabeling and connecting with edges, but it worked in graphing. I graphed the largest 26 commercial banks by total asset holdings (and adjusted them to % of total commercial bank assets in the US), so you will see that the size of nodes increase from 26-1. Here's the output.
Like I said, this solution works, but I am far from sure whether it would be considered proper or kosher. I welcome anyone to edit this solution so that it clarifies what is actually happening with my code and or post a proper/optimized solution if it exists. I'm going to give this post a solid few days before marking it solved, as I would like to still get a solid answer on this confusing problem.
P.S. If anyone knows of a way to force nodes not to overlap, I would appreciate a comment explaining how to do that. If you look at my picture, you'll see that the effect of dwarfing the other nodes is diminished when the largest node is covered by it's closely sized peers.
I have created a plot in R using googleVis, specifically gvisMotionChart, plotting a number of variables.
I am primarily using the line graph and it is all good when I view the graph with all variables, however when I select some of the individual variables it zooms in sunch that some of the plot for this variable is no longer on the graph. I know it should zoom in just to view this variable and can exclude other variables (which is a good feature) but it zooms in too much so that the variable I am after is not entirely on the graph.
This doesn't happen with all variables, and I can get around it by also selecting other variables either side of the one which I want to view, but it would be good if I could fix this. Has anyone come across a similar problem before and know a way around it?
Thanks in advance
EDIT: I have an example of this using the data Batting from the Lahman package. (I know nothing about basaeball so the analysis probably doesn't make sense, in fact looking at the results it almost certainly doesn't but it displays my point). If you run the following code:
library(Lahman)
recent <- subset(Batting, yearID > 2000)
homeruns <- aggregate(HR ~ stint + yearID, data = recent, FUN = sum)
avgHR <- mean(homeruns$HR)
homeruns$HR <- homeruns$HR - avgHR
m <- gvisMotionChart(data = homeruns, idvar = "stint", timevar = "yearID")
plot(m)
Then select the line graph, then subset on number 2, the top part of the graph is cut off
It seems to be Google's bug. I could even reproduce this same error in their "Visualization Playground" (https://code.google.com/apis/ajax/playground/?type=visualization#motion_chart) making part of the data negative.
I've already reported the issue as a bug: https://code.google.com/p/google-visualization-api-issues/issues/detail?id=1479
Might the force be with them!
I just had the same problem w/ a Sankey plot. I resolved it by deleting entries with value==0. However, I just tried to reproduce your example and could not reproduce your bug, so perhaps this has already been solved?