nMDs non-metric multi-dimensional scaling coding a data set - r

I have a data set of lizard retreat sites that i'd like to examine using an nmds in r to determine which variables are likely important. I'm a novice with r and was told I need to code the data so r can read it. I'm using OS X 10.9.5 (13F1911, r version R 3.3.3 GUI 1.69 Mavericks build (7328).
I'm not sure how to attach the data file, so I've copied the 'head'(data)here:
data <- data.frame(newdataset)
head(data)
Hide.. PIT Year Species Alive.Partial.Dead Standing.half.fallen.fallen X..days.obs Total...of.day.occupied Height Diameter Angle Aspect
1 1 91A1 2004 Hog Doctor A S 6 6 4.2 ? . ?
2 2 91A1 2004 Mammie A S 4 4 1.8 5-10cm 90 SW
3 3 COFE 2004 Tabebuia riparia A S 17 16 3 5-10cm 0 ENE
4 4 COFE 2004 Columar cactus P Fallen 2 2 0 5-10cm 90 S
5 5 COFE 2004 ? D Fallen 4 3 0.2 5-10cm 60 ?
6 6 COFE 2004 Eugenia sp (check greeny fruit) P S 7 7 3.5 10-20cm 0 W
As you can see I managed to read the data into r, but I'm not sure what is next? I know I need to the convert my data.frame(newdataset) to a distance matrix, but I am unclear if I have to code or create levels for some of the variables, e.g., If the retreat site (selected by the lizard) was in a tree that was either, 1. Alive, 2. Partially Dead, 3. Dead.
A little more about the variables- Column 1. Hide (retreat) Identifies each retreat selected by lizards i.e., one lizard may use a single or multiple retreats, Column 2.Passive Internal Transponder identification number uniquely identifying each lizard, Column 3. Year the data were collected, 4. Species refers to the tree species in which a retreat was located or in the case of a single lizard the substrate (rock) used, 5. Identifies if the tree was alive, partially alive or dead, 6. Identifies if the tree was standing upright, if it was leaning over, or if it was lying on the ground, 7. The number of days a lizard was observed using a particular retreat site, 8. The total number of days a retreat site was known to be used, 9. The height of the retreat site from the ground, 10. The diameter of the section of tree containing the retreat site, 11. The angle of the retreat site relative to the ground, 12. The angle of the retreat site relative to the ground.
Thank you to anyone that can give some advice with this problem.
Cheers
Rick

Related

Constrained K-means, R

I am currently doing k-means to cluster my data, however, I wish each cluster to appear once in each given year. I have searched for answers for a whole night but with no result. Would anyone have ideas upon this problem using R? Or is there any package I should look for ? Thanks.
More background infos :
I try to replicated the cluster of relationships, using the reported gender, education level and birth year. I am doing this because this is a survey data whose respondents are old people and they sometime will report inaccurate age or education infos. My main challenge now is that I wish to "have only one cluster labels in each survey year". For example, I do not want to see there are two cluster3 in survey year 2000. My data is like below :
survey year
relationship
gender
education level
birth year
k-means cluster
2000
41( first daughter)
0
3
1997
1
2003
41( first daughter)
0
3
1997
1
2000
42( second daughter)
0
4
1999
2
2003
42( second daughter)
0
4
1999
2
2000
42( third daughter)
0
5
1999
2
2003
42( third daughter)
0
5
2001
3
Thanks in advance.
--Update--
A more detailed description of the task:
The data set is a panel survey data asking elders for their health status, their relationships ( incl. sons, daughters, neighbors ). Since these older people are sometimes imprecise on their family's demographic information such as birth year, education level, etc., we might need to delete a big part of the data if it did not match.
(e.g., A reported his first son is 30 years old in 1997, while said his first son was 29 years old in 1999, this data could therefore be problematic). My task is to save as much data as possible if the imprecision is not that high.
Therefore I first mutated columns to check the precision of each family member (e.g., birth year error %in% c(-1,2)). Next, I run k-means if the family members are detected to be imprecise. In this way, I save much of the data. Although I did not solve the above problem, it rarely occurs that I can almost ignore or drop these observations.

R: optimal sorting/allocation/distribution of items

I'm hoping someone may be able to help with a problem I have - trying to solve using R.
Individuals can submit requests for items. The minimum number of requests per person is one. There is a recommended maximum of five, but people can submit more in exceptional circumstances. Each item can only be allocated one individual.
Each item has a 'desirability'/quality score ranging from 10 (high quality) down to 0 (low quality). The idea is to allocate items, in line with requests, such that as many high quality items as possible are allocated. It is less important that individuals have an equitable spread of requests met.
Everyone has to have at least one request met. Next priority is to look at whether we can get anyone who is over the recommended limit within it by allocating requests to others. After that the priority is to look at where the item would rank in each individual's request list based on quality score, and allocate to the person where it would rank highest (eg, if it would be first in someone's list and third in another's, give it to the former).
Effectively I'd need a sorting algorithm of some kind that:
Identifies where an item has been requested more than once
Check all the requests of everyone making said request
If that request is the only one a person has made, give it to them
(if this scenario applies to more than one person, it should be
flagged in some way)
If all requestees have made more than one request, check to see if
any have made more than five requests - if they have it can be taken
off them.
If all are within the recommended limit, see where the request would
rank (based on quality score) and give to the person in whose list it
would rank highest.
The process needs to check that the above step isn't happening to people so many times that it leaves them without any requests...so it
effectively has to check one item at a time.
Does anyone have any ideas about how to approach this? I can think of all kinds of why I could arrange the data to make it easy to identify and see where this needs to happen, but not to automate the process itself. Thanks in advance for any help.
The data (at least the bits needed for this process) looks like the below:
Item ID Person ID Item Score
1 AAG 9
1 AAK 8
2 AAAX 8
2 AN 8
2 AAAK 8
3 Z 8
3 K 8
4 AAC 7
4 AR 5
5 W 10
5 V 9
6 AAAM 7
6 AAAL 7
7 AAAAN 5
7 AAAAO 5
8 AB 9
8 D 9
9 AAAAK 6
9 AAAAC 6
10 A 3
10 AY 3

Grouping words that are similar

CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
I want to get either:
CompanyName2
Kraft
Kraft
Kraft
nestle
nestle
general motors
general motors
Dow
Dow
But would be absolutely fine with:
CompanyName2
1
1
1
2
2
3
3
I see algorithms for getting the distance between two words, so if I had just one weird name I would compare it to all other names and pick the one with the lowest distance. But I have thousands of names and want to group them all into groups.
I do not know anything about elastic search, but would one of the functions in the elastic package or some other function help me out here?
I'm sorry there's no programming here. I know. But this is way out of my area of normal expertise.
Solution: use string distance
You're on the right track. Here is some R code to get you started:
install.packages("stringdist") # install this package
library("stringdist")
CompanyName <- c('Kraft', 'Kraft Foods', 'Kfraft', 'nestle', 'nestle usa', 'GM', 'general motors', 'the dow chemical company', 'Dow')
CompanyName = tolower(CompanyName) # otherwise case matters too much
# Calculate a string distance matrix; LCS is just one option
?"stringdist-metrics" # see others
sdm = stringdistmatrix(CompanyName, CompanyName, useNames=T, method="lcs")
Let's take a look. These are the calculated distances between strings, using Longest Common Subsequence metric (try others, e.g. cosine, Levenshtein). They all measure, in essence, how many characters the strings have in common. Their pros and cons are beyond this Q&A. You might look into something that gives a higher similarity value to two strings that contain the exact same substring (like dow)
sdm[1:5,1:5]
kraft kraft foods kfraft nestle nestle usa
kraft 0 6 1 9 13
kraft foods 6 0 7 15 15
kfraft 1 7 0 10 14
nestle 9 15 10 0 4
nestle usa 13 15 14 4 0
Some visualization
# Hierarchical clustering
sdm_dist = as.dist(sdm) # convert to a dist object (you essentially already have distances calculated)
plot(hclust(sdm_dist))
If you want to group then explicitly into k groups, use k-medoids.
library("cluster")
clusplot(pam(sdm_dist, 5), color=TRUE, shade=F, labels=2, lines=0)

CART Methodology for data with mutually exhaustive rows

I am trying to use CART to analyse a data set whose each row is a segment, for example
Segment_ID | Attribute_1 | Attribute_2 | Attribute_3 | Attribute_4 | Target
1 2 3 100 3 0.1
2 0 6 150 5 0.3
3 0 3 200 6 0.56
4 1 4 103 4 0.23
Each segment has a certain population from the base data (irrelevant to my final use).
I want to condense, for example in the above case, the 4 segments into 2 big segments, based on the 4 attributes and on the target variable. I am currently dealing with 15k segments and want only 10 segments with each of the final segment based on target and also having a sensible attribute distribution.
Now, pardon my if I am wrong but CHAID on SPSS (if not using autogrow) will generally split the data into 70:30 ratio where it builds the tree on 70% of the data and tests on the remaining 30%. I can't use this approach since I need all my segments in the data to be included. I essentially want to club these segments into a a few big segments as explained before. My question is whether I can use CART (rpart in R) for the same. There is an explicit option 'subset' in the rpart function in R but I am not sure whether not mentioning it will ensure CART utilizing 100% of my data. I am relatively new to R and hence a very basic question.

Excel: Select data for graph

To put it simple, I have three columns in excel like the ones below:
Vehicle x y
1 10 10
1 15 12
1 12 9
2 8 7
2 11 6
3 7 12
x and y are the coordinates of customers assigned to the corresponding vehicle. This file is the output of a program I run in advance. The list will always be sorted by vehicle, but the number of customers assigned to vehicle "k" may change from one experiment to the next.
I would like to plot a graph containing 3 series, one for each vehicle, where the customers of each vehicle would appear (as dots in 2D based on their x- and y- values) in different color.
In my real file, I have 12 vehicles and 3200 customers, and the ranges change from one experiment to the next, so I would like to automate the process, i.e copy-paste the list on my excel and see the graph appear automatically (if this is possible).
Thanks in advance for your time and effort.
EDIT: There is a similar post here: Use formulas to select chart data but requires the use of VB. Moreover, I am not sure whether it has been indeed answered.
you should try this free online tool - www.cloudyexcel.com/excel-to-graph/

Resources