Constrained K-means, R - r

I am currently doing k-means to cluster my data, however, I wish each cluster to appear once in each given year. I have searched for answers for a whole night but with no result. Would anyone have ideas upon this problem using R? Or is there any package I should look for ? Thanks.
More background infos :
I try to replicated the cluster of relationships, using the reported gender, education level and birth year. I am doing this because this is a survey data whose respondents are old people and they sometime will report inaccurate age or education infos. My main challenge now is that I wish to "have only one cluster labels in each survey year". For example, I do not want to see there are two cluster3 in survey year 2000. My data is like below :
survey year
relationship
gender
education level
birth year
k-means cluster
2000
41( first daughter)
0
3
1997
1
2003
41( first daughter)
0
3
1997
1
2000
42( second daughter)
0
4
1999
2
2003
42( second daughter)
0
4
1999
2
2000
42( third daughter)
0
5
1999
2
2003
42( third daughter)
0
5
2001
3
Thanks in advance.
--Update--
A more detailed description of the task:
The data set is a panel survey data asking elders for their health status, their relationships ( incl. sons, daughters, neighbors ). Since these older people are sometimes imprecise on their family's demographic information such as birth year, education level, etc., we might need to delete a big part of the data if it did not match.
(e.g., A reported his first son is 30 years old in 1997, while said his first son was 29 years old in 1999, this data could therefore be problematic). My task is to save as much data as possible if the imprecision is not that high.
Therefore I first mutated columns to check the precision of each family member (e.g., birth year error %in% c(-1,2)). Next, I run k-means if the family members are detected to be imprecise. In this way, I save much of the data. Although I did not solve the above problem, it rarely occurs that I can almost ignore or drop these observations.

Related

Portfolio sorts with incomplete data

I have a panel data of stock returns, where after a certain year the coverage universe of stocks doubled. It looks a bit like this:
Year Stock 1 Stock 2 Stock 3 Stock 4
2000 5.1% 0.04% NA NA
2001 3.6% 9.02% NA NA
2002 5.0% 12.09% NA NA
2003 -2.1% -9.05% 1.1% 4.7%
2004 7.1% 1.03% 4.2% -1.1%
.....
Of course, I am trying to maximize my observations both in the time series and in the cross-section as much as possible. However, I am not sure which of these 3 ways to sort would be the most "academically honest":
Sort the years until 2001 using only stocks 1 and 2, and incorporate the remaining stocks in the calculations once they become available in 2003.
Only include those stocks in calculations that have been available since 2000, i.e. stocks 1 and 2. Ignore the remaining stocks altogether since we do not have the full return profile.
Start the sort in year 2003, to have a larger cross-section.
The reason why our coverage universe expands in 2003 is simply because the data provider I am using changed their methodology in that year and decided to track more stocks. Stocks 3 and 4 do exist before 2003, but I cannot use their past return data since I need to follow my data provider (for the second variable I am sorting on).
Thanks all!
I am using the portsort() package in R but this does not seem to work well with NA`s.

Propensity Score Matching with panel data

I am trying to use MatchIt to perform Propensity Score Matching (PSM) for my panel data. The data is panel data that contains multi-year observations from the same group of companies.
The data is basically describing a list of bond data and the financial data of their issuers, also the bond terms such as issued date, coupon rate, maturity, and bond type of bonds issued by them. For instance:
Firmnames
Year
ROA
Bond_type
AAPL US Equity
2015
0.3
0
AAPL US Equity
2015
0.3
1
AAPL US Equity
2016
0.3
0
AAPL US Equity
2017
0.3
0
C US Equity
2015
0.3
0
C US Equity
2016
0.3
0
C US Equity
2017
0.3
0
......
I've already known how to match the observations by the criteria I want and I use exact = Year to make sure I match observations from the same year. The problem now I am facing is that the observations from the same companies will be matched together, this is not what I want. The code I used:
matchit(Bond_type ~ Year + Amount_Issued + Cpn + Total_Assets_bf + AssetsEquityRatio_bf + Asset_Turnover_bf, data = rdata, method = "nearest", distance = "glm", exact = "Year")
However, as you can see, in the second raw of my sample, there might be two observations in one year from the same companies due to the nature of my study (the company can issue bonds more than one time a year). The only difference between them is the Bond_type. Therefore, the MathcIt function will, of course, treat them as the best control and treatment group and match these two observations together since they have the same ROA and other matching factors in that year.
I have two ways to solve this in my opinion:
Remove the observations from the same year and company, however, removing the observations might lead to bias results and ruined the study.
Preventing MatchIt function match the observations from the same company (or with the same Frimnames)
The second approach will be better since it will not lead to bias, however, I don't know if I can do this in MatchIt function. Hope someone can give me some advice on this or maybe there's any better solution to this problem, please be so kind to share with me, thanks in advance!
Note: If there's any further information or requirement I should provide, please just inform me. This is my first time raising the question here!
This is not possible with MatchIt at the moment (though it's an interesting idea and not hard to implement, so I may add it as a feature).
In the optmatch package, which perfroms optimal pair and full matching, there is a constraint that can be added called "anti-exact matching", which sounds exactly like what you want. Units with the same value of the anti-exact matching variable will not be matched with each other. This can be implemented using optmatch::antiExactMatch().
In the Matching package, which performs nearest neighbor and genetic matching, the restrict argument can be supplied to the matching function to restrict certain matches. You could manually create the restriction matrix by restricting all pairs of observations in the same company and then supply the matrix to Match().

nMDs non-metric multi-dimensional scaling coding a data set

I have a data set of lizard retreat sites that i'd like to examine using an nmds in r to determine which variables are likely important. I'm a novice with r and was told I need to code the data so r can read it. I'm using OS X 10.9.5 (13F1911, r version R 3.3.3 GUI 1.69 Mavericks build (7328).
I'm not sure how to attach the data file, so I've copied the 'head'(data)here:
data <- data.frame(newdataset)
head(data)
Hide.. PIT Year Species Alive.Partial.Dead Standing.half.fallen.fallen X..days.obs Total...of.day.occupied Height Diameter Angle Aspect
1 1 91A1 2004 Hog Doctor A S 6 6 4.2 ? . ?
2 2 91A1 2004 Mammie A S 4 4 1.8 5-10cm 90 SW
3 3 COFE 2004 Tabebuia riparia A S 17 16 3 5-10cm 0 ENE
4 4 COFE 2004 Columar cactus P Fallen 2 2 0 5-10cm 90 S
5 5 COFE 2004 ? D Fallen 4 3 0.2 5-10cm 60 ?
6 6 COFE 2004 Eugenia sp (check greeny fruit) P S 7 7 3.5 10-20cm 0 W
As you can see I managed to read the data into r, but I'm not sure what is next? I know I need to the convert my data.frame(newdataset) to a distance matrix, but I am unclear if I have to code or create levels for some of the variables, e.g., If the retreat site (selected by the lizard) was in a tree that was either, 1. Alive, 2. Partially Dead, 3. Dead.
A little more about the variables- Column 1. Hide (retreat) Identifies each retreat selected by lizards i.e., one lizard may use a single or multiple retreats, Column 2.Passive Internal Transponder identification number uniquely identifying each lizard, Column 3. Year the data were collected, 4. Species refers to the tree species in which a retreat was located or in the case of a single lizard the substrate (rock) used, 5. Identifies if the tree was alive, partially alive or dead, 6. Identifies if the tree was standing upright, if it was leaning over, or if it was lying on the ground, 7. The number of days a lizard was observed using a particular retreat site, 8. The total number of days a retreat site was known to be used, 9. The height of the retreat site from the ground, 10. The diameter of the section of tree containing the retreat site, 11. The angle of the retreat site relative to the ground, 12. The angle of the retreat site relative to the ground.
Thank you to anyone that can give some advice with this problem.
Cheers
Rick

Using name full name and maiden name strings (and birthdays) to match individuals across time

I've got a set of 20 or so consecutive individual-level cross-sectional data sets which I would like to link together.
Unfortunately, there's no time-stable ID number; there are, however, fields for first, last, and maiden names, as well as year of birth--this should allow for a pretty high (90-95%) match rate, I presume.
Ideally, I would create a time-independent ID for each unique individual.
I can do this for those whose marital status (maiden name) does not change pretty easily in R--stack the data sets to get a long panel, then do something to the effect of:
unique(dt,by=c("first_name","last_name","birth_year"))[,id:=.I]
(I'm of course using R data.table), then merging back to the full data.
However, I'm stuck on how to incorporate the maiden name to this procedure. Any suggestions?
Here's a preview of the data:
first_name last_name nee birth_year year
1: eileen aaldxxxx dxxxx 1977 2002
2: eileen aaldxxxx dxxxx 1977 2003
3: sarah aaxxxx gexxxx 1974 2003
4: kelly aaxxxx nxxxx 1951 2008
5: linda aarxxxx-gxxxx aarxxxx 1967 2008
---
72008: stacey zwirxxxx kruxxxx 1982 2010
72009: stacey zwirxxxx kruxxxx 1982 2011
72010: stacey zwirxxxx kruxxxx 1982 2012
72011: stacey zwirxxxx kruxxxx 1982 2013
72012: jill zydoxxxx gundexxxx 1978 2002
UPDATE:
I've done a lot of chipping and hammering at the problem; here's what I've got so far. I would appreciate any comments for possible improvements to the code so far.
I'm still completely missing something like 3-5% of matches due to inexact matches ("tonya" vs. "tanya", "jenifer" vs. "jennifer"); I haven't come up with a clean way of doing fuzzy matching on the stragglers, so there's room for better matching in that direction if anyone's got a straightforward way to implement that.
The basic approach is to build cumulatively--assign IDs in the first year, then look for matches in the second year; assign new IDs to the unmatched. Then for year 3, look back at the first 2 years, etc. As to how to match, the idea is to slowly expand the matching criteria--the idea being that the more robust the match, the lower the chances of mismatching accidentally (particularly worried about the John Smiths).
Without further ado, here's the main function for matching a pair of data sets:
get_id<-function(yr,key_from,key_to=key_from,
mdis,msch,mard,init,mexp,step){
#Want to exclude anyone who is matched
existing_ids<-full_data[.(yr),unique(na.omit(teacher_id))]
#Get the most recent prior observation of all
# unmatched teachers, excluding those teachers
# who cannot be uniquely identified by the
# current key setting
unmatched<-
full_data[.(1996:(yr-1))
][!teacher_id %in% existing_ids,
.SD[.N],by=teacher_id,
.SDcols=c(key_from,"teacher_id")
][,if (.N==1L) .SD,keyby=key_from
][,(flags):=list(mdis,msch,mard,init,mexp,step)]
#Merge, reset keys
setkey(setkeyv(
full_data,key_to)[year==yr&is.na(teacher_id),
(update_cols):=unmatched[.SD,update_cols,with=F]],
year)
full_data[.(yr),(update_cols):=lapply(.SD,function(x)na.omit(x)[1]),
by=id,.SDcols=update_cols]
}
Then I basically go through the 19 years yy in a for loop, running 12 progressively looser matches, e.g. step 3 is:
get_id(yy,c("first_name_clean","last_name_clean","birth_year"),
mdis=T,msch=T,mard=F,init=F,mexp=F,step=3L)
The final step is to assign new IDs:
current_max<-full_data[.(yy),max(teacher_id,na.rm=T)]
new_ids<-
setkey(full_data[year==yy&is.na(teacher_id),.(id=unique(id))
][,add_id:=.I+current_max],id)
setkey(setkey(full_data,id)[year==yy&is.na(teacher_id),
teacher_id:=new_ids[.SD,add_id]],year)

How to retrieve/calculate citation counts and/or citation indices from a list of authors?

I have a list of authors.
I wish to automatically retrieve/calculate the (ideally yearly) citation index (h-index, m-quotient,g-index, HCP indicator or ...) for each author.
Author Year Index
first 2000 1
first 2001 2
first 2002 3
I can calculate all of these metrics given the citation counts for each paper of each researcher.
Author Paper Year Citation_count
first 1 2000 1
first 2 2000 2
first 3 2002 3
Despite my efforts, I have not found an API/scraping method capable of this.
My institution has access to a number of services including Web of Science.
Effectively the main problem is to build the citation graph. Once you have that you can compute any metrics you want (e.g. h-index, g-index, PageRank).
Supposing you have a collections of papers (that you've retrieved in some way) you can extract the citations from each of them and build the citation graph. You might find useful ParsCit, an open-source CRF Reference String and Logical Document Structure Parsing Package which is also used by CiteSeerX and works great.

Resources