How to define the "breaks" to classify a raster data - r

First time posting a question here. Useless times this forum helped, but now, I fell my R skills are not strong enough to do the job.
My problem is: I have a Spatial Data frame with multiples attributes, such as Grid_code (pixels values, integer), Sub_Population(Character) and Origin_year (integer). I need to find the break values, in this case, 3 breaks values to place 1/4 of the pixels in each class - that will be 4 classes.
Also, this breaks will vary regarding the Sub_population and Origin_year unique combination.
SubPop Origin grid_code
AL 2008 4.730380
AL 2008 5.552315
AL 2008 5.968850
AL 2008 5.128384
AL 2009 6.927450
AL 2009 7.135734
ALCentral 2008 7.381087
ALCentral 2008 6.232927
ALCentral 2009 6.431800
ALCentral 2009 6.690246
ALCentral 2009 6.794144
That said, the breaks that will allocate the pixels into 4 different classes (1/4 of pixels in each class) will be a unique single set for each combination of Sub_population and Origin_Year.
What I'm thinking to do:
For each unique combination of Sub_population and Origin_year I'll create a df.
> cstands_spdf_split <- cstands_select_df[ which(
> cstands_select_df$SubPop == "AL" | cstands_select_df$Origin
> ==2008) , ]
Now I need to know for to define the breaks for this unique combination. I was thinking in using the split function with quantiles, but I don't know how this can be done...
Within the time and leaning I'll update this script to be used to run like a function.
Any feedback is appreciate.

Related

Portfolio sorts with incomplete data

I have a panel data of stock returns, where after a certain year the coverage universe of stocks doubled. It looks a bit like this:
Year Stock 1 Stock 2 Stock 3 Stock 4
2000 5.1% 0.04% NA NA
2001 3.6% 9.02% NA NA
2002 5.0% 12.09% NA NA
2003 -2.1% -9.05% 1.1% 4.7%
2004 7.1% 1.03% 4.2% -1.1%
.....
Of course, I am trying to maximize my observations both in the time series and in the cross-section as much as possible. However, I am not sure which of these 3 ways to sort would be the most "academically honest":
Sort the years until 2001 using only stocks 1 and 2, and incorporate the remaining stocks in the calculations once they become available in 2003.
Only include those stocks in calculations that have been available since 2000, i.e. stocks 1 and 2. Ignore the remaining stocks altogether since we do not have the full return profile.
Start the sort in year 2003, to have a larger cross-section.
The reason why our coverage universe expands in 2003 is simply because the data provider I am using changed their methodology in that year and decided to track more stocks. Stocks 3 and 4 do exist before 2003, but I cannot use their past return data since I need to follow my data provider (for the second variable I am sorting on).
Thanks all!
I am using the portsort() package in R but this does not seem to work well with NA`s.

Constrained K-means, R

I am currently doing k-means to cluster my data, however, I wish each cluster to appear once in each given year. I have searched for answers for a whole night but with no result. Would anyone have ideas upon this problem using R? Or is there any package I should look for ? Thanks.
More background infos :
I try to replicated the cluster of relationships, using the reported gender, education level and birth year. I am doing this because this is a survey data whose respondents are old people and they sometime will report inaccurate age or education infos. My main challenge now is that I wish to "have only one cluster labels in each survey year". For example, I do not want to see there are two cluster3 in survey year 2000. My data is like below :
survey year
relationship
gender
education level
birth year
k-means cluster
2000
41( first daughter)
0
3
1997
1
2003
41( first daughter)
0
3
1997
1
2000
42( second daughter)
0
4
1999
2
2003
42( second daughter)
0
4
1999
2
2000
42( third daughter)
0
5
1999
2
2003
42( third daughter)
0
5
2001
3
Thanks in advance.
--Update--
A more detailed description of the task:
The data set is a panel survey data asking elders for their health status, their relationships ( incl. sons, daughters, neighbors ). Since these older people are sometimes imprecise on their family's demographic information such as birth year, education level, etc., we might need to delete a big part of the data if it did not match.
(e.g., A reported his first son is 30 years old in 1997, while said his first son was 29 years old in 1999, this data could therefore be problematic). My task is to save as much data as possible if the imprecision is not that high.
Therefore I first mutated columns to check the precision of each family member (e.g., birth year error %in% c(-1,2)). Next, I run k-means if the family members are detected to be imprecise. In this way, I save much of the data. Although I did not solve the above problem, it rarely occurs that I can almost ignore or drop these observations.

How to add only one observation at a time amongst several observations in R?

Say I have observations for several periods for financial data, how can I create a function in R that only adds one observation at a time throughout my dataset so that I can compare how a single observation impacts my original data?
Say for instance that I have something like this:
Apple Microsoft Tesla Amazon
2010 0.8533719 0.8078440 0.2620114 0.1869552
2011 0.7462573 0.5127501 0.5452448 0.1369686
2012 0.7580671 0.5062639 0.7847919 0.8362821
2013 0.3154078 0.6960258 0.7303597 0.6057027
2014 0.4741735 0.3906580 0.4515726 0.1396147
2015 0.4230036 0.4728911 0.1262413 0.7495193
2016 0.2396552 0.5001825 0.6732861 0.8535837
2017 0.2007575 0.8875209 0.5086837 0.2211072
#And I define my original covariance matrix as follows:
cov.m <- cov(x[1:5,])
#I would like to add only one new observation at a time, so the results should be:
cov(x[1:5,]), cov(x[1:6,]), cov(x[1:7,]), cov(x[1:8,])
I have tried using rbind and a repeat loop, but it seems like I still have to define every row to include in rbind, which is quite tedious if I want to test on say 100+ different observations as I then manually need to specify all the observations, and I would have no use for the repeat loop in that case either.
Does this get you closer to your expected output?
lapply(5:nrow(x), function(y) cov(x[1:y, ]))

Are there simple ways to lag (by group) in data frames without workarounds like data tables, xts, zoo, dplyr etc in R?

Whenever I want to lag in a data frame I realize that something that should be simple is not. While the problem has been asked & answered many times (see p.s.), I did not find a simple solution which I can remember until the next time I lag. In general, lagging does not seem to be a simple thing in R as the multiple workarounds testify. I run into this problem often and it would be very helpful to have some basic R solutions which do not need extra packages. Could you provide your simple solution for lagging?
If that is not possible, could you at least provide your workaround here so we can choose amongst second best alternatives? One collection already exists here
Also, in all blog posts on this subject I see people complain about how unexpectedly difficult lagging is so how can we get a simple lag function for data frames into R Core? This must be extremely disappointing for anyone coming from Stata or EViews. Or am I missing something and there is a simple built in solution?
say we want to lag "value" by 3 "year"s for each "country" here:
Data <- data.frame(year=c(rep(2010:2015,2)),country=c(rep("AT",6),rep("DE",6)),value=rnorm(12))
to create L3 like:
year country value L3
2010 AT 0.3407 NA
2011 AT -1.7981 NA
2012 AT -0.8390 NA
2013 AT -0.6888 0.3407
2014 AT -1.1019 -1.7981
2015 AT -0.8953 -0.8390
2010 DE 0.5877 NA
2011 DE -1.0204 NA
2012 DE -0.6576 NA
2013 DE 0.6620 0.5877
2014 DE 0.9579 -1.0204
2015 DE -0.7774 -0.6576
And we neither want to change the nature of our data (to ts or data table) nor do we want to immerse ourselves in three new packages when the deadline is tonight and our supervisor uses Stata and thinks lagging is easy ;-) (its not, I just want to be prepared...)
p.s.:
without groups
with data.table: Lag in dataframe or How to create a lag variable within each group?
time series are straightforward
If the question is how to provide a column with the prior third year's value not using packages then try this:
prior_year3 <- function(x, k = 3) head(c(rep(NA, k), x), length(x))
transform(Data, prior_year_value = ave(value, country, FUN = prior_year3))
giving:
year country value prior_year_value
1 2010 AT -1.66562121 NA
2 2011 AT -0.04950063 NA
3 2012 AT 1.55930293 NA
4 2013 AT -0.40462394 -1.66562121
5 2014 AT 0.78602610 -0.04950063
6 2015 AT 0.73912916 1.55930293
7 2010 DE 1.03710539 NA
8 2011 DE -1.13370942 NA
9 2012 DE -1.20530981 NA
10 2013 DE 1.66870572 1.03710539
11 2014 DE 1.53615793 -1.13370942
12 2015 DE -0.09693335 -1.20530981
That said, to use R effectively you do need to learn how to use the key packages.
Try slide from data combine package, its simple
slide(Data,Var='value',GroupVar = 'country',slideBy=-3)

Using name full name and maiden name strings (and birthdays) to match individuals across time

I've got a set of 20 or so consecutive individual-level cross-sectional data sets which I would like to link together.
Unfortunately, there's no time-stable ID number; there are, however, fields for first, last, and maiden names, as well as year of birth--this should allow for a pretty high (90-95%) match rate, I presume.
Ideally, I would create a time-independent ID for each unique individual.
I can do this for those whose marital status (maiden name) does not change pretty easily in R--stack the data sets to get a long panel, then do something to the effect of:
unique(dt,by=c("first_name","last_name","birth_year"))[,id:=.I]
(I'm of course using R data.table), then merging back to the full data.
However, I'm stuck on how to incorporate the maiden name to this procedure. Any suggestions?
Here's a preview of the data:
first_name last_name nee birth_year year
1: eileen aaldxxxx dxxxx 1977 2002
2: eileen aaldxxxx dxxxx 1977 2003
3: sarah aaxxxx gexxxx 1974 2003
4: kelly aaxxxx nxxxx 1951 2008
5: linda aarxxxx-gxxxx aarxxxx 1967 2008
---
72008: stacey zwirxxxx kruxxxx 1982 2010
72009: stacey zwirxxxx kruxxxx 1982 2011
72010: stacey zwirxxxx kruxxxx 1982 2012
72011: stacey zwirxxxx kruxxxx 1982 2013
72012: jill zydoxxxx gundexxxx 1978 2002
UPDATE:
I've done a lot of chipping and hammering at the problem; here's what I've got so far. I would appreciate any comments for possible improvements to the code so far.
I'm still completely missing something like 3-5% of matches due to inexact matches ("tonya" vs. "tanya", "jenifer" vs. "jennifer"); I haven't come up with a clean way of doing fuzzy matching on the stragglers, so there's room for better matching in that direction if anyone's got a straightforward way to implement that.
The basic approach is to build cumulatively--assign IDs in the first year, then look for matches in the second year; assign new IDs to the unmatched. Then for year 3, look back at the first 2 years, etc. As to how to match, the idea is to slowly expand the matching criteria--the idea being that the more robust the match, the lower the chances of mismatching accidentally (particularly worried about the John Smiths).
Without further ado, here's the main function for matching a pair of data sets:
get_id<-function(yr,key_from,key_to=key_from,
mdis,msch,mard,init,mexp,step){
#Want to exclude anyone who is matched
existing_ids<-full_data[.(yr),unique(na.omit(teacher_id))]
#Get the most recent prior observation of all
# unmatched teachers, excluding those teachers
# who cannot be uniquely identified by the
# current key setting
unmatched<-
full_data[.(1996:(yr-1))
][!teacher_id %in% existing_ids,
.SD[.N],by=teacher_id,
.SDcols=c(key_from,"teacher_id")
][,if (.N==1L) .SD,keyby=key_from
][,(flags):=list(mdis,msch,mard,init,mexp,step)]
#Merge, reset keys
setkey(setkeyv(
full_data,key_to)[year==yr&is.na(teacher_id),
(update_cols):=unmatched[.SD,update_cols,with=F]],
year)
full_data[.(yr),(update_cols):=lapply(.SD,function(x)na.omit(x)[1]),
by=id,.SDcols=update_cols]
}
Then I basically go through the 19 years yy in a for loop, running 12 progressively looser matches, e.g. step 3 is:
get_id(yy,c("first_name_clean","last_name_clean","birth_year"),
mdis=T,msch=T,mard=F,init=F,mexp=F,step=3L)
The final step is to assign new IDs:
current_max<-full_data[.(yy),max(teacher_id,na.rm=T)]
new_ids<-
setkey(full_data[year==yy&is.na(teacher_id),.(id=unique(id))
][,add_id:=.I+current_max],id)
setkey(setkey(full_data,id)[year==yy&is.na(teacher_id),
teacher_id:=new_ids[.SD,add_id]],year)

Resources