When defining a hierarchy you can specify the Fact Aggr. Type as "Member and Ancestors" or "Member only".
In my understanding "Member and Ancestors" looks like a double count (or a triple count if the hierarchy has three levels).
As this is the default, my understanding is probably wrong. Do you have an small example of both cases?
Country City Qty
Swiss Geneva 2
Swiss Lausanne 3
France Lyon 4
France Ferney 5
Values are not counted several times:
Member and Ancestors (the default) : a member aggregates the facts defined for himself and his descendants.
Member only: a member aggregates the facts defined for himself only.
Let's do an example, for a dimension
2013
2013 Q1
2013 Q2
2013 Q2
and for facts that bind all levels of the date dimension :
date value
2013 3 'not common but it was make the difference
2013 Q1 1
2013 Q2 2
2013 Q3 2
The value for ( [2013],[Value] ) will be
Member and Ancestors: 8 (3+1+2+2)
Member only: 3 (we ignore the children)
This is handy is a hierarchy does not aggregate his children. You can see this as a 'fake' hierarchy
Related
I'm having difficulty using data.table operations to correctly manipulate my data. The goal is to, by group create a number of rows for the group based on the value of two date columns. I'm changing my data here in order to protect it, but it gets the idea across
head(my_data_table, 6)
team_name play_name first_detected last_detected PlayID
1: Baltimore Play Action 2016 2017 41955-58
2: Washington Four Verticals 2018 2020 54525-52
3: Dallas O1 Trap 2019 2019 44795-17
4: Dallas Play Action 2020 2020 41955-58
5: Dallas Power Zone 2020 2020 54782-29
6: Dallas Bubble Screen 2018 2018 52923-70
The goal is to turn it into this
team_name play_name year PlayID
1: Baltimore Play Action 2016 41955-58
2: Baltimore Play Action 2017 41955-58
3: Washington Four Verticals 2018 54525-52
4: Washington Four Verticals 2019 54525-52
5: Washington Four Verticals 2020 54525-52
6: Dallas O1 Trap 2019 44795-17
...
n: Dallas Bubble Screen 2018 52923-70
My code I attempt to employ for this purpose is the following
my_data_table[,.(PlayID, year = seq(first_detected,last_detected,by=1)), by = .(team_name, play_name)]
When I run this code, I get:
Error in seq.default(first_detected_ever, last_detected_ever, by = 1) :
'from' must be of length 1
Two other attempts also failed
my_data_table[,.(PlayID, year = seq(min(first_detected),max(last_detected),by=1)), by = .(team_name, play_name)]
my_data_table[,.(PlayID, year = list(seq(min(first_detected),max(last_detected),by=1))), by = .(team_name, play_name)]
which both result in something that looks like
by year PlayID
1: Baltimore Washington Dallas Play Action 2011, 2012, 2013, 2014, 2015, 2016 ... 41955-58
...
In as.data.table.list(jval, .named = NULL) :
Item 3 has 2 rows but longest item has 38530489; recycled with remainder.
I haven't found any clear answers on why this is happening. It seems like, when passing the "first detected' and "last detected", that it's interpreting it somehow as the entire range of the column's values, despite me passing the by = .(team_name,play_name), which always results in one distinct row, which I have verified. Going by the "by" grouping here should only have one value of first_detected and last_detected. I've done something similar before, but the difference was that I wasn't doing it with a "by = .(x,y,z,...)" grouping, and applied the operation on each row. Could anyone help me understand why I am unable to get the desired output with this data.table method?
Despite struggling with this for hours, I managed to solve my own question only a short while later.
The code
my_data_table[,.(PlayID, year = first_detected:last_detected), by = .(team_name, play_name)]
Produces the desired result, creating, by group, a row that has each year inclusive, so long as first_detected and last_detected are integers.
Stata and R:
I have two cross-sectional datasets I'm merging. The two datasets have an equal amount of countries and only one dataset has zero missing years (year). The problem is that the missing years are simply not recorded, so I need to make a new variable that would add the years where there is no other data. Otherwise, I cannot merge the datasets according to the two keys, country and year.
Not so -- in Stata (and I would be surprised at a problem in R, but others must speak to that).
Missing observations -- in this context and any similar better called absent -- are not a problem. Here's a demonstration. merge is smart enough to notice gaps and make them explicit as missings. You could "fix" them yourself ahead of the merge, but that is pointless.
clear
input state year y
1 2019 1
1 2020 2
2 2019 3
2 2020 4
end
save tomerge
clear
input state year x
1 2019 42
2 2019 84
end
merge 1:1 state year using tomerge
list
Results
. merge 1:1 state year using tomerge
Result Number of obs
-----------------------------------------
Not matched 2
from master 0 (_merge==1)
from using 2 (_merge==2)
Matched 2 (_merge==3)
-----------------------------------------
.
. list
+----------------------------------------+
| state year x y _merge |
|----------------------------------------|
1. | 1 2019 42 1 Matched (3) |
2. | 2 2019 84 3 Matched (3) |
3. | 1 2020 . 2 Using only (2) |
4. | 2 2020 . 4 Using only (2) |
+----------------------------------------+
Otherwise put, 1:1 as syntax specifies the overall pattern and doesn't rule out 0:1 or 1:0 matches. merge will actually append if identifiers don't match at all. You do need the key variables to exist under identical names in both datasets.
I am currently doing k-means to cluster my data, however, I wish each cluster to appear once in each given year. I have searched for answers for a whole night but with no result. Would anyone have ideas upon this problem using R? Or is there any package I should look for ? Thanks.
More background infos :
I try to replicated the cluster of relationships, using the reported gender, education level and birth year. I am doing this because this is a survey data whose respondents are old people and they sometime will report inaccurate age or education infos. My main challenge now is that I wish to "have only one cluster labels in each survey year". For example, I do not want to see there are two cluster3 in survey year 2000. My data is like below :
survey year
relationship
gender
education level
birth year
k-means cluster
2000
41( first daughter)
0
3
1997
1
2003
41( first daughter)
0
3
1997
1
2000
42( second daughter)
0
4
1999
2
2003
42( second daughter)
0
4
1999
2
2000
42( third daughter)
0
5
1999
2
2003
42( third daughter)
0
5
2001
3
Thanks in advance.
--Update--
A more detailed description of the task:
The data set is a panel survey data asking elders for their health status, their relationships ( incl. sons, daughters, neighbors ). Since these older people are sometimes imprecise on their family's demographic information such as birth year, education level, etc., we might need to delete a big part of the data if it did not match.
(e.g., A reported his first son is 30 years old in 1997, while said his first son was 29 years old in 1999, this data could therefore be problematic). My task is to save as much data as possible if the imprecision is not that high.
Therefore I first mutated columns to check the precision of each family member (e.g., birth year error %in% c(-1,2)). Next, I run k-means if the family members are detected to be imprecise. In this way, I save much of the data. Although I did not solve the above problem, it rarely occurs that I can almost ignore or drop these observations.
MarriageLicen
Year Month Amount
1 2011 Jan 742
2 2011 Feb 796
3 2011 Mar 1210
4 2011 Apr 1376
BusinessLicen
Month Year MARRIAGE_LICENSES
1 Jan 2011 754
2 Feb 2011 2706
3 Mar 2011 2689
4 Apr 2011 738
My question is how can we predict the number of Marriage Licenses (Y) issued by the city using the number of Business Licenses (X)?
And how can we join two datasets together?
It says that you can join using the combined key of Month and Year.
But I am suffering from this question for several days.
There are three options here.
The first is to just be direct. I'm going to assume you have the labels swapped around for the data frames in your example (it doesn't make a whole lot of sense to have a MARRIAGE_LICENSES variable in the BusinessLicen data frame, if I'm following what you are trying to do).
You can model the relationship between those two variables with:
my.model <- lm(MarriageLicen$MARRIAGE_LICENSES ~ BusinessLicen$Amount)
The second (not very rational) option would be to make a new data frame explicitly, since it looks like you have an exact match on each of your rows:
new.df <- data.frame(marriage.licenses=MarriageLicen$MARRIAGE_LICENSES, business.licenses=BusinessLicen$Amount)
my.model <- lm(marriage.licenses ~ business.licenses, data=new.df)
Finally, if you don't actually have the perfect alignment shown in your example you can use merge.
my.df <- merge(BusinessLicen, MarriageLicen, by=c("Month", "Year"))
my.model <- lm(MARRIAGE_LICENCES ~ Amount, data=my.df)
I've got a set of 20 or so consecutive individual-level cross-sectional data sets which I would like to link together.
Unfortunately, there's no time-stable ID number; there are, however, fields for first, last, and maiden names, as well as year of birth--this should allow for a pretty high (90-95%) match rate, I presume.
Ideally, I would create a time-independent ID for each unique individual.
I can do this for those whose marital status (maiden name) does not change pretty easily in R--stack the data sets to get a long panel, then do something to the effect of:
unique(dt,by=c("first_name","last_name","birth_year"))[,id:=.I]
(I'm of course using R data.table), then merging back to the full data.
However, I'm stuck on how to incorporate the maiden name to this procedure. Any suggestions?
Here's a preview of the data:
first_name last_name nee birth_year year
1: eileen aaldxxxx dxxxx 1977 2002
2: eileen aaldxxxx dxxxx 1977 2003
3: sarah aaxxxx gexxxx 1974 2003
4: kelly aaxxxx nxxxx 1951 2008
5: linda aarxxxx-gxxxx aarxxxx 1967 2008
---
72008: stacey zwirxxxx kruxxxx 1982 2010
72009: stacey zwirxxxx kruxxxx 1982 2011
72010: stacey zwirxxxx kruxxxx 1982 2012
72011: stacey zwirxxxx kruxxxx 1982 2013
72012: jill zydoxxxx gundexxxx 1978 2002
UPDATE:
I've done a lot of chipping and hammering at the problem; here's what I've got so far. I would appreciate any comments for possible improvements to the code so far.
I'm still completely missing something like 3-5% of matches due to inexact matches ("tonya" vs. "tanya", "jenifer" vs. "jennifer"); I haven't come up with a clean way of doing fuzzy matching on the stragglers, so there's room for better matching in that direction if anyone's got a straightforward way to implement that.
The basic approach is to build cumulatively--assign IDs in the first year, then look for matches in the second year; assign new IDs to the unmatched. Then for year 3, look back at the first 2 years, etc. As to how to match, the idea is to slowly expand the matching criteria--the idea being that the more robust the match, the lower the chances of mismatching accidentally (particularly worried about the John Smiths).
Without further ado, here's the main function for matching a pair of data sets:
get_id<-function(yr,key_from,key_to=key_from,
mdis,msch,mard,init,mexp,step){
#Want to exclude anyone who is matched
existing_ids<-full_data[.(yr),unique(na.omit(teacher_id))]
#Get the most recent prior observation of all
# unmatched teachers, excluding those teachers
# who cannot be uniquely identified by the
# current key setting
unmatched<-
full_data[.(1996:(yr-1))
][!teacher_id %in% existing_ids,
.SD[.N],by=teacher_id,
.SDcols=c(key_from,"teacher_id")
][,if (.N==1L) .SD,keyby=key_from
][,(flags):=list(mdis,msch,mard,init,mexp,step)]
#Merge, reset keys
setkey(setkeyv(
full_data,key_to)[year==yr&is.na(teacher_id),
(update_cols):=unmatched[.SD,update_cols,with=F]],
year)
full_data[.(yr),(update_cols):=lapply(.SD,function(x)na.omit(x)[1]),
by=id,.SDcols=update_cols]
}
Then I basically go through the 19 years yy in a for loop, running 12 progressively looser matches, e.g. step 3 is:
get_id(yy,c("first_name_clean","last_name_clean","birth_year"),
mdis=T,msch=T,mard=F,init=F,mexp=F,step=3L)
The final step is to assign new IDs:
current_max<-full_data[.(yy),max(teacher_id,na.rm=T)]
new_ids<-
setkey(full_data[year==yy&is.na(teacher_id),.(id=unique(id))
][,add_id:=.I+current_max],id)
setkey(setkey(full_data,id)[year==yy&is.na(teacher_id),
teacher_id:=new_ids[.SD,add_id]],year)