Generate variable by country for missing years? - r

Stata and R:
I have two cross-sectional datasets I'm merging. The two datasets have an equal amount of countries and only one dataset has zero missing years (year). The problem is that the missing years are simply not recorded, so I need to make a new variable that would add the years where there is no other data. Otherwise, I cannot merge the datasets according to the two keys, country and year.

Not so -- in Stata (and I would be surprised at a problem in R, but others must speak to that).
Missing observations -- in this context and any similar better called absent -- are not a problem. Here's a demonstration. merge is smart enough to notice gaps and make them explicit as missings. You could "fix" them yourself ahead of the merge, but that is pointless.
clear
input state year y
1 2019 1
1 2020 2
2 2019 3
2 2020 4
end
save tomerge
clear
input state year x
1 2019 42
2 2019 84
end
merge 1:1 state year using tomerge
list
Results
. merge 1:1 state year using tomerge
Result Number of obs
-----------------------------------------
Not matched 2
from master 0 (_merge==1)
from using 2 (_merge==2)
Matched 2 (_merge==3)
-----------------------------------------
.
. list
+----------------------------------------+
| state year x y _merge |
|----------------------------------------|
1. | 1 2019 42 1 Matched (3) |
2. | 2 2019 84 3 Matched (3) |
3. | 1 2020 . 2 Using only (2) |
4. | 2 2020 . 4 Using only (2) |
+----------------------------------------+
Otherwise put, 1:1 as syntax specifies the overall pattern and doesn't rule out 0:1 or 1:0 matches. merge will actually append if identifiers don't match at all. You do need the key variables to exist under identical names in both datasets.

Related

Tidy one unique identifier across several time period tables in r

I want to import data and tidy it in r. I have achieved some of the results I want using functions in Excel, but it is tedious and must be redone by hand each time I get a new Excel file with updated data. I have an Excel file with separate worksheets for each time period. This Excel file is updated multiple times each year, keeping the same style but adding additional data, including adding additional time period worksheets. Each worksheet follows the same format, as follows:
Student_ID| Major_ID | Gender | Age | Semester_Registered | Marital_Status | Home_State
20130001 | 10022 | M | 22 | 3 | S | AZ
20130002 | 10022 | F | 23 | 5 | M | CA
20140001 | 10022 | M | 21 | 3 | M | CA
20140004 | 10034 | F | 24 | 4 | S | AZ
This would be the example for the first few records of a given time period worksheet, let's say 2016_Semester_1. Student ID is assigned to a student when they register for classes and serves as a unique identifier. Major_ID corresponds to a table with Major_ID and Major_Name and Campus. The codes stay the same for each worksheet, but a student can change majors or change campus, thus Major_ID could be different for a given student from one time period to another. Gender and age are self-explanatory. Semester_Registered is a number from 1 to 8. When a student first registers for classes, they are in Semester_Registered 1, then their second semester in their first year they should move on to 2, their first semester of their sophomore year they should be in 3, all the way to 8 in the second semester of their senior year. However, some students do not move through the sequence of semesters at the normal rate, for example if they have to repeat a semester due to failed courses or if they have to leave the university for a time in order to earn more money before returning later and continuing their studies. Marital_Status is either S for Single, M for Married, D for Divorced or W for Widowed. Home_State is the two letter abbreviation for the US State the student is from, mainly needed to see if the student qualifies for in-state tuition rates, but also useful for reports to see where most students come from to focus marketing activities on those states.
The Excel workbook that I have contains a worksheet for each academic semester from 2014_1 to 2019_1. I want to consolidate the data and tidy it in two main ways. First, I want to make new tables for each Freshman class, including only those who were in Semester_Registered 1 in the 2014_1 semester in one table, in the 2015_1 semester in another table, up through 2019_1. The headers for the data I want in these tables would like like this:
First_Semester | Student_ID | Major_ID_Start | Gender | Age_Start | Marital_Status_Start | Final_Semester_Time | Final_Semester_Registered | Graduated_On_Time | Graduated_Late | Major_ID_End | Age_End | Marital_Status_End | Still_Enrolled
All of the records in a given table would have the same First_Semester value, such as 2014_1 or 2015_1. Student_ID is the identifier. Maojor_ID_Start is the Major_ID the student had in First_Semester. Gender could probably be collected onlly once from First_Semester. Age_Start and Marital_Status_Start are their respective values as listed in First_Semester. Final_Semester_Registered needs to look through each time period worksheet until it finds that the given Student_ID no longer appears on the list of registered students; for students who graduate, this should be the time period when Semester_Registered equals 8, but some students drop out before graduation so this would show in which time period they were last registered before dropping out. Final_Semester_Registered is shows the value of Semester_Registered in Final_Semester_Time, which should be 8 if the student graduated but if not it will show how far the student advanced in their studies before dropping out. Graduated_On_Time is either true or false, true if the student shows up with Semester_Registered 8 exactly 4 years after First_Semester year, such as a student who graduated in started their freshman year in 2014_1 and graduated at the end of 2018_2. Graduated_Late is also true or false, and is true if the student reached Semester_Registered 8 at some point after 4 years after their First_Semester year. Major_ID_End shows the last registered Major_ID for the last semester that the given Student_ID shows up in the list of registered students, and is useful to compare with Major_ID_Start to see if the student changed majors. Age_End and Marital_Status_End registered their respective values in the time period of Final_Semester_Time. Still_Enrolled is true or false, and it is true if the Student_ID is still present in the latest time period worksheet, at present this would be 2019_1 but it would be ideal to have this update in the future to use the latest time period sheet included in the data (since, for example, in a few months we will put in new data which will include 2019_2).
Second, I want a table simply showing Student_ID of students who are no longer registered in the latest time period. This would have column headers as follows:
First_Semester | Student_ID | Major_ID_Start | Gender | Age_Start | Marital_Status_Start | Final_Semester_Time | Final_Semester_Registered | Graduated_On_Time | Graduated_Late | Dropped_Out | Major_ID_End | Age_End | Marital_Status_End
The columns are the same as the other example, except for Dropped_Out, which is true or false and it is true if the student has a Final_Semester_Registered less than 8. The key point here is that this table should only include those Student_IDs where Still_Enrolled is false, and serves as a consolidated list of all of the students who used to be enrolled in the university but are no longer enrolled, allowing for analysis between those who graduated on time, those who graduated late, and those who dropped out.
I have achieved some of these results using Excel, but it is a drawn out and manual process, which must be re-done every time the data file updates. Excel has also become fairly slow in loading the file and updating the formula calculations, so I would like to move this to the r statistical software. For reference, though, here are some of the formulas I used in Excel, to give an idea of what might be adaptable into r.
I have a consolidated table with each Student_ID as a row and it includes columns like:
Student_ID | Major_ID_2014_1 | Major_ID_2014_2 | Major_ID_2015_1 | Semester_Registered_2014_1 | Semester_Registered_2014_2 | Semester_Registered_2014_2 | Final_Semester_Time | Final_Semester_Registered | Age_Start | Age_End
This is abbreviated, since it includes both Major_Id and Semester_Registered columns from 2014_1 up through 2019_1, but here in my example I a just showing up to 2015_1 to give the idea.
The formula for Major_ID_2041_1 is =IFERROR(INDEX(Semester_2014_1,MATCH(Student_ID_Cell,Student_IDs_2014_1,0)),"") where Semester_2014_1 and Student_IDs_2014_1 are named ranges from the worksheet of the time period 2014_1 including the relevant rows. A similar formula uses a different named data set for the rows related to Semester_Registered. Then I can use something like =IF(SUMPRODUCT(1/COUNTIF(F3:R3,F3:R3))<3,FALSE,TRUE) on the range of cells for Major_ID from 2014_1 to 2019_1 (each in its own column) to see if the Major_ID changed (meaning the student changed majors or changed campuses) and I can use a MAX() formula for the range of columnns for Semester_Registered to find the highest semester the student reached. A formula like =LOOKUP(2,1/(V3:AH3<>""),$V$2:$AH$2) which goes over the same range of columns for Seester_Registered where the second row has a header like 2014_1, 2014_2, etc. returns the last column that is not blank (thus the last column the student was registered in). This can then be used with the an INDIRECT() in order to reference a named data set (I had to manually name all the data sets in each worksheet by time period) like =IFERROR(INDEX(INDIRECT(CONCATENATE("DATA_",AK3)),MATCH(T3,INDIRECT(CONCATENATE("Student_IDs_",AK3)),0),4),"") where AK3 contains the Final_Semester_Time, like 2014_1.

CART Methodology for data with mutually exhaustive rows

I am trying to use CART to analyse a data set whose each row is a segment, for example
Segment_ID | Attribute_1 | Attribute_2 | Attribute_3 | Attribute_4 | Target
1 2 3 100 3 0.1
2 0 6 150 5 0.3
3 0 3 200 6 0.56
4 1 4 103 4 0.23
Each segment has a certain population from the base data (irrelevant to my final use).
I want to condense, for example in the above case, the 4 segments into 2 big segments, based on the 4 attributes and on the target variable. I am currently dealing with 15k segments and want only 10 segments with each of the final segment based on target and also having a sensible attribute distribution.
Now, pardon my if I am wrong but CHAID on SPSS (if not using autogrow) will generally split the data into 70:30 ratio where it builds the tree on 70% of the data and tests on the remaining 30%. I can't use this approach since I need all my segments in the data to be included. I essentially want to club these segments into a a few big segments as explained before. My question is whether I can use CART (rpart in R) for the same. There is an explicit option 'subset' in the rpart function in R but I am not sure whether not mentioning it will ensure CART utilizing 100% of my data. I am relatively new to R and hence a very basic question.

Friedman test unreplicated complete block design error

I'm having trouble running a Friedman test over my data.
I'm trying to run a Friedman test using this command:
friedman.test(mean ~ isi | expId, data=monoSum)
On the following database (https://www.dropbox.com/s/2ox0y1b4gwld0ai/monoSum.csv):
> monoSum
expId isi N mean
1 m80B1 1 10 100.000000
2 m80B1 2 10 73.999819
3 m80B1 3 10 45.219362
4 m80B1 4 10 116.566174
. . . . .
18 m80L2 2 10 82.945491
19 m80L2 3 10 57.675480
20 m80L2 4 10 207.169277
. . . . . .
25 m80M2 1 10 100.000000
26 m80M2 2 10 49.752687
27 m80M2 3 10 19.042592
28 m80M2 4 10 150.411035
It gives me back the error:
Error in friedman.test.default(c(100, 73.9998193095267, 45.2193621626293, :
not an unreplicated complete block design
I figure it gives the error because, when monoSum$isi==1 the value of mean is always 100. Is this correct?
However, monoSum$isi==1 is alway 100 because it is the control group on which all the other monoSum$isi groups are normalized. I can not assume a normal distribution, so I cannot run a rmANOVA…
Is there a way to run a friedman test on this data or am I missing a very essential point here?
Many thanks in advance!
I don't get an error if I run your dataset:
Friedman rank sum test
data: mean and isi and expId
Friedman chi-squared = 17.9143, df = 3, p-value = 0.0004581
However, you have to make sure that expId and isi are coded as factors. Run these commands:
monoSum$expID$<-factor(monoSum$expID)
monoSum$isi$<-factor(monoSum$isi)
Then run the test again. This has worked for me with a similar problem.
I know this is pretty old but for future generations (see also: me when I forget and google this again):
You can determine what the missing values are in your dataframe by running table(groups, blocks) or in the case of this question table(monoSum$isi, monoSum$expID). This will return a table of 0s and 1s. This missing records are in the the cells with 0s.
I ran into this problem after trying to remove the blocks that had incomplete results; taking a subset of the data did not remove the blocks for some reason.
Just thought I would mention I found this post because I was getting a similar error message. The above suggestions did not solve it. Strangely, I had to sort my dataframe so that block by block the groups appeared in order (i.e. I could not have the following:
Block 1 A
Block 1 B
Block 2 B
Block 2 A
It has to appear as A, B, A, B)
I ran into the same cryptic error message in R, though in my case it was resolved when I applied the 'as.matrix' function to what was originally a dataframe for the CSV file I imported in using the read.csv() function.
I also had a missing data point in my original data set, and I found that when my data was transformed into a matrix for the friedman.test() call, the entire row containing the missing data point was omitted automatically.
Using the function as.matrix() to transform my dataframe is the magic that got the function to run for me.
I had this exact error too with my dataset.
It turns out that the function friedman.test() accepts data frames (fx those created by data.frame() ) but not tibbles (those created by dplyr and other modern tools). The solution for me was to convert my dataset to a dataframe first.
D_fri <- D_all %>% dplyr::select(FrustrationEpisode, Condition, Participant)
D_fri <- as.data.frame(D_fri)
str(D_fri) # confirm the object should now be a 'data.frame'
friedman.test(FrustrationEpisode ~ Condition | Participant, D_fri)
I ran into this problem too. Fixed mine by removing the NAs.
# My data (called layers) looks like:
| resp.no | av.l.all | av.baseem | av.base |
| 1 | 1.5 | 1.3 | 2.3 |
| 2 | 1.4 | 3.2 | 1.4 |
| 3 | 2.5 | 2.8 | 2.9 |
...
| 1088 | 3.6 | 1.1 | 3.3 |
# Remove NAs
layers1 <- na.omit(layers)
# Re-organise data so the scores are stacked, and a column added with the original column name as a factor
layers2 <- layers1 %>%
gather(key = "layertype", value = "score", av.l.all, av.baseem, av.base) %>%
convert_as_factor(resp.no, layertype)
# Data now looks like this
| resp.no | layertype | score |
| 1 | av.l.all | 1.5 |
| 1 | av.baseem | 1.3 |
| 1 | av.base | 2.3 |
| 2 | av.l.all | 1.4 |
...
| 1088 | av.base | 3.3 |
# Then do Friedman test
friedman.test(score ~ layertype | resp.no, data = layers2)
Just want to share what my problem was. My ID factor did not have correct levels after doing pivot_longer(). Because of this, the same error was given. I made sure the correct level and it worked by the following:as.factor(as.character(df$ID))
Reviving an old thread with new information. I ran into a similar problem after removing NAs. My group and block were factors before the NA removal. However, after removing NAs, the factors retained the levels before the removal even though some levels were no longer in the data!
Running the friedman.test() with the as.matrix() trick (e.g., friedman.test(a ~ b | c, as.matrix(df))) was fine but running frdAllPairsExactTest() or friedman_effsize() would throw the not an unreplicated complete block design error. I ended up re-factoring the group and block (i.e., dropping the levels that were no longer in the data, df$block <- factor(df$block)) to make things work. After the re-factor, I did not need the as.matrix() trick, either.

How can I get a moving average for each day of the week?

I get data into graphite with a granularity of an hour. For example
2013-12-06-12:00 15
2013-12-08-09:00 14
2013-12-09-12:00 3
2013-12-13-00:00 10
2013-12-14-08:00 20
2013-12-14-09:00 1
2013-12-15-00:00 5
2013-12-16-00:00 11
2013-12-16-02:00 12
... and so on
Now, I'd like to be able to graph this into the "evolution of the value for every day in the week" so the actual value displayed is the sum (or average) of the values for this particular day of week over some weeks (let's say 2 weeks for example).
My graph would look like that if I only look at the last week :
^ 21
20| |
| |
| 12.5| 13
10| | | 9.5 |
| | | | |
| | | | |
0+--------------------------------->
Mon Tue Wed Thu Fri Sat Sun
12 13 14 15 16
So for example, for the "Friday" point, it takes the values of today (11+12), the value of last friday (3) and makes an average of both ((11+12)+3)/2
Is this possible, how ?
summarize(your.metric.goes.here, "1week", "sum") will summarize the data in 1 week intervals by summing them. You can also use avg, max, min there.
As far as semantics go- Timers, usually need to be averaged and counters need to be summed when summarized.
Example: If you measure lap-counts and lap-times when you run every day, and want weekly summary, you average the lap-time of seven days and allocate it to that one weekly lap-time. With lap-count, it makes more sense to know total, so you sum it.
On a different note: timeStack and timeShift are used in cases when you want to compare last month's data with this month's on the same timeline. Also, you can timeShift the summarized data too.
I think specifically what you are looking for is a combination of both timeStack and averageSeries. For example:
averageSeries(timeStack(your.metric.here,"1week", 0, 2))
Where the last two arguments are the range of "1week" series you'd like to incorporate (so it gets a series for this week and each of the previous 2 weeks).

count distinct values in spreadsheet

I have a Google spreadsheet with a column that looks like this:
City
----
London
Paris
London
Berlin
Rome
Paris
I want to count the appearances of each distinct city (so I need the city name and the number of appearances).
City | Count
-------+------
London | 2
Paris | 2
Berlin | 1
Rome | 1
How do I do that?
Link to Working Examples
Solution 0
This can be accompished using pivot tables.
Solution 1
Use the unique formula to get all the distinct values. Then use countif to get the count of each value. See the working example link at the top to see exactly how this is implemented.
Unique Values Count
=UNIQUE(A3:A8) =COUNTIF(A3:A8;B3)
=COUNTIF(A3:A8;B4)
...
Solution 2
If you setup your data as such:
City
----
London 1
Paris 1
London 1
Berlin 1
Rome 1
Paris 1
Then the following will produce the desired result.
=sort(transpose(query(A3:B8,"Select sum(B) pivot (A)")),2,FALSE)
I'm sure there is a way to get rid of the second column since all values will be 1. Not an ideal solution in my opinion.
via http://googledocsforlife.blogspot.com/2011/12/counting-unique-values-of-data-set.html
Other Possibly Helpful Links
http://productforums.google.com/forum/#!topic/docs/a5qFC4pFZJ8
You can use the query function, so if your data were in col A where the first row was the column title...
=query(A2:A,"select A, count(A) where A != '' group by A order by count(A) desc label A 'City'", 0)
yields
City count
London 2
Paris 2
Berlin 1
Rome 1
Link to working Google Sheet.
https://docs.google.com/spreadsheets/d/1N5xw8-YP2GEPYOaRkX8iRA6DoeRXI86OkfuYxwXUCbc/edit#gid=0
=iferror(counta(unique(A1:A100))) counts number of unique cells from A1 to A100
Not exactly what the user asked, but an easy way to just count unique values:
Google introduced a new function to count unique values in just one step, and you can use this as an input for other formulas:
=COUNTUNIQUE(A1:B10)
This works if you just want the count of unique values in e.g. the following range
=counta(unique(B4:B21))
This is similar to Solution 1 from #JSuar...
Assume your original city data is a named range called dataCity. In a new sheet, enter the following:
A | B
----------------------------------------------------------
1 | =UNIQUE(dataCity) | Count
2 | | =DCOUNTA(dataCity,"City",{"City";$A2})
3 | | [copy down the formula above]
4 | | ...
5 | | ...
=UNIQUE({filter(Core!L8:L27,isblank(Core!L8:L27)=false),query(ArrayFormula(countif(Core!L8:L27,Core!L8:L27)),"select Col1 where Col1 <> 0")})
Where Core!L8:L27 is the list in the question.

Resources