What's the smart way to aggregate data? - r

Suppose there is a dataset of different regions, each region a subset of a state, and some outcome variable:
regions <- c("Michigan, Eastern",
"Michigan, Western",
"Minnesota",
"Mississippi, Northern",
"Mississippi, Southern",
"Missouri, Eastern",
"Missouri, Western")
set.seed(123)
outcome <- rpois(7, 12)
testset <- data.frame(regions,outcome)
regions outcome
1 Michigan, Eastern 10
2 Michigan, Western 11
3 Minnesota 17
4 Mississippi, Northern 12
5 Mississippi, Southern 12
6 Missouri, Eastern 17
7 Missouri, Western 13
A useful tool would aggregate each region and add, or take the mean or maximum, etc. of outcome by region and generate a new data frame for state. A sum, for example, would output this:
state outcome
1 Michigan 21
3 Minnesota 17
4 Mississippi 24
6 Missouri 30
The aggregate() function won't solve this problem. Is there something else in R that is built for this? It seems like grep could be used to generate the new column "states" as part of an application specific program. Seems like this would already be out there somewhere though.

The reason this isn't straight forward is that the structure of your data is not consistent, so you couldn't build a library simply for it.
Your state, region column is basically an index column, and you want to index across part of it. tapply is designed for this, but there's no reason to build in a function to do it automatically for this specific scenario. You could do it without creating the column though
tapply(outcome,gsub(",.*$","",testset$regions),sum)
The index column just replaces the , and everything after it, leaving the index column.
PS: you have a slight typo in your example, your data.frame should be
testset <- data.frame(regions,outcome)

Related

Create data frame of names of 50 states in R

I'm working on a problem where I'm trying to map each state to a region for some data analysis. It seems the first thing I need to do is create a dataframe containing the names of all 50 states. Is there a way to do this without explicitly naming each state and inputting it into a row in the dataframe?
Sample data:
region_key <- as.data.frame("")
colnames(region_key) <- c("state")
region_key$region <- ""
region_key$state <- "AL"
I create an empty data frame, create a "state" and "region" column, then populate the state two letter abbreviations in the above fashion. Is there a way to both populate the data frame with the state abbreviations and classify by region (e.g. Alabama would be "South")?
Expected output:
head(region_key)
state region
1 AL South
Thanks in advance for your help!
Figured out my problem based on the comment from #alistair, thank you.
Sample data:
region_key <- data.frame(state.abb, state.region)
head(region_key)
state.abb state.region
1 AL South
2 AK West
3 AZ West
4 AR South
5 CA West
6 CO West

How do I classify this data by year and region in R? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
So I have the following data:
I have 5 regions, and years 1998-2009. What I like to do is to classify countries each year by regions. I'm new at R so the only step I've taken so far is the following:
finalData$Region = factor(finalData$Region, levels = c('Former Socialist Bloc', 'Independent', 'Western Europe','Scandinavia', 'Former Yugoslavia'), levels = c(1, 2, 3, 4, 5))
but I get this error:
Error in factor(finalData$Region, levels = c("Former Socialist Bloc",
: formal argument "levels" matched by multiple actual arguments
Could please tell me how to fix this and an approach to how to do the classification? Thank you!
This is a terribly formulated question. But I am going to imagine this is roughly what you want to do to give you an idea.
You have samples (rows) which are countries and you have some variables (columns) which are observations about these samples. You want to use all/multiple variables (multivariate analysis) to cluster the countries. If this is what you want to do, then below is one approach.
I am creating a data.frame with pseudo dataset.
dfr <- data.frame(Country=c("USA","UK","Germany","Austria","Taiwan","China","Japan","South Korea"),
Year=factor(c(2009,2009,2009,2010,2010,2010,2010,2011)),
Language=c("English","English","German","German","Chinese","Chinese","Japanese","Korean"),
Region=c("North America","Europe","Europe","Europe","Asia","Asia","Asia","Asia"))
> head(dfr)
Country Year Language Region
1 USA 2009 English North America
2 UK 2009 English Europe
3 Germany 2009 German Europe
4 Austria 2010 German Europe
5 Taiwan 2010 Chinese Asia
6 China 2010 Chinese Asia
First thing you want to do is move the country names out to row names because country names are the sample labels and they are not observations.
rownames(dfr) <- dfr$Country
dfr$Country <- NULL
Now you want all the remaining variables to be numeric or factors. Do that manually and carefully. I only have categorical observations. Finally we want to
recode all factors to integers. So that our final data.frame contains only numbers.
dfr1 <- as.data.frame(sapply(dfr,as.numeric))
rownames(dfr1) <- rownames(dfr)
> head(dfr1)
Year Language Region
USA 1 2 3
UK 1 2 2
Germany 1 3 2
Austria 2 3 2
Taiwan 2 1 1
China 2 1 1
Now run some clustering algorithm. Here for example a PCA.
pc <- prcomp(dfr1)
pcval <- pc$x
> head(pcval)
PC1 PC2 PC3
USA -1.04369951 1.2743507 0.36120850
UK -0.87597336 0.5087910 -0.25990844
Germany 0.06243255 0.8258430 -0.39728520
Austria 0.36452903 0.2660249 0.37429849
Taiwan -1.34455665 -1.1336389 0.02793507
China -1.34455665 -1.1336389 0.02793507
Combine the output principal components with original data.
pcval1 <- cbind(pcval,dfr)
rownames(pcval1) <- rownames(dfr)
> head(pcval1)
PC1 PC2 PC3 Year Language Region
USA -1.04369951 1.2743507 0.36120850 2009 English North America
UK -0.87597336 0.5087910 -0.25990844 2009 English Europe
Germany 0.06243255 0.8258430 -0.39728520 2009 German Europe
Austria 0.36452903 0.2660249 0.37429849 2010 German Europe
Taiwan -1.34455665 -1.1336389 0.02793507 2010 Chinese Asia
China -1.34455665 -1.1336389 0.02793507 2010 Chinese Asia
What is PCA and what is going on here is clearly out of the scope of this answer. In short, it creates some new variables based on all your observed variables.
Scatterplot the principal components 1 and 2. Colour points by some variable. Say "Region". Add country names as text labels.
library(ggplot2)
pcval1$Country <- rownames(pcval1)
ggplot(pcval1,aes(x=PC1,y=PC2,colour=Region))+
geom_point(size=3)+
geom_text(aes(label=pcval1$Country),hjust=1.5)+
theme_bw(base_size=15)
Now we see that countries have clustered together based on the observations in your dataset. We have the countries roughly grouping by Region. Obviously, there may or may not be any clustering depending on your data.
This is just an example. If you blindly follow this, you may be violating all sorts of statistical assumptions and what-not. You have to take into account what kind of data distributions you are dealing with and what clustering algorithm is suitable for that data etc.

R Leaflet- Change density to column name of my own

I have been working on leaflet in R.
https://rstudio.github.io/leaflet/choropleths.html
The above us-Map contains density of a state.The Format of the data is Geo-Json. I want to remove the density variable and I want to pass my columnname with corresponding variable value. (For Example when you hover on the New Mexico I am getting density as 17.16 (density:17.16), instead I want to display as (mycolumnname:value) ).
This is a pretty common need in working with leaflet. There are a few ways to do this, but this is the simplest in my mind:
All of the information you would like to plot is stored in the section of the SpatialPolygonsDataFrame found at states#data, which you can see by looking at the head of this data frame section:
I made a data frame (traditional r data frame) using the state names from the original SpatialPolygonsDataFrame names states in your code above and created my_var.
a<-data.frame( States=states#data$name)
a$my_var <- round(runif(52, 15, 185),2)
This is the first few rows of my new data frame, which is like yours but has data OTHER than density in it.
head(a)
States my_var
1 Alabama 120.33
2 Alaska 179.41
3 Arizona 67.92
4 Arkansas 30.57
5 California 72.26
6 Colorado 56.33
Now that you have this data frame you can call up the library maptools and do a polygon cbind as follows:
states2<-spCbind(states,a$my_var)
Now looking at the head of states2 (which you could name states and replace the original states SpatialPolygonsDataFrame I kept both to compare before and after)
head(states2#data)
id name density data.my_var
0 01 Alabama 94.650 58.01
1 02 Alaska 1.264 99.01
2 04 Arizona 57.050 81.05
3 05 Arkansas 56.430 124.68
4 06 California 241.700 138.19
5 08 Colorado 49.330 103.78
this added the data.my_var variable into the spatial data frame. Now you can use find/replace, to go through and replace the references in your code where it says density with data.my_var and the new variables will be used.
Important things to consider
Your data has 50 state names, the spatial data frame has 52, you will need to add in the missing states to your data frame before cBinding them, they must be the same length AND in the same order.
If you grab the names like this:
a<-data.frame( States=states#data$name)
from the states object, you can then left merge on States, with your data and it will keep the order a and all the cells which are empty where the new regions have not data in your data set will remain empty.
Use merge to be sure that data lines up properly.
a<- merge(a, your_data ,by=c("States","name"))
Also, once they are merged and you have checked that states#data$name is in the same order as a$States, you can use any name you want as new heading in the SpatialPolygonDataFrame by extracting the data into a vector with the name you want prior to binding them:
my_var <- a$my_var
states2<-spCbind(states, my_var)
this will leave you with a data frame which looks like this:
id name density my_var
0 01 Alabama 94.650 58.01
1 02 Alaska 1.264 99.01
This is easier to address as a column name from inside leaflet without long strings.

How do I replace values in an R dataframe column with a corresponding value?

Ok, so I have a dataframe that I downloaded from Pew Research Center. One of the columns (called 'cregion') contains a series of numbers from 1-56, with each number corresponding to a geographic location in the U.S. Most of these locations are states, and the additional 6 are at the sub-state level. So, for example, the number '1' corresponds to 'Alabama', and '11' corresponds to the 'District Of Columbia'.
What I'd like to do is replace each of those numbers in the 'cregion' column with the ACTUAL name of the region it corresponds to. Unfortunately, there is no column in this data frame that I can use to swap the values, as the key for which number corresponds to which region exists completely separately (word document). I'm new to R and while I've been searching for a few hours for the best way to go about this, I can't seem to find a method that would work (or I just don't understand the explanation). Can anybody suggest a method to me?
If you have a vector of the state names as strings called statevec whose ith element corresponds to cregion i, and your data frame is named dat, just do
dat <- data.frame(cregion = sample(1:50), stuff = runif(50))
head(dat)
# cregion stuff
#1 25 0.665843896
#2 11 0.144631131
#3 13 0.691616240
#4 28 0.507454243
#5 9 0.416535139
#6 30 0.004196311
statevec <- state.name
dat$cregion <- statevec[dat$cregion]
head(dat)
# cregion stuff
#1 Missouri 0.665843896
#2 Hawaii 0.144631131
#3 Illinois 0.691616240
#4 Nevada 0.507454243
#5 Florida 0.416535139
#6 New Jersey 0.004196311

Using name full name and maiden name strings (and birthdays) to match individuals across time

I've got a set of 20 or so consecutive individual-level cross-sectional data sets which I would like to link together.
Unfortunately, there's no time-stable ID number; there are, however, fields for first, last, and maiden names, as well as year of birth--this should allow for a pretty high (90-95%) match rate, I presume.
Ideally, I would create a time-independent ID for each unique individual.
I can do this for those whose marital status (maiden name) does not change pretty easily in R--stack the data sets to get a long panel, then do something to the effect of:
unique(dt,by=c("first_name","last_name","birth_year"))[,id:=.I]
(I'm of course using R data.table), then merging back to the full data.
However, I'm stuck on how to incorporate the maiden name to this procedure. Any suggestions?
Here's a preview of the data:
first_name last_name nee birth_year year
1: eileen aaldxxxx dxxxx 1977 2002
2: eileen aaldxxxx dxxxx 1977 2003
3: sarah aaxxxx gexxxx 1974 2003
4: kelly aaxxxx nxxxx 1951 2008
5: linda aarxxxx-gxxxx aarxxxx 1967 2008
---
72008: stacey zwirxxxx kruxxxx 1982 2010
72009: stacey zwirxxxx kruxxxx 1982 2011
72010: stacey zwirxxxx kruxxxx 1982 2012
72011: stacey zwirxxxx kruxxxx 1982 2013
72012: jill zydoxxxx gundexxxx 1978 2002
UPDATE:
I've done a lot of chipping and hammering at the problem; here's what I've got so far. I would appreciate any comments for possible improvements to the code so far.
I'm still completely missing something like 3-5% of matches due to inexact matches ("tonya" vs. "tanya", "jenifer" vs. "jennifer"); I haven't come up with a clean way of doing fuzzy matching on the stragglers, so there's room for better matching in that direction if anyone's got a straightforward way to implement that.
The basic approach is to build cumulatively--assign IDs in the first year, then look for matches in the second year; assign new IDs to the unmatched. Then for year 3, look back at the first 2 years, etc. As to how to match, the idea is to slowly expand the matching criteria--the idea being that the more robust the match, the lower the chances of mismatching accidentally (particularly worried about the John Smiths).
Without further ado, here's the main function for matching a pair of data sets:
get_id<-function(yr,key_from,key_to=key_from,
mdis,msch,mard,init,mexp,step){
#Want to exclude anyone who is matched
existing_ids<-full_data[.(yr),unique(na.omit(teacher_id))]
#Get the most recent prior observation of all
# unmatched teachers, excluding those teachers
# who cannot be uniquely identified by the
# current key setting
unmatched<-
full_data[.(1996:(yr-1))
][!teacher_id %in% existing_ids,
.SD[.N],by=teacher_id,
.SDcols=c(key_from,"teacher_id")
][,if (.N==1L) .SD,keyby=key_from
][,(flags):=list(mdis,msch,mard,init,mexp,step)]
#Merge, reset keys
setkey(setkeyv(
full_data,key_to)[year==yr&is.na(teacher_id),
(update_cols):=unmatched[.SD,update_cols,with=F]],
year)
full_data[.(yr),(update_cols):=lapply(.SD,function(x)na.omit(x)[1]),
by=id,.SDcols=update_cols]
}
Then I basically go through the 19 years yy in a for loop, running 12 progressively looser matches, e.g. step 3 is:
get_id(yy,c("first_name_clean","last_name_clean","birth_year"),
mdis=T,msch=T,mard=F,init=F,mexp=F,step=3L)
The final step is to assign new IDs:
current_max<-full_data[.(yy),max(teacher_id,na.rm=T)]
new_ids<-
setkey(full_data[year==yy&is.na(teacher_id),.(id=unique(id))
][,add_id:=.I+current_max],id)
setkey(setkey(full_data,id)[year==yy&is.na(teacher_id),
teacher_id:=new_ids[.SD,add_id]],year)

Resources