Using dplyr to make a dataframe look like output from ftable -- assigning a value to certain elements using group_by - r

Say for instance I have the following tall dataframe df:
state <- state.abb[1:10]
county <- letters[1:10]
zipcode <- sample(1000:9999, 5)
library(data.table)
df <- data.frame(CJ(state, county, zipcode))
colnames(df) <- c("state", "county", "zip")
df[1:15,]
state county zip
1 AK a 2847
2 AK a 2913
3 AK a 3886
4 AK a 6551
5 AK a 8447
6 AK b 2847
7 AK b 2913
8 AK b 3886
9 AK b 6551
10 AK b 8447
11 AK c 2847
12 AK c 2913
13 AK c 3886
14 AK c 6551
15 AK c 8447
For purposes of presentation, it might look nicer like this:
state county zip
1 AK a 2847
2 2913
3 3886
4 6551
5 8447
6 b 2847
7 2913
8 3886
9 6551
10 8447
11 c 2847
12 2913
13 3886
14 6551
15 8447
I use dplyr frequently to create crosstabs instead of using base R's table or ftable functions so that I can pipe the output into xtable to make a nice HTML presentation.
To make this look like output from ftable, I want to set all elements but the first unique one from each of the columns I grouped by to "". I know I can use group_by to perform similar operations as this using dplyr, but it doesn't seem to play nice with indices, which is the only method I'm envisioning to accomplish this task:
library(dplyr)
df <- group_by(df, state, county)
df[-1,] <- ""
Should I be thinking about this differently, or is there some handy dplyr syntax to do this? Thanks.

Here is one way. First, group the data by state. Any duplicated county will be "" in the first mutate(). Then, ungroup the data. Given the county, a appears at the beginning of each state, whichever rows with a are ones you want to keep state names. Otherwise, you want "". This is done in the second mutate().
group_by(df, state) %>%
mutate(county = order_by(county, ifelse(!duplicated(county), county, ""))) %>%
ungroup() %>%
mutate(state = ifelse(county == "a", state, ""))
# state county zip
#1 AK a 2429
#2 3755
#3 6108
#4 8364
#5 9577
#6 b 2429
#7 3755
#8 6108
#9 8364
#10 9577
In data.table, the code above could be something like these.
setDT(df)[, county := ifelse(!duplicated(county), county, ""), by = state][,
state := ifelse(county == "a", state, "")]
setDT(df)[, county := ifelse(!duplicated(county), county, ""), by = state][
county != "a", state := ""]

Related

converting an abbreviation into a full word

I am trying to avoid writing a long nested ifelse statement in excel.
I am working on two datasets, one where I have abbreviations and county names.
Abbre
COUNTY_NAME
1 AD Adams
2 AS Asotin
3 BE Benton
4 CH Chelan
5 CM Clallam
6 CR Clark
And another data set that contains the county abbreviation and votes.
CountyCode Votes
1 WM 97
2 AS 14
3 WM 163
4 WM 144
5 SJ 21
For the second table, how do I convert the countycode (abbreviation) into the full spelled-out text and add that as a new column?
I have been trying to solve this unsuccessfully using grep, match, and %in%. Clearly I am missing something and any insight would be greatly appreciated.
We can use a join
library(dplyr)
library(tidyr)
df2 <- df2 %>%
left_join(Abbre %>%
separate(COUNTY_NAME, into = c("CountyCode", "FullName")),
by = "CountyCode")
Or use base R
tmp <- read.table(text = Abbre$COUNTY_NAME, header = FALSE,
col.names = c("CountyCode", "FullName"))
df2 <- merge(df2, tmp, by = 'CountyCode', all.x = TRUE)
Another base R option using match
df2$COUNTY_NAME <- with(
df1,
COUNTY_NAME[match(df2$CountyCode, Abbre)]
)
gives
> df2
CountyCode Votes COUNTY_NAME
1 WM 97 <NA>
2 AS 14 Asotin
3 WM 163 <NA>
4 WM 144 <NA>
5 SJ 21 <NA>
A data.table option
> setDT(df1)[setDT(df2), on = .(Abbre = CountyCode)]
Abbre COUNTY_NAME Votes
1: WM <NA> 97
2: AS Asotin 14
3: WM <NA> 163
4: WM <NA> 144
5: SJ <NA> 21

Mutate DF1 based on DF2 with a check

nubie here with a dataframe/mutate question... I want to update a dataframe (df1) based on data in another dataframe (df2). For one offs I've used MUTATE so I figure this is the way to go. Additionally I would like a check function added (TRUE/FALSE ?) to indicate if the the field in df1 was updated.
For Example..
df1-
State
<chr>
1 N.Y.
2 FL
3 AL
4 MS
5 IL
6 WS
7 WA
8 N.J.
9 N.D.
10 S.D.
11 CALL
df2
State New_State
<chr> <chr>
1 N.Y. New York
2 FL Florida
3 AL Alabama
4 MS Mississippi
5 IL Illinois
6 WS Wisconsin
7 WA Washington
8 N.J. New Jersey
9 N.D. North Dakota
10 S.D. South Dakota
11 CAL California
I want the output to look like this
df3
New_State Test
<chr>
1 New York TRUE
2 Florida TRUE
3 Alabama TRUE
4 Mississippi TRUE
5 Illinois TRUE
6 Wisconsin TRUE
7 Washington TRUE
8 New Jersey TRUE
9 North Dakota TRUE
10 South Dakota TRUE
11 CALL FALSE
In essence I want R to read the data in df1 and change df1 based on the match in df2 chaining out to the full state name and replace. Lastly if the data in df1 was update mark as "TRUE" (N.Y. to NEW YORK) and "FALSE" if not updated (CALL vs CAL)
Thanks in advance for any and all help.
This should give you the result you're looking for:
match_vec <- match(df1$State, table = df2$State)
This vector should match all the abbreviated state names in df1 with those in df2. Where there's no match, you end up with a missing value:
Then the following code using dplyr should produce the df3 you requested.
library(dplyr)
df3 <- df1 %>%
mutate(New_State = df2$New_State[match_vec]) %>%
mutate(Test = !is.na(match_vec)) %>%
mutate(New_State = ifelse(is.na(New_State),
State, New_State)) %>%
select(New_State, Test)

efficiently creating a panel data.frame from cross sections with unharmonized column names

I need to create a panel data set (long format) from multiple yearly (cross-sectional) data sets. The variables of interest have different names in the single data sets and i need to harmonize them.
I loaded the dataframes to a list and now want to manipulate the names using lapply or a chunk of code that allows binding the dataframes. I can see several ways of doing this, but would like to use one which works with little code on a large list of data.frames, so that I can do this for several variables and easily change specifics later on.
So what I am looking for is either a way to rename the columns, so that I able to simple use bind_rows() from dplyr or an equivalent method, or a way to rename and bind the datasets in one step. Since I need to do this for several variables it might be safer to keep the two steps apart.
To illustrate, here an example:
a <- data.frame(id=c("Marc", "Julia", "Rico"), year=2000:2002, laborincome=1:3)
b <- data.frame(id=c("Marc", "Julia", "Rico"), earningsfromlabor=2:4, year=2003:2005)
dflist <- list(a, b)
equivalent_vars <- c("laborincome", "earningsfromlabor")
newnanme <- "income"
Desired result:
data.frame(id=c("Marc", "Julia", "Rico"), income=c(1,2,3,2,3,4), year=2000:2005)
id income year
1 Marc 1 2000
2 Julia 2 2001
3 Rico 3 2002
4 Marc 2 2003
5 Julia 3 2004
6 Rico 4 2005
We could use setnames from data.table
library(data.table)
do.call(rbind, Map(setnames, dflist, old = equivalent_vars, new = newnanme))
# id year income
#1 Marc 2000 1
#2 Julia 2001 2
#3 Rico 2002 3
#4 Marc 2003 2
#5 Julia 2004 3
#6 Rico 2005 4
Or we can use the :=
library(dplyr)
library(purrr)
map2_df(dflist, equivalent_vars, ~ .x %>%
rename(!! (newnanme) := !! .y)) %>%
select(id, income, year)
# id income year
#1 Marc 1 2000
#2 Julia 2 2001
#3 Rico 3 2002
#4 Marc 2 2003
#5 Julia 3 2004
#6 Rico 4 2005

Having trouble merging/joining two datasets on two variables in R

I realize there have already been many asked and answered questions about merging datasets here, but I've been unable to find one that addresses my issue.
What I'm trying to do is merge to datasets using two variables and keeping all data from each. I've tried merge and all of the join operations from dplyr, as well as cbind and have not gotten the result I want. Usually what happens is that one column from one of the datasets gets overwritten with NAs. Another thing that will happen, as when I do full_join in dplyr or all = TRUE in merge is that I get double the number of rows.
Here's my data:
Primary_State Primary_County n
<fctr> <fctr> <int>
1 AK 12
2 AK Aleutians West 1
3 AK Anchorage 961
4 AK Bethel 1
5 AK Fairbanks North Star 124
6 AK Haines 1
Primary_County Primary_State Population
1 Autauga AL 55416
2 Baldwin AL 208563
3 Barbour AL 25965
4 Bibb AL 22643
5 Blount AL 57704
6 Bullock AL 10362
So I want to merge or join based on Primary_State and Primary_County, which is necessary because there are a lot of duplicate county names in the U.S. and retain the data from both n and Population. From there I can then divide the Population by n and get a per capita figure for each county. I just can't figure out how to do it and keep all of the data, so any help would be appreciated. Thanks in advance!
EDIT: Adding code examples of what I've already described above.
This code (as well as left_join):
countyPerCap <- merge(countyLicense, countyPops, all.x = TRUE)
Produces this:
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians West 1 NA
3 AK Anchorage 961 NA
4 AK Bethel 1 NA
5 AK Fairbanks North Star 124 NA
6 AK Haines 1 NA
This code:
countyPerCap <- right_join(countyLicense, countyPops)
Produces this:
Primary_State Primary_County n Population
<chr> <chr> <int> <int>
1 AL Autauga NA 55416
2 AL Baldwin NA 208563
3 AL Barbour NA 25965
4 AL Bibb NA 22643
5 AL Blount NA 57704
6 AL Bullock NA 10362
Hope that's helpful.
EDIT: This is what happens with the following code:
countyPerCap <- merge(countyLicense, countyPops, all = TRUE)
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians East NA 3296
3 AK Aleutians West 1 NA
4 AK Aleutians West NA 5647
5 AK Anchorage 961 NA
6 AK Anchorage NA 298192
It duplicates state and county and then adds n to one record and Population in another. Is there a way to deduplicate the dataset and remove the NAs?
We can give column names in merge by mentioning "by" in merge statement
merge(x,y, by=c(col1, col2 names))
in merge statement
I figured it out. There were trailing whitespaces in the Census data's county names, so they weren't matching with the other dataset's county names. (Note to self: Always check that factors match when trying to merge datasets!)
trim.trailing <- function (x) sub("\\s+$", "", x)
countyPops$Primary_County <- trim.trailing(countyPops$Primary_County)
countyPerCap <- full_join(countyLicense, countyPops,
by=c("Primary_State", "Primary_County"), copy=TRUE)
Those three lines did the trick. Thanks everyone!

Reshape long to wide where most columns have multiple values

I have data as below:
IDnum zipcode City County State
10011 36006 Billingsley Autauga AL
10011 36022 Deatsville Autauga AL
10011 36051 Marbury Autauga AL
10011 36051 Prattville Autauga AL
10011 36066 Prattville Autauga AL
10011 36067 Verbena Autauga AL
10011 36091 Selma Autauga AL
10011 36703 Jones Autauga AL
10011 36749 Plantersville Autauga AL
10011 36758 Uriah Autauga AL
10011 36480 Atmore Autauga AL
10011 36502 Bon Secour Autauga AL
I have a list of zipcodes, the cities they encompass, and counties/states they are located in. IDnum = numeric value for county and state, combined. List is in format you see now, I need to reshape it from long to wide / vertical to horizontal, where the IDnum variable becomes the unique identifier, and all other possible value combinations become wide variables.
IDnum zip1 city1 county1 state1 zip2 city2 county2
10011 36006 Billingsley Autauga AL 36022 Deatsville Autauga
This is just sample of the dataset, it encompasses every zip in the USA and includes more variables. I have seen other questions and answers similar to this one, but not where there are multiple values in almost every column.
There are commands in SPSS and STATA that will reshape data this way, in SPSS I can run a Restructure/Cases to Vars command that turns 11 variables in my initial dataset into about 1750, b/c one county has over 290 zips and it replicates most of the other variables 290+ times. This will create many blanks, but I need it to be reshaped into one very long horizontal file.
I have looked at reshape and reshape2, and am hung up on the 'default to length' error message. I did get melt/dcast to sorta work, but this creates one variable that is a list of all values, rather than creating variables for each value.
melted_dupes <- melt(zip_code_list_dupes, id.vars= c("IDnum"))
HRZ_dupes <- dcast(melted_dupes, IDnum ~ variable, fun.aggregate = list)
I have tried tidyr and dplyr but got lost in syntax. Am a little surprised there isn't a command the data similar to built in commands in other packages, making me assume there is, and I just haven't figured it out.
Any help is appreciated.
You can do this with the base function reshape after adding in a consecutive count by IDnum. Assuming your data is stored in a data.frame named df:
df2 <- within(df, count <- ave(rep(1,nrow(df)),df$IDnum,FUN=cumsum))
Provides a new column of the consecutive count named "time". And now we can reshape to wide format
reshape(df2,direction="wide",idvar="IDnum",timevar="count")
IDnum zipcode.1 City.1 County.1 State.1 zipcode.2 City.2 County.2 State.2 zipcode.3 City.3 County.3 State.3 zipcode.4 City.4 County.4 State.4
1 10011 36006 Billingsley Autauga AL 36022 Deatsville Autauga AL 36051 Marbury Autauga AL 36051 Prattville Autauga AL
(output truncated, goes all the way to zipcode.12, etc.)
There might be a more efficient way, but try the following.
I used my own (example) dataset, very similar to yours.
Run the process step by step to see how it works, as you'll have to modify some things in the code.
library(dplyr)
library(tidyr)
# get example data
dt = data.frame(id = c(1,1,1,2,2),
zipcode = c(4,5,6,7,8),
city = c("A","B","C","A","C"),
county = c("A","B","C","A","C"),
state = c("A","B","C","A","C"))
dt
# id zipcode city county state
# 1 1 4 A A A
# 2 1 5 B B B
# 3 1 6 C C C
# 4 2 7 A A A
# 5 2 8 C C C
# get maximum number of rows for a single id
# this will help you get the wide format
max_num_rows = max((dt %>% count(id))$n)
# get names of columns to reshape
col_names = names(dt)[-1]
dt %>%
group_by(id) %>%
mutate(nrow = paste0("row",row_number())) %>%
unite_("V",col_names) %>%
spread(nrow, V) %>%
unite("z",matches("row")) %>%
separate(z, paste0(col_names, sort(rep(1:max_num_rows, ncol(dt)-1))), convert=T) %>%
ungroup()
# # A tibble: 2 × 13
# id zipcode1 city1 county1 state1 zipcode2 city2 county2 state2 zipcode3 city3 county3 state3
# * <dbl> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr>
# 1 1 4 A A A 5 B B B 6 C C C
# 2 2 7 A A A 8 C C C NA <NA> <NA> <NA>

Resources