Having trouble merging/joining two datasets on two variables in R - r

I realize there have already been many asked and answered questions about merging datasets here, but I've been unable to find one that addresses my issue.
What I'm trying to do is merge to datasets using two variables and keeping all data from each. I've tried merge and all of the join operations from dplyr, as well as cbind and have not gotten the result I want. Usually what happens is that one column from one of the datasets gets overwritten with NAs. Another thing that will happen, as when I do full_join in dplyr or all = TRUE in merge is that I get double the number of rows.
Here's my data:
Primary_State Primary_County n
<fctr> <fctr> <int>
1 AK 12
2 AK Aleutians West 1
3 AK Anchorage 961
4 AK Bethel 1
5 AK Fairbanks North Star 124
6 AK Haines 1
Primary_County Primary_State Population
1 Autauga AL 55416
2 Baldwin AL 208563
3 Barbour AL 25965
4 Bibb AL 22643
5 Blount AL 57704
6 Bullock AL 10362
So I want to merge or join based on Primary_State and Primary_County, which is necessary because there are a lot of duplicate county names in the U.S. and retain the data from both n and Population. From there I can then divide the Population by n and get a per capita figure for each county. I just can't figure out how to do it and keep all of the data, so any help would be appreciated. Thanks in advance!
EDIT: Adding code examples of what I've already described above.
This code (as well as left_join):
countyPerCap <- merge(countyLicense, countyPops, all.x = TRUE)
Produces this:
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians West 1 NA
3 AK Anchorage 961 NA
4 AK Bethel 1 NA
5 AK Fairbanks North Star 124 NA
6 AK Haines 1 NA
This code:
countyPerCap <- right_join(countyLicense, countyPops)
Produces this:
Primary_State Primary_County n Population
<chr> <chr> <int> <int>
1 AL Autauga NA 55416
2 AL Baldwin NA 208563
3 AL Barbour NA 25965
4 AL Bibb NA 22643
5 AL Blount NA 57704
6 AL Bullock NA 10362
Hope that's helpful.
EDIT: This is what happens with the following code:
countyPerCap <- merge(countyLicense, countyPops, all = TRUE)
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians East NA 3296
3 AK Aleutians West 1 NA
4 AK Aleutians West NA 5647
5 AK Anchorage 961 NA
6 AK Anchorage NA 298192
It duplicates state and county and then adds n to one record and Population in another. Is there a way to deduplicate the dataset and remove the NAs?

We can give column names in merge by mentioning "by" in merge statement
merge(x,y, by=c(col1, col2 names))
in merge statement

I figured it out. There were trailing whitespaces in the Census data's county names, so they weren't matching with the other dataset's county names. (Note to self: Always check that factors match when trying to merge datasets!)
trim.trailing <- function (x) sub("\\s+$", "", x)
countyPops$Primary_County <- trim.trailing(countyPops$Primary_County)
countyPerCap <- full_join(countyLicense, countyPops,
by=c("Primary_State", "Primary_County"), copy=TRUE)
Those three lines did the trick. Thanks everyone!

Related

How to Merge Shapefile and Dataset?

I want to create a spatial map showing drug mortality rates by US county, but I'm having trouble merging the drug mortality dataset, crude_rate, with the shapefile, usa_county_df. Can anyone help out?
I've created a key variable, "County", in both sets to merge on but I don't know how to format them to make the data mergeable. How can I make the County variables correspond? Thank you!
head(crude_rate, 5)
Notes County County.Code Deaths Population Crude.Rate
1 Autauga County, AL 1001 74 975679 7.6
2 Baldwin County, AL 1003 440 3316841 13.3
3 Barbour County, AL 1005 16 524875 Unreliable
4 Bibb County, AL 1007 50 420148 11.9
5 Blount County, AL 1009 148 1055789 14.0
head(usa_county_df, 5)
long lat order hole piece id group County
1 -97.01952 42.00410 1 FALSE 1 0 0.1 1
2 -97.01952 42.00493 2 FALSE 1 0 0.1 2
3 -97.01953 42.00750 3 FALSE 1 0 0.1 3
4 -97.01953 42.00975 4 FALSE 1 0 0.1 4
5 -97.01953 42.00978 5 FALSE 1 0 0.1 5
crude_rate$County <- as.factor(crude_rate$County)
usa_county_df$County <- as.factor(usa_county_df$County)
merge(usa_county_df, crude_rate, "County")
[1] County long lat order hole
[6] piece id group Notes County.Code
[11] Deaths Population Crude.Rate
<0 rows> (or 0-length row.names)`
My take on this. First, you cannot expect a full answer with code because you did not provide a link to you data. Next time, please provide a full description of the problem with the data.
I just used the data you provided here to illustrate.
require(tidyverse)
# Load the data
crude_rate = read.csv("county_crude.csv", header = TRUE)
usa_county = read.csv("usa_county.csv", header = TRUE)
# Create the variable "county_join" within the county_crude to "left_join" on with the usa_county data. Note that you have to have the same type of data variable between the two tables and the same values as well
crude_rate = crude_rate %>%
mutate(county_join = c(1:5))
# Join the dataframes using a left join on the county_join and County variables
df_all = usa_county %>%
left_join(crude_rate, by = c("County"="county_join")) %>%
distinct(order,hole,piece,id,group, .keep_all = TRUE)
Data link: county_crude
Data link: usa_county
Blockquote

How to Merge Uneven Data Frames With Real Data

Problem:
I have two different size data sets that I would like to merge together. Without abandoning rows or inserting NA's. To compare this to a excel document situation you would have five columns and you would drago down 3 of them to populate the blank space left by the rows inserted by adding your data to the 4th and 5th column.
Example Data Set
zipcode = a, step3 = b in my later brainstorming code to solve my problem
> head(zipcode_joincsv)
zip city abv latitude longitude median mean pop
226 01749 Hudson AL 42.38981 -71.55791 76500 85689 18081
227 01752 Marlborough AL 42.35091 -71.54753 71835 89002 36273
228 01754 Maynard AL 42.43078 -71.45594 76228 82167 10414
229 01756 Mendon AL 42.09201 -71.54474 102625 117692 5257
230 01757 Milford AL 42.14918 -71.52149 68565 82206 26877
231 01760 Natick AL 42.29076 -71.35368 90673 113933 31763
> head(step3_df)
tolower.state.name. state.abb
1 alabama AL
2 alaska AK
3 arizona AZ
4 arkansas AR
5 california CA
6 colorado CO
Desired Result:
One DF where each zipcode city combo is combined with their states pop and
income. A column in common they have is the abbreviation column.
tolower.state.name. zip city abv latitude longitude median mean pop
1 alabama 01749 Hudson AL 42.38981 -71.55791 76500 85689 18081
2 alabama 01752 Marlborough AL 42.35091 -71.54753 71835 89002 36273
3 alabama 01754 Maynard AL 42.43078 -71.45594 76228 82167 10414
4 alabama 01756 Mendon AL 42.09201 -71.54474 102625 117692 5257
5 alabama 01757 Milford AL 42.14918 -71.52149 68565 82206 26877
6 alabama 01760 Natick AL 42.29076 -71.35368 90673 113933 31763
7 alaska data from these rows
8 arizona data from these rows
9 arkansas data from these rows
10 california data from these rows
11 colorado data from these rows
I've contemplated using something like
sqldf ("SELECT a.Zip, a.City, a.State Abv, a.Lat, a.Long, a.median, a.mean, a.pop, b.state.name, b.states.abb, b.pop, b.income
FROM a a
LEFT JOIN b b using (abv)")
I know that is probably not going to work if only that if it worked all the rows that there was not a matching set from A would input a NA where what I would like is that for every abv of NY the states average income and total population gets copied down the line. Than for every AR and every AL etc until the two data sets are one that a ggplot using all of the data can be created.
dplyr::left_join(a, b, by="abv") should work.

How do I get the sum of frequency count based on two columns?

Assuming that the dataframe is stored as someData, and is in the following format:
ID Team Games Medal
1 Australia 1992 Summer NA
2 Australia 1994 Summer Gold
3 Australia 1992 Summer Silver
4 United States 1991 Winter Gold
5 United States 1992 Summer Bronze
6 Singapore 1991 Summer NA
How would I count the frequencies of the medal, based on the Team - while excluding NA as an variable. But at the same time, the total frequency of each country should be summed, rather than displayed separately for Gold, Silver and Bronze.
In other words, I am trying to display the total number of medals PER country, with the exception of NA.
I have tried something like this:
library(plyr)
counts <- ddply(olympics, .(olympics$Team, olympics$Medal), nrow)
names(counts) <- c("Country", "Medal", "Freq")
counts
But this just gives me a massive table of every medal for every country separately, including NA.
What I would like to do is the following:
Australia 2
United States 2
Any help would be greatly appreciated.
Thank you!
We can use count
library(dplyr)
df1 %>%
filter(!is.na(Medal)) %>%
count(Team)
# A tibble: 2 x 2
# Team n
# <fct> <int>
#1 Australia 2
#2 United States 2
You can do that in base R with table and colSums
colSums(table(someData$Medal, someData$Team))
Australia Singapore United States
2 0 2
Data
someData = read.table(text="ID Team Games Medal
1 Australia '1992 Summer' NA
2 Australia '1994 Summer' Gold
3 Australia '1992 Summer' Silver
4 'United States' '1991 Winter' Gold
5 'United States' '1992 Summer' Bronze
6 Singapore '1991 Summer' NA",
header=TRUE)

Reshape long to wide where most columns have multiple values

I have data as below:
IDnum zipcode City County State
10011 36006 Billingsley Autauga AL
10011 36022 Deatsville Autauga AL
10011 36051 Marbury Autauga AL
10011 36051 Prattville Autauga AL
10011 36066 Prattville Autauga AL
10011 36067 Verbena Autauga AL
10011 36091 Selma Autauga AL
10011 36703 Jones Autauga AL
10011 36749 Plantersville Autauga AL
10011 36758 Uriah Autauga AL
10011 36480 Atmore Autauga AL
10011 36502 Bon Secour Autauga AL
I have a list of zipcodes, the cities they encompass, and counties/states they are located in. IDnum = numeric value for county and state, combined. List is in format you see now, I need to reshape it from long to wide / vertical to horizontal, where the IDnum variable becomes the unique identifier, and all other possible value combinations become wide variables.
IDnum zip1 city1 county1 state1 zip2 city2 county2
10011 36006 Billingsley Autauga AL 36022 Deatsville Autauga
This is just sample of the dataset, it encompasses every zip in the USA and includes more variables. I have seen other questions and answers similar to this one, but not where there are multiple values in almost every column.
There are commands in SPSS and STATA that will reshape data this way, in SPSS I can run a Restructure/Cases to Vars command that turns 11 variables in my initial dataset into about 1750, b/c one county has over 290 zips and it replicates most of the other variables 290+ times. This will create many blanks, but I need it to be reshaped into one very long horizontal file.
I have looked at reshape and reshape2, and am hung up on the 'default to length' error message. I did get melt/dcast to sorta work, but this creates one variable that is a list of all values, rather than creating variables for each value.
melted_dupes <- melt(zip_code_list_dupes, id.vars= c("IDnum"))
HRZ_dupes <- dcast(melted_dupes, IDnum ~ variable, fun.aggregate = list)
I have tried tidyr and dplyr but got lost in syntax. Am a little surprised there isn't a command the data similar to built in commands in other packages, making me assume there is, and I just haven't figured it out.
Any help is appreciated.
You can do this with the base function reshape after adding in a consecutive count by IDnum. Assuming your data is stored in a data.frame named df:
df2 <- within(df, count <- ave(rep(1,nrow(df)),df$IDnum,FUN=cumsum))
Provides a new column of the consecutive count named "time". And now we can reshape to wide format
reshape(df2,direction="wide",idvar="IDnum",timevar="count")
IDnum zipcode.1 City.1 County.1 State.1 zipcode.2 City.2 County.2 State.2 zipcode.3 City.3 County.3 State.3 zipcode.4 City.4 County.4 State.4
1 10011 36006 Billingsley Autauga AL 36022 Deatsville Autauga AL 36051 Marbury Autauga AL 36051 Prattville Autauga AL
(output truncated, goes all the way to zipcode.12, etc.)
There might be a more efficient way, but try the following.
I used my own (example) dataset, very similar to yours.
Run the process step by step to see how it works, as you'll have to modify some things in the code.
library(dplyr)
library(tidyr)
# get example data
dt = data.frame(id = c(1,1,1,2,2),
zipcode = c(4,5,6,7,8),
city = c("A","B","C","A","C"),
county = c("A","B","C","A","C"),
state = c("A","B","C","A","C"))
dt
# id zipcode city county state
# 1 1 4 A A A
# 2 1 5 B B B
# 3 1 6 C C C
# 4 2 7 A A A
# 5 2 8 C C C
# get maximum number of rows for a single id
# this will help you get the wide format
max_num_rows = max((dt %>% count(id))$n)
# get names of columns to reshape
col_names = names(dt)[-1]
dt %>%
group_by(id) %>%
mutate(nrow = paste0("row",row_number())) %>%
unite_("V",col_names) %>%
spread(nrow, V) %>%
unite("z",matches("row")) %>%
separate(z, paste0(col_names, sort(rep(1:max_num_rows, ncol(dt)-1))), convert=T) %>%
ungroup()
# # A tibble: 2 × 13
# id zipcode1 city1 county1 state1 zipcode2 city2 county2 state2 zipcode3 city3 county3 state3
# * <dbl> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr>
# 1 1 4 A A A 5 B B B 6 C C C
# 2 2 7 A A A 8 C C C NA <NA> <NA> <NA>

Calculate rows with same title

Since my other question got closed, here is the required data.
What I'm trying to do is have R calculate the last column 'count' towards the column city so I can map the data. Therefore I would need some kind of code to match this. Since I want to show how many participants (in count) are in the state of e.g Hawaii (HI)
zip city state latitude longitude count
96860 Pearl Harbor HI 24.859832 -168.021815 36
96863 Kaneohe Bay HI 21.439867 -157.74772 39
99501 Anchorage AK 61.216799 -149.87828 12
99502 Anchorage AK 61.153693 -149.95932 17
99506 Elmendorf AFB AK 61.224384 -149.77461 2
what I've tried is
match<- c(match(datazip$state, datazip$number))>$
but I'm really helpless trying to find a solution since I don't even know how to describe this in short. My plan afterwards is to make choropleth map with the data and believe me by now I've seen almost all the pages that try to give advice. so your help is pretty much appreciated. Thanks
# I read your sample data to a data frame
> df
zip city state latitude longitude count
1 96860 Pearl_Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe_Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf_AFB AK 61.22438 -149.7746 2
# If you want to sum the number of counts by state
library(plyr)
> ddply(df, .(state), transform, count2 = sum(count))
zip city state latitude longitude count count2
1 99501 Anchorage AK 61.21680 -149.8783 12 31
2 99502 Anchorage AK 61.15369 -149.9593 17 31
3 99506 Elmendorf_AFB AK 61.22438 -149.7746 2 31
4 96860 Pearl_Harbor HI 24.85983 -168.0218 36 75
5 96863 Kaneohe_Bay HI 21.43987 -157.7477 39 75
Maybe aggregate would be a nice and simple solution for you:
df
zip city state latitude longitude count
1 96860 Pearl Harbor HI 24.85983 -168.0218 36
2 96863 Kaneohe Bay HI 21.43987 -157.7477 39
3 99501 Anchorage AK 61.21680 -149.8783 12
4 99502 Anchorage AK 61.15369 -149.9593 17
5 99506 Elmendorf AFB AK 61.22438 -149.7746 2
aggregate(df$count,by=list(df$state),sum)
Group.1 x
1 AK 31
2 HI 75
aggregate(df$count,by=list(df$city),sum)
Group.1 x
1 Anchorage 29
2 Elmendorf AFB 2
3 Kaneohe Bay 39
4 Pearl Harbor 36

Resources