Binding dataframes with matching country names

Binding dataframes with matching country names - r

I have two data frames of country data.
df1 has all the countries of the world.
df2 has a subset of countries but has the populations in one of its columns.
I want to take the population data and add it to df1 where the country names are a match.
If df1$Column1 = df2$Column1 (same country name) then populate df1$Column2 (currently empty) with the information from df2$Column2 (country's population) where the row is the the one for that country match.
I tried to merge the two using the column "Name" which they both have for country names :
total <- merge(map,Co2_2x, by="NAME")
the columns are all there but I get empty rows in my new dataframe.
I'd like to be able to say "for this row and column matrix position in df1 (the country), get the row (country name match in df2) and column X (population data). Then put it in this row and column Y matrix position in df1 (new population column in df1 for the matched country name)"... There must be an easier way :-)
Here is my code : I'd like to fill map$measure with data from Co2_2x$premium where the countries match.
library(XML)
library(raster)
library(rgdal)
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile="TM_WORLD_BORDERS_SIMPL-0.3.zip")
unzip("TM_WORLD_BORDERS_SIMPL-0.3.zip",exdir=getwd())
polygons <- shapefile("TM_WORLD_BORDERS_SIMPL-0.3.shp")
polygons
map <- as.data.frame(polygons)
map$Measure <- 0
library(rvest)
Co2 <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions")
Co2_2x<-Co2 %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
names(Co2_2x)[2]<-paste("premium")
names(Co2_2x)[1]<-paste("NAME")
total <- merge(map,Co2_2x, by="NAME")
Thanks!

To have the first dataset rows with no match in the other dataset appear, you just need to add the all.x=T option, as follows (have a look at the documentation for details) :
total <- merge(map,Co2_2x, by="NAME",all.x=T)
These rows will then appear with NA in the second dataset columns.
If the matching doesn't seem to work, you may want to make sure that your matching variable (in your case, NAME) is filled exaclty the same way in the two datasets (letter case, possible spaces at the extremities...).
This answer provides a fine way of doing so.

you can use sqldf library in R.
Just follow the code below. You'll be able to merge (join) the two dataset that you have:
library(sqldf)
merged_data <- sqldf("select a.country, b.population from df1 as a
left join df2 as b on (a.country = b.country) group by 1")
Thanks and happy R-programming!!!

Related

Adding a column of a dataframe to another dataframe if they match in another column

For a project in university, i'm working with large stock price dataframe's.
I have two dataframes.
Dataframe df1 includes the daily close prices over a certain time. The header includes the stock's shortcut.
Dataframe df2 includes the stock's shortcut in the first column and in the second column, there is the industry name of the stock's firm. IMPORTANT to know is that in df2 there are more values than in df1 (but every value in df1 should be in df2)
Is there any possibility to integrate the second column of df2 into the first row of df1 if they match (=> value from df1 header = df2 first column)
# Example Code
df1=as.data.frame(matrix(runif(20,min=0,max=1), nrow = 4))
df1
df2 <- as.data.frame(c("V1","V829","V2","V3","V493","V4","V5","V6","V992","V7"))
df2$insert <- c("test1","test2","test3","test4","test5","test6","test7","test8","test9","test10")
names(df2) <- c("Column2","test")
df1
df2
# Now insert/combine df2$test in (or over) df1[1,] as a row, if names(df1) and df2$Column2 matches
enter image description here (DataFrame df1)
enter image description here (DataFrame df2)
Thank you for your answers guys!
Nino

I would recommend you reshape your df1 into long format (see Reshaping data.frame from wide to long format).
library(tidyr)
df1_long <- df1 %>% gather(Instrument, value, -X)
I would organize the file this way because that makes it easier to use left__join() to match the data frames (see a description of mutating joins on the data wrangling cheat sheet).
df <- left_join(df1_long, df2, by = "Instrument")
If you want you can then make your dataframe wide again using the spread() function, which is the reverse of gather().
For the future I recommend you generate a reproducible example, rather than linking image files of your dataframes, as the links might expire, and it makes it generally less likely to get an answer on Stack Overflow.

Changing a Column to an Observation in a Row in R

I am currently struggling to transition one of my columns in my data to a row as an observation. Below is a representative example of what my data looks like:
library(tidyverse)
test_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409),
abstention=rep(52199))
As seen above, the abstention column exists at the end of my data frame, and I would like my data to look like the following:
library(tidyverse)
desired_df <- tibble(unit_name=rep("Chungcheongbuk-do"),unit_n=rep(2),
can=c("Cho Bong-am","Lee Seung-man","Lee Si-yeong","Shin Heung-woo","abstention"),
pev1=rep(510014),vot1=rep(457815),vv1=rep(445955),
ivv1=rep(11860),cv1=c(25875,386665,23006,10409,52199))
Here, abstentions are treated like a candidate, in the can column. Thus, the rest of the data is maintained, and the abstention values are their own observation in the cv1 column.
I have tried using pivot_wider, but I am unsure how to use the arguments to get what I want. I have also considered t() to transpose the column into a row, but also having a hard time slotting it back into my data. Any help is appreciated! Thanks!

Here's a strategy what will work if you have multiple unit_names
test_df %>%
group_split(unit_name) %>%
map( function(group_data) {
slice(group_data, 1) %>%
mutate(can="abstention", cv1=abstention) %>%
add_row(group_data, .) %>%
select(-abstention)
}) %>%
bind_rows()
Basically we split the data up by unit_name, then we grab the first row for each group and move the values around. Append that as a new row to each group, and then re-combine all the groups.

R: Adding a column from one dataset to another based on matching multiple columns

I have two datasets:
DS1 - contains a list of subjects with a columns for name, ID number and Employment status
DS2 - contains the same list of subjects names and ID numbers but some of these are missing on the second data set.
Finally it contains a 3rd column for Education Level.
I want to merge the Education column onto the first dataset. I have done this using the merge function sorting by ID number but because some of the ID numbers are missing on the second data set I want to merge the remaining Education level by name as a secondary option. Is there a way to do this using dplyr/tidyverse?

There are two ways you can do this. Choose the one based on your preference.
1st option:
#here I left join twice and select columns each time to ensure there is no duplication like '.x' '.y'
finalDf = DS1 %>%
dplyr::left_join(DS2 %>%
dplyr::select(ID,EducationLevel1=EducationLevel),by=c('ID')) %>%
dplyr::left_join(DS2 %>%
dplyr::select(Name,EducationLevel2=EducationLevel),by=c('Name')) %>%
dplyr::mutate(FinalEducationLevel = ifelse(is.na(EducationLevel1),EducationLevel2,EducationLevel1))
2nd option:
#first find the IDs which are present in the 2nd dataset
commonIds = DS1 %>%
dplyr::inner_join(DS2 %>%
dplyr::select(ID,EducationLevel),by=c('ID'))
#now the records where ID was not present in DS2
idsNotPresent = DS1 %>%
dplyr::filter(!ID %in% commonIds$ID) %>%
dplyr::left_join(DS2 %>%
dplyr::select(Name,EducationLevel),by=c('Name'))
#bind these two dfs to get the final df
finalDf = bind_rows(commonIds,idsNotPresent)
Let me know if this works.

The second option in makeshift-programmer's answer worker for me. Thank you so much. Had to play around with it for my actual data sets but the basic structure worked very well and it was easy to adapt

Inner_join() adding rows together instead of unique rows

I'm concerned that something is wrong with my R/Rstudio. I'm trying to do an inner_join() to get the intersection of male and female baby names from the babynames package, but am seeing that my inner_join() is greater than my subset for male names with the following code:
library(babynames)
library(dplyr)
malenames <- babynames %>%
filter(sex=="M")
girlnames <- babynames %>%
filter(sex=="F")
names <- inner_join(girlnames, malenames, by ="name")
To clarify, I'm seeing rows for 786372 rows for malenames and 1138293 rows for girlnames. What could be going wrong? Thank you in advanced for your guidance.

You need to join on both name and year, otherwise each (year, name) pair in girlnames gets matched with every row with a matching name in malenames:
names <- inner_join(girlnames, malenames, by = c("name", "year"))

Compare two data frame values for retrieve extra values from one them in R

I have two dataframes, df1 with 76349 rows and 4 columns (long, lat, country, year), and df2 with 2999 rows and 2 columns (long, lat). All the coords in df2 are mutual coordinates with df1. I need obtain the values of country and year of df1 for the same values of df2. I have trying solve using merge function. Apparently the output are correct, showing the values of country and year in df1 for coords identical of df2, however the number of rows in output is bigger than df2 (data refference). I tried to remove NA values and duplicated values, but the output remains bigger than df2.
How can I obtain values from country and years in df1 for the exactly values in df2?
I use the comand:
x = merge(df1,df2, by=c('long','lat'))
Thank for helping!
Here: link for data download.
https://www.dropbox.com/sh/zr9n56by0qs3h4l/AABjUO6wVi4zzrY2LWHH5P65a?dl=0

The package dplyr has several join options that may be helpful.
If I understand your question, the function 'inner_join' in that package should return what you want, i.e.:
library(dplyr)
x = inner_join(df2, df1)