Inner_join() adding rows together instead of unique rows

Inner_join() adding rows together instead of unique rows - r

I'm concerned that something is wrong with my R/Rstudio. I'm trying to do an inner_join() to get the intersection of male and female baby names from the babynames package, but am seeing that my inner_join() is greater than my subset for male names with the following code:
library(babynames)
library(dplyr)
malenames <- babynames %>%
filter(sex=="M")
girlnames <- babynames %>%
filter(sex=="F")
names <- inner_join(girlnames, malenames, by ="name")
To clarify, I'm seeing rows for 786372 rows for malenames and 1138293 rows for girlnames. What could be going wrong? Thank you in advanced for your guidance.

You need to join on both name and year, otherwise each (year, name) pair in girlnames gets matched with every row with a matching name in malenames:
names <- inner_join(girlnames, malenames, by = c("name", "year"))

Related

Lead and lag issue using dplyr

I have a data frame with data that looks like this that has 365 rows reflecting the calendar year. I am trying to shift the county name columns up by one row. The data frame doesn't contain any missing values.
I tried using the following code to shift it, but the resulting table has values that are all NA.
covid_shift <- covid_pivot %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
Does anyone know what might be the issue?

Since covid_pivot is grouped by date, and each of these groups has one row, the lead and lag functions return NA.
Try:
covid_shift <- covid_pivot %>%
ungroup() %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
You might also consider using across()
covid_pivot %>%
ungroup() %>%
mutate(across(-date, ~lag(.x)))

Merging many columns in R

I have an issue with merging many columns by the same ID. I know that this is possible for two lists but I need to combine all species columns into one so I have first column as species (combined) and then w,w.1,w.2,w.3, w.4... The species columns all have the same species in them but are not in order so I can't just drop every other column as this would mean the w values aren't associated with the right species. This is an extremely large dataset of 10000 rows and 2000 columns so would need to automated. I need the w values to be associated to the corresponding species. Dataset attached.
Thank you for any help
dataset

If your data is in a frame called dt, you can use lapply() along with bind_rows() like this:
library(dplyr)
library(tidyr)
bind_rows(
lapply(seq(1,ncol(dt),2), function(x) {
dt[,c(x,x+1)] %>%
rename_with(~c("Species", "value")) %>%
mutate(w = colnames(dt)[x+1])
})
) %>%
pivot_wider(id_cols = Species, names_from = w)

How do I get rid of multiple columns with the same name in R?

I'm gathering SAT scores by school districts in Texas and their amount of education spending. The data for SAT scores come in csv files that are split by year. I want to consolidate the scores into my dataframe that has the amount of education spending without creating multiple columns for Total, Math score, Reading score, etc.
I've tried the different types of join functions, semi_join, full_join, left_join, etc. but none of these seems to address the issue I am having.
temp1<-left_join(temp, sat17, by= c("District","year"))%>%
left_join(., sat16, by=c("District","year"))%>%
left_join(., sat15, by=c("District","year"))%>%
left_join(., sat14, by=c("District","year"))%>%
left_join(., sat13, by=c("District","year"))%>%
left_join(., sat12, by=c("District","year"))%>%
left_join(., sat11, by=c("District","year"))
The output gives me columns Math.x, Math.y, Total.x, Total.y, and so on for each joined dataframe. Also, sat17 includes a column called ERW, instead of Reading because the test changed that year. I want to keep ERW separate, and the rest of the Reading, Math, and Total scores to line up under one of each column.

I think that what you want to do is to bind them together... that is to "add" them up one on the top of the other.
Try:
do.call(rbind, dfs) # dfs is the list of dataframes
or using purrr
library(purrr)
bind_rows(dfs, .id = NULL)

Explanation
dplyr is automatically going to rename any columns that you don't join by and have a matching column name in the joined data set.
In your case, since you only want to join by=c("District", "year"), any other columns that have the same name will get renamed.
The starting data set columns getting .x appended to the end of their name, while the columns being left joined get .y appended to the end of their name.
Solution
If you want to have Math, Reading, and Total all in the same column, then you need to stack the data sets in top of each other with dplyr::bind_rows()
combined_sat <- dplyr::bind_rows(sat17, sat16, sat15, sat14, sat13, sat12, sat11)

Or say you want to just bind them at the .csv level to begin with, just throw all your files into a subdirectory called "data". You can try something like this:
setwd("./data/")
library(purrr)
library(tidyverse)
binded_data <- tibble(filenames = list.files()) %>%
mutate(yearly_sat = map(filenames, read_csv)) %>%
unnest()

Data frame subset according to matching values in R

I have a data.frame with information on racing performance on horses. I have a variable Competition.year that has a "Total" row and then a row for each year the horse competed. I also have a variable Competition.age that describes the age the horses were in each specific year they competed.
I am trying to create a subsetted df based on their best racing times and the age they were when they achieved it. In the "Total" row, the racing time included is their best one. So, I need to figure out how to tell R that, when the race time in Total row is equal to whenever it is they actually achieved that time, include the age they were then in the new data frame. I am super new to R so I have no idea where to even begin doing this, I've tried some stuff I've seen on other questions but I can't get it right. Any help would be much appreciated!
My df looks like this:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
The desired df should have 223 rows (since that is the total amount of horses I have) with columns Name, Competition.year=="Total", Time.record.auto.start, Time.record.volt.start and Competition.age

Firstly, I had to change your sample data to make sure all 5 variables only had 14 observations each. I did this by removing the final NA in the Competition.age variable. I also had to swap around the 94.3 and 98.3 values in the Time.record.volt.start variable so that the values lined up with what was expected in the Total column for the horse with Name equal to Pelson Poika.
Here is the corrected data:
travdata <- data.frame(
"Name"=c(rep("Muuttuva",3),rep("Pelson Poika",7),rep("Muusan Muisto",4)),
"Competition.year" = c("Total",2005,2004,"Total",2003,2004,2006,2005,2002,2001,2008,2010,"Total",2009),
"Time.record.auto.start"=c(93.5,NA,93.5,96.5,NA,NA,104.2,96.5,NA,96.6,NA,NA,NA,NA),
"Time.record.volt.start"=c(92.5,98.4,92.5,94.3,NA,105.3,98.3,94.3,102.1,99.1,107.5,NA,107.5,NA),
"Competition.age"=c(NA,6,7,NA,4,5,6,7,8,9,NA,5,6,7))
And here is a simple dplyr solution, which I think does what you want.
library(dplyr)
df1 <-
travdata %>% group_by(Name) %>% filter(Competition.year == "Total") %>% select(Name, Time.record.auto.start, Time.record.volt.start)
df2 <- travdata %>% filter(Competition.year != "Total")
df3 <-
inner_join(
df1,
df2,
by = c(
"Name" = "Name",
"Time.record.auto.start" = "Time.record.auto.start",
"Time.record.volt.start" = "Time.record.volt.start"
)
)
The dataframe df3 should return what you were after.

Binding dataframes with matching country names

I have two data frames of country data.
df1 has all the countries of the world.
df2 has a subset of countries but has the populations in one of its columns.
I want to take the population data and add it to df1 where the country names are a match.
If df1$Column1 = df2$Column1 (same country name) then populate df1$Column2 (currently empty) with the information from df2$Column2 (country's population) where the row is the the one for that country match.
I tried to merge the two using the column "Name" which they both have for country names :
total <- merge(map,Co2_2x, by="NAME")
the columns are all there but I get empty rows in my new dataframe.
I'd like to be able to say "for this row and column matrix position in df1 (the country), get the row (country name match in df2) and column X (population data). Then put it in this row and column Y matrix position in df1 (new population column in df1 for the matched country name)"... There must be an easier way :-)
Here is my code : I'd like to fill map$measure with data from Co2_2x$premium where the countries match.
library(XML)
library(raster)
library(rgdal)
download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip",destfile="TM_WORLD_BORDERS_SIMPL-0.3.zip")
unzip("TM_WORLD_BORDERS_SIMPL-0.3.zip",exdir=getwd())
polygons <- shapefile("TM_WORLD_BORDERS_SIMPL-0.3.shp")
polygons
map <- as.data.frame(polygons)
map$Measure <- 0
library(rvest)
Co2 <- read_html("https://en.wikipedia.org/wiki/List_of_countries_by_carbon_dioxide_emissions")
Co2_2x<-Co2 %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
names(Co2_2x)[2]<-paste("premium")
names(Co2_2x)[1]<-paste("NAME")
total <- merge(map,Co2_2x, by="NAME")
Thanks!

To have the first dataset rows with no match in the other dataset appear, you just need to add the all.x=T option, as follows (have a look at the documentation for details) :
total <- merge(map,Co2_2x, by="NAME",all.x=T)
These rows will then appear with NA in the second dataset columns.
If the matching doesn't seem to work, you may want to make sure that your matching variable (in your case, NAME) is filled exaclty the same way in the two datasets (letter case, possible spaces at the extremities...).
This answer provides a fine way of doing so.

you can use sqldf library in R.
Just follow the code below. You'll be able to merge (join) the two dataset that you have:
library(sqldf)
merged_data <- sqldf("select a.country, b.population from df1 as a
left join df2 as b on (a.country = b.country) group by 1")
Thanks and happy R-programming!!!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Inner_join() adding rows together instead of unique rows - r

You need to join on both name and year, otherwise each (year, name) pair in girlnames gets matched with every row with a matching name in malenames: names <- inner_join(girlnames, malenames, by = c("name", "year"))

Related

Lead and lag issue using dplyr

Merging many columns in R

How do I get rid of multiple columns with the same name in R?

Data frame subset according to matching values in R

Binding dataframes with matching country names

Categories

Resources