How do I merge two datasets using R without getting duplicate values? - r

I'm trying to merge two datasets in R. The 1st dataset is called AcademicData and the other one is called Mathsdata. When I merge the datasets, I'm getting thousands of duplicate rows. Here a pic of the code and the resulting merge table called total. I'm trying to merge the datasets by the variable "gender".
Heres the code.
setwd("H:/Data application/x14484252-DAD Project")
MathsData <- read.csv("Math-Students.csv", header=T, na.strings=c(""),
stringsAsFactors = T)
AcademicData <- read.csv("Academic-Performance.csv", header=T,
na.strings=c(""), stringsAsFactors = T)
total <- merge(MathsData, AcademicData, by="gender", all.x=TRUE)
As you can see from the image, there are 93,435 rows being created from the merge in the table called total.Table
Heres an image of the each the 1st dataset in excel.
Academic Dataset
Here an image of the second dataset in excel.
MathsData
I want to merge the two datasets by gender, without duplicate rows being created in the table called total.

You could do this:
library(data.table)
setDT(MathsData); setDT(AcademicData)
MathsData[AcademicData, mult = "first", on = "gender", nomatch=0L]
Since you did not provide a reproducible data, I couldn't test the code. But I think this shall work well.

Related

Transform a csv or excel table (with rows in one order and eg 20 columns with head) into another table with same rows in other pre-established order

I want to transform a csv or excel table (with rows in one order and eg 20 columns with head) into another table with same rows in other pre-established order.
Thank you very much
suppose your table looks a bit like this, once you've loaded it into r:
# Packages
library(tidyverse)
# Sample Table
df <- tibble(names = c("Jane","Boris","Michael","Tina"),
height = c(167,175,182,171),
age = c(26,45,32,51),
occupation = c("Teacher","Construction Worker","Salesman","Banker"))
If all you want to do is reorder the columns, you can do the following:
df <- df %>%
select(occupation,height,age,names)
There are other ways to do this, especially if you only want to move one or two columns out of your 20. But suppose you want to rearrange all of them, this will do it.

R - adding a variable from another dataset with different # rows

I'm currently working on R on a survey on schools and I would like to add a variable with the population of the city the school is in.
In the first data set I have all the survey respondants which includes a variable "city_name". I have managed to find online a list of the cities with their population which I have imported on R.
What I now would like to do is to add a variable in dataset_1 called city_pop which is equal to the city population when city_name is in both data sets. It might be relevant to know that the first dataset has around 1200 rows while the second one has around 36000 rows.
I've tried several things including the following:
data_set_1$Pop_city = ifelse(data_set_1$city_name == data_set_2$city_name, data_set_2$Pop_city, 0)
Any clues?
Thanks!!
You need to merge the two dataset:
new_df <- merge(data_set_1, data_set_2, by="city_name")
The result will be a dataframe containing only matching rows (in your case, 1200 rows assuming that all cities in data_set_1 are also in data_set_2) and all columns of both data frames. If you want to also keep non-matching rows of data_set_1, you can use the all.x option:
new_df <- merge(data_set_1, data_set_2, by="city_name", all.x=TRUE)
Two ways you could try using dplyr:
library(dplyr)
data_set_1 %>%
mutate(Pop_city = ifelse(city_name %in% data_set_2$city_name,
data_set_2$Pop_city[city_name == data_set_2$city_name],
0))
or using a left_join
data_set_1 %>%
left_join(data_set_2, by = "city_name")
perhaps followed by a select(names(data_set_1), Pop_city).

rbind three data bases using Rbind function

I know it's a newbie question, I have these 3 xlsx files with 3 three data bases of the same 14 variables,its a cross section data panel ,
All I want is to concatenate them in one single data base called eplt,
First, I import them
library(dplyr)
library(ggplot2)
library(xlsx)
##Import the three data bases
epl_data<-read.xlsx("Notes_ETAB2016-2017.xlsx",sheetIndex = 1,header = TRUE)
epl_data2<-read.xlsx("Notes_ETAB2017-2018.xlsx",sheetIndex = 1,header = TRUE)
epl_data3<-read.xlsx("Notes_ETAB2018-2019.xlsx",sheetIndex = 1,header = TRUE)
## to render the number of rows in each of them
nrow(epl_data)
nrow(epl_data2)
nrow(epl_data3)
# I want to rbind the three sets together
eplt<-rbind(epl_data,epl_data2,epl_data3)
the total number of rows is 29441, but when applying Rbind to bind them all together I get the error
> eplt<-rbind(epl_data,epl_data2,epl_data3)
Error in match.names(clabs, names(xi)) :
names do not match previous names
but the names of the variables in the 3 sets are the same
could someone please help, I only want to rebind 25000 observations, and leave the rest 4441 to compare it with the predictable obs of a multiple regression model,
thanks in advance
The third dataframes doesn't have the same names as the first two: Svt isn't to upper cases.
One way is to apply the names of one dataframe to the others:
colnames(epl_data2) <- colnames(epl_data)
colnames(epl_data3) <- colnames(epl_data)
But i recommand the package janitor whenever your data comes from Excel files. Indeed, it is common to have variable names issues. This package ensure a good formatting of your data column names:
epl_data <- janitor::clean_names(epl_data)
epl_data2 <- janitor::clean_names(epl_data2)
epl_data3 <- janitor::clean_names(epl_data3)
Therefore, the rbind should work
As already mentioned you have a mismatch in the variable name 'SVT'. Here is an alternative that would make the column names lower case and bind them together in one dataframe.
library(dplyr)
library(purrr)
eplt <- list.files(pattern = 'Notes_ETAB2016-\\d+\\.xlsx') %>%
map_df(~readxl::read_excel(.x) %>% rename_with(~tolower(.)))

Merging using two columns is bringing mismatching results on the second column using R

I would like to merge the left side (consider this the first dataframe or data1) with the table on the right (consider this the second dataframe or data2 using R software. I tried using R software using the following code but keep getting a mismatching population.id columns.
merged <- merge(data1, data2, by.x=c("article.id", "population.id"),
by.y=c("article id", "population.id")

Match observations between two datasets by ID

I am working with the following data: http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv
What I want to do is train my algorithm to predict correctly if a person will drop out in the subsequent period.
data1 <- subset(data, YEAR==1984)
data2 <- subset(data, YEAR==1985)
didtheydrop <- as.integer(data1$id)
didtheydrop <- lapply(didtheydrop, function(x) as.integer(ifelse(x==data2$id, 0, 1)))
This created a large list with the values that I think I wanted, but I'm not sure. In the end, I would like to append this variable to the 1984 data and then use that to create my model.
What can I do to ensure that the appropriate values are compared? The list lengths aren't the same, and it's also not the case that they appear in the correct order (i.e. respondents 3 - 7 do not respond in 1984 but they appear in 1985)
Assumming data1 and data2 are two dataframes (unclear, because it appears that you extracted them from an original larger single dataframe called data), I think it is better to merge them and work with a single dataframe. That is, if there is a single larger dataframe, do not subset it, just delete the columns you do not need; if data1 and data2 are two dataframes merge them and work with only one dataframe.
There are multiple ways to do this in R.
You should review the merge function calling ?merge in your console and reading the function description.
Essentially, to merge two dataframes, you should do something like:
merge(data1, data2, by= columnID) #Where columnID is the name of the variable that identifies the ID. If it is different in data1 and data2 you can use by.x and by.y
Then you have to define if you want to merge all rows from both tables with the parameters all.x, all.y, and all: all values from data1 even if no matching is found in data2, or all values from data2 even if no matching is found in data1 or all values regardless of whether there is a matching ID in the other database.
Merge is in the base package with any installation of R.
You can also use dplyr package, which makes the type of join even more explicit:
inner_join(data1, data2, by = "ID")
left_join(data1, data2, by = "ID")
right_join(data1, data2, by = "ID")
full_join(data1, data2, by = "ID")
This is a good link for dplyr join https://rpubs.com/williamsurles/293454
Hope it helps

Resources