I have three data sets that I would like to merge. My data is on companies in the SP500 and their corporate political activity. Of my datasets, one is named PAC, one is named Lobby and one is named BoardData. The datasets all have three columns in common: "ultorg", "sector", and "subind" as well as other columns unique to each dataset.
I would like to merge the three excel documents so that there is only one of each of those columns that has all of the other variables appended to it.
I have tried doing this on my own but I get a few problems. Specifically, I get several columns for ultorg/sector/subind (the variables the datasets have in common) and there are entries that are repeating in places where they shouldn't. For example, my board data only goes until 2015 but my lobbying data goes until 2000. Using the incorrect/incomplete code below, I have rows where company's board data from 2015 is being put in for years 2000-2015. I would just like the years without a Board entry for them (2000-2015)to just have NA entered in.
Here's the current code.
library(tidyverse)
library(janitor)
library(glue)
setwd("~/Desktop/thesis")
library(readxl)
PAC <- read_excel("PAC.xlsx")
library(readxl)
Lobby <- read_excel("Lobby.xlsx")
library(readxl)
BoardData <- read_excel("BoardData.xlsx")
alldata <- left_join(PAC, Lobby, by="ultorg")
alldata <- left_join(alldata, BoardData, by=“ultorg”)
Thank you so much for any help you are able to give me! I really appreciate it and am able to answer any questions regarding my data.
Merging by ultorg, sector, subind will work and if there is column that indicates about date, and it's common, then you should consider to add that column while joining them. Choice between full_join and left_join or etc are up to your purpose. Code below is one of example that you may try.
BoardData %>%
full_join(PAC, by = c("ultorg", "sector", "subind")) %>%
full_join(Lobby, by = c("ultorg", "sector", "subind"))
Related
I usually use dplyr to filter data. I know hava huge dataset (62176 entries) of banks operating in different countries. I'd like to subset/filter that datasets for Eurozone banks only.
I haven't found any workaround rather than pasting all the name of Eurozone countries and then create a new dataset with filter.
Is there any workaround for this problem?
Thank you!
Without the data we can't give you clear answers however, given my understanding of the problem below are some methods.
Assuming your dataset already has a column that has each bank's operating country, you could create a manual vector of the countries you are interested in and then filter the dataset for rows that match
#manually assign countries to vector (this must match how the countries are listed in your data)
euro_countries<- c("Germany","England","France","Poland")
#Then filter dataset to pull up rows that match, I make up colnames as I don't know your data
dataframe %>% filter(op_country %in% euro_countries)
alternatively, depending on your data set you can reference the very helfpul countrycode library in R which has an existing dataset that can potentially join your dataset country column against the matching column in countrycode::codelist and then reference the countrycode::codelist$continent to filter for countries in "Europe".
#join your data set with the codelist table but depends on country column in your dataset
dataframe <- leftjoin(x=df,y=countrycode::codelist,by=c("op_country"="country.name.en"))
#filter your dataset with the new column
dataframe %>% filter(continent=="Europe")
I have two CSV files: One file contains contact information for survey respondents which we have yet to survey, and another csv files contains information for survey respondents we have already contacted. I need to make a new csv file of contacts we have not contacted and I believe I can do this by merging the two files in R and creating a new file which excludes the already contacted survey respondents through R. I am very new to R and want to sharpen my skills, but this task is a little over my head and would appreciate any help or advice
So, since I don't know your data, let's assume you have the data already read in using read.csv(). df1 is the first dataset containing all people and their contact info. df2 contains their answers. Both share the same ID (without unique ID, this won't work, but maybe you have to rename a column or two) which is stored in the column "customer_ID".
# this is my dummy data
df1 <- data.frame("customer_ID" = 1:100,
"address" = rep(c("saturn", "mars"), 25))
df2 <- data.frame("customer_ID" = c(1:25, 75:99),
"likes_apples" = rep(c(TRUE, FALSE), 50))
You could extract the ID's from both tables and combine them. I converted it into a data frame so we can use dplyr.
df_combined <- data.frame("customer_ID" = c(df1$customer_ID, df2$customer_ID))
When you have the IDs together, you can group the data by "customer_ID" and count the number of data points per group. You store only those IDs which occur only once.
once_only <- df_combined %>%
group_by(customer_ID) %>%
filter(n() == 1)
You can then filter the data frame with the contact info on whether the ID is contained in the filtered data points:
df1[df1$customer_ID %in% once_only$customer_ID,]
I bet there are many better ways to do this, but I hope it helped anyway!
Edit: Okay, so I obviously learned something new from the comments. The way easier method would be:
anti_join(df1, df2, by="customer_ID")
I have a wide dataset which makes it really difficult to manipulate the data in the way I need. It looks like the dummy table below:
Dummy_table_unsorted
Essentially, as seen in the table, the information held in 1 row is at a user level, you have a user id and then all the animals owned by each user are in this row. What I would like it, I want this at animal level, so that a user can have multiple entries, which represent each of their different animals. I have pasted a table below of what I would like it to look like:
Dummy_table_sorted
Is there a simple way to do this? I have an idea as to how, but it is very long winded. I thought to maybe subset by selected columns relating to one animal only and merge the datasets back together. The problem is, in may data, it is possible for one person to have up to 100 animals, which makes this very long winded.
Please can someone offer a suggestion or a package/command that would allow me to change this wide dataset into a long one?
Thank You.
First, you should provide data that someone can easily insert into R. Screenshots are not helpful and increase the amount of work a person needs to perform to help you.
The data as you have it should be able to be split, and recombined with bind_rows or rbind. I would subset the data into three dataframes, rename columns, and bind. Assuming your original data is called df
df1 <- df[,c(1:4)]
df2 <- df[,c(1,5:7)]
df3 <- df[,c(1,8:10)]
# rename columns to match
names(df1) <- c('user id', 'animal', 'colour', 'legs')
names(df2) <- c('user id', 'animal', 'colour', 'legs')
names(df3) <- c('user id', 'animal', 'colour', 'legs')
remade <- bind_rows(df1, df2) %>%
bind_rows(df3)
this is my very first time posting a question here. If anything I'm asking about is vague or unclear / I forgot to add extra information for context, feel free to let me know, thank you.
MY QUESTION:
I just made a data frame with multiple columns. How do I code for a new data frame that matches two rows with the same variables, and excludes all rows where the variables I want don't match? (along with any other column I want from the previous screenshot)?
SCREENSHOTS OF MY CURRENT DATA FRAME: ONE
, TWO (This isn't the entire data frame since the list is huge, just parts of it.) Notice how each state has multiple 'counties' under it.
THIS IS AN EXAMPLE OF WHAT I WANT MY FINAL DATA FRAME TO LOOK LIKE. In my new data frame, I want to exclude all rows where Location name does not match State name (so I will get rid of all counties and anything that isn't the State name).
e.g. I want to code for a new data frame where I will California = California, while also excluding rows without matching variables such as California = San Juan County
I want to code all of this using DPLYR.
Thank you!
If I understand your somewhat vague question well:
library(dplyr)
df%>%filter(column1==column2)
Assuming you don't have NA's in your numeric data, if so turn them to 0 before executing below code
library(dplyr)
new_df = df %>% filter(any_drinking.state == any_drinking.location) %>%
mutate(both_sexes_2012 = any_drinking.females_2012+any_drinking.males_2012,
diff = any_drinking.males_2012-any_drinking.females_2012) %>%
rename(females_2012 = any_drinking.females_2012,males_2012 = any_drinking.males_2012,
state = any_drinking.state, location = any_drinking.location)
I have two sets of Data frames. One contains the region names for different regional code. Another data frames has certain GTU data's with regional code mentioned. I want to replace regional codes of second data with region names based on the 1st data Using R. Please help !
If your data.frames are called df1 and df2 then you could try
try df_final <- merge(df1,df2,by="regional code").
Then df_final will contain everything (also region names and regional code).
Later you can delete the regional code column if you want by using
df_final[, !(colnames(df_final) %in% c("regional code"))]