merging data and receiving a big loss of data - r

I've been preparing my data and somehow I have way less data after merging my data sets.
Since I don't have the longitude and latitude in my data I've been using the following code after I downloaded the package zipcode (tel1 is my data containing zipcodes)
merge <- merge(zipcode,tel1,by.x=c('zip'),by.y=c('zip_code'))
Before merging I had 195956 observations, while after merging it dropped down to 180090, but I don't understand why.
In my opinion I just merged them where zip was equal to zip_code and I added the information from the dataset zipcode to my folder tel1
Afterward I wanted to remove the rows that contain NA because the merge couldn't define any numbers or whatever. I used this code
final <- result[complete.cases(result),]
Then my number of observations dropped down to 51006 which I just can't believe. There can't be so many mismatches in my data.
Is there any other code that I should use?
Afterwards I've been trying to delete the duplicates with the code
last <- with(final,final[order(state,latitude,longitude),])
but the number of observations was consistent (51006).
What did I do wrong or is there a way to get my data into an excel file again after merging the data so I could manually check if there are really so many mismatches?
Thanks

Can use the all argument to merge.
merge(zipcode, tel1, by.x='zip', by.y='zip_code', all.y=TRUE)
However, for rows where matches aren't found in the zipcode data, there will be NAs. Thus if you then na.rm or something to that effect, you will wind up with the same "data loss"
Check the zip codes for the rows where there are NAs in the lat and long columns after the merge:
tel1[is.na(tel1$latitude) | is.na(tel1$longitude),]
My guess is they aren't valid zip codes or the list of zipcodes you have is not complete.

Related

How to filter rows in R for Eurozone countries easily?

I usually use dplyr to filter data. I know hava huge dataset (62176 entries) of banks operating in different countries. I'd like to subset/filter that datasets for Eurozone banks only.
I haven't found any workaround rather than pasting all the name of Eurozone countries and then create a new dataset with filter.
Is there any workaround for this problem?
Thank you!
Without the data we can't give you clear answers however, given my understanding of the problem below are some methods.
Assuming your dataset already has a column that has each bank's operating country, you could create a manual vector of the countries you are interested in and then filter the dataset for rows that match
#manually assign countries to vector (this must match how the countries are listed in your data)
euro_countries<- c("Germany","England","France","Poland")
#Then filter dataset to pull up rows that match, I make up colnames as I don't know your data
dataframe %>% filter(op_country %in% euro_countries)
alternatively, depending on your data set you can reference the very helfpul countrycode library in R which has an existing dataset that can potentially join your dataset country column against the matching column in countrycode::codelist and then reference the countrycode::codelist$continent to filter for countries in "Europe".
#join your data set with the codelist table but depends on country column in your dataset
dataframe <- leftjoin(x=df,y=countrycode::codelist,by=c("op_country"="country.name.en"))
#filter your dataset with the new column
dataframe %>% filter(continent=="Europe")

Merging DataFrames without Duplicate Columns in R

I have three data sets that I would like to merge. My data is on companies in the SP500 and their corporate political activity. Of my datasets, one is named PAC, one is named Lobby and one is named BoardData. The datasets all have three columns in common: "ultorg", "sector", and "subind" as well as other columns unique to each dataset.
I would like to merge the three excel documents so that there is only one of each of those columns that has all of the other variables appended to it.
I have tried doing this on my own but I get a few problems. Specifically, I get several columns for ultorg/sector/subind (the variables the datasets have in common) and there are entries that are repeating in places where they shouldn't. For example, my board data only goes until 2015 but my lobbying data goes until 2000. Using the incorrect/incomplete code below, I have rows where company's board data from 2015 is being put in for years 2000-2015. I would just like the years without a Board entry for them (2000-2015)to just have NA entered in.
Here's the current code.
library(tidyverse)
library(janitor)
library(glue)
setwd("~/Desktop/thesis")
library(readxl)
PAC <- read_excel("PAC.xlsx")
library(readxl)
Lobby <- read_excel("Lobby.xlsx")
library(readxl)
BoardData <- read_excel("BoardData.xlsx")
alldata <- left_join(PAC, Lobby, by="ultorg")
alldata <- left_join(alldata, BoardData, by=“ultorg”)
Thank you so much for any help you are able to give me! I really appreciate it and am able to answer any questions regarding my data.
Merging by ultorg, sector, subind will work and if there is column that indicates about date, and it's common, then you should consider to add that column while joining them. Choice between full_join and left_join or etc are up to your purpose. Code below is one of example that you may try.
BoardData %>%
full_join(PAC, by = c("ultorg", "sector", "subind")) %>%
full_join(Lobby, by = c("ultorg", "sector", "subind"))

Merging Two Datasets on Matched Column in R

I'm an R beginner and I'm trying to merge two datasets and having some trouble with losing data. I might be totally off base with what I'm doing.
The first dataset is the Dewey Decimal System and the data looks like this
image of 10 rows of data from this set
I've named this dataset DDC
The next dataset is a list of books ordered during a particular time period.
image of 10 rows of the book ordering dataset
I've named this dataset DOA
I'm unsure how to include the data not in an image
(Can also provide the .csvs if needed)
I would like to merge the sets based on the first three digits of the call number.
To achieve this I've created a new variable in both sets called Call_Category2 that takes the first three digits of the call number value to be matched.
DDC$Call_Category2 = str_pad(DDC$Call_Category, width = 3, side = "left", pad = "0")
This dataset is just over 1000 rows. It is also padded because the 000 to 099 Dewey Decimal Classifications were dropping their leading 0s
DOA_data = transform(DOA_data, Call_Category2 = substr(Call_Category, 1,3))
This dataset is about 24000 rows.
I merge the sets and create a new set called DOA_Call
DOA_Call = merge(DDC, DOA_data, all.x = TRUE)
When I head the data the merge seems to be working properly but 10,000 rows do not get the DOA_Call data added. They just stay in their original state. This is about 40% of my total dataset so it is pretty substantial. My first instinct was that it was only putting DDC rows in once but that would mean I would be missing 23,000 rows which I'm not.
Am I doing something wrong with the merge or could it be an issue with the data not being clean enough?
Let me know if more information is needed!
I don't necessarily need code, pointers on what direction to troubleshoot in would be very helpful!
This is my best try with the information you provide. You will need to use:
functions such as left_join from dplyr (see https://dplyr.tidyverse.org/reference/join.html)
the stringt library to handle some variables (https://dplyr.tidyverse.org/reference/join.html)
and some familiarity with the tidyverse.
Please keep in mind that the best way to ask in stackoveflow is by providing a minimal reproducible example

How to find and replace values of a data frame in R?

I have two sets of Data frames. One contains the region names for different regional code. Another data frames has certain GTU data's with regional code mentioned. I want to replace regional codes of second data with region names based on the 1st data Using R. Please help !
If your data.frames are called df1 and df2 then you could try
try df_final <- merge(df1,df2,by="regional code").
Then df_final will contain everything (also region names and regional code).
Later you can delete the regional code column if you want by using
df_final[, !(colnames(df_final) %in% c("regional code"))]

Look up data frame with values stored in another data frame

I have 15 data frames containing information about patient visits for a group of patients. Example below. They are named as FA.OFC1, FA.OFC2 etc.
ID sex date age.yrs important.var etc...
xx_111 F xx.xx.xxxx x.x x
I am generating a summary data frame (sev.scores) which contains information about the most severe episode a patient has across all recorded data. I have successfully used the which.max function to get the most severe episode but now need additional information about that particular episode.
I recreated the name of the data frame I will need to look up to get the additional information by pasting information after the max return:
max data frame
8 df2
Specifically the names() function gave me the name of the column with the most severe episode (in the summary data frame sev.scores which also gives me information about which data frame to look up:
sev.scores[52:53] <- as.data.frame(cbind(row.names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)]),apply(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)],1,function(x) names(sev.scores[c(5,8,11,14,17,20,23,26,29,32,35,38,41,44,47,50)])[which(x==max(x))])))
However now I would like to figure out how to tell R to take the data frame name stored in the column and search that data frame for the entry in the 5th column.
So in the example above the information about the most severe episode is stored in data frame 2 (df2) and I need to take information from the 5th record (important.var) and return it to this summary data frame.
UPDATE
I have now stored these dfs in a list but am still having some trouble getting the information I would like.
I found the following example for getting the max value from a list
lapply(L1, function(x) x[which.max(abs(x))])
How can I adapt this for a factor which is present in all elements of the list?
e.g. something like:
lapply(my_dfs[[all elements]]["factor of interest"], function(x) x[which.max(abs(x))])
If I may suggest a fundamentally different approach: concatenate all your data.frames into one (rbind), and add a separate column that describes the nature of the original data.frame. For this, it’s necessary to know in which regard the original data.frames differed (e.g. by disease type; since I don’t know your data, let’s stick with this for my example).
Furthermore, you need to ensure that your data is in tidy data format. This is an easy requirement to satisfy, because your data should be in this format anyway!
Then, once you have all the data in a single data.frame, you can create a summary trivially by simply selecting the most severe episode for each disease type:
sev_scores = all_data %>%
group_by(ID) %>%
filter(row_number() == which.max(FactorOfInterest))
Note that this code uses the ‹dplyr› package. You can perform an equivalent analysis using different packages (e.g. ‹data.table›) or base R functions, but I strongly recommend dplyr: The resulting code is generally easier to understand.
Rather than your sev.scores table, which has columns referring to rows and data.frame names, the sev_scores I created above will contain the actual data for the most severe episode for each patient ID.

Resources