Unable to perform merge: what is the difference in these dataframes? - r

I have two dataframes annotatedFile and subOutFile that contain similar data. I am retrieving annotatedFile from an xlsx file using readxl::read_xlsx. subOutFile is retrived using read.delim2 from a tab-separated text file. They contain similar columns but annotatedFile has an extra column - accuracy that I want to merge into the subOutFile dataframe
This is what the data frames look like:
My merge command was:
subOutFile = subOutFile %>% merge(subOutFile, annotatedFile[,c("StimName", "Accuracy")], by = "StimName", all.x = TRUE)
From the images above, you can see that the structure of the two dataframes looks different. One shows the vector-like notification [1:180] and the other does not. Is there something different about these dataframes which is why I am not able to perform the merge? Or is there another reason?

When you write df1 %>% merge(df1, df2), there is one too many df1.
It's either df1 <- merge(df1, df2) or df1 <- df1 %>% merge(df2). For the latter, there is a shortcut, but you will have to load the magrittr package: df1 %<>% merge(df2).

Related

Adding a column of a dataframe to another dataframe if they match in another column

For a project in university, i'm working with large stock price dataframe's.
I have two dataframes.
Dataframe df1 includes the daily close prices over a certain time. The header includes the stock's shortcut.
Dataframe df2 includes the stock's shortcut in the first column and in the second column, there is the industry name of the stock's firm. IMPORTANT to know is that in df2 there are more values than in df1 (but every value in df1 should be in df2)
Is there any possibility to integrate the second column of df2 into the first row of df1 if they match (=> value from df1 header = df2 first column)
# Example Code
df1=as.data.frame(matrix(runif(20,min=0,max=1), nrow = 4))
df1
df2 <- as.data.frame(c("V1","V829","V2","V3","V493","V4","V5","V6","V992","V7"))
df2$insert <- c("test1","test2","test3","test4","test5","test6","test7","test8","test9","test10")
names(df2) <- c("Column2","test")
df1
df2
# Now insert/combine df2$test in (or over) df1[1,] as a row, if names(df1) and df2$Column2 matches
enter image description here (DataFrame df1)
enter image description here (DataFrame df2)
Thank you for your answers guys!
Nino
I would recommend you reshape your df1 into long format (see Reshaping data.frame from wide to long format).
library(tidyr)
df1_long <- df1 %>% gather(Instrument, value, -X)
I would organize the file this way because that makes it easier to use left__join() to match the data frames (see a description of mutating joins on the data wrangling cheat sheet).
df <- left_join(df1_long, df2, by = "Instrument")
If you want you can then make your dataframe wide again using the spread() function, which is the reverse of gather().
For the future I recommend you generate a reproducible example, rather than linking image files of your dataframes, as the links might expire, and it makes it generally less likely to get an answer on Stack Overflow.

Can I use join with across from dplyr?

I do not want to use all variables in data frame. I was thinking of something like this, but it comes up with an error.
df1 %>%
full_join(df2, by = 'DATE':'Vz').
Both data frame contain the same variables from DATE to Vz. I am interested in bringing the non-zero values of df2 to df1.
Thank you.
You can join by multiple columns with dplyr. Let me know if this answers your question:
library(dplyr)
full_join(df1, df2,
by=colnames(d1)[which(colnames(df1)=="DATE"):which(colnames(df1)=="Vz")])

Merging of dataframes with different number of columns

I have these two dataframes.
DF1:
DF2:
I want my output DF to be be DF1 along with the value of X1 from DF2. That is, this is how I want the output to look like:
I have tried using merge and join, but am unable to get this required output. The primary problem seems to be due to the fact that the ID in DF1 has multiple matches in DF2. The resulting dataframe I get has all the rows, somewhat like this:
How do I fix this?
Thanks.
(apologies for table images, I wasn't able to figure out how to create a table on the fly)
You can use match to return the first hit in DF2.
DF1$X1 <- DF2$X1[match(DF1$ID, DF2$ID)]
Keep unique values in terms of ID in the second data frame and then join:
library(tidyverse)
DF2 <- DF2 %>%
distinct(ID, .keep_all = TRUE) %>%
select(ID, X1)
res <- DF1 %>%
inner_join(DF2, by = "ID")
glimpse(res)

How do I get rid of multiple columns with the same name in R?

I'm gathering SAT scores by school districts in Texas and their amount of education spending. The data for SAT scores come in csv files that are split by year. I want to consolidate the scores into my dataframe that has the amount of education spending without creating multiple columns for Total, Math score, Reading score, etc.
I've tried the different types of join functions, semi_join, full_join, left_join, etc. but none of these seems to address the issue I am having.
temp1<-left_join(temp, sat17, by= c("District","year"))%>%
left_join(., sat16, by=c("District","year"))%>%
left_join(., sat15, by=c("District","year"))%>%
left_join(., sat14, by=c("District","year"))%>%
left_join(., sat13, by=c("District","year"))%>%
left_join(., sat12, by=c("District","year"))%>%
left_join(., sat11, by=c("District","year"))
The output gives me columns Math.x, Math.y, Total.x, Total.y, and so on for each joined dataframe. Also, sat17 includes a column called ERW, instead of Reading because the test changed that year. I want to keep ERW separate, and the rest of the Reading, Math, and Total scores to line up under one of each column.
I think that what you want to do is to bind them together... that is to "add" them up one on the top of the other.
Try:
do.call(rbind, dfs) # dfs is the list of dataframes
or using purrr
library(purrr)
bind_rows(dfs, .id = NULL)
Explanation
dplyr is automatically going to rename any columns that you don't join by and have a matching column name in the joined data set.
In your case, since you only want to join by=c("District", "year"), any other columns that have the same name will get renamed.
The starting data set columns getting .x appended to the end of their name, while the columns being left joined get .y appended to the end of their name.
Solution
If you want to have Math, Reading, and Total all in the same column, then you need to stack the data sets in top of each other with dplyr::bind_rows()
combined_sat <- dplyr::bind_rows(sat17, sat16, sat15, sat14, sat13, sat12, sat11)
Or say you want to just bind them at the .csv level to begin with, just throw all your files into a subdirectory called "data". You can try something like this:
setwd("./data/")
library(purrr)
library(tidyverse)
binded_data <- tibble(filenames = list.files()) %>%
mutate(yearly_sat = map(filenames, read_csv)) %>%
unnest()

How to use purrr with dplyr to filter list elements and export lists into Excel

I'm fairly new to working with lists in R and have a quick question that also involes using purrr. Below are too small sample data frames as an example.
Client1 <- c("John","Chris","Yutaro","Dean","Andy")
Animals <- c("Cat","Cat","Dog","Rat","Bird")
Living <- c("House","Condo","Condo","Apartment","House")
Data1 <- data.frame(Client1,Animals,Living)
Client1 <- c("John","Chris","Yutaro","Dean","Andy")
Animals2 <- c("Cat","Dog","Dog","Rat","Cat")
Living2 <- c("House","Apartment","Apartment","Family","Apartment")
Data2 <- data.frame(Client1,Animals2,Living2)
Bonus if you can include how to rename list elements at once instead of using the two lines below:
names(Data1)[1:3] <- c("Client","Animals","Living")
names(Data2)[1:3] <- c("Client","Animals","Living")
So next if I want to filter each data frame by Animals and then export each into an Excel spreadsheet by using the two lines of code below:
Data1 %>% filter(Animals=="Cat") %>% write.csv(.,file="Data1.csv")
Data2 %>% filter(Animals=="Cat") %>% write.csv(.,file="Data2.csv")
However, to be more efficient I can join both data frames into a list and use purrr to filter each at the same time.
DataList <- list(Data1,Data2)
DataList %>% map(~filter(.,Animals=="Cat"))
For the above code, I will use multiple ~filter lines for each animal, so not sure if there's a more efficient way that will avoid writing many different lines of code while still using purrr and dplyr?
Also, how do I use write.csv with purrr. I can either export the list into one spreadsheet, but I'm not sure how to break up the list so that it exports properly. Also, I can export each list element into separate spreadsheets. It would be great to see a solution for both of these situations.
If I understand your question correctly, you want to write a separate file for each of the Animals of both the data frames:
DataList <- list(Data1, Data2)
library(purrr)
a <- DataList %>% map(., function(x) {
colnames(x) <- c("Client","Animals","Living")
x
}) %>% map(., function(x) {
split(x, x$Animals)
}) %>% flatten(.)
names(a) <- paste0("Data", (1:length(a)))
lapply(1:length(a), function(x) write.csv(a[[x]],
file = paste0(names(a[x]), ".csv"),
row.names = FALSE))
We first dump both the data frames in DataList, then rename the columns for both the data frames with the first map, then split both the data frames by Animals, and finally flatten the nested list.
I wish I could do this without breaking the chain, but I couldn't find another way.
From here, we first rename the elements of the list, then use lapply to loop over all the elements in the list and apply write.csv on each of them.
You mentioned Excel - you can just as easily replace write.csv with any of the functions for writing excel files from R
Here is one option, involving binding the two datasets together before re-splitting.
library(purrr)
library(dplyr)
DataList %>%
map(~setNames(.x, c("Client","Animals","Living"))) %>%
setNames(c("Data1", "Data2")) %>%
bind_rows(.id = "id") %>%
split(list(.$id, .$Animals), drop = TRUE) %>%
map(~select(.x, -id) %>%
write.csv(file = paste0(unique(.x$id), unique(.x$Animals), ".csv"),
row.names = FALSE))
The first map line shows how to rename the columns of all the datasets in a list at once via setNames.
DataList %>%
map(~setNames(.x, c("Client","Animals","Living")))
I then set the names of the datasets in the list via setNames. While stacking the datasets together into a single data.frame via dplyr's bind_rows, these names are added as a new column, id.
setNames(c("Data1", "Data2")) %>%
bind_rows(.id = "id")
The last step is to split the combined data.frame by id and Animal before writing each split into a separate csv file. Information is pulled out of the dataset for naming the individual files by dataset and animal (this was the reason to name the elements of DataList). I removed the id variable via select prior to writing the files, as it may be extraneous to your needs.
split(list(.$id, .$Animals), drop = TRUE) %>%
map(~select(.x, -id) %>%
write.csv(file = paste0(unique(.x$id), unique(.x$Animals), ".csv"),
row.names = FALSE))
This can be all be done without putting these into a single data.frame, but I had trouble with naming the files at the end.

Resources