I have a 9801 by 3 reference table.
The first 2 columns of this table is defined as follows.
x1 = x2 = seq(0.01,0.99,0.01)
x12 = data.matrix(expand.grid(x1,x2))
The 3rd columns contains the outcome values.
Now I have another n by 3 matrix where the 1st and 2nd columns are selected rows of the above matrix 'x12' and the 3rd column is to be filled. I would like fill in the 3rd column of the 2nd table by looking up the same combination of the 1st and 2nd column in the 1st table and find the value in the 3rd column.
How can I do this?
You can do this with the merge function:
# Original data frame
x1 = x2 = seq(0.01,0.99,0.01)
x12 = expand.grid(x1,x2)
# Add a fake "outcome"
x12$outcome = rnorm(nrow(x12))
# New data frame with 100 random rows and the first two columns of x12
x12new = x12[sample(1:nrow(x12), 100), c(1,2)]
# Merge the outcome values from x12 into x12new
x12new = merge(x12new, x12, by=c("Var1","Var2"), all.x=TRUE)
by tells merge which columns must match when comparing the two data frames. all.x=TRUE tells merge to keep all rows from the first data frame, x12new in this case, even if they don't have a match in the second data frame (not an issue here, but you'll often want to make sure you don't lose any rows when merging).
One other thing to note is that, unlike vlookup in Excel, merge will increase the number of rows in the new, merged data frame if there are multiple rows that match the criteria. For example, see what happens when you merge values from df2 into df1:
df1 = data.frame(x = c(1,2,3,4), z=c(10,20,30,40))
df2 = data.frame(x = c(1,1,1,2,3), y=c("a","b","c","a","c"))
merge(df1, df2, by="x", all.x=TRUE)
x z y
1 1 10 a
2 1 10 b
3 1 10 c
4 2 20 a
5 3 30 c
6 4 40 <NA>
You can also use left_join from the dplyr package (other types of joins are available as well):
library(dplyr)
left_join(df1, df2, by="x")
Related
I am trying to merge two data set with same columns of "Breed" which represent dog breeds, data1 have dog traits and score for it, data2 have same breed as data1 with there rank of popularity in America from 2013 -2020. I have trouble when trying to merge two data set into one. It either shows NA on the 2013-2020 rank information or it shows duplicate rows of same breed, one rows are data from data set 1 and another row is data from data set 2. The closest i can get is by using merge(x,y, by = 'row.names', all = TRUE) and i get all data in correctly but with two duplicated column of Breed.x and Breed.y. I am looking for a way to solve it with one Breed column only and all data in correctly.
here is the data i am using, breed_traits is the data set 1 i am saying, breed_rank_all is the data set 2 i want to merge in to breed_traits
breed_traits <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_traits.csv')
trait_description <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/trait_description.csv')
breed_rank_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_rank.csv')
this is the function i used with the most correctly one but with
Breed.y
breed_total <- merge(breed_traits, breed_rank_all, by = c('row.names') , all =TRUE)
breed_total
i tried left join as well but it shows NA on the 2013-2020 rank
library(dplyr)
breed_traits |> left_join(breed_rank_all, by = c('Breed'))
this is the one i tried as well and return duplicated rows of same breed.
merge(breed_traits, breed_rank_all, by = c('row.names', 'Breed'), all = TRUE)
I am trying to merge two dataframes in R, joining them by the one column that they share.
Here are screenshots of the two dataframes, and I am merging on the column "INC_KEY".
This is the code I have written to merge the two dataframes:
dp <- inner_join(d,p,by="INC_KEY")
d has 177156 observations, and p has 1641137 observations, but the final merged dataframe has 8416113 observations, which does not make sense to me. I have also tried changing the inner_join function above to the merge function, but I still get the same result. I am wondering how to fix this code so that the merged dataframe has a realistic number of observations - thanks so much for any help!
You most probably have duplicates in either d or p or both of them. Try keeping only one row for each unique INC_KEY value before joining.
library(dplyr)
dp <- inner_join(d %>% distinct(INC_KEY, .keep_all = TRUE),
p %>% distinct(INC_KEY, .keep_all = TRUE),by="INC_KEY")
This can happen if your INC_KEY is not a unique identifier. Here is a simplified example:
library(dplyr)
df1 <- data.frame(key = c("A", "B", "C", "A"),
val1 = 1:4)
df2 <- data.frame(key = c("A", "B", "C", "C", "B"),
val2 = 1:5)
inner_join(df1, df2, by = "key")
Joining, by = "key"
key val1 val2
1 A 1 1
2 B 2 2
3 B 2 5
4 C 3 3
5 C 3 4
6 A 4 1
Because there are two values of "A" in the key column in df1, both rows match the one row of df2 with "A". The one row in df1 with a key of "C" matches both rows with the key of "C" in df2. This is the expected behavior of an inner join with duplicated key values. The join returns all rows in the second data.frame that match each row in the first data.frame. If there are multiple matches, they are all returned.
If you want one row per INC_KEY, then you need to do something to your original data before the join, especially the rows are not complete duplicates.
The key column INC_KEY has duplicates in at least one of your tables. inner_join will then output a table with additional rows depending on the number of found duplicates minus the rows with INC_KEY missing in either dor p.
If you expect your new table to have the same number of rows as table d, then you need to aggregate the information in table p first; grouped by INC_KEY. Then you can perform inner_join.
I have a list of data frames, with each data frame named after patient ID.
df.list <- (1297, 2468, 3323, 4453, 4785, 6489, 7338, 8244, 9345, etc.)
Each data frame has data like this (this is very simplified, but it gets the point across):
A B C D
1 8 4 2
3 4 6 8
I want to merge all of the data frames in the list so that all A values are in one column, all B values in another, etc.
However, I also want to add a new column which tells me which patient this data came from. So I would like to extract the name of the data frame (which is patient ID) from which the data in that particular row came from and add this value to a new column in the merged data frame. I plan on merging it using rbind, but I do not know how to add another column with the patient ID information.
The goal is to have the following information in the final data frame:
A B C D Patient ID
Any help is appreciated!
Thanks!
Using the input data shown in reproducible form in the Note below, rbind the data frames together. The row names will contain the ID followed by a suffix indicating the row number so we can get the desired data frame, df2, like this:
df2 <- do.call("rbind", mget(df.list))
df2$id <- sub("[.].*", "", rownames(df2))
rownames(df2) <- NULL
Note: We assume this input data:
df.list <- c(1297, 2468, 3323, 4453, 4785, 6489, 7338, 8244, 9345)
df.list <- as.character(df.list)
Lines <- "A B C D
1 8 4 2
3 4 6 8"
df <- read.table(text = Lines, header = TRUE)
for(nm in df.list) assign(nm, df)
I have a dataset in which I wish to sum each value in column n, with its corresponding value in column (n+(ncol/2)); i.e., so I can sum a value in column 1 row 1 with a value in column 12 row 1, for a dataset with 22 columns, and repeat this until column 11 is summed with column 22. The solution needs to work for hundreds of rows.
How do I do this using R, while ignoring the column names?
Suppose your data is
d <- setNames(as.data.frame(matrix(rnorm(100 * 22), nc = 22)), LETTERS[1:22])
You can do a simple matrix addition using numbers to select the columns:
output <- d[, 1:11] + d[, 12:22]
so, e.g.
all.equal(output[,1], d[,1] + d[,12])
# [1] TRUE
I have several data frames that I need to merge into the one data frame to rule them all. The master data frame will end up with thousands of columns. All of the data frames have an ID column to join on. One problem is that hundreds of columns are duplicated across data frames. Another problem is that a handful of those columns contain inconsistent values. I would like to find a way to
Combine all data frames, keeping only 1 "master column" of data if there are duplicate column names and the values do not conflict between data frames
Keep both both columns of data if they share the same name, but they have conflicting values.
Are there any packages that can help automate this? Or am I going to be stuck writing a lot of code/manually checking data?
I wrote the package safejoin which solves this very succintly :
#devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
See the following data frames, A is identical in both, B is different in df1 and df2,
C and D are in only one data frame
df1 <- data.frame(id = 1:2, A = 3:4, B= 5:6, C = 7:8)
df2 <- data.frame(id = 1:2, A = 3:4, B= 9:10, D = 11:12)
library(tidyverse)
safe_full_join(df1, df2, by = "id", conflict = ~ if(identical(.x, .y)) .x else
map2( .x, .y,~tibble(df1=.x,df2=.y))) %>%
unnest(.sep="_")
# id A C D B_df1 B_df2
# 1 1 3 7 11 5 9
# 2 2 4 8 12 6 10L