I am working on a project in R. I have two data frames with multiple entries for each employee ID in both the data frames. That is, example, employee ID 1 has multiple entries in Table 1 and table 2. Therefore, there is no Primary key in these tables.
I want to merge these two tables for better analysis. When I try to merge these tables, it counts the permutations of each ID and distorts the data in the resulting table.
Can anyone please suggest a way out.
You can merge two tables with merge command.
by = "employeeid" enables you to specify key column. if you have more than one column by = c("emoloyeeid", "period")
table3 <- merge(table1, table2, by = "employeeid")
?merge will give you more options.
I am working on a project in R. I have two data frames with multiple entries for each employee ID in both the data frames. That is, example, employee ID 1 has multiple entries in Table 1 and table 2. Therefore, there is no Primary key in these tables.
One idea is to wrangle your data so there are no more multiple entries.
Another is to summarize your data so there is only row per Employee in each table.
A third is to use the full-join to connect all matching ID
https://dplyr.tidyverse.org/reference/join.html
library(dplyr)
full_join(df1, df2, by = "EmployeeID")
Check out the DPLYR "Data Transformation Cheat Sheet" https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
Related
I am trying to join two table in R. Since the joiner is not unique for both of the table so there is a chance of duplicate values. To mitigate this duplicate I like to add some conditions within the joining statement.
My hypothetical approach is as follows: where I like to join the DF and GF by id where they have only the mentioned condition such as: DF's Incomedate= GF's Outgoingdate, FD's Incomedate = Gf's Outgoingday + 1day and DF's REFID=1000.
```
DF_new <- DF %>%
left_join(GF, by c('id'='id','Incomedate'='Outgoingdate','Incomedate'=as.date('Outgoingdate'+1),REFId=1000)
I can do it in SQL as follows:
DF LEFTJOIN GF on DF.[id]=GF.[id] and (DF.Incomedate=GF.Outgoingdate or
DATEADD(DAY,1,GF.[Outgoingdate])=DF.Incomingdate and (REFID=1000)
How can I replicate it in R?
I have recently restarted to use R, and I'm trying to compare two excel tables (let's call them table 1 and 2), with very different data. The only common point is situated in one column (let's name it col1), and is the gene ID.
My goal is to find and keep all the rows of table 1 in which the data in col1 is exactly matching the data in table2.
For example if table1 contains 10 columns and col1 contains geneID. Table2 contains only 5 columns and col2 contains geneID. I want to compare and keep matching information of those two columns and get a data.frame containing the whole rows of table1 that I want to keep.
I hope I'm clear? English is not my first language ^^
Thanks a lot !
merge(x = table1,
y = table2,
by.x = "column_name_table1",
by.y = "column_name_table2",
all.x = T)
I have two big dataframes: DBa and DBb. All colums of DBb are in DBa.
I want to merge these two dataframes by all DBb's colums.
I'm trying:
new <- merge(DBa, DBb, by=colnames(DBb))
but it gives me the error:
Elements listed in `by` must be valid column names in x and y
How can I do it?
I don't think you are looking to merge the data frames, you should put them on top of each other with rbind. With merge you will put two data frames next to eachother, and you only need one common column (the key) which should be unique otherwise the results will be a mess.
So use row bind (rbind). The columns must be in the same order and one data frame must not have more columns than the other.
new_data <- rbind(data1, data2)
I have two tables:
one (table1) containing the ring numbers of caught birds plus all the information associated with that ringing (morphological characteristics, dates, locations etc). The other (table2) has all the ring number from a different campaign, which i already searched and trimmed down for the duplicates between the two.
Because there are allot of rings (>600) it would be time consuming to go one by one from one list to the other and copy paste the entire row of information to a new table.
I want to be able to extract all the rows corresponding from Ring column in table1 corresponding to the values for rings in table2, and obtain a new table with only the extracted values.
I tried to code for one of the rings but it didnt work newtbl<-as.data.frame(table1[table1$ring==L45523,]just to see if it would select by ring number directly on table1.
There should be a way of pulling the list of ring numbers from table2 and select only those in table1.
table1 looks like this
Hope this is possible. Thank you in advance!
This sounds like a classic relational join scenario. See the notes on relational data in R4DS here.
If you want the columns in table2 to also be pulled through in to your results then use:
library(dplyr)
results <- table1 %>% inner_join(table2, by = "Ring.No.")
If you just want those records from table1 that match a ring number in table2 you can try:
library(dplyr)
results <- table1 %>% semi_join(table2, by = "Ring.No.")
Note that if ring number is called something else in table2 then you can use the more complete by = ... syntax of:
library(dplyr)
results <- table1 %>% semi_join(table2, by = c("Ring.No." = ["the name of ring number in table2"])
You can try using package dplyr :
table1 %>% filter(Ring.No. %in% unique(table2$Ring.No.))
I am not sure of the structure of your table2 so adapt the code depending on wether it's a list, a data.frame or something else.
Why not do merge on the two tables based on RING, since that seems to be a common column name between the two tables and then filter based on the RING name. Other wise post an excerpt of your two table codes.
newtable = merge(table1,table2 , by.x="table1.RING" , by.y="table2.RING",all=TRUE)
subset( newtable, RING ==YYYY)
Adpat the code to suit your desiered results.
I have two tables which I need to compare
Table 1:XLOC IDs
Column A: Xloc id
Column B: gene id
Table 2: Ensembl IDs
Column A: Ensembl id
Column B: gene Id
In both tables, there are identical Gene ids (names e.g. cpa6). In table 1 there are 25000 entries, in table 2 there are 46000 entries.
I need to insert the Ensemble Ids from ColA, Table 2 into ColC of Table1, when both gene ids in column B match and create an output file with new data- e.g.
Table 1
ENS0002 cpa6
Table 2:
Xloc0014 cpa6
Output file, Table 3:
ENS0002 cpa6 Xloc0014
The columns are not in the same order and cannot be sorted alphabetically etc. The remaining 21000 entries without corresponding Xlocs I will get rid of (but can easily do this post-output).
Does anyone know how to do this in either R, Excel, or other software?, relatively easily?
N.B. Both tables can not be sorted into the same order, so I really need to use a formula/script/bash to do this.
Try this. I have created an example data frame to show how you can merge and keep only the values that exist in both tables.
As you can see the new table is a result of these values that exist in both and now you have 3 columns with the value of the second table.
In case you want to keep all the rows that exist in both you must use the column gene Id in order to keep these gene Id that exist in both.newTable <- merge(tab1,tab2,by = "gen_id") for example.
tab1 <- data.frame(col1=c("id1","id2","id3","id4"),col2=c(1,2,3,4))
tab2 <- data.frame(col1=c("id1","id2","id3","id5","id7"),col2=c(1,3,3,5,6))
newTable <- merge(tab1,tab2,by = "col1")
in case you want to keep all from table1 but maybe they dont exist in table2 use this.
newTable <- merge(tab1,tab2,by = "col1",all.x=T)
these will keep all the rows of table1 and will give a value at col2.y otherwise you will have NAs.
In R I would use the merge function merge(Table 1, Table 2,by="cpa6").
However, I have done this in Excel before, which worked well too using the VLOOKUP function. You just need to use a IF function in R, with a nested VLOOKUP inside:
=IF(ISERROR(VLOOKUP(cell with gene name in Table1,array of cells that contain the gen names in Table2, number of the column in the array in Table2,"TRUE" so they match exactly)), Output if true, output if false).
Example:
=IF(ISERROR(VLOOKUP(C4,List1!A1:List1!A$2:A$1000,1,TRUE)), "Does NOT exist in List 1","Exists in List 1")