I am running merge function in R:
Example:
DF <- merge(DF1, DF2, by = c("Date", "Time"), all.x= TRUE)
However, When I run the code, I get duplicated rows!
How can I get unique rows from the function? and why I am getting these duplicated rows?
We can get only the unique rows of DF1 and DF2 and then merge.
DF <- merge(unique(DF1), unique(DF2), by = c("Date", "Time"), all.x= TRUE)
Related
I have two data frames. df1 = 300000rows df2 = 100000rows. Few values in df1 are repeated (can be seen from dimension of data) as I have to a graphical analysis on data. The df2 contains metadata for values in rows in df2.
dput(df1[1:5, ])
c("ENSG00000272905.1", "ENSG00000269148.1", "ENSG00000272905.1",
"ENSG00000204581.2", "ENSG00000158486.12")
dput(df2[1:5, ])
structure(list(ensembl_gene_id = c("ENSG00000004838", "ENSG00000005206",
"ENSG00000007174", "ENSG00000009724", "ENSG00000009844"), hgnc_symbol = c("ZMYND10",
"SPPL2B", "DNAH9", "MASP2", "VTA1"), gene_biotype = c("protein_coding",
"protein_coding", "protein_coding", "protein_coding", "protein_coding"
)), row.names = c(NA, 5L), class = "data.frame")
I want to match each rows in df1 and store its metadata (given in df2) in corresponding columns. My expected results are:
dput(df3[1:5, ])
c("ENSG00000000419.11 ENSG00000000419 DPM1 protein_cod",
"ENSG00000000419.11 ENSG00000000419 DPM1 protein_cod",
"ENSG00000000460.15 ENSG00000000460 C1orf112 protein_cod",
"ENSG00000000460.15 ENSG00000000460 C1orf112 protein_cod",
"ENSG00000000460.15 ENSG00000000460 C1orf112 protein_cod"
)
I tried match function but it returned NA as values in column1 of df1 are in decimals. I also tried %in% operator, but that returned "Error:incorrect dimension".
What should script look like where I can subset my data without omitting repeated values.
R automatically joins the dataframes by common variable names, but you would most likely want to specify df3 <- merge(df1, df2, by = "ensembl_gene_id") to make sure that you are matching on only the fields you desired.
I'm always a fan of the dplyr package (part of the tidyverse).
You will likely need something like this
Unique drops duplicates
df3 <- inner_join(unique(df1), df2, on = "ensembl_gene_id")
Alternatively you could just filter for the desired columns
df3 <- df2 %>% filter(ensembl_gene_id %in% pull(df1, ensembl_gene_id))
Edit: just reread the question, ignore unique. Also the second method will drop uniques too.
You just want df3 <- inner_join(df1, df2, on = "ensembl_gene_id")
Try the following code -
library(dplyr)
result <- result <- df1 %>%
mutate(ensembl_gene_id = sub('\\..*', '', ensembl_gene_id)) %>%
inner_join(df2, by = 'ensembl_gene_id')
result
I'm coding in R. I have a big data frame (df1) and a little data frame (df2). df2 is a subset of df1, but in a random order. I need to know the row indices of df1 which occur in df2. All of the specific cell values have lots of duplicates. Tapirus terrestris shows up more than once, as does each ModType value. I tried experimenting with which() and grpl() but couldn't get my code to work.
df1 <- data.frame(
SpeciesName = c('Tapirus terrestris', 'Panthera onca', 'Leopardus tigrinus' , 'Leopardus tigrinus'),
ModType = c('ANN', 'GAM', 'GAM','RF'),
Variable_scale = c('aspect_s2_sd', 'CHELSAbio1019_s3_sd','CHELSAbio1015_s4_sd','CHELSAbio1015_s4_sd'))
df2 <- data.frame(
SpeciesName = c('Tapirus terrestris', 'Leopardus tigrinus'),
ModType = c('ANN', 'RF'),
Variable_scale = c('aspect_s2_sd', 'CHELSAbio1015_s4_sd'))
Should output an array: 1,4 because df1 rows 1 and 4 occur in df2.
You can create an index column in df1 and merge the datasets.
df1$index <- 1:nrow(df1)
df3 <- merge(df1, df2)
df3$index
#[1] 4 1
You can use match.
df1[match(df2$SpeciesName, df1$SpeciesName), ]
Another option is tidyverse
library(dplyr)
df1 %>%
mutate(index = row_number()) %>%
inner_join(df2)
I have 2 seperate dataframes. The first dataframe, df1, have 7 tickers (where two of the tickers are identical) and 7 unique dates that correspond to unique company-specific events.
The second dataframe, df2, have returns for the tickers from 01.01.2000 until 10.09.2020. I want to merge the two dataframes together in order to get a new column in df2 replicating the date of the company-specific event (in a new column) for all observation in df2 for a given ticker.
df2 <- left_join(df2, df1, by = "Ticker")
I do, however, have a problem with the two identical tickers. When I merge them together I get a problem since the tickers are identical, leading to the company-specific event date for one of the events being incorrect. Anyone know how I can handle this?
Thanks in advance! Hope everyone is doing fine
Update: Under you can find information on the 2 dataframes
library(lubridate)
library(quantmod)
df1 <- data.frame(Ticker = c("DGX", "FAF", "TM", "COF", "FB", "MSFT", "FB"),
Attack_date = ymd(c("2019-05-14", "2019-05-24", "2019-03-29", "2019-07-29", "2019-09-04", "2020-01-22", "2018-03-19")))
tickers <- unique(df1$Ticker)
getSymbols(Symbols = tickers, return.class = "zoo",
from = "2003-01-02",
to = "2020-09-07") # Downloading data for all tickers in breach_2
df2 <- data.frame(Company = character (),
Date = as.Date(character()),
Adjusted = numeric(),
Attack_Date = as.Date(character()),
stringsAsFactors = F) # Creating df to "paste" all returns and dates in
for (i in 1:length(tickers)){
df <- cbind(Company = rep(tickers[i], nrow(get(tickers[i]))),
fortify.zoo(get(tickers[i])) [, c(1,7)])
colnames(df) [c(2,3)] <- c("Date", "Adjusted") # Renaming columns
df2 <- rbind(df2, df)# Binding the rows in df_tickers and df together to create new df
}
After this I want to merge df1 and df2 together by adding the column named "Attack_Date" from df1 to df2.
when I merge dataframes, I write this code:
library(readxl)
df1 <- read_excel("C:/Users/PC/Desktop/precipitaciones_4Q.xlsx")
df2 <- read_excel("C:/Users/PC/Desktop/libro_copia_1.xlsx")
df1 = data.frame(df1)
df2 = data.frame(df2)
df1$codigo = toupper(df1$codigo)
df2$codigo = toupper(df2$codigo)
dat = merge.data.frame(df1,df2,by= "codigo", all.y = TRUE,sort = TRUE)
the data has rainfall counties, df1 has less counties than df2. I want to paste counties that has rainfall data from df1 to df2.
The problem occurs when counties data are paste into df2, repeat counties appears.
df1:
df2:
Instead "id" you must specify the column names for join from the first and second table.
You can use the data.table package and code below:
library(data.table)
dat <- merge(df1, df2, by.x = "Columna1", by.y = "prov", all.y = TRUE)
also, you can use funion function:
dat <- funion(df1, df2)
or rbind function:
dat <- rbind(df1, df2)
dat <- unique(dat)
Note: column names and the number of columns of the two dataframes needs to be same.
I have two data frames and I want to merge them using two columns that are like below:
a <- data.frame(A = c("Ali", "Should Be", "Calif")))
b <- data.frame(B = c("ALI", "CALIF", "SHOULD BE"))
Could you please let me know if it is possible to do it in r?
One way would be to decapitalize your character values using tolower from base R and then do a merge:
library(dplyr) # for mutating
df1 <- df1 %>%
mutate(A = tolower(A))
df2 <- df2 %>%
mutate(B = tolower(B))
df3 <- merge(df1, df2, by.x = "A", by.y = "B")
df3
A
1 ali
2 calif
3 should be
Is this what you needed?
Edit: The dplyr bit is of course not necessary. If everything is to be done in base R, df1$A=tolower(df1$A) and df2$B=tolower(df2$B) - as suggested in the comments - work just as well.