R: Merge dataframes, identical ticker leading to incorrect result - r

I have 2 seperate dataframes. The first dataframe, df1, have 7 tickers (where two of the tickers are identical) and 7 unique dates that correspond to unique company-specific events.
The second dataframe, df2, have returns for the tickers from 01.01.2000 until 10.09.2020. I want to merge the two dataframes together in order to get a new column in df2 replicating the date of the company-specific event (in a new column) for all observation in df2 for a given ticker.
df2 <- left_join(df2, df1, by = "Ticker")
I do, however, have a problem with the two identical tickers. When I merge them together I get a problem since the tickers are identical, leading to the company-specific event date for one of the events being incorrect. Anyone know how I can handle this?
Thanks in advance! Hope everyone is doing fine
Update: Under you can find information on the 2 dataframes
library(lubridate)
library(quantmod)
df1 <- data.frame(Ticker = c("DGX", "FAF", "TM", "COF", "FB", "MSFT", "FB"),
Attack_date = ymd(c("2019-05-14", "2019-05-24", "2019-03-29", "2019-07-29", "2019-09-04", "2020-01-22", "2018-03-19")))
tickers <- unique(df1$Ticker)
getSymbols(Symbols = tickers, return.class = "zoo",
from = "2003-01-02",
to = "2020-09-07") # Downloading data for all tickers in breach_2
df2 <- data.frame(Company = character (),
Date = as.Date(character()),
Adjusted = numeric(),
Attack_Date = as.Date(character()),
stringsAsFactors = F) # Creating df to "paste" all returns and dates in
for (i in 1:length(tickers)){
df <- cbind(Company = rep(tickers[i], nrow(get(tickers[i]))),
fortify.zoo(get(tickers[i])) [, c(1,7)])
colnames(df) [c(2,3)] <- c("Date", "Adjusted") # Renaming columns
df2 <- rbind(df2, df)# Binding the rows in df_tickers and df together to create new df
}
After this I want to merge df1 and df2 together by adding the column named "Attack_Date" from df1 to df2.

Related

Combining rows based on conditions and saving others (in R)

I have a question regarding combining columns based on two conditions.
I have two datasets from an experiment where participants had to type in a code, answer about their gender and eyetracking data was documented. The experiment happened twice (first: random1, second: random2).
eye <- c(1000,230,250,400)
gender <- c(1,2,1,2)
code <- c("ABC","DEF","GHI","JKL")
random1 <- data.frame(code,gender,eye)
eye2 <- c(100,250,230,450)
gender2 <- c(1,1,2,2)
code2 <- c("ABC","DEF","JKL","XYZ")
random2 <- data.frame(code2,gender2,eye2)
Now I want to combine the two dataframes. For all rows where code and gender match, the rows should be combined (so columns added). Code and gender variables of those two rows should become one each (gender3 and code3) and the eyetracking data should be split up into eye_first for random1 and eye_second for random2.
For all rows where there was not found a perfect match for their code and gender values, a new dataset with all of these rows should exist.
#this is what the combined data looks like
gender3 <- c(1,2)
eye_first <- c(1000,400)
eye_second <- c(100, 230)
code3 <- c("ABC", "JKL")
random3 <- data.frame(code3,gender3,eye_first,eye_second)
#this is what the data without match should look like
gender4 <- c(2,1,2)
eye4 <- c(230,250,450)
code4 <- c("DEF","GHI","XYZ")
random4 <- data.frame(code4,gender4,eye4)
I would greatly appreciate your help! Thanks in advance.
Use the same column names for your 2 data.frames and use merge
random1 <- data.frame(code = code, gender = gender, eye = eye)
random2 <- data.frame(code = code2, gender = gender2, eye = eye2)
df <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"))
For your second request, you can use anti_join from dplyr
df2 <- merge(random1, random2, by = c("code", "gender"), suffixes = c("_first", "_second"), all = TRUE) # all = TRUE : keep rows with ids that are only in one of the 2 data.frame
library(dplyr)
anti_join(df2, df, by = c("code", "gender"))

Conditional operations between factor level pairs

I have a dataframe (df1) that contains Start times and End times for observations of different IDs:
df <- structure(list(ID = 1:4, Start = c("2021-05-12 13:22:00", "2021-05-12 13:25:00", "2021-05-12 13:30:00", "2021-05-12 13:42:00"),
End = c("2021-05-13 8:15:00", "2021-05-13 8:17:00", "2021-05-13 8:19:00", "2021-05-13 8:12:00")),
class = "data.frame", row.names = c(NA,
-4L))
I want to create a new dataframe that shows the latest Start time and the earliest End time for each possible pairwise comparison between the levels ofID.
I was able to accomplish this by making a duplicate column of ID called ID2, using dplyr::expand to expand them, and saving that in an object called Pairs:
library(dplyr)
df$ID2 <- df$ID
Pairs <-
df%>%
expand(ID, ID2)
Making two new objects a and b that store the Start and End times for each comparison separately, and then combining them into df2:
a <- left_join(df, Pairs, by = 'ID')%>%
rename(StartID1 = Start, EndID1 = End, ID2 = ID2.y)%>%
select(-ID2.x)
b <- left_join(Pairs, df, by = "ID2")%>%
rename(StartID2 = Start, EndID2 = End)%>%
select(ID2, StartID2, EndID2)
df2 <- cbind(a,b)
df2 <- df2[,-4]
and finally using dplyr::if_else to find the LatestStart time and the EarliestEnd time for each of the comparisons:
df2 <-
df2%>%
mutate(LatestStart = if_else(StartID1 > StartID2, StartID1, StartID2),
EarliestEnd = if_else(EndID1 > EndID2, EndID2, EndID1))
This seems like such a simple task to perform, is there a more concise way to achieve this from df1 without creating all of these extra objects?
For such computations usually outer comes handy:
df %>%
mutate(across(c("Start", "End"), lubridate::ymd_hms)) %>%
{
data.frame(
ID1 = rep(.$ID, each = nrow(.)),
ID2 = rep(.$ID, nrow(.)),
LatestStart = outer(.$Start, .$Start, pmax),
LatestEnd = outer(.$End, .$End, pmin)
)
}

How to rename multiple columns with different column names and different order in several dataframes based on a dictionary in R

I am working on merging multiple datasets from different sources. The column names from each dataset (as datframes) have different names and are in different orders. I have created a dictionary that contains all the different names and the common name I want to rename the original names with. How do I rename the original column names using the dictionary in R? I specifically want to use a dictionary because I may add more datasets (with different column names) in the future and it would be easy to adapt the dictionary.
I know I can manually rename every column but there are many (like 30) and they may change with the addition of new datasets.
df1 <- data.frame(site = c(1:6), code = c(rep("A",3), rep("B", 3)), result = c(20:25))
df2 <- data.frame(site_no = c(10:19), day = c(1:10), test = c(rep("A", 5), rep("B", 5)), value = c(1:10))
dict <- data.frame(oldName = c("site", "code", "result", "site_no", "day", "test", "value"), newName = c("site_number", "parameter", "result", "site_number", "day", "parameter", "result"))
I would like to rename the columns in df1 and df2 based on the dict dataframe, which contains the old names (all the column names from df1 and df2) and the new names (the common names to use).
The result would be:
colnames(df1)
"site_number" "parameter" "result"
colnames(df2)
"site_number" "day" "parameter" "result"
We can match the names of the respective df to the oldname, then extract the newname at the matched indices:
names(df1) = with(dict,newName[match(names(df1),oldName)])
names(df2) = with(dict,newName[match(names(df2),oldName)])
print(df1)
print(df2)
We can use rename_all after placing the datasets in a list. It is better to have those datasets in a list instead of having them in the global environment
library(dplyr)
library(purrr)
out <- mget(ls(pattern = "^df\\d+$")) %>%
map(~ .x %>%
rename_all(~ as.character(dict$newName)[match(., dict$oldName)]))
If we want, we can can change the column names in the original object with list2env
list2env(out, .GlobalEnv)
names(df1)
#[1] "site_number" "parameter" "result"
names(df2)
#[1] "site_number" "day" "parameter" "result"

Repeated values ​when join data frames in r

when I merge dataframes, I write this code:
library(readxl)
df1 <- read_excel("C:/Users/PC/Desktop/precipitaciones_4Q.xlsx")
df2 <- read_excel("C:/Users/PC/Desktop/libro_copia_1.xlsx")
df1 = data.frame(df1)
df2 = data.frame(df2)
df1$codigo = toupper(df1$codigo)
df2$codigo = toupper(df2$codigo)
dat = merge.data.frame(df1,df2,by= "codigo", all.y = TRUE,sort = TRUE)
the data has rainfall counties, df1 has less counties than df2. I want to paste counties that has rainfall data from df1 to df2.
The problem occurs when counties data are paste into df2, repeat counties appears.
df1:
df2:
Instead "id" you must specify the column names for join from the first and second table.
You can use the data.table package and code below:
library(data.table)
dat <- merge(df1, df2, by.x = "Columna1", by.y = "prov", all.y = TRUE)
also, you can use funion function:
dat <- funion(df1, df2)
or rbind function:
dat <- rbind(df1, df2)
dat <- unique(dat)
Note: column names and the number of columns of the two dataframes needs to be same.

Merging List of data frames into a single data frame, or avoiding it altogether

I have a dataset like:
Company,Product,Users
MSFT,Office,1000
MSFT,VS,4000
GOOG,gmail,3203
GOOG,appengine,45454
MSFT,Windows,1500
APPL,iOS,6000
APPL,iCloud,3442
I'm writing a function to return a data frame with the nth product product for each company ranked by "Users" so the output of rankcompany(1) should be:
Company Prodcut Users
APPL APPL iOS 6000
GOOG GOOG appengine 45454
MSFT MSFT VS 4000
The function looks like:
rankcompany <- function(num=1){
#Read data file
company_data <- read.csv("company.csv",stringsAsFactors = FALSE)
#split by company
split_data <- split(company_data, company_data$Company)
#sort and select the nth row
selected <- lapply(split_data, function(df) {
df <- df[order(-df$Users, df$Product),]
df[num,]
})
#compose output data frame
#this part needs to be smarter??
len <- length(selected)
selected_df <- data.frame(Company=character(len),Prodcut=character(len), Users=integer(len),stringsAsFactors = FALSE)
row.names(selected_df) <- names(selected)
for (n in names(selected)){
print(str(selected[[n]]))
selected_df[n,] <- selected[[n]][1,]
}
selected_df
}
I split the input data frame into a list then perform the sorting and selection then try to merge the result into the output data frame "selected_df"
I'm new to R and I thin the merging can be done in a smarter way. Or should I avoid splitting in the first place? Any suggestions?
Thanks
You can do it in a much simpler way with dplyr :
rankcompany <- function(d, num=1) {
d %>% group_by(Company) %>% arrange(desc(Users)) %>% slice(num)
}
And then you can do :
rankcompany(d,2)
or :
d %>% rankcompany(1)
Based on the comment from #DMT
I replaced the merging code with:
selected_df <- rbindlist(selected)
selected_df <- as.data.frame(selected_df)
row.names(selected_df) <- names(selected)
selected_df
And it works fine.
If you like the clarity of split and lapply you can use a much shorter version of your function.
rankcompany <- function(N){
byCompany <- split(df, sorted$Company)
ranks <- lapply(byCompany,
function(x)
{
r <- which(rank(-x$Users)==N)
x[r,]
})
do.call("rbind", ranks)
}
rankcompany(1)
> rankcompany(1)
Company Product Users
APPL MSFT VS 4000
GOOG GOOG appengine 45454
MSFT APPL iOS 6000
If you are using rbindlist, you may not need to convert to data.frame before doing this:
library(data.table) ## 1.9.2+
n <- 1L
setDT(company_data)[order(-Users), .SD[n], keyby=Company]
# Company Product Users
#1: APPL iOS 6000
#2: GOOG appengine 45454
#3: MSFT VS 4000
setDT converts the data.frame to data.table by reference (without any additonal copy/memory usage). Then we sort the data.table in descending order by Users column, and then group by company, and for each group, we obtain the nth row from Subset of Data (.SD) for that group.
In your case, perhaps,
DT <- rbindlist(selected)
DT[order(-Users), .SD[n], keyby=Company]
But the previous solution is a much more efficient and easier one-liner to solve the issue.
data
company_data <- structure(list(Company = c("MSFT", "MSFT", "GOOG", "GOOG", "MSFT",
"APPL", "APPL"), Product = c("Office", "VS", "gmail", "appengine",
"Windows", "iOS", "iCloud"), Users = c(1000L, 4000L, 3203L, 45454L,
1500L, 6000L, 3442L)), .Names = c("Company", "Product", "Users"
), class = "data.frame", row.names = c(NA, -7L))

Resources