Merging with Partially Similar Columns in R - r

I'm trying to merge together two data-frames by matching their Tickers. However, in my first data-frame, some companies are listed with two tickers (as below)
Company Ticker Returns
1800-Flowers FLWS, FWC .01
First Busey Corp BUSE .02
First Bancshare Inc FBSI 0
In the second data frame, there is only one ticker.
Ticker Other Info
FLWS 50
BUSE 60
FBSI 20
How can I merge these two files together in R so that it will recognize that the data from FLWS belongs to the first row in data frame 1 because it contains the Ticker?
Please note that in data frame 1, most companies have only 1 ticker listed, many have 2, and some companies have 3 or 4 tickers.

Related

How to combine sets of similar data under common columns (a specific case)

Here's a portion of a dataset that I have of daily closes for certain stocks within a common period of time in .xlsx format:
What I need is an R script that would produce something like this:
So I need a row for each stock everyday for the time period and the corresponding prices for them in the third column like above. Of course, I have more than 100 stocks for a period of 4 years. So that makes more than 100 rows for each day for 4 years. For example, a hundred rows of the day 5.01.2015 and so forth.
I'm still very new to R so help is very much appreciated.

R read csv with observations in colums

in every example I have seen so far for reading csv files in R, the variables are in columns and the observations (individuals) are in rows. In a introductory statistics course I am taking there is an example table where the (many) variables are in rows and the (few) observations are in columns. Is there a way to read such a table so that you get a dataframe in the usual "orientation"?
Here is a solution that uses the tidyverse. First, we gather the data into narrow format tidy data, then we spread it back to wide format, setting the first column as a key for the gathered observations by excluding it from gather().
We'll demonstrate the technique with state level summary data from the U.S. Census Bureau.
I created a table of population data for four states, where the states (observations) are in columns, and the variables are listed in rows of the table.
To make the example reproducible, we entered the data into Excel and saved it as a comma separated values file, which we assign to a vector in R and read with read.csv().
textFile <- "Variable,Georgia,California,Alaska,Alabama
population2018Estimate,10519475,39557045,737438,4887871
population2010EstimatedBase,9688709,37254523,710249,4780138
pctChange2010to2018,8.6,6.2,3.8,2.3
population2010Census,8676653,37253956,710231,4779736"
# load tidyverse libraries
library(tidyr)
library(dplyr)
# first gather to narrow format then spread back to wide format
data %>%
gather(.,state,value,-Variable) %>% spread(Variable,value)
...and the results:
state pctChange2010to2018 population2010Census
1 Alabama 2.3 4779736
2 Alaska 3.8 710231
3 California 6.2 37253956
4 Georgia 8.6 8676653
population2010EstimatedBase population2018Estimate
1 4780138 4887871
2 710249 737438
3 37254523 39557045
4 9688709 10519475

Comparing multiple data frames based on unique values in one column and finding overlapping values in second column in multiple data frames in R

I wanted to ask for advice based on a problem I am having in trying to identify intersecting values in multiple data frames, but in my mind this is a bit complex and I cant figure out how to do it using the normal intersect function.
I have several data frames (up to 12) with multiple columns that are showing gene changes over time (for example 5 time points) and how other genes correlate with this change (i.e, other genes that also go down, or up in a manner that correlates other genes in the data). The analysis takes each gene one at a time, uses that gene as a reference and tests every single gene against it to see if the pattern of change over time of those genes correlate with the first reference gene. This is repeated for every single gene. So taking one data frame as an example, the results would appear as follows.
Column 1 contains genes that serve as the reference gene, this value can occur multiple times if other genes correlate with changes over time in this gene. for example if gene b, c and d correlate with gene a, the first two columns show as follows:
a b
a c
a d
The same for gene b and so on and so fourth 20,000 times (number of genes)! Hope this makes sense?
b a
b c
b d
The analyses above is carried in multiple different samples, so I will get up to 12 data frames which are different samples each with results detailed as above.
Objective (and apologies in advance that I do not have code as I am not entirely sure where to start!) as I am thinking this might best be served by creating a function for this: For gene 'x' in column number 1, in every single data frame, I would like to see if column 2 has overlapping values.
Taking the example above, multiple data frames may look like this:
df1
a b
a c
a d
df2
a d
a c
a e
df3
a d
a e
a f
So comparing the data frames, the function would identify that for gene a, there is one column value between all data frame... gene d.. as it is common to all data frames for gene a.
Similarly, the function would carry out this overlap analysis for every single gene... gene a,b,c..etc
The output would be the values of the overlap for every single gene in column 2 that occurs for the same gene in column a across the data frames
I am pasting head(analysis)
Feature1 Feature2 delay pBefore pAfter corBefore
1 ENSMUSG00000001525 ENSMUSG00000026211 0 0.1093914984 0.1093914984 0.7161907
2 ENSMUSG00000001525 ENSMUSG00000055653 -1 0.0916478944 0.1047749696 0.7414240
3 ENSMUSG00000001525 ENSMUSG00000003038 0 0.0006810160 0.0006810160 0.9786161
plus many many more genes in feature 1, each with genes in feature 2 associated with genes in feature 1
this data frame would be one sample and I would have a separate result for the other samples
I would really appreciate any hints as to how to create code to achieve this goal. In additon, it would be nice to be able to specify that I would also liek to see over lap of genes that only contain, i.e pBefore of >= 0.8 for example, or same for the delay column etc...
Many thanks for taking the time to read this!
If I understand correctly, you can add all 12 dataframes as
df_final = pd.concat([df1,df2.....df12])
Find the combination of genes present in all 12 dataframe
df_n = df_final.groupby(['A','B']).size().reset_index(name = 'count')
As there are 12 Dataframe
df_n[df_n['count']==12]
will give you the pair of genes in all 12 dataframes.

Matching timestamped data to closest time in another dataset leads to different amount of rows

I have a timestamp in one data frame that I am trying to match to the closest timestamp in a second dataframe, for the purpose of extracting data from the second dataframe.
Earlier I found that I can try data.tables rolling join using the nearest option:
library(data.table) # v1.9.6+
setDT(reference)[data, refvalue, roll = "nearest", on = "datetime"]
# [1] 5 7 7 8
However, this results in a list that is 2 rows longer than the data file.
Is this because the time stamps of 2 observations were equally close to 2 time stamps (one earlier, one later)?
Is there an option for when the time of an observation is in the middle of 2 time stamps? Is there a function to choose for one of the 2 time stamps? Or can you find out for which observation there are two possible time stamps?
If I merge the list with the data, the 2 last observations from the list are unused, I figured this changes my data.
Thank You!

Faster version of split{base} for very large data frame

I have a data frame with roughly three million rows (each row is a booking) and a small number of columns one of them beeing customer id. My goal is to split this data frame by customer id into a list of data frames such that each data frame contains all the bookings of a customer. So I tried
cstmr_list <- split(df, f = df$cstmr_id)
but cancelled it after half an hour because it took too long. Next, I only split the indexes with
idx_list <- split(seq(nrow(df)), f = df$cstmr_id)
which took less than 10s. Now, I want to populate idx_list with the corresponding rows of df. Who knows how to do it?

Resources