Subsetting the duplicate company identifiers in the company level data - r

I have a dataset that contains the financial information of various companies such as revenue, and profits in the year. The companies are given a unique ID. I would like to subset the dataset to include all the duplicated company ID and their related financial information.
I have tried the function duplicated() but it either hangs or provide an error as the dataframe has over 200 million records. Could anyone help me with this? Thank you as I am still unfamiliar with R.
I have tried the following:
Duplicate <- DF[duplicated(DF$ID),]
where DF is the name of the dataframe and ID is the company ID. But the code could not run and I am stuck.
#ID# #Revenue# #Profit#
1 50 10
2 20 10
2 20 10
2 20 10
3 10 0
4 20 5
4 20 5
I want a dataframe that includes all the IDs with Company 2 and 4 since they are duplicates

The function duplicated(DF$ID) returns a logical vector of same length as rows in DF, indicating whether the value at each position has been seen before. So for the following string, duplicated will return
1 2 2 2 2 3 4 4
F F T T T F F T
and hence, your code line returns the subset of rows where the ID has is a duplicate but not the first instance.
For me it is unclear whether you need just a reduced list of which IDs appear multiple times, the entire rows of duplicate records (including/excluding first record), or whether you are considering looking at duplicate records or just duplicate IDs.
To get which IDs appear multiple times, count them:
count <- table(DF$ID)
names(count[count > 1])
Note: names() returns a character vector.
To get the records where the IDs appear multiple times, we can:
Duplicate <- DF[DF$ID %in% as.integer(names(count[count > 1])), ] # wrapped in as.integer, as I suspect your ID column is an integer vector.
or, try with dplyr:
library(dplyr)
Duplicate <- DF %>% group_by(ID) %>%
add_count() %>%
filter(n > 1)
Might be quicker if you have 200 mio rows.
Update:
To get only the first(*) occurrence of each ID, simply
DF[!duplicated(DF$ID),]
*first occurrence depends entirely on the ordering of the data.frame.
But do note that you must be entirely sure whether the entire records are actually duplicate, so you would definitely want to look into if duplicate records differ in any way.
If you have multiple columns that you regard sufficient to be duplicated, try with dplyr::distinct:
library(dplyr)
DF %>% distinct(ID, Revenue, .keep_all = TRUE)
If you do not include .keep_all = TRUE, only the named columns will be returned.
You have 200 mio rows
At this point I would say you've fallen outside the scope of R - at least using R naively.
If you have access to a database server (MySQL, MS SQL, PostSQL, etc.), you should definitely leverage that! I assume you are just starting on the data cleaning and these database engines have some powers for working with data of this size.
If you do not, you can offload it into an SQLite database file with RSQLite-package. It allows you to run many of the same operations with SQL. For your case, we can retrieve the set of rows with duplicated IDs with:
WITH cte AS (
SELECT ID
FROM DF
GROUP BY ID
HAVING count(*) > 1
)
SELECT *
FROM DF
INNER JOIN cte USING (ID);
Once loaded your data frame into the table DF and using an appropriate index.
To get DISTINCT rows you do not have the same luxury as dplyr::distinct, but you can do:
SELECT DISTINCT *
FROM DF;
If you however want to retrieve all columns that are distinct on a subset of columns, window function could be the way (unless you have a distinct row ID -- which SQLite implicitly always has):
SELECT *,
ROW_NUMBER() OVER (PARTITION BY ID, column2, ...) AS rownumber
FROM DF
WHERE rownumber = 1;
SQLite wouldn't like that, so it would have to be wrapped in a subquery or view. But SQLite, with its implicit rowid you could do:
WITH cte AS (
SELECT min(_ROWIID_) AS rowid
FROM DF
GROUP BY ID, column2, ...
)
SELECT DF.*
FROM DF
INNER JOIN cte USING (rowid);

Related

Given large data.table, use binary search to find the correct row based on the first two columns and then add 1 to third column

I have a dataframe with 3 columns. First two columns are IDs (ID1 and ID2) referring to the same item and the third column is a count of how many times items with these two IDs appear. The dataframe has many rows so I want to use binary search to first find the appropriate row where both IDs match and then add 1 to the cell under the count column in that row.
I have used the which() function to find the index of the correct row and then using the index added 1 to the count column.
For example:
index <- which(DF$ID1 == x & DF$ID1 == y)
DF$Count[index] <- DF$Count[index] + 1
While this works, the which function is very inefficient. Because I have to do this within a for loop for more than a trillion times, it takes a lot of time. Also, there is only one row in the data frame with this ID combination. While the which function goes through all the rows, a function that stops once it finds the correct row should suffice. I have looked into using data.table and setkey for this purpose but do not know how to implement that for my purpose. Thank you in advance.
Indeed you can use data.table and setkeyv (not setkey because you need 2 columns as indexes)
library(data.table)
DF <- data.frame(ID1=sample(1:100,100000,replace=TRUE),ID2=sample(1:100,100000,replace=TRUE))
# convert DF to a data.table
DF <- as.data.table(DF)
# put both ID1 and ID2 as indexes, in that order
setkeyv(DF,c("ID1","ID2"))
# random x and y values
x <- 10
y <- 18
# select value for ID1=x and ID2=y and add 1 in the Count column
DF[.(x,y),"Count"] <- DF[,.(x,y),"Count"]+1

R equivalent to SQL query - sum of integer column where date column between parameters

Hi there I'm trying to obtain a sum of an integer column where the date given in a separate column is between given parameters.
The following SQL query does what I want, however the query is far too slow in sqldf so I need to find an R equivalent. The data is of hospital episodes where the Stay column indicates number of days spent in hospital for that episode. The df table contains the Index and EndDate for each patient (AnonID)
SELECT m.*, b.Sum
FROM df
LEFT JOIN
(SELECT AnonID, SUM(e.Stay) AS Sum
FROM
(SELECT e.*, d.IndexDate, d.EndDate
FROM Episodes e
LEFT JOIN df d
ON e.AnonID=d.AnonID )a
WHERE AdmissionDate BETWEEN CAST(a.IndexDate AS datetime2) AND CAST(a.EndDate AS datetime2)
GROUP BY AnonID) b
The dplyr library is one of the most used data manipulation packages for R.
In your particular case we need:
left_join for LEFT JOIN
filter for the WHERE clause
group_by for the GROUP BY
summarise (or summarize) to compute aggregates such as SUM
%>% for piping, purely aesthetic but it makes the code easier to read
Putting all that together, you should have something like:
library(dplyr)
eps_in_range <- episodes %>%
left_join(df, by="AnonID") %>%
filter(AdmissionDate >= IndexDate,
AdmissionDate <= EndDate) %>%
group_by(AnonID) %>%
summarise(stay_sum = sum(Stay))
df %>%
left_join(eps_in_range)
It is hard to make sure this is 100% correct without seeing the data or understanding what you want to achieve. Hopefully this is enough to get you started. There is a lot of dplyr resources out there. I suggest you run the pipes one by one to understand what is happening.

Insert all missing rows into data table for a range of values for 2 columns

I am interested in inserting all missing rows into a data table for a new range of values for 2 columns.
Example, dt1[,a] has some values from 1 to 5, as does dt1[,b], but i'd like not only all pair wise combinations to be present in columns a and b, but all combinations to be present in a newly defined range, e.g. 1 to 7 instead.
# Example data.table
dt1 <- data.table(a=c(1,1,1,1,2,2,2,2,3,3,3,4,4,4,4,4,5,5,5),
b=c(1,3,4,5,1,2,3,4,1,2,3,1,2,3,4,5,3,4,5),
c=sample(1:10,19,replace=T))
setkey(dt1,a,b)
# CJ in data.table will create all rows to ensure all
# pair wise combinations are present (using the nominated columns).
dt1[CJ(a,b,unique=T)]
The above is great but will only use the max and min in the nominated columns. I'd like the inserted rows to give me all combinations between a new, nominated range, e.g. 1 to 7. There would be 49 rows.
# the following is a temporary workaround
template <- data.table(a1=rep(1:7,each=7),b1=rep(1:7,7))
setkey(template,a1,b1)
full <- dt1[template]
Instead of the already existing values in 'a' column, we can have a range of values to pass into 'CJ' for the 'a'
dt1[CJ(a = 1:7, b, unique = TRUE)]

Redshift join with metadata table and select columns

I have created a subset of the pg_table_def table with table_name,col_name and data_type. I have also added a column active with 'Y' as value for some of the rows. Let us call this table as config.Table config looks like below:
table_name column_name
interaction_summary name_id
tag_transaction name_id
interaction_summary direct_preference
bulk_sent email_image_click
crm_dm web_le_click
Now I want to be able to map the table names from this table to the actual table and fetch values for the corresponding column. name_id will be the key here which will be available in all tables. My output should look like below:
name_id direct_preference email_image_click web_le_click
1 Y 1 2
2 N 1 2
The solution needs to be dynamic so that even if the table list extends tomorrow, the new table should be able to accommodate. Since I am new to Redshift, any help is appreciated. I am also considering to do the same via R using the dplyr package.
I understood that dynamic queries don't work with Redshift.
My objective was to pull any new table that comes in and use their columns for regression analysis in R.
I made this working by using listagg feature and concat operation. And then wrote the output to a dataframe in R. This dataframe would have 'n' number of select queries as different rows.
Below is the format:
df <- as.data.frame(tbl(conn,sql("select 'select ' || col_names|| ' from ' || table_name as q1 from ( select distinct table_name, listagg(col_name,',') within group (order by col_name)
over (partition by table_name) as col_names
from attribute_config
where active = 'Y'
order by table_name )
group by 1")))
Once done, I assigned every row of this dataframe to a new dataframe and fetched the output using below:
df1 <- tbl(conn,sql(df[1,]))
I know this is a round about solution. But it works !! Fetches about 17M records under 1 second.

Match values in each group of a data.table column to values in a vector

I recently started to use the data.table package to identify values in a table's column that conform to some conditions. Although and I manage to get most of the things done, now I'm stuck with this problem:
I have a data table, table1, in which the first column (labels) is a group ID, and the second column, o.cell, is an integer. The key is on "labels"
I have another data table, table2, containing a single column: "cell".
Now, I'm trying to find, for each group in table1, the values from the column "o.cell" that are in the "cell" column of table2. table1 has some 400K rows divided into 800+ groups of unequal sizes. table2 has about 1.3M rows of unique cell numbers. Cell numbers in column "o.cell" table1 can be found in more than one group.
This seems like a simple task but I can't find the right way to do it. Depending on the way I structure my call, it either gives me a different result than what I expect or it never completes and I have to end R task because it's frozen (my machine has 24 GB RAM).
Here's an example of one of the "variant" of the calls I have tried:
overlap <- table1[, list(over.cell =
o.cell[!is.na(o.cell) & o.cell %in% table2$cell]),
by = labels]
I pretty sure this is the wrong way to use data tables for this task and on top of that I can't get the result I want.
I will greatly appreciate any help. Thanks.
Sounds like this is your set up:
dt1 = data.table(labels = c('a','b'), o.cell = 1:10)
dt2 = data.table(cell = 4:7)
And you simply want to do a simple merge:
setkey(dt1, o.cell)
dt1[dt2]
# o.cell labels
#1: 4 b
#2: 5 a
#3: 6 b
#4: 7 a

Resources