I am attempting to add a column from one table to another. I can't use a simple join because the data type of the column changes, so I am using a data function/r-script to add the specified column.
I can get the proper column to populate, but cannot get the order to preserve (i.e. the incorrect records are being returned for a given identifier PROPNUM).
# join tables to ensure proper number of records
newTable <- merge(OutputTable, InputTable, by = "PROPNUM")
#populate column with values from merged table
OutputColumn <- newtTable[,4]
#sample output - output column is not order preserved
PROPNUM OutputColumn
A B
A B
C A
C A
B C
B C
If you want the order to be as the input table
# sort newtable
newTable <- newTable(order(newTable$column1, newTable$column2, newTable$....))
Merging (or joining) does not preserve the input orders.
Related
I am trying to merge two data set with same columns of "Breed" which represent dog breeds, data1 have dog traits and score for it, data2 have same breed as data1 with there rank of popularity in America from 2013 -2020. I have trouble when trying to merge two data set into one. It either shows NA on the 2013-2020 rank information or it shows duplicate rows of same breed, one rows are data from data set 1 and another row is data from data set 2. The closest i can get is by using merge(x,y, by = 'row.names', all = TRUE) and i get all data in correctly but with two duplicated column of Breed.x and Breed.y. I am looking for a way to solve it with one Breed column only and all data in correctly.
here is the data i am using, breed_traits is the data set 1 i am saying, breed_rank_all is the data set 2 i want to merge in to breed_traits
breed_traits <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_traits.csv')
trait_description <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/trait_description.csv')
breed_rank_all <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-01/breed_rank.csv')
this is the function i used with the most correctly one but with
Breed.y
breed_total <- merge(breed_traits, breed_rank_all, by = c('row.names') , all =TRUE)
breed_total
i tried left join as well but it shows NA on the 2013-2020 rank
library(dplyr)
breed_traits |> left_join(breed_rank_all, by = c('Breed'))
this is the one i tried as well and return duplicated rows of same breed.
merge(breed_traits, breed_rank_all, by = c('row.names', 'Breed'), all = TRUE)
In TABLE 1 (LEFT SIDED), I have lists of words in a column(WordEng) and each of them have unique IDs listed parallel in another column(WordEngID).
In TABLE 2 (RIGHT SIDED), I have lists of word IDs, which is nothing but unique IDs of words in TABLE 1. I want to copy words from TABLE 1 into a column in TABLE 2 according to its unique IDs. In Table 1 no words have same IDs and in Table 2 the same IDs are in different rows.
For example,
there is a word "cut" with ID "1807" in Table 1 and I want to copy and paste it in the column "words" in Table 2 which have IDs "1807"(five in this case).
You need this UPDATE statement:
UPDATE Table2
SET words = (SELECT t1.WrdEng FROM Table1 t1 WHERE t1.WrdEngId = Table2.WrdEngId)
WHERE words IS NULL
If you want all the values of words updated, even if they are not null, then remove the WHERE clause.
I am interested in inserting all missing rows into a data table for a new range of values for 2 columns.
Example, dt1[,a] has some values from 1 to 5, as does dt1[,b], but i'd like not only all pair wise combinations to be present in columns a and b, but all combinations to be present in a newly defined range, e.g. 1 to 7 instead.
# Example data.table
dt1 <- data.table(a=c(1,1,1,1,2,2,2,2,3,3,3,4,4,4,4,4,5,5,5),
b=c(1,3,4,5,1,2,3,4,1,2,3,1,2,3,4,5,3,4,5),
c=sample(1:10,19,replace=T))
setkey(dt1,a,b)
# CJ in data.table will create all rows to ensure all
# pair wise combinations are present (using the nominated columns).
dt1[CJ(a,b,unique=T)]
The above is great but will only use the max and min in the nominated columns. I'd like the inserted rows to give me all combinations between a new, nominated range, e.g. 1 to 7. There would be 49 rows.
# the following is a temporary workaround
template <- data.table(a1=rep(1:7,each=7),b1=rep(1:7,7))
setkey(template,a1,b1)
full <- dt1[template]
Instead of the already existing values in 'a' column, we can have a range of values to pass into 'CJ' for the 'a'
dt1[CJ(a = 1:7, b, unique = TRUE)]
I have created a subset of the pg_table_def table with table_name,col_name and data_type. I have also added a column active with 'Y' as value for some of the rows. Let us call this table as config.Table config looks like below:
table_name column_name
interaction_summary name_id
tag_transaction name_id
interaction_summary direct_preference
bulk_sent email_image_click
crm_dm web_le_click
Now I want to be able to map the table names from this table to the actual table and fetch values for the corresponding column. name_id will be the key here which will be available in all tables. My output should look like below:
name_id direct_preference email_image_click web_le_click
1 Y 1 2
2 N 1 2
The solution needs to be dynamic so that even if the table list extends tomorrow, the new table should be able to accommodate. Since I am new to Redshift, any help is appreciated. I am also considering to do the same via R using the dplyr package.
I understood that dynamic queries don't work with Redshift.
My objective was to pull any new table that comes in and use their columns for regression analysis in R.
I made this working by using listagg feature and concat operation. And then wrote the output to a dataframe in R. This dataframe would have 'n' number of select queries as different rows.
Below is the format:
df <- as.data.frame(tbl(conn,sql("select 'select ' || col_names|| ' from ' || table_name as q1 from ( select distinct table_name, listagg(col_name,',') within group (order by col_name)
over (partition by table_name) as col_names
from attribute_config
where active = 'Y'
order by table_name )
group by 1")))
Once done, I assigned every row of this dataframe to a new dataframe and fetched the output using below:
df1 <- tbl(conn,sql(df[1,]))
I know this is a round about solution. But it works !! Fetches about 17M records under 1 second.
I recently started to use the data.table package to identify values in a table's column that conform to some conditions. Although and I manage to get most of the things done, now I'm stuck with this problem:
I have a data table, table1, in which the first column (labels) is a group ID, and the second column, o.cell, is an integer. The key is on "labels"
I have another data table, table2, containing a single column: "cell".
Now, I'm trying to find, for each group in table1, the values from the column "o.cell" that are in the "cell" column of table2. table1 has some 400K rows divided into 800+ groups of unequal sizes. table2 has about 1.3M rows of unique cell numbers. Cell numbers in column "o.cell" table1 can be found in more than one group.
This seems like a simple task but I can't find the right way to do it. Depending on the way I structure my call, it either gives me a different result than what I expect or it never completes and I have to end R task because it's frozen (my machine has 24 GB RAM).
Here's an example of one of the "variant" of the calls I have tried:
overlap <- table1[, list(over.cell =
o.cell[!is.na(o.cell) & o.cell %in% table2$cell]),
by = labels]
I pretty sure this is the wrong way to use data tables for this task and on top of that I can't get the result I want.
I will greatly appreciate any help. Thanks.
Sounds like this is your set up:
dt1 = data.table(labels = c('a','b'), o.cell = 1:10)
dt2 = data.table(cell = 4:7)
And you simply want to do a simple merge:
setkey(dt1, o.cell)
dt1[dt2]
# o.cell labels
#1: 4 b
#2: 5 a
#3: 6 b
#4: 7 a