I've got a data.table DT that I'd like to write to DB2 and update using ibmdbR package.
I upload the first batch using as.ida.data.frame.
> DT<- data.table(A = c(111,222,333,444), MONTH= c('2018-01', '2018-02', '2018-03', '2018-04'), B= c(11,22,33,44))
> DT
A MONTH B
1: 111 2018-01 11
2: 222 2018-02 22
3: 333 2018-03 33
4: 444 2018-04 44
> db2_test <- as.ida.data.frame(DT, table='myschema.TEST', clear.existing=FALSE, case.sensitive=FALSE,
rownames=NULL, dbname='DB_NAME', asAOT=FALSE)
This creates a DB2 table named TEST in my schema in the database.
Then I try to update TEST based on column MONTH using another data.table DT2 by doing:
> DT2 <- data.table(A = c(999,888), MONTH = c('2018-01', '2019-02'), B = c(99,77))
> DT2
A MONTH B
1: 999 2018-01 99
2: 888 2019-02 77
> idaUpdate(myconnection, updf = 'myschema.TEST', dfrm = DT2, idaIndex = 'MONTH')
Error in sqlUpdate(db2Conn, dfrm, updf, index = idaIndex, fast = ifelse(idaIsOracleMode(), :
[RODBC] Failed exec in Update02000 100 [IBM][CLI Driver][DB2/NT64] SQL0100W No row was found for FETCH, UPDATE or DELETE; or the result of a query is an empty table. SQLSTATE=02000
While i receive this error, when I look at the data in TEST table in DB2 database, the first entry has changed, which is expected.
A MONTH B
999 2018-01 99
222 2018-02 22
333 2018-03 33
444 2018-04 44
So i think the error comes from the second entry in DT2. there are no rows in TEST with MONTH = '2019-02', so it fails.
But I thought the point of updating with an index column is substituting the rows that have a match with the index column and adding the rows that don't?
How can I update TEST properly with DT2, so that if month exists then update the rows? but add the news ones if there are no rows in TEST that match MONTH column in DT2?
basically, how can I append data properly to a DB2 table from an R object?
I never had issues with AWS. DB2 is a nightmare.
Related
The main table is large.
Has certain undesired values that I want to override.
I am writing into a lookup table the keys and new_value (NA) to override.
Both have 2 keys (session_id and datetime), not one unique.
Other similar questions goes into replacing an NA with a value, but I want to replace the value with an NA. Clear cells contents.
The 2 keys limits the use of match() which can handle only one key and first occurrences.
left_join or merge operations, would create a new large dataframe with an added column, and will fill them up with NA for every row, and it would also require to perform some 'coalescing' into an NA value, which I guess, doesn't exists.
I don't want to remove the entire row, as there are many other columns with its own values. I just want to delete that value from that cells.
I think, that in short, it is just an assignment operation to a filtered subset based on 2 keys. Something like:
table[ lookup_paired_keys(session_ids, lookup_datetimes) ] <- NA
Follows a sample dataset with undesired "0" to replace by NA. The real dataset may contain other kind of values.
table <- read.table(text = "
session_id datetime CaloriesDaily
1233815059 2016-05-01 5555
8583815123 2016-05-03 4444
8512315059 2016-05-04 2432
8583815059 2016-05-12 0
6290855005 2016-05-10 0
8253242879 2016-04-30 0
1503960366 2016-05-20 0
1583815059 2016-05-19 2343
8586545059 2016-05-20 1111
1290855005 2016-05-11 5425
1253242879 2016-04-25 1234
1111111111 2016-05-09 6542", header = TRUE)
table$datetime = as.POSIXct(table$datetime, tz='UTC')
table
lookup <- read.table(text = "
session_id datetime CaloriesDaily
8583815059 2016-05-12 NA
6290855005 2016-05-10 NA
8253242879 2016-04-30 NA
1503960366 2016-05-12 NA", header = TRUE)
lookup$datetime = as.POSIXct(lookup$datetime, tz='UTC')
lookup$CaloriesDaily = as.numeric(lookup$CaloriesDaily)
lookup
SOLVED
After reading the accepted answer, I want to share the final outcome.
And as I have the main table a data.table and I got some warns regarding nomenclature, be aware that I am no expert, but is working with this example dataset and my own.
lookup_by : Standard Lookup operation
lookup_by <- function(table, lookup, by) {
merge( table, lookup, by=by )
}
### usage ###
keys = c('session_id','datetime')
lookup_by( table, lookup, keys)
Adopted solution: match_by
Like match() but with keys.
It returns a vectors with row numbers when keys match.
So that, assignment like table[ ..matches.. ] <- NA is possible.
match_by <- function(table, lookup, by) {
table <- setDT(table)[,..by]
table$idx1 <- 1:nrow(table)
lookup <- setDT(lookup)[,..by]
lookup$idx2 <- 1:nrow(lookup)
m <- merge( table , lookup, by=by )
return( m[ ,c('idx1','idx2') ] )
}
### usage ###
keys = c('session_id','datetime')
rows = match_by( table, lookup, keys)
overrides <- c(lookup[ rows$idx2, 'CaloriesDaily' ])
table[ rows$idx1, 'CaloriesDaily' ] <- overrides
table
Here’s a solution using dplyr::semi_join() and dplyr::anti_join() to split your dataframe based on whether the id and date keys match your lookup table. I then assign NAs in just the subset with matching keys, then row-bind the subsets back together. Note that this solution doesn’t preserve the original row order.
library(dplyr)
table_ok_vals <- table %>%
anti_join(lookup, by = c("session_id", "datetime"))
table_replaced_vals <- table %>%
semi_join(lookup, by = c("session_id", "datetime")) %>%
mutate(CaloriesDaily = NA_real_)
table <- bind_rows(table_ok_vals, table_replaced_vals)
table
Output:
session_id datetime CaloriesDaily
1 1233815059 2016-05-01 5555
2 8583815123 2016-05-03 4444
3 8512315059 2016-05-04 2432
4 1503960366 2016-05-20 0
5 1583815059 2016-05-19 2343
6 8586545059 2016-05-20 1111
7 1290855005 2016-05-11 5425
8 1253242879 2016-04-25 1234
9 1111111111 2016-05-09 6542
10 8583815059 2016-05-12 NA
11 6290855005 2016-05-10 NA
12 8253242879 2016-04-30 NA
I have a dataframe price1 in R that has four columns:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
There are ten Car names in all in price1, so the above is just to give an idea about the structure. Each car name should have 54 observations corresponding to 54 weeks. But, there are some weeks for which no observation exists (for e.g., Week 3 and 4 in the above case). For these missing weeks, I need to plug in information from another dataframe price2:
Name AveragePrice AverageRebate
Car 1 20000 500
Car 2 20000 400
Car 3 20000 400
---- ---- ---
Car 10 20400 450
So, I need to identify the missing week for each Car name in price1, capture the row corresponding to that Car name in price2, and insert the row in price1. I just can't wrap my head around a possible approach, so unfortunately I do not have a code snippet to share. Most of my search in SO is leading me to answers regarding handling missing values, which is not what I am looking for. Can someone help me out?
I am also indicating the desired output below:
Name Week Price Rebate
Car 1 1 20000 500
Car 1 2 20000 400
Car 1 3 20200 410
Car 1 4 20300 420
Car 1 5 20000 400
---- -- ---- ---
Car 1 54 20400 450
---- -- ---- ---
Car 10 54 21400 600
Note that the output now has Car 1 info for Week 4 and 5 which I should fetch from price2. Final output should contain 54 observations for each of the 10 car names, so total of 540 rows.
try this, good luck
library(data.table)
carNames <- paste('Car', 1:10)
df <- data.table(Name = rep(carNames, each = 54), Week = rep(1:54, times = 10))
df <- merge(df, price1, by = c('Name', 'Week'), all.x = TRUE)
df <- merge(df, price2, by = 'Name', all.x = TRUE); df[, `:=`(Price = ifelse(is.na(Price), AveragePrice, Price), Rebate = ifelse(is.na(Rebate), AverageRebate, Rebate))]
df[, 1:4]
So if I understand your problem correctly you basically have 2 dataframes and you want to make sure the dataframe - "price1" has the correct rownames(names of the cars) in the 'names' column?
Here's what I would do, but it probably isn't the optimal way:
#create a loop with length = number of rows in your frame
for(i in 1:nrow(price1)){
#check if the value is = NA,
if (is.na(price1[1,i] == TRUE){
#if it is NA, replace it with the corresponding value in price2
price1[1,i] <- price2[1,i]
}
}
Hope this helps (:
If I understand your question correctly, you only want to see what is in the 2nd table and not in the first. You will just want to use an anti_join. Note that the order you feed the tables into the anti_join matters.
library(tidyverse)
complete_table ->
price2 %>%
anti_join(price1)
To expand your first table to cover all 54 weeks use complete() or you can even fudge it and right_join a table that you will purposely build with all 54 weeks in it. Then anything that doesn't join to this second table gets an NA in that column.
Sorry that my question is a little vague. I have two separated data bases (data1 as the first database and data2 as the second one) as follows:
Area Yr AllRev Totalcalls
A 2012 1021597.78 835
B 2013 1002968.21 833
c 2014 730345.93 65
d 2015 251956.26 232
e 2012 22408.71 25
...
Data 2:
Yr TotRev TotCalls
2012 160038596.0 131064
2013 399750664.0 312651
...
Now I want to add a column "RevPercent" to data 1 which is going to calculate the following value for each row:
100*data1$AllRev/data2$TotRev
However, if yr ==2012 for data1, I want it to read TotRev for 2012 from data2 to calculate the aformentioned value. I wrote the following line of code but I definitely am getting an error:
data1 <- cbind(data1,100*round(data1[,3]/data2[data2[,1]==data2[,2],2],4))
And the error is as follows:
In data2[, 1] == data2[,2] :
longer object length is not a multiple of shorter object leng
Any help is appreciated.
Thanks
I'm having trouble applying a simple data.table join example to a larger (10GB) data set. merge() works just fine on data.frames with the larger dataset, although I'd love to take advantage of the speed in data.table. Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?
Here is the simple example (derived from this thread: Join of two data.tables fails).
# The data of interest.
(DT <- data.table(id = c(rep(1154:1155, 2), 1160),
price = c(1.99, 2.50, 15.63, 15.00, 0.75),
key = "id"))
id price
1: 1154 1.99
2: 1154 15.63
3: 1155 2.50
4: 1155 15.00
5: 1160 0.75
# Lookup table.
(lookup <- data.table(id = 1153:1160,
version = c(1,1,3,4,2,1,1,2),
yr = rep(2006, 4),
key = "id"))
id version yr
1: 1153 1 2006
2: 1154 1 2006
3: 1155 3 2006
4: 1156 4 2006
5: 1157 2 2006
6: 1158 1 2006
7: 1159 1 2006
8: 1160 2 2006
# The desired table. Note: lookup[DT] works as well.
DT[lookup, allow.cartesian = T, nomatch=0]
id price version yr
1: 1154 1.99 1 2006
2: 1154 15.63 1 2006
3: 1155 2.50 3 2006
4: 1155 15.00 3 2006
5: 1160 0.75 2 2006
The larger data set consists of two data.frames: temp.3561 (the dataset of interest) and temp.versions (the lookup dataset). They have the same structure as DT and lookup (above), respectively. Using merge() works well, however my application of data.table is clearly flawed:
# Merge data.frames: works just fine
long.merged <- merge(temp.versions, temp.3561, by = "id")
# Convert the data.frames to data.tables
DTtemp.3561 <- as.data.table(temp.3561)
DTtemp.versions <- as.data.table(temp.versions)
# Merge the data.tables: doesn't work
setkey(DTtemp.3561, id)
setkey(DTtemp.versions, id)
DTlong.merged <- merge(DTtemp.versions, DTtemp.3561, by = "id")
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), :
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate
key values in i, each of which join to the same group in x over and over again. If that's ok,
try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the
large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-
help for advice.
DTtemp.versions has the same structure as lookup (in the simple example), and the key "id" consists of 779,473 unique values (no duplicates).
DTtemp3561 has the same structure as DT (in the simple example) plus a few other variables, but its key "id" only has 829 unique values despite the 7,946,667 observations (lots of duplicates).
Since I'm just trying to add version numbers and years from DTtemp.versions to each observation in DTtemp.3561, the merged data.table should have the same number of observations as DTtemp.3561 (7,946,667). Specifically, I don't understand why merge() generates "excess" observations when using data.table but not when using data.frame.
Likewise
# Same error message, but with 12,055,777 observations
altDTlong.merged <- DTtemp.3561[DTtemp.versions]
# Same error message, but with 11,277,332 observations
alt2DTlong.merged <- DTtemp.versions[DTtemp.3561]
Including allow.cartesian=T and nomatch=0 doesn't drop the "excess" observations.
Oddly, if I truncate the dataset of interest to have 10 observatons, merge() works fine on both data.frames and data.tables.
# Merge short DF: works just fine
short.3561 <- temp.3561[-(11:7946667),]
short.merged <- merge(temp.versions, short.3561, by = "id")
# Merge short DT
DTshort.3561 <- data.table(short.3561, key = "id")
DTshort.merged <- merge(DTtemp.versions, DTshort.3561, by = "id")
I've been through the FAQ (http://datatable.r-forge.r-project.org/datatable-faq.pdf, and 1.12 in particular). How would you suggest thinking about this?
Could anyone point out what I'm misunderstanding about data.table (and the error message in particular)?
Taking you answer directly. The error message
Join results in 11277332 rows; more than 7946667 = max(nrow(x),nrow(i)). Check for duplicate key values in i...
states the result of your join has more values than usual cases expects. This means the lookup table key has duplicates which results multiple matches on join.
If it doesn't answer your question you should restate it.
Working through an R tutorial that I'm having a hard time understanding.
Directory is a folder with numerous csv files. The function takes as id either one of more of the files and returns the number of records in each.
My function:
complete <- function(directory,id = 1:332) {
csvfiles <- sprintf("/Users/myname/Desktop/%s/%03d.csv", directory, id)
nrows <- sapply( csvfiles, function(f) nrow(read.csv(f)))
data.frame(ID=sprintf('%03d', id),
countrows=sapply(csvfiles,function(x) length(count.fields(x))),
row.names=id
)
}
Then complete("specdata", 100:105)
Returns
ID countrows
100 100 1097
101 101 731
102 102 1462
103 103 3653
104 104 2558
105 105 2192
What must I do so that the left most column is a sequence starting 1? So that, for example, the first record would be 1 100 & 1092, the second record 2 101 & 731
The first apparent column is just the names of the rows (look at e.g. ncol(specdata)). You can rename rows as follows:
row.names(specdata) <- 1:nrow(specdata)
Inside the function use this inside the dataframe call:
row.names = 1: length(id)