result_create(conn#ptr, statement) : Result too large - r

t1_DA <- sqldf("select decile,
count(decile) as count, avg(pred_spent) as avg_pred_spent,
avg(exp(total_spent)) as avg_total_spent,
avg(log(pred_spent)) as ln_avg_pred_spent,
avg(total_spent) as ln_avg_total_spent
from t1
group by decile
order by decile desc")
I am doing linear regression on a file and when I run this part of the code I am getting following error
Error in result_create(conn#ptr, statement) : Result too large
Is there any way to overcome this error?

As mentioned, by default sqldf uses the SQLite dialect which does not support extensive mathematical and statistical functions like exp and log. Admittedly, a better raised message can help users debug rather than Result too large (maybe a git issue for author, #ggrothendieck?).
However, in order to integrate these outputs into your aggregate query, consider creating those columns before running in sqldf. Use either transform or within for easy new column assignment without constant reference to data frame using the $ assignment approach.
t1 <- transform(t1, exp_total_spent = exp(total_spent),
log_pred_spent = log10(log_pred_spent)
)
# ALTERNATIVE
t1 <- within(t1, {exp_total_spent <- exp(total_spent)
log_pred_spent <- log10(log_pred_spent)
})
t1_DA <- sqldf("select decile,
count(decile) as count,
avg(pred_spent) as avg_pred_spent,
avg(exp_total_spent) as avg_total_spent,
avg(log_pred_spent) as ln_avg_pred_spent,
avg(total_spent) as ln_avg_total_spent
from t1
group by decile
order by decile desc")

Related

R - passing variables to ddply, refering to columns by name or index, quick syntax problem

I am aware there treads out there for similar problems, but after hours of trying I couldn't run my script.
I would like run ddply within a custom function, but would like to pass the column names as variables to it.
I have tried referencing either with the index of the column or its string name, but no luck
colnum2 <- which(colnames(z) == "TAZ")
colnum2 <- "TAZ"
output <- ddply(df, .(AColumnNameThatIsFixed), summarise,
ANewColInSummary = sum(.(colnum2) == 0))
This does not run
hey <- ddply(z, .(AColumnNameThatIsFixed), summarise,
ANewColInSummary = sum(z[[colnum2]] == 0))
This does not give the right result as it does not evaluate per row.
The data and summary is obviously more complex but this shows the fundamental problem.

How to use aggregate( ) to count NA values and using tapply() as an alternative

I am new to R and trying to prepare for an exam in R which will take place in one week.
On one of the homework questions, I am trying to solve a single problem in as many as ways as possible (preparing more tools always comes in handy in a time-constrained coding exam).
The problem is the following: in my dataset, "ckm_nodes.csv"
The variable adoption date records the month in which the
doctor began prescribing tetracycline, counting from November 1953. If the doctor did not begin prescribing it by month 17, i.e. February 1955, when the study ended, this is recorded as Inf. If it's not known when or if the doctor adopted tetracycline, their value is NA. Answer the following. (a) How many doctors began prescribing tetracycline in each month of the study? (b) How many never prescribed it during the study? (c) How many are NAs?
I was trying to use the aggregate( ) function to count the number of doctors starting to prescribe in each month. My base code is:
aggregate(nodes$adoption_date, by = nodes["adoption_date"], length),
which works but for the NA values.
I wonder if there is a way I can let the aggregate function count the NA values, so I read the R documentation on aggregate( ) function, which says the following:
na.action
a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.
So I googled how to solve this problem and set "na.action = NULL". However, when I try to run this code, here is what happened:
aggregate(nodes$adoption_date, by = nodes["adoption_date"], length, na.action = NULL)
Error in FUN(X[[i]], ...) :
2 arguments passed to 'length' which requires 1
Tried to move around the arguments in order:
aggregate(nodes$adoption_date, length, by = nodes["adoption_date"], na.action = NULL)
Error in FUN(X[[i]], ...) :
2 arguments passed to 'length' which requires 1
But it doesn't work either.
Any idea how to fix this?
***************** tapply()
Additionally, I was wondering if one can use the "tapply" function to solve Q1 on the homework. I tried
count <- function(data){
return(length(data$adoption_date))
}
count_tetra <- tapply(nodes,nodes$adoption_date,count)
Error in tapply(nodes, nodes$adoption_date, count) : arguments must
have same length
************** loops
I am also wondering how I can use a loop to achieve the same goal.
I can start by sorting the vector:
nodes_sorted <- nodes[order(nodes$adoption_date),]
Then, write a for loop, but how...?
Goal is to get a vector count, and each element of count corresponds to a value for number of prescriptions.
Thanks!
Example data:
nodes <- data.frame(
adoption_date = rep(c(1:17,NA,Inf), times = c(rep(5,17),20,3))
)
Have you looked at data.table? I believe something like this does the trick.
require(data.table)
# convert nodes to data.table
setDT(nodes)
# count occurrences for each value of adoption_rate
nodes[, .N, by = adoption_date]

LHS:RHS vs functional in data.table

Why does the functional ':=' not aggregate unique rows using 'by' yet LHS:RHS does aggregate using 'by'? Below is a .csv file of 20 rows of data with 58 variables. A simple copy, paste, delim = .csv works. I am still trying to find the best way to post sample data to SO. The 2 variants of my code are:
prodMatrix <- so.sample[, ':=' (Count = .N), by = eval(names(so.sample)[2:28])]
---this version does not aggregate the rowID using by---
prodMatrix <- so.sample[, (Count = .N), by = eval(names(so.sample)[2:28])]
---this version does aggregate the rowID using by---
"CID","NetIncome_length_Auto Advantage","NetIncome_length_Certificates","NetIncome_length_Comm. Share Draft","NetIncome_length_Escrow Shares","NetIncome_length_HE Fixed","NetIncome_length_HE Variable","NetIncome_length_Holiday Club","NetIncome_length_IRA Certificates","NetIncome_length_IRA Shares","NetIncome_length_Indirect Balloon","NetIncome_length_Indirect New","NetIncome_length_Indirect RV","NetIncome_length_Indirect Used","NetIncome_length_Loanline/CR","NetIncome_length_New Auto","NetIncome_length_Non-Owner","NetIncome_length_Personal","NetIncome_length_Preferred Plus Shares","NetIncome_length_Preferred Shares","NetIncome_length_RV","NetIncome_length_Regular Shares","NetIncome_length_S/L Fixed","NetIncome_length_S/L Variable","NetIncome_length_SBA","NetIncome_length_Share Draft","NetIncome_length_Share/CD Secured","NetIncome_length_Used Auto","NetIncome_sum_Auto Advantage","NetIncome_sum_Certificates","NetIncome_sum_Comm. Share Draft","NetIncome_sum_Escrow Shares","NetIncome_sum_HE Fixed","NetIncome_sum_HE Variable","NetIncome_sum_Holiday Club","NetIncome_sum_IRA Certificates","NetIncome_sum_IRA Shares","NetIncome_sum_Indirect Balloon","NetIncome_sum_Indirect New","NetIncome_sum_Indirect RV","NetIncome_sum_Indirect Used","NetIncome_sum_Loanline/CR","NetIncome_sum_New Auto","NetIncome_sum_Non-Owner","NetIncome_sum_Personal","NetIncome_sum_Preferred Plus Shares","NetIncome_sum_Preferred Shares","NetIncome_sum_RV","NetIncome_sum_Regular Shares","NetIncome_sum_S/L Fixed","NetIncome_sum_S/L Variable","NetIncome_sum_SBA","NetIncome_sum_Share Draft","NetIncome_sum_Share/CD Secured","NetIncome_sum_Used Auto","totNI","Count","totalNI"
93,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,-123.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,212.97,0,0,0,-71.36,0,0,0,49.01,0,0,67.42,6,404.52
114,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,4,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,14.54,0,0,0,0,0,-285.44,0,0,0,49.01,0,0,-221.89,90,-19970.1
1112,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,60.23,0,0,0,0,-101.55,0,-71.36,0,0,0,98.02,0,0,-14.66,28,-410.48
5366,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
6078,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,7,0,0,0,1,0,0,0,0,0,0,0,0,-17.44,0,0,0,0,0,0,0,14.54,0,0,0,0,0,-499.52,0,0,0,49.01,0,0,-453.41,3,-1360.23
11684,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
47358,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,-14.43,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-85.79,3194,-274013.26
193761,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-101.55,0,-71.36,0,0,0,49.01,0,0,-123.9,9973,-1235654.7
232530,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
604897,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
1021309,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-71.36,43262,-3087176.32
1023633,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-71.36,43262,-3087176.32
1029726,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,60.23,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,37.88,8688,329101.44
1040005,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
1040092,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,49.01,0,0,-22.35,77631,-1735052.85
1064453,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,14.54,0,212.97,0,0,0,-142.72,0,0,0,0,0,0,84.79,49,4154.71
1067508,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,-123.2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-194.56,4162,-809758.72
1080303,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-71.36,0,0,0,0,0,0,-71.36,43262,-3087176.32
1181005,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,2,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-101.55,0,-142.72,0,0,0,98.02,0,0,-146.25,614,-89797.5
1200484,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-101.55,0,-285.44,0,0,0,0,0,0,-386.99,50,-19349.5
Because := is making operations by reference. It means it will not invoke in-memory copy of your dataset but it will update it in-place.
Making aggregation of your dataset is a copy of it's original unaggregated form.
You can read more about it in Reference semantics vignette.
This is a design concept in data.table that := is used to update by reference and other forms - .(), list() or direct expression are used to query data. And query data isn't a by reference operation. The by reference operation is not able to aggregate rows, it can just calculate aggregates and put it into dataset in-place. Query is able to aggregate dataset because query result is not the same object in memory as original data.table.

How to pass the name of dataframe inside a list to sqldf?

I am annotating several data frames by using sqldf package.
The annotation data is in a dataframe annot.
I use INNER JOIN to select the corresponding information by id value
To automate the process, I write the code below:
prepareAnnot <- function(x){
annoted <- sqldf("SELECT x.*,
annot.*
FROM x INNER JOIN annot
ON x.id = annot.id;")
return(annoted)}
I put 5 data frames (A,B,C,D,E) into a list and want to apply the prepareAnnot function
and save the annotated data in a new data frame with suffix "anotated"
myresults <- list(A=A,B=B,C=C,D=D,E=E)
for (i in seq_along(myresults)){
assign (paste(names(myresults)[i],"annotated",sep="_"),prepareAnnot(myresults[i]))
}
However it seems the prepareAnnot function can not recognize the dataframe name in my list.
and I got the error message below:
Error in sqliteExecStatement(con, statement, bind.data) :
RS-DBI driver: (error in statement: no such table: x)
How should I correctly pass the data frame name inside the list to the function ?
I cannot replicate your error. Also, it's not a good idea to use assign() like that. If you have a bunch of variables that are related, it's better just to keep them in a list so that you can run vectorized operations over them easily. Here's a working example
annot <- data.frame(id=1:10, n=letters[1:10])
prepareAnnot <- function(x) {
sqldf("select x.*, annot.n from x INNER JOIN annot ON x.id = annot.id")
}
myresults <- list(A=data.frame(id=1:3), B=data.frame(id=4:7))
annotated <- lapply(myresults, prepareAnnot)
annotated
tested with "sqldf_0.4-7.1".
I can get that same error if one of the elements in myresults is not a data.frame. Be sure to check
sapply(myresults, class)
to see that they are all proper data.frames.

Batch-rename variables in R without For loop

I have a table of survey questions:
ques <- data.frame(stringsAsFactors=F,
question_id=c("q1","q2"),
question_nm=c("computer_hrs","exercise_hrs"),
question_txt=c("How many hours did you spend last week on a computer?",
"How many hours did you spend last week on exercise?")
)
The question_nm is a short description string, already checked to be valid as a variable name.
I have a table of responses:
resp <- data.frame(stringsAsFactors=F,
respondent_id=c(1,2),
respondent_nm=c("Joe Programmer","Jill Athlete"),
q2=c(1,100), #column order in resp not guaranteed same as row order in ques
q1=c(100,1)
)
In order to have meaningful response variable names I wish to replace the names q1 and q2 with computer_hrs and exercise_hrs.
Note that you would get the wrong answer with:
names(resp)[ques$question_id %in% names(resp)] <- ques$question_nm #wrong
due to the column order in responses not matching the row order in questions. (I know I could fix that by sorting each order.)
I can do this with a for loop...
for (q in ques$question_id){
names(resp)[names(resp)==q] <- ques$question_nm[ques$question_id==q]
}
... but given a function that returned the mapping of the elements of ques$question_id to names(resp), similar to %in% but returning the position rather than T/F, I could do it without the For loop. Unfortunately, the only way I know to write that function involves a For loop.
Is there a way to accomplish this replacement without the loop?
Try:
names(resp)[match(ques[,1], names(resp))] <- ques$question_nm

Resources