R data table usage as parameter in package - r

I have a problem using a data.table as a parameter to a function.
If I define the function in the script I'm working in it works - see fn_good.
If I define the function (identically) as part of a package I've made it won't work fully. It seems that the column names are not recognized. Commands within the function such as 'tables()' or x[1:5,1:2] work fine. It is just the column names can't be used as they were in fn_good.
The other functions in my package work ok.
Any Ideas?
many thanks
R.version 3.0.0
cd<-data.table(PY=1992:2001,DV=1:10,IN=2000)
fn_good<-function(x) {x[1:5, list(PY, DV)]}
fn_good(x=cd)
PY DV
1: 1992 1
2: 1993 2
3: 1994 3
4: 1995 4
5: 1996 5
fn_in_Package_Bad
function (x)
{
x[1:5, list(PY, DV)] #identical to above
}
<environment: namespace:RBasicChainLadder>
fn_in_Package_Bad(x=cd)
Error in `[.data.frame`(x, i, j) : object 'PY' not found

To make the package data.table aware I had to add
depends: data.table
to the package description file

Related

How to use package adespatial as same as sPCA package in ade4

I used the package adegenet with the function sPCA to understand if there are geographical patterns in my genetic data.
vcf<- read.table("AMZ.012") #samples per line
vcf_m<-as.matrix(vcf)
# Add coordinates of samples
xy <-read.table("CoordAMZ_m.csv", sep=",") #geo coordinates for each sample
The matrix "vcf" have 0 and 1 (1 means that the information is there and 0 means no information) in each line is a different sample, as the following example:
0 1 0 1
1 1 0 1
1 1 0 0
I ran sPCA using adegenet package in R, following the example:
mySpca <- spca(vcf_m, xy, ask=FALSE, type=5, scannf=FALSE)
The result was:
This function is now deprecated. Please use the 'multispati' function in the 'adespatial' package.
I tried to use this new function but I don't have any idea how could I use as the same as implemented in sPCA and get similar results. I am expecting something like in this pdf (http://adegenet.r-forge.r-project.org/files/tutorial-spca.pdf), page 7.
I would be very happy if someone could help me.
Thanks.
I am not sure if you are still interested but it is very simple. First have a look at the args(multispati) to see what is required.
The first argument required is a multi-variate ordination analysis using dudi in the ade4 package.
I created a dudi.pca.
Then a listw is required.
This is simply a weighted connection network similar to what you use in sPCA but in a different format.
You can use chooseCN just like in your sPCA.
Then convert your myCN to a listw using the nb2listw function.
Hope this helps!
Cheers,
Kat

What are the possible reasons for the "Invalid .internal.selfref detected" warning message when using na.locf() with data.table?

I have a large dataset (3667856 x 20), which gives me a warning message below:
library(data.table)
library(zoo)
data[, new_quant_PD := na.locf(QUANT_PD,na.rm=FALSE), by=c('OBLIGOR_ID','PORTFOLIO','OBLIGATION_NUMBER')]
Warning messages:
1: In `[.data.table`(data, , `:=`(new_quant_PD, na.locf(QUANT_PD, ... :
Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.
In order to understand the situation better, I created the following simpler (yet similar) example:
tmp = data.table(name=c('Zhao','Zhao','Zhao','Qian','Qian','Sun','Sun','Li','Li','Li'),score=c('B+',NA,'B',NA,NA,NA,'A',NA,'A-',NA))
tmp
name score
1: Zhao B+
2: Zhao NA
3: Zhao B
4: Qian NA
5: Qian NA
6: Sun NA
7: Sun A
8: Li NA
9: Li A-
10: Li NA
tmp[,new_score:=na.locf(score,na.rm=FALSE),by='name']
tmp
name score new_score
1: Zhao B+ B+
2: Zhao NA B+
3: Zhao B B
4: Qian NA NA
5: Qian NA NA
6: Sun NA NA
7: Sun A A
8: Li NA NA
9: Li A- A-
10: Li NA A-
This smaller example does not generate a warning message at all.
In theory I can loop over all combinations of OBLIGOR_ID, PORTFOLIO, and OBLIGATION_NUMBER, and find out which one(s) is (are) causing the trouble, but data is only part of a 81293658 row dataset that I have. I don't think I can afford so much loop time in R.
Any suggestion is greatly appreciated!
Good question but it is not reproducible because we can't see where the object data came from. This step is critically important in helping you.
The warning message (that I wrote) is included in your question so that's good. But it appears as one single long line. Here it is again in full so we can easily see it :
Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.
The second sentence starts "At an earlier point ...". So, were did this data object come from? What are the reproducible steps to create this particular data? Do any of the hints already suggested right there in the warning message help at all? It would really help if you showed us that you read the warning message and tried its hints at the time you ask the question.

R - Object not found error when using ddply

I'm applying ddply to the following data frame. The point is to apply ecdf function to yearly_test_count value to rows that have the same country.
> head(test)
country yearly_test_count download_speed
1 AU 1 2.736704
2 AU 6 3.249486
3 AU 6 2.287267
4 AU 6 2.677241
5 AU 6 1.138213
6 AU 6 3.205364
This is the script I used:
house_total_year_ecdf <- ddply(test, c("country"), mutate,
ecdf_val = ecdf(yearly_test_count)(yearly_test_count)*length(yearly_test_count))
But I received the following error:
Error in eval(substitute(expr), envir, enclos) :
object 'yearly_test_count' not found
==================================================================
I tried using the function ecdf alone with yearly_test_count column and it works:
ecdf(test$yearly_test_count)(test$yearly_test_count)*length(test$yearly_test_count)
Anyone has any idea why this doesn't work when using ddply?
This is weird since the script worked before, now I run the script again and encounter the mentioned error. I'm not sure if this issue is related to different in versions of R or versions of the package?
Any help is much appreciated ! :)
One option would be using ave from base R
test$ecdf_val <- with(test, ave(yearly_test_count, country,
FUN = function(x) ecdf(x)(x)*length(x)))

%Rpush >> lists of complex objects (e.g. pandas DataFrames in IPython Notebook)

Once again, I am having a great time with Notebook and the emerging rmagic infrastructure, but I have another question about the bridge between the two. Currently I am attempting to pass several subsets of a pandas DataFrame to R for visualization with ggplot2. Just to be clear upfront, I know that I could pass the entire DataFrame and perform additional subsetting in R. My preference, however, is to leverage the data management capability of Python and the subset-wise operations I am performing are just easier and faster using pandas than the equivalent operations in R. So for the sake of efficiency and morbid curiosity...
I have been trying to figure out if there is a way to push several objects at once. The wrinkle is that sometimes I don't know in advance how many items will need to be pushed. To retain flexibility, I have been populating dictionaries with DataFrames throughout the front end of the script. The following code provides a reasonable facsimile of what I am working through (I have not converted via com.convert_to_r_dataframe for simplicity, but my real code does take this step):
import pandas as pd
from pandas import DataFrame
%load_ext rmagic
d1=DataFrame(np.arange(16).reshape(4,4))
d2=DataFrame(np.arange(20).reshape(5,4))
d_list=[d1,d2]
names=['n1','n2']
d_dict=dict(zip(names,d_list))
for name in d_dict.keys():
exec '%s=d_dict[name]' % name
%Rpush n1
As can be seen, I can assign a static name and push the DataFrame into the R namespace individually (as well as in a 'list' >> %Rpush n1 n2). What I cannot do is something like the following:
for name in d_dict.keys():
%Rpush d_dict[name]
That snippet raises an exception >> KeyError: u'd_dict[name]'. I also tried to deposit the dynamically named DataFrames in a list, the list references end up pointing to the data rather than the object reference:
df_list=[]
for name in d_dict.keys():
exec '%s=d_dict[name]' % name
exec 'df_list.append(%s)' % name
print df_list
for df in df_list:
%Rpush df
[ 0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15,
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19]
%Rpush did not throw an exception when I looped through the lists contents, but the DataFrames could not be found in the R namespace. I have not been able to find much discussion of this topic beyond talk about the conversion of lists to R vectors. Any help would be greatly appreciated!
Rmagic's push uses the name that you give it both to look up the Python variable, and to name the R variable it creates. So it needs a valid name, not just any expression, on both sides.
There's a trick you can do to get the name from a Python variable:
d1=DataFrame(np.arange(16).reshape(4,4))
name = 'd1'
%Rpush {name}
# equivalent to %Rpush d1
But if you want to do more advanced things, it's best to get hold of the r object and use that to put your objects in. Rmagic is just a convenience wrapper over rpy2, which is a full API. So you can do:
from rpy2.robjects import r
r.assign('a', 1)
You can mix and match which interface you use - rmagic and rpy2 are talking to the same instance of R.

fast join data.table (potential bug, checking before reporting)

This might be a bug. In that case, I will delete this question and report as bug. I would like someone to take a look to make sure I'm not doing something incorrectly so I don't waste the developer time.
test = data.table(mo=1:100, b=100:1, key=c("mo", "b"))
mo = 1
test[J(mo)]
That returns the entire test data.table instead of the correct result returned by
test[J(1)]
I believe the error might be coming from test having the same column name as the table which is being joined by, mo. Does anyone else get the same problem?
This is a scoping issue, similar to the one discussed in data.table-faq 2.13 (warning, pdf). Because test contains a column named mo, when J(mo) is evaluated, it returns that entire column, rather than value of the mo found in the global environment, which it masks. (This scoping behavior is, of course, quite nice when you want to do something like test[mo<4]!)
Try this to see what's going on:
test <- data.table(mo=1:5, b=5:1, key=c("mo", "b"))
mo <- 1
test[browser()]
Browse[1]> J(mo)
# mo
# 1: 1
# 2: 2
# 3: 3
# 4: 4
# 5: 5
# Browse[1]>
As suggested in the linked FAQ, a simple solution is to rename the indexing variable:
MO <- 1
test[J(MO)]
# mo b
# 1: 1 6
(This will also work, for reasons discussed in the documentation of i in ?data.table):
mo <- data.table(1)
test[mo]
# mo b
# 1: 1 6
This is not a bug, but documented behaviour afaik. It's a scoping issue:
test[J(globalenv()$mo)]
mo b
1: 1 100

Resources