How to interpolate values in a whole database with approx function? - r

I have recorded values from different treatments at different moments. I had done a linear interpolation with these points using the approx () function. As a result I got the predicted values. I have done this procedure only for one repetition belonging to one treatment. Now I want to do it for the whole database. The criterion that I decided to use was to create a new column (see below "polyname") which include treatment+block, and then adjust approx () function with polyname as a criteria for splitting the database but I could not find how to perform this (see below the output error). Any help will be really appreciated.
Here is the script and the link with the database.
http://www.filedropper.com/dataexample
names(dataexample)
#To summary the categorical variables
str(dataexample)
# Transform
dataexample$x<-as.numeric(as.character(dataexample$x))
str(dataexample)
# Create a new column (polyname) combining treatment and block, separated by ","
dataexample$polyname <- paste(dataexample$treat, dataexample$block, sep=",")
#Split the database and run approx function with the new column polyname
model1<-lapply (split(dataexample, dataexample$polyname), approx(x, y, method="linear", xout=7:148, yleft=0, yright=0, rule = 1, f = 0, ties = mean))
model1
Output Error:
> #Split the database and run approx function
> model1<-lapply (split(dataexample, dataexample$polyname), approx(x, y, method="linear", xout=7:148, yleft=0, yright=0, rule = 1, f = 0, ties = mean))
Error in xy.coords(x, y) : object 'y' not found
> model1
Error: object 'model1' not found
Thanks in advance.
Regards.
Matías.

Related

Error: negative length vectors are not allowed

I have a relatively big dataframe in R called df which is about 2.9 gb in size, with dimensions
3701578 rows and 94 columns. I am trying to run the following command with the package pls to perform a principal component regression (pcr):
set.seed(1)
y_cols = tail(colnames(df),1) # select last column as dependent variable
x_cols = colnames(df)[-c(1, 2, 93, 94)] # PCA applied only to columns from 3 to 92, whose components become the regressors for pcr
formula = as.formula(
paste0("`",y_cols,"`", " ~ ", paste(paste0("`", x_cols, "`"), collapse = " + "))
) # to ease the writing down the formula
model <- pcr(formula=formula, data=df[df$date<19801231,], scale=FALSE, center=FALSE)
I get the following error:
Error in array(0, dim = c(npred, nresp, ncomp)): negative length vectors are not allowed
Traceback:
1. pcr(formula = formula, data = df[df$date < 19801231, ], scale = FALSE,
. center = FALSE)
2. eval(cl, parent.frame())
3. eval(cl, parent.frame())
4. pls::mvr(formula = formula, data = df[df$date < 19801231, ],
. scale = FALSE, center = FALSE, method = "svdpc")
5. fitFunc(X, Y, ncomp, Y.add = Y.add, center = center, ...)
6. array(0, dim = c(npred, nresp, ncomp))
Slicing the dataframe as in the formula gives a smaller dataframe of 751024 rows × 94 columns. At the beginning I thought (based on similar cases I found online) that this could be due to a memory limit, but actually I have around 1000 gb of RAM available so that is definitely not the case. Funny thing, I have no problem if I run the same command on the entire dataframe df. Creating a new object e.g. new<- df[df$date < 19801231, ] and then running the code does not help either. I managed to get it running if I set some missing data (relatively few) to zero in new. However, if I keep the missing data, the pcr command runs smoothly if I use the entire (bigger) df. Somebody has any idea about this behavior?

Calculate the pearson correlation between two lists

I have many equally structured text files containing experimental data (641*976). At the beginning I define the correct "working directory" and order all the files in a list. Thereby I generate two different lists. Once the file.listx containing my sample data and once the file.listy containing reference data. Afterwards I rearrange the data in order to conduct the correlation analysis. Here the code shows how I generate the "x" list. The "y" list was generated exactly the same way with the reference data.
file.listx <- list.files(pattern="*.txt", full.names=T)
datalist = lapply(file.listx, FUN=read.table, header = F, sep = "\t", skip = 2)
cmbn = expand.grid(1:641, 1:977)
flen = length(datalist)
x=lapply(1:(nrow(cmbn)),function(t,lst,cmbn){
return(sapply(1:flen,function(i,t1,lst1,cmbn1){
return(lst1[[i]][cmbn1$Var1[t1],cmbn1$Var2[t1]])},t,lst,cmbn))}
,datalist,cmbn)
Now I want to calculate the pearson correlation between the two lists.
http://www.datasciencemadesimple.com/pearson-function-in-excel/
According to the pearson correlation formula corresponds my "x" to the sample and my "y" to the reference.
cor(x, y, method = "pearson")
Then the error message pops up that 'x' must be numeric. I do not know how I can solve this problem. When I use,
x = as.numeric(x)
it seems that the list structure gets lost. And the following approach does also not solve the problem.
x = as.matrix(x)
How can I convert my list into a numeric type without loosing the structure? I want to calculate the pearson correlation between the two lists.
Here is the code to generate two dummy lists. This way the error can be reproduced.
x = list(4:10, 10:16, 32:38, 100:106) # sample
y = list(10:16, 20:26, 40:46, 110:116) # reference
cor(x, y, method = "pearson")

Can I get unwtd.count included when running the svymean from the R Survey package?

I've written an R script to loop through a bunch of variables in a survey and output weighted values, CVs, CIs etc.
I would like it to also output the unweighted observations count.
I know it's a bit of a lazy question because I can calculate unweighted counts on my own and join them back in. I'm just trying to replicate a stata script that would return 'obs'
svy:tab jdvariable, per cv ci obs column format(%14.4g)
This is my calculated values table:
myresult_year_calc <- svyby(make.formula(newmetricname), # variable to pass to function
by = ~year, # grouping
design = subset(csurvey, geoname %in% jv_geo), # design object with subset definition
vartype = c("ci","cvpct"), # report variation as ci, and cv percentage
na.rm.all=TRUE,
FUN = svymean # specify function from survey package
)
By using unwtd.count instead of FUN, I get the counts I want.
myresult_year_obs <- svyby(make.formula(newmetricname), # variable to pass to function
by = ~year, # grouping
design = subset(csurvey, geoname %in% jv_geo), # design object with subset definition
vartype = c("ci","cvpct"), # report variation as ci, and cv percentage
na.rm.all=TRUE,
unwtd.count
)
Honestly in writing this question I made it 98% through a solution, but I'll ask anyway in case someone knows a more efficient way.
myresult_year_calc and myresult_year_obs both return what I expect, and if I use merge(myresult_year_calc, myresult_year_obs by"year") I get the table I want. This actually just gives me one count, per year in this example instead of one count for 'Yes' responses and one count for 'No'.
Is there any way to get both means and unweighted counts with a single command?
I figured this out by creating a second dsgn function where weights = ~0. When I ran svyby using the svytotal function with the unweighted design it followed the formula.
dsgn2 <- svydesign(ids = ~0,
weights = ~0,
data = data,
na.rm = T)
unweighted_n <- svyby(~interaction(group1,group2), ~as.factor(mean_rating), design = dsgn2, FUN = svytotal, na.rm = T)

Looping through list for each element computation in R

I am naive at R and trying to get a stuff done so advanced apologies if it is a stupid way of doing it.
I am trying to get coefficient and relevance of x-values to y-values. Values in X are criteria to which co-relevance is being tested.
I need to find postive or negative relevance/confidence for results represented in myList. Rather than putting one column in Y manually I just want to iterate through it for result of each column.
library(rms)
parameters <- read.csv(file="C:/Users/manjaria/Documents/Lek papers/validation_csv.csv", header=TRUE)
#attach(parameters)
myList <- c("name1","name2","name3","name4","name5")
for (cnt in seq(length(myList))) {
Y<- cbind(myList[cnt])
X<- cbind(age,female,income,employed,traveldays,modesafety,prPoolsize)
XVar <-c("age","female","income","employed","traveldays","modesafety","prPoolsize")
summary (Y)
summary (X)
table(Y)
ddist<- datadist(XVar)
options(datadist = 'ddist')
ologit<- lrm(Y ~ X, data = parameters)
print(ologit)
fitted<- predict(ologit, newdata=parameters, type = "fitted.ind")
colMeans(fitted)
}
I encounter:
Error in model.frame.default(formula = Y ~ X, data = parameters, na.action = function (frame) :
variable lengths differ (found for 'X')
If I don't loop through for-loop and use a static name for Y like
Y<- cbind(name1) it works well.

Apply nls() function to multiple subsets

I need to run a non-linear least squares regression on an entire data set, and then repeat the regression on several subsets of that data set. I can do this for a single subset; for example (where y is a generic logistic equation, and x is a vector from 1 to 20):
example = nls(x ~ y, subset = c(2:20))
but I want to do this for 3:20, 4:20, 5:20, etc. I tried a for loop:
datasubsets <- sapply(2:19, seq, to = 20)
for (i in 1:19){
example[i] = nls(x ~ y, subset = datasubsets[i])
}
but I receive "Error in xj[i] : invalid subscript type 'list'". I would very much like to avoid having to copy and paste nls() 20 times. Any help is much appreciated.
This does the job: sapply(2:19,function(jj) nls(x~y,subset=jj:20)).

Resources