I have a list of data frames and I would like to calculate the nonlinear bestfit to each dataframe in the list and to get a list with the best fit objects. I am trying to use lapply but I am having problems with the parameters.
# define a function for D
fncTtr <- function(n,d) (d/n)*((sqrt(1+2*(n/d))-1))
# define a function for best fit
bestFitD <- function(dat,fun) {
res <- nls(dat$ttr~fun(n,d),data=dat,start=list(d=25),trace=T)
return(res)
}
resL <- lapply(dData2,bestFitD,dat=dData2,fun=fncTtr)
When I execute this code I am getting the following error:
Error in FUN(X[[1L]], ...) : unused argument(s) (X[[1]])
I want the lapply to go thru each dataframe in dData2 and to execute the bestFitD function. How should I define the parameters for the function in lapply. The dData2 object is a list. I am using it as a parameter for bestFitD but this function expects one dataframe as a parameter. How can I define this parameter? When I execute the function bestFitD by itself with one dataframe, it is running correctly. example of a dData2 list with components that are dataframes:
$`1`
n ttr d id
1 35 0.6951 27.739 1
2 36 0.6925 28.072 1
3 37 0.6905 28.507 1
4 38 0.6887 28.946 1
5 39 0.6790 28.003 1
6 40 0.6703 27.247 1
7 41 0.6566 25.735 1
8 42 0.6605 26.981 1
9 43 0.6567 27.016 1
10 44 0.6466 26.026 1
11 45 0.6531 27.667 1
12 46 0.6461 27.128 1
13 47 0.6336 25.751 1
14 48 0.6225 24.636 1
15 49 0.6214 24.992 1
16 50 0.6248 26.011 1
$`2`
n ttr d id
17 35 0.6951 27.739 2
18 36 0.6925 28.072 2
19 37 0.6905 28.507 2
20 42 0.6605 26.981 2
The following code seems to be ok:
res <- bestFitD(dData2[[1]],fncTtr)
but when I execute the following:
res <- bestFitD(dData2[[2]],fncTtr)
I am getting the followin error:
Error in model.frame.default(formula = ~dat + ttr + n, data = dat) :
invalid type (list) for variable 'dat'
Why? Both are dataframes!
But it seems that There is something strange with the second component!
just get rid of the dat$ in your nls function call. i believe it's looking for dat$dat$ttr which obviously will break. That is, your bestFitD function should be:
bestFitD <- function(dat,fun) {
res <- nls(ttr~fun(n,d),data=dat,start=list(d=25),trace=T)
return(res)
}
Now, call using lapply as:
resL <- lapply(dData2, bestFitD, fun = fncTtr)
This should work:
resL <- lapply(dData2, function(x,fun){
bestFitD(x,fun)
},fun='fncTtr')
Where I rewrite,bestFitD using do.call
bestFitD <- function(dat,fun){
nls(ttr~do.call(fun,list(n,d)), data=dat,
start=list(d=25),trace=T)
res
}
0.003237457 : 25
0.0009393089 : 26.77943
0.0009362902 : 26.84895
0.0009362902 : 26.84898
0.001282807 : 25
4.771935e-05 : 27.64267
4.389588e-05 : 27.80729
4.389584e-05 : 27.80781
EDIT
my solution can be simplified to (similar but not exactly to Anthony solution)
lapply(dData2, bestFitD, fun = 'fncTtr')
Related
So my situation is the following: I have a large dataframe which contains the data I should use in matching analyses. I should, however, match inside subgroups that are defined by certain areas. Because I didn't want to do that "manually" for each subgroup (there are too many), I came up with an approach that divides the initial dataframe into sub-dataframes containing information of each unique treated area and the control areas, and saves these dataframes into a list. After this, I performed matching on the dataframes in the list using matchit function from R's MatchIt package. Here a heavily simplified example of how the dataframe list looks like:
> list_df
$A
name treatment cov1 cov2 cov3 var
1 A 1 13.65933 200.5809 13 1000.1185
2 A 1 15.80334 233.8301 13 1010.1038
3 A 1 15.16098 215.1046 13 999.8548
4 A 1 16.45487 185.4957 13 997.5585
5 A 1 15.55230 193.5955 13 1001.2822
9 U 0 16.33895 175.6502 13 999.0682
10 U 0 18.05787 197.6041 13 1003.2781
11 U 0 14.29088 229.5446 13 1002.9567
12 U 0 16.32195 238.9975 13 998.9453
13 U 0 15.25240 217.5467 13 1004.0581
14 U 0 14.69154 219.9963 13 999.3270
15 U 0 14.88606 153.6038 15 989.6423
16 U 0 14.34472 212.5205 15 994.6094
17 U 0 14.66233 231.1179 15 999.7775
18 U 0 14.69155 240.4084 15 994.9280
19 U 0 15.63663 198.3323 10 1007.4225
20 U 0 15.19980 183.5846 10 997.6229
$B
name treatment cov1 cov2 cov3 var
6 B 1 15.66004 187.1542 15 1004.2311
7 B 1 13.89696 197.5548 15 995.6478
8 B 1 16.17403 204.9423 15 1001.5157
9 U 0 16.33895 175.6502 13 999.0682
10 U 0 18.05787 197.6041 13 1003.2781
11 U 0 14.29088 229.5446 13 1002.9567
12 U 0 16.32195 238.9975 13 998.9453
13 U 0 15.25240 217.5467 13 1004.0581
14 U 0 14.69154 219.9963 13 999.3270
15 U 0 14.88606 153.6038 15 989.6423
16 U 0 14.34472 212.5205 15 994.6094
17 U 0 14.66233 231.1179 15 999.7775
18 U 0 14.69155 240.4084 15 994.9280
19 U 0 15.63663 198.3323 10 1007.4225
20 U 0 15.19980 183.5846 10 997.6229
In the real data, I have seven covariates, two of which are matched using exact method.
Here code for matching combining matchit (with Mahalanobis distance) and lapply:
library(MatchIt)
m_obj_Mah <- lapply(area_list,
function(x){
matchit(Treatment ~ Cov1 + Cov2 + Cov3 + Cov4 + Cov5,
data=x, method="nearest", exact = ~ Cov6 + Cov7, distance="mahalanobis")
}
)
In the code above, everything works fine. However, when I try to extract the matched datasets, I get the error:
m_data_Mah <- lapply(m_obj_Mah,
function(x) {match.data(x)})
Error in eval(object$call$data, envir = env) : object 'x' not found
Weirdest thing here is that I used the same approach to do nearest neighbour propensity score matching with calipers in the same dataset and the error didn't appear. The error apparently has something to do with defining the function using x as a name for each df in lapply, but I can't come up with a solution (either looping through the areas in another way or defining the x in lapply somehow differently). Any suggestions?
And sorry that I didn't provide any data. It would be quite complicated to generate a realistic dataset and I cannot share the original. I can try to come up with some kind of a dummy dataset if it's absolutely necessary.
Please see this issue, which asks the same question, and the documentation for match.data(), which answers it (see the data argument).
This is an inherent limitation of match.data(), but the solution is simple and documented: supply the original dataset to the data argument of match.data(), as so:
m_data_Mah <- lapply(seq_along(area_list), function(i) {
match.data(m_obj_Mah[[i]], data = area_list[[i]])}
If you are using version 4.2.0 or higher of MatchIt, using exact will automatically match within subgroups of the exact matching variables (i.e., it will perform separate matching procedures within each one) when using method = "nearest". Setting verbose = TRUE will show which level is currently being matched. You can also use the new rbind() method to combine the matched datasets together (in older versions, you will create statistical errors by using rbind()).
I'm trying to use a new R package called waldo (see at the tidyverse blog too) that is designed to compare data objects to find differences. The waldo::compare() function returns an object that is, according to the documentation:
a character vector with class "waldo_compare"
The main purpose of this function is to be used within the console, leveraging coloring features to highlight outstanding values that are not equal between data objects. However, while just examining in console is useful, I do want to take those values and act on them (filter them out from the data, etc.). Therefore, I want to programmatically extract the outstanding values. I don't know how.
Example
Generate a vector of length 10:
set.seed(2020)
vec_a <- sample(0:20, size = 10)
## [1] 3 15 13 0 16 11 10 12 6 18
Create a duplicate vector, and add additional value (4) into an 11th vector element.
vec_b <- vec_a
vec_b[11] <- 4
vec_b <- as.integer(vec_b)
## [1] 3 15 13 0 16 11 10 12 6 18 4
Use waldo::compare() to test the differences between the two vectors
waldo::compare(vec_a, vec_b)
## `old[8:10]`: 12 6 18
## `new[8:11]`: 12 6 18 4
The beauty is that it's highlighted in the console:
But now, how do I extract the different value?
I can try to assign waldo::compare() to an object:
waldo_diff <- waldo::compare(vec_a, vec_b)
and then what? when I try to do waldo_diff[[1]] I get:
[1] "`old[8:10]`: \033[90m12\033[39m \033[90m6\033[39m \033[90m18\033[39m \n`new[8:11]`: \033[90m12\033[39m \033[90m6\033[39m \033[90m18\033[39m \033[34m4\033[39m"
and for waldo_diff[[2]] it's even worse:
Error in waldo_diff[3] : subscript out of bounds
Any idea how I could programmatically extract the outstanding values that appear in the "new" vector but not in the "old"?
As a disclaimer, I didn't know anything about this package until you posted so this is far from an authoritative answer, but you can't easily extract the different values using the compare() function as it returns an ANSI formatted string ready for pretty printing. Instead the workhorses for vectors seem to be the internal functions ses() and ses_context() which return the indices of the differences between the two objects. The difference seems to be that ses_context() splits the result into a list of non-contiguous differences.
waldo:::ses(vec_a, vec_b)
# A tibble: 1 x 5
x1 x2 t y1 y2
<int> <int> <chr> <int> <int>
1 10 10 a 11 11
The results show that there is an addition in the new vector beginning and ending at position 11.
The following simple function is very limited in scope and assumes that only additions in the new vector are of interest:
new_diff_additions <- function(x, y) {
res <- waldo:::ses(x, y)
res <- res[res$t == "a",] # keep only additions
if (nrow(res) == 0) {
return(NULL)
} else {
Map(function(start, end) {
d <- y[start:end]
`attributes<-`(d, list(start = start, end = end))
},
res[["y1"]], res[["y2"]])
}
}
new_diff_additions(vec_a, vec_b)
[[1]]
[1] 4
attr(,"start")
[1] 11
attr(,"end")
[1] 11
At least for the simple case of comparing two vectors, you’ll be better off
using diffobj::ses_dat() (which is from the package that waldo uses
under the hood) directly:
waldo::compare(1:3, 2:4)
#> `old`: 1 2 3
#> `new`: 2 3 4
diffobj::ses_dat(1:3, 2:4)
#> op val id.a id.b
#> 1 Delete 1 1 NA
#> 2 Match 2 2 NA
#> 3 Match 3 3 NA
#> 4 Insert 4 NA 3
For completeness, to extract additions you could do e.g.:
extract_additions <- function(x, y) {
ses <- diffobj::ses_dat(x, y)
y[ses$id.b[ses$op == "Insert"]]
}
old <- 1:3
new <- 2:4
extract_additions(old, new)
#> [1] 4
I am attempting to prepare a dist structure for use with one of the clustering libraries in R. The input is a file containing fixed-size vectors of integers. Here is a sample input, although the real input will have more than 7 columns in each row.
54540,22060,52234,49984,34093,16412,46758
47075,41863,26267,8577,45153,6133,31558
33741,50895,45860,49942,47074,30793,10483
24437,5884,44081,34891,41438,23705,30392
28575,50826,36723,27807,28152,31804,3303
12936,26572,5576,1758,5484,12165,19950
2086,33345,4810,15722,594,34,15562
19701,46650,38306,33724,24992,55188,53023
I want to read them in and have R interpret them as vectors, which should then be passed to myfun for the purposes of computing edit distance as the distance metric.
However, it appears that data does not have the appropriate format, so myfun is being called with incorrect parameters.
I currently have the following code
library(cluster)
library(proxy)
myfun <- function(x,y) {
numDiffs <- 0;
for (i in x) {
if (x[i] != y[i])
numDiffs <- numDiffs + 1;
}
return(numDiffs);
}
summary(pr_DB)
pr_DB$set_entry(FUN = myfun, names = c("myfun", "vectorham"))
pr_DB$get_entry("MYFUN")
data <- read.table("Sample.txt", header=FALSE, sep=",")
x <- dist(x = data, method = "MYFUN")
When I run this code I get the following error:
Error in if (x[i] != y[i]) numDiffs <- numDiffs + 1 :
missing value where TRUE/FALSE needed
Calls: dist ... .proxy_external -> do.call -> .External -> <Anonymous>
I have tried manually looking at data and it appears to be a matrix rather than a vector of vectors, but I'm not sure how to fix this.
The problem is the line for (i in x) {. I think you mean for (i in seq_along(x)) {. Anyway, try using this instead:
myfun <- function(x,y) sum(x != y)
EDIT: The following code:
data <- read.table(header=FALSE, sep=",", text="
54540,22060,52234,49984,34093,16412,46758
47075,41863,26267,8577,45153,6133,31558
33741,50895,45860,49942,47074,30793,10483
24437,5884,44081,34891,41438,23705,30392
28575,50826,36723,27807,28152,31804,3303
12936,26572,5576,1758,5484,12165,19950
2086,33345,4810,15722,594,34,15562
19701,46650,38306,33724,24992,55188,53023
")
library(cluster)
library(proxy)
myfun <- function(x,y) sum(x != y)
summary(pr_DB)
pr_DB$set_entry(FUN = myfun, names = c("myfun", "vectorham"))
pr_DB$get_entry("MYFUN")
x <- dist(x = data, method = "MYFUN")
Gives the following result to me:
> x
1 2 3 4 5 6 7
2 7
3 7 7
4 7 7 7
5 7 7 7 7
6 7 7 7 7 7
7 7 7 7 7 7 7
8 7 7 7 7 7 7 7
> data
V1 V2 V3 V4 V5 V6 V7
1 54540 22060 52234 49984 34093 16412 46758
2 47075 41863 26267 8577 45153 6133 31558
3 33741 50895 45860 49942 47074 30793 10483
4 24437 5884 44081 34891 41438 23705 30392
5 28575 50826 36723 27807 28152 31804 3303
6 12936 26572 5576 1758 5484 12165 19950
7 2086 33345 4810 15722 594 34 15562
8 19701 46650 38306 33724 24992 55188 53023
Which I believe is correct.
Couple problems with this:
If x and y in myfun are different lengths, you could get missing value where TRUE/FALSE needed, because one will be N/A.
Dist doesn't take your own defined functions as far as I know. See http://stat.ethz.ch/R-manual/R-patched/library/stats/html/dist.html
What Ferdinand said about "i in x". His suggestion is good.
And a matrix really isn't any different than a vector of vectors. Just call a row vector as data[rowNum, ]. Further, a "dist structure" is just a matrix.
I'm trying to create a function that automatically creates polynomials of a zoo object. Coming from Python, the typical way to it is to create a list outside a for loop, and then append the list inside the loop. Following this, I wrote the below code in R:
library("zoo")
example<-zoo(2:8)
polynomial<-function(data, name, poly) {
##creating the catcher object that the polynomials will be attached to
returner<-data
##running the loop
for (i in 2:poly) {
#creating the polynomial
poly<-data^i
##print(paste(name, i), poly) ##done to confirm that paste worked correctly##
##appending the returner object
merge.zoo(returner, assign(paste(name, i), poly))
}
return(returner)
}
#run the function
output<-polynomial(example, "example", 4)
However, when I run the function, R throws no exceptions, but the output object does not have any additional data beyond what I originally created in the example zoo object. I suspect I'm misunderstanding merge.zoo or perhaps now allowed to dynamically reassign the names of the polynomials inside the loop.
Thoughts?
As for error in your code you are missing assignment of result from merge.zoo to returner.
However, I think there is better way to achieve what you want.
example <- zoo(2:8)
polynomial <- function(data, name, poly) {
res <- zoo(sapply(1:poly, function(i) data^i))
names(res) <- paste(name, 1:4)
return(res)
}
polynomial(example, "example", 4)
## example 1 example 2 example 3 example 4
## 1 2 4 8 16
## 2 3 9 27 81
## 3 4 16 64 256
## 4 5 25 125 625
## 5 6 36 216 1296
## 6 7 49 343 2401
## 7 8 64 512 4096
I am new to R. Now I have a function as follow:
funItemAverRating = function()
{
itemRatingNum = array(0, itemNum);
print("begin");
apply(input, 1, function(x)
{
itemId = x[2]+1;
itemAverRating[itemId] <<- itemAverRating[itemId] + x[3];
itemRatingNum[itemId] <<- itemRatingNum[itemId] + 1;
}
);
}
In this function input is a n*3 data frame, n is ~6*(10e+7), itemRatingNum is a vector of size ~3*(10e+5).
My question is why the apply function is so slow (it would take nearly an hour to finish)? Also, as the function runs, it uses more and more memory. But as you can see, the variables are all defined outside the apply function. Can anybody help me?
cheng
It's slow because you call high-level R functions many times.
You have to vectorize your function, meaning that most operations (like <- or +1) should be computed over all data vectors.
For example it looks to me that itemRatingNum holds frequencies of input[[2]] (second column of input data.frame) which could be replaced by:
tb <- table(input[[2]]+1)
itemRatingNum[as.integer(names(tb))] <- tb
Don't do that. You're following a logic that is completely not R-like. If I understand it right, you want to add to a certain itemAverRating vector a value from a third column in some input dataframe.
What itemRatingNum is doing, is rather obscure. It does not end up in the global environment, and it just becomes a vector filled with frequencies at the end of the loop. As you define itemRatingNum within the function, the <<- assignment will also assign it within the local environment of the function, and it will get destroyed when the function ends.
Next, you should give your function input, and get some output. Never assign to the global environment if it's not necessary. Your function is equivalent to the - rather a whole lot faster - following function, which takes input and gives output :
funItemAverRating = function(x,input){
sums <- rowsum(input[,3],input[,2])
sumid <- as.numeric(rownames(sums))+1
x[sumid]+c(sums)
}
FUNCTION EDITED PER MAREKS COMMENT
Which works like :
# make data
itemNum <- 10
set.seed(12)
input <- data.frame(
a1 = rep(1:10,itemNum),
a2 = sample(9:0,itemNum*10,TRUE),
a3 = rep(10:1,itemNum)
)
itemAverRating <- array(0, itemNum)
itemAverRating <- funItemAverRating(itemAverRating,input)
itemAverRating
0 1 2 3 4 5 6 7 8 9
39 65 57 36 62 33 98 62 60 38
If I try your code, I get :
> funItemAverRating()
[1] "begin"
...
> itemAverRating
[1] 39 65 57 36 62 33 98 62 60 38
Which is the same. If you want itemRatingNum, then just do :
> itemRatingNum <- table(input[,2])
0 1 2 3 4 5 6 7 8 9
6 11 11 8 10 6 18 9 13 8