Problem: I have several (10+) custom functions, each defining a step in the workflow. I want to run a nested function of these steps over a large data frame for n (50+) periods iteratively. My current function achieves the result but it is too slow and not very elegant.
Example Input
id x_1975 z_1975
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 4 4 NA
5 5 5 NA
Step 1:
Compare initial x values (x_1975) against a cutoff=3. If x is greater than 3, then the z value should be "Y".
Step 2:
If z value is "Y", then x value in next year should be x times 2. Otherwise, it should be x times 5. Although the z values can be skipped altogether, I need the categorical column to create summary stats.
Note:
The data set I am working with has 20 variables that need to be calculated based on some similar logics.
Desired Output
id x_1975 z_1975 x_1976 z_1976 x_1977 z_1977 x_1978
1 1 1 <NA> 5 Y 10 Y 20
2 2 2 <NA> 10 Y 20 Y 40
3 3 3 <NA> 15 Y 30 Y 60
4 4 4 Y 8 Y 16 Y 32
5 5 5 Y 10 Y 20 Y 40
6 6 6 Y 12 Y 24 Y 48
What I have tried:
Tried setting the data in long format. But found it complicated to iterate over rows.
Pre-allocated all columns with appropriate class. That reduced run time a little although not enough.
Have been trying to use purrr::compose to nest all the functions. But I am not being able to make it work.
Reproducible Example
library(dplyr)
library(purrr)
# Create Data Frame
n <- 6
dat <- data.frame(id=1:n,
x_1975=seq(1,6,1),
z_1975=NA)
cut_off <- 3
# Functions
# Set a value for "z_" variables in period t by comparing "x_" value in period t against the the cut_off value.
func_1 <- function(dat,yr){
# pre-define variables
z <- paste0("z_",yr)
x <- paste0("x_",yr)
# Caclulate values for "z_" in period t
dat <- dat %>% mutate(!!sym(z):=
case_when(!!sym(x)>cut_off ~ "Y",
TRUE~as.character(NA)
))
}
# Calculate the value for "x_" variables in period t+1 based on "z_" variables in period t.
func_2 <- function(dat,yr){
# pre-define variables
x <- paste0("x_",yr+1)
x_lag <- paste0("x_",yr)
z <- paste0("z_",yr)
# Calculate "x_" value for t+1
dat <- dat %>% mutate(!!sym(x):=case_when(
!!sym(z)=="Y"~!!sym(x_lag)*2,
TRUE~!!sym(x_lag)*5
))
}
# Join function 1 and function 2 together. The joined function needs to iterate over the `dat` from beginning year to ending year
joined_func <- function(dat,beginning,ending){
for (year in seq(beginning,ending,1)){
dat <- func_1(dat,year)
# Output of step 1 is used as input for step 2
dat <- func_2(dat,year)
}
return(dat)
}
# Run the code from 1975 to 2025.The data_output has the desired output, but need to reduce runtime.
data_output <- joined_func(dat,1975,1977)
# Tried to use the compose function from purrr. but getting error.
my_funs <- c(func_1, func_2)
f1 <- invoke(compose, my_funs)
joined_func_2 <- function(dat,beginning,ending){
for (year in seq(beginning,ending,1)){
dat <- f1(dat,year=year)
}
}
data_output_2 <- joined_func_2(dat,1975,1977)
# Error message:
# Error in f1(dat, year = year) : unused argument (year = year).
Questions
a) how do I make purrr::compose work? b) any other way to achieve efficiency?
Would really appreciate if someone could help me on this!
I have a function that I've executed many times but is now throwing an error, which I do not understand. I'm trying to apply the function over a list.
I did not write the function and I have limited experience with functions. So, I'm not sure how to troubleshoot the code.
function:
myfun<-function(Year, SampleID, Species, Abundance, resamps) {
library(vegan)
counter<-1
simbaseline<-data.frame(array(NA,dim=c(length(unique(Year)),5)))
names(simbaseline)<-c("Year", "Jaccard","Horn","Bray","Pearson")
simnext<-data.frame(array(NA,dim=c(length(unique(Year)),5)))
names(simnext)<-c("Year", "Jaccard","Horn","Bray","Pearson")
simhind<-data.frame(array(NA,dim=c(length(unique(Year)),5)))
names(simhind)<-c("Year", "Jaccard","Horn","Bray","Pearson")
counter2<-1
# getting vector with number of samples per year
nsamples<-c()
for(y in unique(Year)){
nsamples<-c(nsamples, length(unique(SampleID[Year==y])))
}
t<-1
minsample<-min(nsamples)
for(repeats in 1:resamps){
raref<-data.frame(array(NA,dim=c(1,3)))
names(raref)<-c("Year","Species","Abundance")
for(y in unique(Year)){
#getting samples for this year
samps<-unique(SampleID[Year==y])
# re-sampling samples to equalize number of samples
sam<-as.character(sample(samps,minsample,replace=T))
# getting data that belongs to bootstraped samples
rarefyear<-data.frame(SampleID[which(SampleID %in% sam & Year == y)],
Species[which(SampleID %in% sam & Year == y)],
Abundance[which(SampleID %in% sam & Year == y)])
names(rarefyear)<-c("SampleID", "Species", "Abundance")
# calculating pooled abundances of eahc species to store
spabun<-tapply(as.numeric(rarefyear[,3]),rarefyear[,2],sum)
spar<-data.frame(rep(y, length(spabun)),names(spabun),spabun, row.names=NULL)
names(spar)<-c("Year","Species","Abundance")
raref<-rbind(raref,spar)
counter<-counter+1
}
# calculating year by species table of abundance
rareftabtemp<-with(raref,tapply(Abundance,list(Year,Species),function(x)x))
rareftabtemp[is.na(rareftabtemp)]<-0
Pearsoncor<-cor(t(log(rareftabtemp+1)), method="pearson")
# calculating between year similarities (NOT DISTANCE!) with Jaccard, Morisita-Horn, Bray and Pearson correlations
Jacsim<-as.matrix(1-vegdist(rareftabtemp, method="jaccard"))
Hornsim<-as.matrix(1-vegdist(rareftabtemp, method="horn"))
Braysim<-as.matrix(1-vegdist(rareftabtemp, method="bray"))
n<-length(unique(Year))
simbaseline[counter2:(counter2+n-2),]<-
cbind(unique(Year)[2:n],Jacsim[2:n],Hornsim[2:n],Braysim[2:n],Pearsoncor[2:n])
simnext[counter2:(counter2+n-2),]<-
cbind(unique(Year)[2:n],Jacsim[row(Jacsim)-col(Jacsim)==1],
Hornsim[row(Hornsim)-col(Hornsim)==1],
Braysim[row(Braysim)-col(Braysim)==1],
Pearsoncor[row(Pearsoncor)-col(Pearsoncor)==1])
# added hindcasting
simhind[counter2:(counter2+n-2),]<-
cbind(unique(Year)[1:(n-1)],
Jacsim[row(Jacsim) %in% 1:(max(row(Jacsim))-1) &
col(Jacsim)==max(col(Jacsim))],
Hornsim[row(Hornsim)%in%1:(max(row(Hornsim))-1) &
col(Hornsim)==max(col(Hornsim))],
Braysim[row(Braysim)%in%1:(max(row(Braysim))-1) &
col(Braysim)==max(col(Braysim))],
Pearsoncor[row(Pearsoncor)%in%1:(max(row(Pearsoncor))-1) &
col(Pearsoncor)==max(col(Pearsoncor))])
counter2<-counter2+n
}
baselinesim<-data.frame(unique(Year)[2:n],
tapply(simbaseline$Jaccard,simbaseline$Year,mean),
tapply(simbaseline$Horn,simbaseline$Year,mean),
tapply(simbaseline$Bray,simbaseline$Year,mean),
tapply(simbaseline$Pearson,simbaseline$Year,mean))
names(baselinesim)<-c("Year", "Jaccard","Horn","Bray","Pearson")
nextsim<-data.frame(unique(Year)[2:n],
tapply(simnext$Jaccard,simnext$Year,mean),
tapply(simnext$Horn,simnext$Year,mean),
tapply(simnext$Bray,simnext$Year,mean),
tapply(simnext$Pearson,simnext$Year,mean))
names(nextsim)<-c("Year", "Jaccard","Horn","Bray","Pearson")
hindcastsim<-data.frame(unique(Year)[1:(n-1)],
tapply(simhind$Jaccard,simhind$Year,mean),
tapply(simhind$Horn,simhind$Year,mean),
tapply(simhind$Bray,simhind$Year,mean),
tapply(simhind$Pearson,simhind$Year,mean))
names(hindcastsim)<-c("Year", "Jaccard","Horn","Bray","Pearson")
a<-list(baselinesim,nextsim,hindcastsim)
return(a)
}
error:
Error in [<-.data.frame(*tmp*, counter2:(counter2 + n - 2), , value = c(NA, : replacement has 2 items, need 5
Traceback
6.
stop(sprintf(ngettext(m, "replacement has %d item, need %d",
"replacement has %d items, need %d"), m, n * p), domain = NA)
5.
`[<-.data.frame`(`*tmp*`, counter2:(counter2 + n - 2), , value = structure(c(NA,
2009), .Dim = 2:1))
4.
`[<-`(`*tmp*`, counter2:(counter2 + n - 2), , value = structure(c(NA,
2009), .Dim = 2:1))
3.
myfun(x$Year, x$Bay, x$Species, x$Abundance, 20)
2.
FUN(X[[i]], ...)
1.
lapply(summer.split, function(x) myfun(x$Year, x$Bay,
x$Species, x$Abundance, 20))
Again, the function worked
Someone seems to have asked a similar question before and was answered by #Marat Talipov but I'm not experienced enough to make sense of what the solution was.
The answer was:
This error pops up when you're unlucky and i <- runif(n) < 1/2 consists only of FALSE, i.e. no permutations happen. You need to add a check in the swap function to fix this problem.
R error in '[<-.data.frame'... replacement has # items, need #
A subset of my data can be found here:
https://fil.email/sI4Kyhaj
The data was split by "Bay" to generate the list
Note that the function may not throw an error on a different machine because it seems to occur periodically.
The problem is located at the peace of code below (and similar indexed reasignment of dataframe's expressions in your code):
simbaseline[counter2:(counter2+n-2),]<-
cbind(unique(Year)[2:n],Jacsim[2:n],Hornsim[2:n],Braysim[2:n],Pearsoncor[2:n])
What happend is that the amount of rows you are trying to assign in the left-hand experession is not equal to the right-hand one. To avoid it you can use intermediate dataframe filled-in with e.g. NAs then reassign temporary dataframe to your target data frame. Please see the simulated code below with explanations how it can be done:
# simulation
df <- data.frame(i = 1:10, l = letters[1:10], stringsAsFactors = FALSE)
head(df)
# i l
# 1 1 a
# 2 1 1
# 3 a a
# 4 4 d
# 5 5 e
# 6 6 f
# with error
df[1:5, ] <- cbind(1:3, c("a", "b", "c"))
# Error in `[<-.data.frame`(`*tmp*`, 1:5, , value = c("1", "2", "3", "a", :
# replacement has 6 items, need 10
# without error
dftemp_in <- cbind(1:3, c("a", "b", "c"))
dftemp_out <- df[1:5, ]
dftemp_out[] <- NA
dftemp_out[seq(nrow(dftemp_in)), ] <- dftemp_in
df[1:5, ] <- dftemp_out
df
# i l
# 1 1 a
# 2 2 b
# 3 3 c
# 4 <NA> <NA>
# 5 <NA> <NA>
# 6 6 f
# 7 7 g
# 8 8 h
# 9 9 i
# 10 10 j
I have a data frame called "Region_Data" which I have created by performing some functions on it.
I want to take this data frame called "Region_Data" and use it an input and I want to subset it using the following function that I created. The function should produce the subset data frame:
Region_Analysis_Function <- function(Input_Region){
Subset_Region_Data = subset(Region_Data, Region == "Input_Region" )
Subset_Region_Data
}
However, when I create this function and then execute it using:
Region_Analysis_Fuction("North West")
I get 0 observations when I execute this code (though I know that there are xx number of observations in the data frame.)
I read that there is something called global / local environment, but I'm not really clear on that.
How do I solve this issue? Thank you so much in advance!!
When you try to subset your data using subset(Region_Data, Region == "Input_Region" ), "Input_Region" is being interpreted as a string literal, rather than being evaluated to the value it represents. This means that unless the column Input_Region in your object Region_Data contains some rows with the value "Input_Region", your function will return a zero-row subset. Removing the quotes will solve this, and changing == to %in% will make your function more generalized. Consider the following data set,
mydf <- data.frame(
x = 1:5,
y = rnorm(5),
z = letters[1:5])
##
R> mydf
x y z
1 1 -0.4015449 a
2 2 0.4875468 b
3 3 0.9375762 c
4 4 -0.7464501 d
5 5 0.8802209 e
and the following 3 functions,
qfoo <- function(Z) {
subset(mydf, z == "Z")
}
foo <- function(Z) {
subset(mydf, z == Z)
}
##
bar <- function(Z) {
subset(mydf, z %in% Z)
}
where qfoo represents the approach used in your question, foo implements the first change I noted, and bar implements both changes.
The second two functions will work when the input value is a scalar,
R> qfoo("c")
[1] x y z
<0 rows> (or 0-length row.names)
##
R> foo("c")
x y z
3 3 0.9375762 c
##
R> bar("c")
x y z
3 3 0.9375762 c
but only the third will work if it is a vector:
R> foo(c("a","c"))
x y z
1 1 -0.4015449 a
Warning messages:
1: In is.na(e1) | is.na(e2) :
longer object length is not a multiple of shorter object length
2: In `==.default`(z, Z) :
longer object length is not a multiple of shorter object length
##
R> bar(c("a","c"))
x y z
1 1 -0.4015449 a
3 3 0.9375762 c
I've got a seemingly simple question that I can't answer: I've got three vectors:
x <- c(1,2,3,4)
weight <- c(5,6,7,8)
y <- c(1,1,1,2,2,2)
I want to create a new vector that replicates the values of weight for each time an element in x matches y such that it produces the following new weight vector associated with y:
y_weight <- c(5,5,5,6,6,6)
Any thoughts on how to do this (either loop or vectorized)? Thanks
You want the match function.
match(y, x)
to return the indicies of the matches, the use that to build your new weight vector
weight[match(y, x)]
#Using plyr
library(plyr)
df<-as.data.frame(cbind(x,weight)) # converting to dataframe
df<-rename(df,c(x="y")) # rename x as y for joining dataframes
y<-as.data.frame(y) # converting to dataframe
mydata <- join(df, y, by = "y",type="right")
> mydata
y weight
1 1 5
2 1 5
3 1 5
4 2 6
5 2 6
6 2 6
I have a dataset like this:
x
A B
1 x 2
2 y 4
3 z 4
4 x 4
5 x 4
6 x 3
......
I want to know if in this dataset are present a same number of "A" upper than some value(for example 3).
Probably i will need to group this value in a temporary table getting this:
X Y z
4 1 1
and after this i will call another method (that i don't know) that gives me this result
X
because only the value X is present more than 3 times in my previous table.
Can R optimise this operation?
data<-data.frame(factor(c("x","y","z","x","x","x")),c(2,4,4,4,4,3))
To get the count of each letter, do
table(data[,1])
and to get the name of the factors with > 3
names(table(data[,1]))[table(data[,1]) > 3]
Don´t know if I understand you right... whats with this B column?
Is this working for you?
set.seed(1234)
A <- sample(c("x", "y", "z"), 20, replace = TRUE)
Ad <- data.frame(table(A))
with(Ad, A[Freq >= 7])
[1] x y