I am trying to create new variables in a dataframe that represent multiple lags. I have one time series in it right now "series" and I would like to create 10 different variables, each representing a certain lag of "series". So the resulting data frame would have the original variable "series," plus 10 variables named (1, 2, 3, 4, ... 10) that would represent that number of lags. I am currently trying this on a for loop:
for (i in 1:max.lag){
lag.death$"i" <- lag(tscampos, i)
}
But after reading here, I suspect I might want to use one of the apply functions? Any ideas?
There you go: this function will allow you getting a lagged version of your serie whenever you'll need it. ('better than storing each lagged replicate of the same serie in 10 different columns I find)
lag.death = data.frame(series = floor(runif(10,0,100)));
lag.death$serie
lagit4me = function(serie,lag){
n = length(serie);
pad = rep(0,lag);
return(c(pad,serie)[1:n]);
}
lagit4me(lag.death$serie,1);
lagit4me(lag.death$serie,3);
'can tweak it then to allow negative lags or etc.
( But if you really need it: )
allIn1 = lapply(0:10,lagit4me,serie=lag.death$series);
allIn1 = data.frame(allIn1);
names(allIn1) = 0:10;
allIn1
Enjoy :)
You can also use purrr::map(), similar to lapply() above. This uses dplyr::lag(), instead of lagit4me()
library(dplyr)
library(purrr)
num.lags <- 0:10
list.lags <-
purrr::map(
.x = num.lags,
.f = ~ dplyr::lag(series, .x)
)
Note, you need to name the list elements to coerce to a data_frame
chr.lags <- paste0("lag_", num.series.lags)
names(list.model.subset.lags) <- chr.lags
tbl.model.subset.lags <-
dplyr::bind_rows(list.model.subset.lags)
This produces a tbl with 11 variables, the input variable (lag_0) and 10 lagged variables (with NAs)
print(tbl.model.subset.lags)
Related
I created the following function that takes 3 numeric parameters, size of longitude (in degrees), size of latitude (in degrees) and year. The function creates squares (grids) of size denoted by the first two parameters and then allocates the observations in the dataset over those grids, seperated by year (the third parameter). The function is working as intended.
To use the function to construct a 2x2 Assemblage (the grid with the all the observations in it) for the year 2009, I call:
assemblage_2009 <- CreateAssembleage(2, 2, 2009)
However, I would like to create assembleages iteratively from the year 2009 to 2018.
I tried to do a for loop with i in 2009:2018 without much success. I also tried lapply but also without much success.
Any ideas from more experienced R users?
The function:
CreateAssembleage <- function(size_long, size_lat, year){
# create a dataset to hold only values with the chosen year
data_grid_year <- dplyr::filter(data_grid, Year == year)
# Create vectors to hold the columns (easier to work with)
Longitude <- data_grid_year$Longitude
Latitude <- data_grid_year$Latitude
dx <- size_long # set up the dimensions (easier to change here than inside the code)
dy <- size_lat
# construct the grids
gridx <- seq(min(Longitude), max(Longitude), by = dx) # the values we discussed for the big square
gridy <- seq(min(Latitude), max(Latitude), by = dy)
# take the data and create 3 new columns (x, y, cell) by finding the specified data inside the constructed grids
grid_year <- data_grid_year %>%
mutate(
x = findInterval(Longitude, gridx),
y = findInterval(Latitude, gridy),
cell = paste(x, y, sep = ",")) %>%
relocate(Sample_Id, Latitude, Longitude, x, y, cell) # bring forward the new columns
### Create the assemblage
data_temp <- grid_year %>%
group_by(cell) %>% # group by the same route id
select(-c(Sample_Id, Latitude, Longitude, Midpoint_Date_Local,
Year, Month, Chlorophyll_Index, x, y)) %>% # remove unneeded columns
summarise(across(everything(), sum)) # calculate the sum
return(data_temp) #return the result
}
Thank you all for any ideas.
I cannot check whether your function works since I don't have any data from you. This said, there are multiple possibilities to call a function n times and save the output.
Since you didn't specify the problem, I have to assume you struggle to run the function in a loop and save the output.
Also, I'll have to assume that 1st: your function works, 2nd: size_long and size_lat are always set to 2. If you have different ideas, you'll have to make more clear what you want.
Some options:
Create a list with the output using lapply. Note that here, you'll have to set size_long = 2, size_lat = 2 when you define the function, so these values are the standard values. Furthermore, make year the first argument.
years <- 2009:2018
results <- lapply(years, CreateAssembleage)
Create a list with the output using a for loop:
results <- list()
for(i in 2009:2018){
list[[paste0("assemblage_", i)]] <- CreateAssembleage(size_long = 2, size_lat = 2, year = i)
}
If need be, create multiple variables, one for each year:
for(i in 2009:2018){
do.call("<-", list(paste0("assemblage_", i), CreateAssembleage(size_long = 2, size_lat = 2,
year = i)))
}
Same as 3. but using assign:
for(i in 2009:2018){
assign(paste0("assemblage_", i), CreateAssembleage(size_long = 2, size_lat = 2, year = i))
}
Note that if you want to alter not only year but also the other variables each time, e.g., change size_lat for each iteration, you'll have to use mapply instead of lapply, or, in case of the loops, you'll have to create vectors (or a dataframe) with the other variables as well and adjust your loop.
Edit: As suggested by MrFlick, I changed the order of the options and added the assign-option. Loops are easier to understand for most beginners, but they can be annoyingly slow for large datasets. So it is probably best to get used to lapply.
I am converting my for-loops in R for a model that has multiple input datasets. In the for-loop I use the current loop value to retrieve values from other datasets. I am looking to replicate this using an apply function (over columns in a dataset) however I'm struggling to establish index of the apply function in order to retrieve the appropriate variables from other data
The apply function references the column by the variable in the function which is fine and I've tried to use both colname (after having named my various columns by number) but have not had any joy. Below is an example dataset and for loop with what I'd like to achieve (simplified somewhat). The length of the vectors and the number of columns in the tabular dataset will always be equal.
iteration<-1:3
df <- data.frame("column1" = 6:10, "column2" = 12:16, "column3" = 31:35)
variable1<-rnorm(3,mean = 25)
variable2<-rnorm(3, mean = 0.21)
outcome<-numeric()
for (i in iteration) {
intermediate<-(mean(df[,i])*variable1[i])^variable2[i]
outcome<-c(outcome,intermediate)
}
outcome
The expected results are outcome above...trying this in apply
What I imagine it to be is this:
apply(df, 2, function(x) (mean(x)*variable1[colnumber(x)])^variable2[colnumber(x)]
or perhaps
apply(df, 2, function(x) (mean(x)*variable1[x])^variable2[x])
but these two obviously do not work.
first time user so apologies for any etiquette issues but found the answer to my own problem using the purrr package, but maybe this helps someone else
pmap(list(df, variable1, variable2), function(df, variable1, variable2) (mean(df)*variable1)^variable2)
I have a function in R which returns a list with N columns each of M rows of multiple types - date, numeric and char. I am using sapply to create multiple copies of these lists, which then end up in a top level list. I would like to concatenate the underlying lists together to produce a single list of N columns and M * number of list rows.
I've been trying different combinations of do.call, sapply, rbind, c, etc, but I think I'm missing something pretty fundamental. Below is a simple script that mimics the problem and shows the desired outcome. I've used 3 variables here, but the number of variables is arbitrary.
# Set up test function
testfun <- function(varName)
{
currDate = seq(as.Date('2018-12-31'), as.Date('2019-01-10'), "days")
t1 = runif(11)
t2 = runif(11)
groupNum = c(rep(1,5), rep(2,6))
varName = rep(varName, 11)
dataout= data.frame(currDate, t1, t2, groupNum, varName)
}
# create 3 test variables and run the data
varNames = c('test1', 'test2', 'test3')
tmp = sapply(varNames, testfun)
# I would like it to look like the following, but for any given number of variables
desiredAnswer <- rbind(as.data.frame(tmp[,1]),as.data.frame(tmp[,2]),as.data.frame(tmp[,3]))
The final answer will later be used to create a data table and feed ggplot with the varName as facets.
I'm happy to use any method to get the desired results, there's no reason the function needs to produce a list instead of say a data.frame. I'm certain I'm doing something dumb, any help appreciated.
I am trying to compare multiple columns in two different dataframes in R. This has been addressed previously on the forum (Compare group of two columns and return index matches R) but this is a different scenario: I am trying to compare if a column in dataframe 1 is between the range of 2 columns in dataframe 2. Functions like match, merge, join, intersect won't work here. I have been trying to use purr::pluck but didn't get far. The dataframes are of different sizes.
Below is an example:
temp1.df <- mtcars
temp2.df <- data.frame(
Cyl = sample (4:8, 100, replace = TRUE),
Start = sample (1:22, 100, replace = TRUE),
End = sample (1:22, 100, replace = TRUE)
)
temp1.df$cyl <- as.character(temp1.df$cyl)
temp2.df$Cyl <- as.character(temp2.df$Cyl)
My attempt:
temp1.df <- temp1.df %>% mutate (new_mpg = case_when (
temp1.df$cyl %in% temp2.df$Cyl & temp2.df$Start <= temp1.df$mpg & temp2.df$End >= temp1.df$mpg ~ 1
))
Error:
Error in mutate_impl(.data, dots) :
Column `new_mpg` must be length 32 (the number of rows) or one, not 100
Expected Result:
Compare temp1.df$cyl and temp2.df$Cyl. If they are match then -->
Check if temp1.df$mpg is between temp2.df$Start and temp2.df$End -->
if it is, then create a new variable new_mpg with value of 1.
It's hard to show the exact expected output here.
I realize I could loop this so for each row of temp1.df but the original temp2.df has over 250,000 rows. An efficient solution would be much appreciated.
Thanks
temp1.df$new_mpg<-apply(temp1.df, 1, function(x) {
temp<-temp2.df[temp2.df$Cyl==x[2],]
ifelse(any(apply(temp, 1, function(y) {
dplyr::between(as.numeric(x[1]),as.numeric(y[2]),as.numeric(y[3]))
})),1,0)
})
Note that this makes some assumptions about the organization of your actual data (in particular, I can't call on the column names within apply, so I'm using indexes - which may very well change, so you might want to rearrange your data between receiving it and calling apply, or maybe changing the organization of it within apply, e.g., by apply(temp1.df[,c("mpg","cyl")]....
At any rate, this breaks your data set into lines, and each line is compared to the a subset of the second dataset with the same Cyl count. Within this subset, it checks if any of the mpg for this line falls between (from dplyr) Start and End, and returns 1 if yes (or 0 if no). All these ones and zeros are then returned as a (named) vector, which can be placed into temp1.df$new_mpg.
I'm guessing there's a way to do this with rowwise, but I could never get it to work properly...
I am trying to apply series of simple functions on many variables labelled sequentially AND bind these newly created variables to the same data frame. I managed to do the first part (largely with the help of a previous answer) but not the second part.
dat <- data.frame(x1=sample(c(0:1)), av1 = sample(10) , av2 = sample(10) , av3 = sample(10),av4=sample(10))
dat$t1<-ifelse(dat$x1==1,dat$av1*2/7,dat$av1*5/7)
dat$t2<-ifelse(dat$x1==1,dat$av2*2/7,dat$av2*5/7)
dat$t3<-ifelse(dat$x1==1,dat$av3*2/7,dat$av3*5/7)
dat$t4<-ifelse(dat$x1==1,dat$av4*2/7,dat$av4*5/7)
dat
Basically, I would like to repeat these ifelse statement over all values of av1,av2,av3.. to create corresponding variables labelled as tu1, tu2, tu3 without re-typing function each time. For example:
dat <- cbind(dat, sapply(dat[grep("av", names(dat))], function(col) { ifelse(dat$x1==0, col*2/7, col*5/7) } ) )
However, now all the new variables are also labelled as av. I guess I can change the names of columns afterwards, e.g.:
names( dat)[10:13] <- gsub("av", "tu", names(dat)[10:13])
Because I keep adding/removing variables beforehand in my code those column numbers keep changing. Is there a way for me to create, attach and relabel new variables simultaneously? Or is there a better way of applying the same function over sequentially labelled variables?
You might try something like this:
out <- ifelse(matrix(dat$x1,nrow(dat),sum(grepl("av",colnames(dat)))) == 1,
as.matrix(dat[,grepl("av",colnames(dat))]) * 2 / 7,
as.matrix(dat[,grepl("av",colnames(dat))]) * 5 / 7)
colnames(out) <- paste0("tu",seq_len(ncol(out)))
That's a bit more compact than it needs to be, since I'm doing all the coercion in one go. It might be clearer to extract the piece of dat that you need, and create the indicator matrix separately.
Another option would be to melt your data frame and operate on it by groups and then recast it to wide format.