R - Create (Mutate) a new column as a function of past observations - r
Ok so i have a pretty large data set of around 500 observations and 3 variables. The first column refers to time.
For a test data set I am using:
dat=as.data.frame(matrix(c(1,2,3,4,5,6,7,8,9,10,
1,1.8,3.5,3.8,5.6,6.2,7.8,8.2,9.8,10.1,
2,4.8,6.5,8.8,10.6,12.2,14.8,16.2,18.8,20.1),10,3))
colnames(dat)=c("Time","Var1","Var2")
Time Var1 Var2
1 1 1.0 2.0
2 2 1.8 4.8
3 3 3.5 6.5
4 4 3.8 8.8
5 5 5.6 10.6
6 6 6.2 12.2
7 7 7.8 14.8
8 8 8.2 16.2
9 9 9.8 18.8
10 10 10.1 20.1
So what I need to do is create a new column that each observations is the slope respect to time of some past points. For example taking 3 past points it would be something like:
slopeVar1[i]=slope(Var1[i-2:i],Time[i-2:i]) #Not real code
slopeVar[i]=slope(Var2[i-2:i],Time[i-2:i]) #Not real code
Time Var1 Var2 slopeVar1 slopeVar2
1 1 1 2 NA NA
2 2 1.8 4.8 NA NA
3 3 3.5 6.5 1.25 2.25
4 4 3.8 8.8 1.00 2.00
5 5 5.6 10.6 1.05 2.05
6 6 6.2 12.2 1.20 1.70
7 7 7.8 14.8 1.10 2.10
8 8 8.2 16.2 1.00 2.00
9 9 9.8 18.8 1.00 2.00
10 10 10.1 20.1 0.95 1.95
I actually got as far as using a for() function, but for really large data sets (>100,000) it starts taking too long.
The for() argument that I used is shown bellow:
#CREATE DATA FRAME
rm(dat)
dat=as.data.frame(matrix(c(1,2,3,4,5,6,7,8,9,10,
1,1.8,3.333,3.8,5.6,6.2,7.8,8.2,9.8,10.1,
2,4.8,6.5,8.8,10.6,12.2,14.8,16.2,18.8,20.1),10,3))
colnames(dat)=c("Time","Var1","Var2")
dat
plot(dat)
#CALCULATE SLOPE OF n POINTS FROM i TO i-n.
#In this case I am taking just 3 points, but it should
#be possible to change the number of points taken.
attach(dat)
n=3 #number for points to take slope
l=dim(dat[1])[1] #number of iterations
y=0
x=0
slopeVar1=NA
slopeVar2=NA
for (i in 1:l) {
if (i<n) {slopeVar1[i]=NA} #For the rows where there are not enough previous observations, it outputs NA
if (i>=n) {
y1=Var1[(i-n+1):i] #y data sets for calculating slope of Var1
y2=Var2[(i-n+1):i]#y data sets for calculating slope of Var2
x=Time[(i-n+1):i] #x data sets for calculating slope of Var1&Var2
z1=lm(y1~x) #Temporal value of slope of Var1
z2=lm(y2~x) #Temporal value of slope of Var2
slope1=as.data.frame(z1[1]) #Temporal value of slope of Var1
slopeVar1[i]=slope1[2,1] #Populating string of slopeVar1
slope2=as.data.frame(z2[1])#Temporal value of slope of Var2
slopeVar2[i]=slope2[2,1] #Populating string of slopeVar2
}
}
slopeVar1 #Checking results.
slopeVar2
(result=cbind(dat,slopeVar1,slopeVar2)) #Binds original data with new calculated slopes.
This code actually outputs what I want; but again, for really large data sets is quite inefficient.
This quick rollapply implemenation seems to be speeding it up somewhat -
library("zoo")
slope_func = function(period) {
y1=period[,2] #y data sets for calculating slope of Var1
y2=period[,3] #y data sets for calculating slope of Var2
x=period[,1] #x data sets for calculating slope of Var1&Var2
z1=lm(y1~x) #Temporal value of slope of Var1
z2=lm(y2~x) #Temporal value of slope of Var2
slope1=as.data.frame(z1[1]) #Temporal value of slope of Var1
slopeVar1[i]=slope1[2,1] #Populating string of slopeVar1
slope2=as.data.frame(z1[1])#Temporal value of slope of Var2
slopeVar2[i]=slope2[2,1] #Populating string of slopeVar2
}
}
start = Sys.time()
rollapply(dat[1:3], FUN=slope_func, width=3, by.column=FALSE)
end=Sys.time()
print(end-start)
Time difference of 0.04980111 secs
OP's previous implementation was taking Time difference of 0.2666121 secs for the same
Related
Make a loop function by including gender into the algorithm
I have the following data set: Age<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA) R<-c(2,2.1,2.2,3.4,3.5,4.2,4.7,4.8,5,5.6,NA, 5.9, NA) sex<-c(1,0,1,1,1,1,1,0,0,0,NA, 0,1) df1<-data.frame(Age,R,sex) # Second dataset: Age2<-seq(2,20,0.25) Mspline<-rnorm(73) df2.F<-data.frame(Age2, Mspline) # Third data Age2<-seq(2,20,0.25) Mspline<-rnorm(73) df2.M<-data.frame(Age2, Mspline) I was wondering how I can include gender into the calculation and combine these two algorithm to make a loop function. What I need is: If sex=1 then use the following function to calculate Time last = dim(df2.F)[1] fM.F<-approxfun(df2.F$Age2, df2.F$Mspline, yleft = df2.F$Mspline[1] , yright = df2.F$Mspline[last]) df1$Time<-fM.F(df1$Age) and If sex=0 then use this function to calculate Time last = dim(df2.M)[1] fM.M<-approxfun(df2.M$Age2, df2.M$Mspline, yleft = df2.M$Mspline[1] , yright = df2.M$Mspline[last]) df1$Time<-fM.M(df1$Age) I mean: Read the first record in df1 if it is Female (with age=4.1) the time=fM.F(its age=4.1) but if the gender is Male then to calculate Time apply fM.M on its age so time=fM.M(4.1)
You can create a function that takes the Age vector, the sex value, and the male and female specific dataframes, and selects the frame to use based on the sex value. f <- function(age, s, m,f) { if(is.na(s)) return(NA) if(s==0) df = m else df = f last = dim(df)[1] fM<-approxfun(df$Age2, df$Mspline, yleft = df$Mspline[1] , yright = df$Mspline[last]) fM(age) } Now, just apply the function by group, using pull(cur_group(),sex) to get the sex value for the current group. library(dplyr) df1 %>% group_by(sex) %>% mutate(time = f(Age, pull(cur_group(),sex), df2.M, df2.F)) Output: Age R sex time <dbl> <dbl> <dbl> <dbl> 1 2 2 1 -0.186 2 2.1 2.1 0 1.02 3 2.2 2.2 1 -1.55 4 3.4 3.4 1 -0.461 5 3.5 3.5 1 0.342 6 4.2 4.2 1 -0.560 7 4.7 4.7 1 -0.114 8 4.8 4.8 0 0.247 9 5 5 0 -0.510 10 5.6 5.6 0 -0.982 11 NA NA NA NA 12 5.9 5.9 0 -0.231 13 NA NA 1 NA
Repeatedly apply a conditional summary to groups in a dataframe
I have a large dataframe that looks like this: group_id distance metric 1 1.1 0.85 1 1.1 0.37 1 1.7 0.93 1 2.3 0.45 ... 1 6.3 0.29 1 7.9 0.12 2 2.5 0.78 2 2.8 0.32 ... The dataframe is already sorted by group_id and then distance. I want know the dplyr or data.table efficient equivalent to doing the following operations: Within each group_id: Let the unique and sorted values of distance within the current group_id be d1,d2,...,d_n. For each d in d1,d2,...,d_n: Compute some function f on all values of metric whose distance value is less than d. The function f is a custom user defined function, that takes in a vector and returns a scalar. Assume that the function f is well defined on an empty vector. So, in the example above, the desired dataframe would look like: group_id distance_less_than metric 1 1.1 f(empty vector) 1 1.7 f(0.85, 0.37) 1 2.3 f(0.85, 0.37, 0.93) ... 1 7.9 f(0.85, 0.37, 0.93, 0.45,...,0.29) 2 2.5 f(empty vector) 2 2.8 f(0.78) ... Notice how distance values can be repeated, like the value 1.1 under group 1. In such cases, both of the rows should be excluded when the distance is less than 1.1 (in this case this results in an empty vector).
A possible approach is to use non-equi join available in data.table. The left table is the unique set of combinations of group_id and distance and right table are all the distance less than left table's distance. f <- sum DT[unique(DT, by=c("group_id", "distance")), on=.(group_id, distance<distance), allow.cartesian=TRUE, f(metric), by=.EACHI] output: group_id distance V1 1: 1 1.1 NA 2: 1 1.7 1.22 3: 1 2.3 2.15 4: 1 6.3 2.60 5: 1 7.9 2.89 6: 2 2.5 NA 7: 2 2.8 0.78 data: library(data.table) DT <- fread("group_id distance metric 1 1.1 0.85 1 1.1 0.37 1 1.7 0.93 1 2.3 0.45 1 6.3 0.29 1 7.9 0.12 2 2.5 0.78 2 2.8 0.32")
Don't think this would be faster than data.table option but here is one way using dplyr library(dplyr) df %>% group_by(group_id) %>% mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .]))) where f is your function. map_dbl expects return type of function to be double. If you have different return type for your function you might want to use map_int, map_chr or likes. If you want to keep only one entry per distance you might remove them using filter and duplicated df %>% group_by(group_id) %>% mutate(new = purrr::map_dbl(distance, ~f(metric[distance < .]))) %>% filter(!duplicated(distance))
extracting letters from LSD.test output in agricolae
I'm using LSD.test in agricolae packages Below is a reproducible example library('agricolae') group <- c(1,1,1,2,2,2,3,3,3) variable <- c(1,2,1.5,10,11,12,22,23,21) df <- data.frame(cbind(group,variable)) model <- aov(variable~group,data=df) LSD.test(model,"group",p.adj="bonferroni") I'm getting the below output which is great $statistics MSerror Df Mean CV t.value MSD 0.8035714 7 11.5 7.794969 3.127552 2.289134 $parameters test p.ajusted name.t ntr alpha Fisher-LSD bonferroni group 3 0.05 $means variable std r LCL UCL Min Max Q25 Q50 Q75 1 1.5 0.5 3 0.2761907 2.723809 1 2 1.25 1.5 1.75 2 11.0 1.0 3 9.7761907 12.223809 10 12 10.50 11.0 11.50 3 22.0 1.0 3 20.7761907 23.223809 21 23 21.50 22.0 22.50 $comparison NULL $groups variable groups 3 22.0 a 2 11.0 b 1 1.5 c attr(,"class") [1] "group" I wanted to extract the median and letter from this output. To extract the median of group 3 for example, I used this function output [[5]][[1]][[1]] that gives this output [1] 22 Till now, everything is fine. I'll explain the problem and ask the question below. Now, I need to extract the letter as well. I tried the following code output [[5]][[2]][[1]] [1] a Levels: a b c My question is: Is there any way to get rid of the Levels: a b c statement in the code and get only the letter? Many thanks in advance.
as.character(output [[5]][[2]][[1]]) Solved it, thanks to #Tim Biegeleisen's comment
Save every iteration of for loop
I have asked these question before and solve the problem with Saga's help. I am working on a simulation study. I have to reorganize my results and continue to analysis. I have a data matrix contains may results like this > data It S X Y F 1 1 0.5 0.8 2.39 1 2 0.3 0.2 1.56 2 1 1.56 2.13 1.48 3 1 2.08 1.05 2.14 3 2 1.56 2.04 2.45 ....... It shows iteration S shows second iteration working inside of IT X shows coordinate of X obtained from a method Y shows coordinate of Y obtained from a method F shows the F statistic. My problem is I have to find minimum F value for every iteration. So I have to store every iteration on a different matrix or data frame and find minimum F value. I have tried many things but not worked. Any help, idea will be appreciated. EDIT: Updated table information This was the solution: library(dplyr) data %>% group_by(It) %>% slice(which.min(F)) A tibble: 3 x 5 Groups: It [3] It S X Y F 1 1 2 0.30 0.20 1.56 2 2 1 1.56 2.13 1.48 3 3 1 2.08 1.05 2.14 However , I will continue another for loop and I want to select every X values providing above conditions. For example when I use data$X[i] This code doesn't select to values of X (0.30, 1.56, 2.08). It selected original values from "data" before grouping. How can I solve this problem?
I hope this is what you are expecting: > library(dplyr) > data %>% group_by(It) %>% slice(which.min(F)) # A tibble: 3 x 5 # Groups: It [3] It S X Y F <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 2 0.30 0.20 1.56 2 2 1 1.56 2.13 1.48 3 3 1 2.08 1.05 2.14
Find where species accumulation curve reaches asymptote
I have used the specaccum() command to develop species accumulation curves for my samples. Here is some example data: site1<-c(0,8,9,7,0,0,0,8,0,7,8,0) site2<-c(5,0,9,0,5,0,0,0,0,0,0,0) site3<-c(5,0,9,0,0,0,0,0,0,6,0,0) site4<-c(5,0,9,0,0,0,0,0,0,0,0,0) site5<-c(5,0,9,0,0,6,6,0,0,0,0,0) site6<-c(5,0,9,0,0,0,6,6,0,0,0,0) site7<-c(5,0,9,0,0,0,0,0,7,0,0,3) site8<-c(5,0,9,0,0,0,0,0,0,0,1,0) site9<-c(5,0,9,0,0,0,0,0,0,0,1,0) site10<-c(5,0,9,0,0,0,0,0,0,0,1,6) site11<-c(5,0,9,0,0,0,5,0,0,0,0,0) site12<-c(5,0,9,0,0,0,0,0,0,0,0,0) site13<-c(5,1,9,0,0,0,0,0,0,0,0,0) species_counts<-rbind(site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,site11,site12,site13) accum <- specaccum(species_counts, method="random", permutations=100) plot(accum) In order to ensure I have sampled sufficiently, I need to make sure the curve of the species accumulation plot reaches an asymptote, defined as a slope of <0.3 between the last two points (ei between sites 12 and 13). results <- with(accum, data.frame(sites, richness, sd)) Produces this: sites richness sd 1 1 3.46 0.9991916 2 2 4.94 1.6625403 3 3 5.94 1.7513054 4 4 7.05 1.6779918 5 5 8.03 1.6542263 6 6 8.74 1.6794660 7 7 9.32 1.5497149 8 8 9.92 1.3534841 9 9 10.51 1.0492422 10 10 11.00 0.8408750 11 11 11.35 0.7017295 12 12 11.67 0.4725816 13 13 12.00 0.0000000 I feel like I'm getting there. I could generate an lm with site vs richness and extract the exact slope (tangent?) between sites 12 and 13. Going to search a bit longer here.
Streamlining your data generation process a little bit: species_counts <- matrix(c(0,8,9,7,0,0,0,8,0,7,8,0, 5,0,9,0,5,0,0,0,0,0,0,0, 5,0,9,0,0,0,0,0,0,6,0,0, 5,0,9,0,0,0,0,0,0,0,0,0, 5,0,9,0,0,6,6,0,0,0,0,0, 5,0,9,0,0,0,6,6,0,0,0,0, 5,0,9,0,0,0,0,0,7,0,0,3, 5,0,9,0,0,0,0,0,0,0,1,0, 5,0,9,0,0,0,0,0,0,0,1,0, 5,0,9,0,0,0,0,0,0,0,1,6, 5,0,9,0,0,0,5,0,0,0,0,0, 5,0,9,0,0,0,0,0,0,0,0,0, 5,1,9,0,0,0,0,0,0,0,0,0), byrow=TRUE,nrow=13) Always a good idea to set.seed() before running randomization tests (and let us know that specaccum is in the vegan package): set.seed(101) library(vegan) accum <- specaccum(species_counts, method="random", permutations=100) Extract the richness and sites components from within the returned object and compute d(richness)/d(sites) (note that the slope vector is one element shorter than the origin site/richness vectors: be careful if you're trying to match up slopes with particular numbers of sites) (slopes <- with(accum,diff(richness)/diff(sites))) ## [1] 1.45 1.07 0.93 0.91 0.86 0.66 0.65 0.45 0.54 0.39 0.32 0.31 In this case, the slope never actually goes below 0.3, so this code for finding the first time that the slope falls below 0.3: which(slopes<0.3)[1] returns NA.