For Loop Function in R - r

I have been struggling to figure out why I am not returning the correct values to my data frame from my function. I want to loop through a vector of my data frame and create a new column by a calculation within the vector's elements. Here's what I have:
# x will be the data frame's vector
y <- function(x){
new <- c()
for (i in x){
new <- c(new, x[i] - x[i+1])
}
return (new)
}
So here I want to create a new vector that returns the next element subtracted from current element. Now, when I apply it to my data frame
df$new <- lapply(df$I, y)
I get all NAs. I know I'm missing something completely obvious...
Also, how would I execute the function that resets itself if df$ID changes so I am not subtracting elements from two different df$IDs? For example, my data frame will have
ID I Order new
1001 5 1 1
1001 6 2 -2
1001 4 3 -2
1001 2 4 NA
1005 2 1 6
1005 8 2 0
1005 8 3 -2
1005 6 4 NA
Thanks!

Avoid the loop and use diff. Everything is vectorized here so it's easy.
df$new <- c(diff(df$I), NA)
But I don't understand your example result. Why are some 0 values changed to NA and some are not? And shouldn't 8-2 be 6 and not -6? I think that needs to be clarified.
If the 0 values need to be changed to NA, just do the following after the above code.
df$new[df$new == 0] <- NA
A one-liner of the complete process, that returns the new data frame, can be
within(df, { new <- c(diff(I), NA); new[new == 0] <- NA })
Update : With respect to your comments below, my updated answer follows.
> M <- do.call(rbind, Map(function(x) { x$z <- c(diff(x$I), NA); x },
split(dat, dat$ID)))
> rownames(M) <- NULL
> M
ID I Order z
1 1001 5 1 1
2 1001 6 2 -2
3 1001 4 3 -2
4 1001 2 4 NA
5 1005 2 1 6
6 1005 8 2 0
7 1005 8 3 -2
8 1005 6 4 NA

The dplyr library makes it very easy to do things separately for each level of a grouping variable, in your case ID. We can use diff as #Richard Scriven recommends, and use dplyr::mutate to add a new column.
> library(dplyr)
> df %>% group_by(ID) %>% mutate(new2 = c(diff(I), NA))
Source: local data frame [8 x 5]
Groups: ID
ID I Order new new2
1 1001 5 1 1 1
2 1001 6 2 -2 -2
3 1001 4 3 -2 -2
4 1001 2 4 NA NA
5 1005 2 1 6 6
6 1005 8 2 0 0
7 1005 8 3 -2 -2
8 1005 6 4 NA NA

Rather than a loop, you would be better off using a vector version of the math. The exact indices will depend on what you want to do with the last value... (Note this line is not placed into your for loop, but just gives the result.)
df$new = c(df$I[-1],NA) - df$I
Here you will be subtracting the original df$I from a shifted version that omits the first value [-1] and appends a NA at the end.
EDIT per comments: If you don't want to subtract across df$ID, you can blank out that subset of cells after subtraction:
df$new[df$ID != c(df$ID[-1],NA)] = NA

Related

How to divide all previous observations by the last observation iteratively within a data frame column by group in R and then store the result

I have the following data frame:
data <- data.frame("Group" = c(1,1,1,1,1,1,1,1,2,2,2,2),
"Days" = c(1,2,3,4,5,6,7,8,1,2,3,4), "Num" = c(10,12,23,30,34,40,50,60,2,4,8,12))
I need to take the last value in Num and divide it by all of the preceding values. Then, I need to move to the second to the last value in Num and do the same, until I reach the first value in each group.
Edited based on the comments below:
In plain language and showing all the math, starting with the first group as suggested below, I am trying to achieve the following:
Take 60 (last value in group 1) and:
Day Num Res
7 60/50 1.2
6 60/40 1.5
5 60/34 1.76
4 60/30 2
3 60/23 2.60
2 60/12 5
1 60/10 6
Then keep only the row that has the value 2, as I don't care about the others (I want the value that is greater or equal to 2 that is the closest to 2) and return the day of that value, which is 4, as well. Then, move on to 50 and do the following:
Day Num Res
6 50/40 1.25
5 50/34 1.47
4 50/30 1.67
3 50/23 2.17
2 50/12 4.17
1 50/10 5
Then keep only the row that has the value 2.17 and return the day of that value, which is 3, as well. Then, move on to 40 and do the same thing over again, move on to 34, then 30, then 23, then 12, the last value (or Day 1 value) I don't care about. Then move on to the next group's last value (12) and repeat the same approach for that group (12/8, 12/4, 12/2; 8/4, 8/2; 4/2)
I would like to store the results of these divisions but only the most recent result that is greater than or equal to 2. I would also like to return the day that result was achieved. Basically, I am trying to calculate doubling time for each day. I would also need this to be grouped by the Group. Normally, I would use dplyr for this but I am not sure how to link up a loop with dyplr to take advantage of group_by. Also, I could be overlooking lapply or some variation thereof. My expected dataframe with the results would ideally be this:
data2 <- data.frame(divres = c(NA,NA,2.3,2.5,2.833333333,3.333333333,2.173913043,2,NA,2,2,3),
obs_n =c(NA,NA,1,2,2,2,3,4,NA,1,2,2))
data3 <- bind_cols(data, data2)
I have tried this first loop to calculate the division but I am lost as to how to move on to the next last value within each group. Right now, this is ignoring the group, though I obviously have not told it to group as I am unclear as to how to do this outside of dplyr.
for(i in 1:nrow(data))
data$test[i] <- ifelse(!is.na(data$Num), last(data$Num)/data$Num[i] , NA)
I also get the following error when I run it:
number of items to replace is not a multiple of replacement length
To store the division, I have tried this:
division <- function(x){
if(x>=2){
return(x)
} else {
return(FALSE)
}
}
for (i in 1:nrow(data)){
data$test[i]<- division(data$test[i])
}
Now, this approach works but only if i need to run this once on the last observation and only if I apply it to 1 group. I have 209 groups and many days that I would need to run this over. I am not sure how to put together the first for loop with the division function and I also am totally lost as to how to do this by group and move to the next last values. Any suggestions would be appreciated.
You can modify your division function to handle vector and return a dataframe with two columns divres and ind the latter is the row index that will be used to calculate obs_n as shown below:
division <- function(x){
lenx <- length(x)
y <- vector(mode="numeric", length = lenx)
z <- vector(mode="numeric", length = lenx)
for (i in lenx:1){
y[i] <- ifelse(length(which(x[i]/x[1:i]>=2))==0,NA,x[i]/x[1:i] [max(which(x[i]/x[1:i]>=2))])
z[i] <- ifelse(is.na(y[i]),NA,max(which(x[i]/x[1:i]>=2)))
}
df <- data.frame(divres = y, ind = z)
return(df)
}
Check the output of division function created above using data$Num as input
> division(data$Num)
divres ind
1 NA NA
2 NA NA
3 2.300000 1
4 2.500000 2
5 2.833333 2
6 3.333333 2
7 2.173913 3
8 2.000000 4
9 NA NA
10 2.000000 9
11 2.000000 10
12 3.000000 10
Use cbind to combine the above output with dataframe data1, use pipes and mutate from dplyr to lookup the obs_n value in Day using ind, select appropriate columns to generate the desired dataframe data2:
data2 <- cbind.data.frame(data, division(data$Num)) %>% mutate(obs_n = Days[ind]) %>% select(-ind)
Output
> data2
Group Days Num divres obs_n
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 2.300000 1
4 1 4 30 2.500000 2
5 1 5 34 2.833333 2
6 1 6 40 3.333333 2
7 1 7 50 2.173913 3
8 1 8 60 2.000000 4
9 2 1 2 NA NA
10 2 2 4 2.000000 1
11 2 3 8 2.000000 2
12 2 4 12 3.000000 2
You can create a function with a for loop to get the desired day as given below. Then use that to get the divres in a dplyr mutation.
obs_n <- function(x, days) {
lst <- list()
for(i in length(x):1){
obs <- days[which(rev(x[i]/x[(i-1):1]) >= 2)]
if(length(obs)==0)
lst[[i]] <- NA
else
lst[[i]] <- max(obs)
}
unlist(lst)
}
Then use dense_rank to obtain the row number corresponding to each obs_n. This is needed in case the days are not consecutive, i.e. have gaps.
library(dplyr)
data %>%
group_by(Group) %>%
mutate(obs_n=obs_n(Num, Days), divres=Num/Num[dense_rank(obs_n)])
# A tibble: 12 x 5
# Groups: Group [2]
Group Days Num obs_n divres
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 1 2.3
4 1 4 30 2 2.5
5 1 5 34 2 2.83
6 1 6 40 2 3.33
7 1 7 50 3 2.17
8 1 8 60 4 2
9 2 1 2 NA NA
10 2 2 4 1 2
11 2 3 8 2 2
12 2 4 12 2 3
Explanation of dense ranks (from Wikipedia).
In dense ranking, items that compare equally receive the same ranking number, and the next item(s) receive the immediately following ranking number.
x <- c(NA, NA, 1,2,2,4,6)
dplyr::dense_rank(x)
# [1] NA, NA, 1 2 2 3 4
Compare with rank (default method="average"). Note that NAs are included at the end by default.
rank(x)
[1] 6.0 7.0 1.0 2.5 2.5 4.0 5.0

Comparing items in a list to a dataset in R

I have a large dataset (8,000 obs) and about 16 lists with anywhere from 120 to 2,000 items. Essentially, I want to check to see if any of the observations in the dataset match an item in a list. If there is a match, I want to include a variable indicating the match.
As an example, if I have data that look like this:
dat <- as.data.frame(1:10)
list1 <- c(2:4)
list2 <- c(7,8)
I want to end with a dataset that looks something like this
Obs Var List
1 1
2 2 1
3 3 1
4 4 1
5 5
6 6
7 7 2
8 8 2
9 9
10 10
How do I go about doing this? Thank you!
Here is one way to do it using boolean sum and %in%. If several match, then the last one is taken here:
dat <- data.frame(Obs = 1:10)
list_all <- list(c(2:4), c(7,8))
present <- sapply(1:length(list_all), function(n) dat$Obs %in% list_all[[n]]*n)
dat$List <- apply(present, 1, FUN = max)
dat$List[dat$List == 0] <- NA
dat
> dat
Obs List
1 1 NA
2 2 1
3 3 1
4 4 1
5 5 NA
6 6 NA
7 7 2
8 8 2
9 9 NA
10 10 NA

Replacing Value in Next Row with Output from Function Before Applying Function to Next Row

I'm looking at trying to apply a function in R to each row, while updating each row with the output of the function from the previous row. I know that's a mouthful, but here's an example. Let's say I had dataframe, df:
df<- data.frame(a=c(10,15,20,25,30), b=c(2,4,5,7,10))
And I had a function, funR, that just took the difference between column a and column b:
funR<- function(argA, argB){
c<- argA-argB
return(c)
}
Now a simplified version of what I'd be going for is let's say I apply the function to the first row and get 10 - 2 = 8. I would then want to replace the second row of column a with this output before applying the function to that row, so instead of 15 - 4 I'd be doing 8 - 4. I would then replace 20 in row 3 with 4, and so on and so on.
Edit to show expected output:
a b
1 10 2
2 8 4
3 4 5
4 -1 7
5 -8 10
Any help would be greatly appreciated!
This is really a one-liner in base R:
Method 1:
for (i in 1:(nrow(df) - 1)) df$a[i + 1] <- df$a[i] - df$b[i];
df;
# a b
#1 10 2
#2 8 4
#3 4 5
#4 -1 7
#5 -8 10
Here we implement the recursion relation a[i+1] = a[i] - b[i] in a simple for loop. The for loop will be very fast, as we directly overwrite existing entries in df.
Method 2
Or alternatively:
df$a <- df$a[1] - cumsum(c(0, df$b))[1:length(df$a)];
df;
# a b
#1 10 2
#2 8 4
#3 4 5
#4 -1 7
#5 -8 10
This is based on the expanded recursion relation, where you can see that e.g. a[4] = a[1] - (b[1] + b[2] + b[3]), and so on.
We can also do this with accumulate from purrr
library(purrr)
library(dplyr)
df %>%
mutate(a = accumulate(b[-n()], `-`, .init = a[1]))
# a b
#1 10 2
#2 8 4
#3 4 5
#4 -1 7
#5 -8 10
Here is a faster version if you want to maintain the use of the function funR.
df<- data.frame(a=c(10,15,20,25,30), b=c(2,4,5,7,10))
funR<- function(argA, argB){
n = length(argA)
argC = c(argA[1], argB)
accumdiff <- function(x){
Reduce(function(x1,x2) x1-x2, x, accumulate=TRUE)}
argC = c(argA[1],accumdiff(argC)[c(-1)])
rev(rev(argC)[-1])
}
df$a <- funR(df$a, df$b)
df
# a b
# 1 10 2
# 2 8 4
# 3 4 5
# 4 -1 7
# 5 -8 10

R - Comparing values in a column and creating a new column with the results of this comparison. Is there a better way than looping?

I'm a beginner of R. Although I have read a lot in manuals and here at this board, I have to ask my first question. It's a little bit the same as here but not really the same and i don't understand the explanation there.I have a dataframe with hundreds of thousands of rows and 30 columns. But for my question I created a simplier dataframe that you can use:
a <- sample(c(1,3,5,9), 20, replace = TRUE)
b <- sample(c(1,NA), 20, replace = TRUE)
df <- data.frame(a,b)
Now I want to compare the values of the last column (here column b), so that I'm looking iteratively at the value of each row if it is the same as the in the next row. If it is the same I want to write a 0 as the value in a new column in the same row, otherwise it should be a 1 as the value of the new column.
Here you can see my code, that's not working, because the rows of the new column only contain 0:
m<-c()
for (i in seq(along=df[,1])){
ifelse(df$b[i] == df$b[i+1],m <- 0, m <- 1)
df$mov <- m
}
The result, what I want to get, looks like the example below. What's the mistake? And is there a better way than creating loops? Maybe looping could be very slow for my big dataset.
a b mov
1 9 NA 0
2 1 NA 1
3 1 1 1
4 5 NA 0
5 1 NA 0
6 3 NA 0
7 3 NA 1
8 5 1 0
9 1 1 0
10 3 1 0
11 1 1 0
12 9 1 0
13 1 1 1
14 5 NA 0
15 9 NA 0
16 9 NA 0
17 9 NA 0
18 5 NA 0
19 3 NA 0
20 1 NA 0
Thank you for your help!
There are a couple things to consider in your example.
First, to avoid a loop, you can create a copy of the vector that is shifted by one position. (There are about 20 ways to do this.) Then when you test vector B vs C it will do element-by-element comparison of each position vs its neighbor.
Second, equality comparisons don't work with NA -- they always return NA. So NA == NA is not TRUE it is NA! Again, there are about 20 ways to get around this, but here I have just replaced all the NAs in the temporary vector with a placeholder that will work for the tests of equality.
Finally, you have to decide what you want to do with the last value (which doesn't have a neighbor). Here I have put 1, which is your assignment for "doesn't match its neighbor".
So, depending on the range of values possible in b, you could do
c = df$b
z = length(c)
c[is.na(c)] = 'x' # replace NA with value that will allow equality test
df$mov = c(1 * !(c[1:z-1] == c[2:z]),1) # add 1 to the end for the last value
You could do something like this to mark the ones which match
df$bnext <- c(tail(df$b,-1),NA)
df$bnextsame <- ifelse(df$bnext == df$b | (is.na(df$b) & is.na(df$bnext)),0,1)
There are plenty of NAs here because there are plenty of NAs in your column b as well and any comparison with NA returns an NA and not a TRUE/FALSE. You could add a df[is.na(df$bnextsame),"bnextsame"] <- 0 to fix that.
You can use a "rolling equality test" with zoo 's rollapply. Also, identical is preferred to ==.
#identical(NA, NA)
#[1] TRUE
#NA == NA
#[1] NA
library(zoo)
df$mov <- c(rollapply(df$b, width = 2,
FUN = function(x) as.numeric(!identical(x[1], x[2]))), "no_comparison")
#`!` because you want `0` as `TRUE` ;
#I added a "no_comparison" to last value as it is not compared with any one
df
# a b mov
#1 5 1 0
#2 1 1 0
#3 9 1 1
#4 5 NA 1
#5 9 1 1
#.....
#19 1 NA 0
#20 1 NA no_comparison

R - Create a column with entries only for the first row of each subset

For instance if I have this data:
ID Value
1 2
1 2
1 3
1 4
1 10
2 9
2 9
2 12
2 13
And my goal is to find the smallest value for each ID subset, and I want the number to be in the first row of the ID group while leaving the other rows blank, such that:
ID Value Start
1 2 2
1 2
1 3
1 4
1 10
2 9 9
2 9
2 12
2 13
My first instinct is to create an index for the IDs using
A <- transform(A, INDEX=ave(ID, ID, FUN=seq_along)) ## A being the name of my data
Since I am a noob, I get stuck at this point. For each ID=n, I want to find the min(A$Value) for that ID subset and place that into the cell matching condition of ID=n and INDEX=1.
Any help is much appreciated! I am sorry that I keep asking questions :(
Here's a solution:
within(A, INDEX <- "is.na<-"(ave(Value, ID, FUN = min), c(FALSE, !diff(ID))))
ID Value INDEX
1 1 2 2
2 1 2 NA
3 1 3 NA
4 1 4 NA
5 1 10 NA
6 2 9 9
7 2 9 NA
8 2 12 NA
9 2 13 NA
Update:
How it works? The command ave(Value, ID, FUN = min) applies the function min to each subset of Value along the values of ID. For the example, it returns a vector of five times 2 and four times 9. Since all values except the first in each subset should be NA, the function "is.na<-" replaces all values at the logical index defined by c(FALSE, !diff(ID)). This index is TRUE if a value is identical with the preceding one.
You're almost there. We just need to make a custom function instead of seq_along and to split value by ID (not ID by ID).
first_min <- function(x){
nas <- rep(NA, length(x))
nas[which.min(x)] <- min(x, na.rm=TRUE)
nas
}
This function makes a vector of NAs and replaces the first element with the minimum value of Value.
transform(dat, INDEX=ave(Value, ID, FUN=first_min))
## ID Value INDEX
## 1 1 2 2
## 2 1 2 NA
## 3 1 3 NA
## 4 1 4 NA
## 5 1 10 NA
## 6 2 9 9
## 7 2 9 NA
## 8 2 12 NA
## 9 2 13 NA
You can achieve this with a tapply one-liner
df$Start<-as.vector(unlist(tapply(df$Value,df$ID,FUN = function(x){ return (c(min(x),rep("",length(x)-1)))})))
I keep going back to this question and the above answers helped me greatly.
There is a basic solution for beginners too:
A$Start<-NA
A[!duplicated(A$ID),]$Start<-A[!duplicated(A$ID),]$Value
Thanks.

Resources