Nested For Loop in R to iterate between columns - r

I'm trying to generate a dataset in which I keep the changing the first column every three rows later and as I change the first column, the second and third column change with it as well. Example given below. I'm a little confused as to how I can achieve this with the nested for loop.
df = NULL
for (CDRID in 1:3)
{
for (STARTDATE in 20200517:20200519)
{
for (PRIIDENTITY in 4:6)
{
df1 = rbind(df, data.frame ( "CDR_ID"=CDRID, "START_DATE"=STARTDATE, "PRI_IDENTITY"=PRIIDENTITY)) }}}

df <- data.frame(ID=1:3, START_DATE=20200517:20200519, PRI_IDENTITY=4:6)
df[rep(seq_len(nrow(df)), each=3),]
# ID START_DATE PRI_IDENTITY
# 1 1 20200517 4
# 1.1 1 20200517 4
# 1.2 1 20200517 4
# 2 2 20200518 5
# 2.1 2 20200518 5
# 2.2 2 20200518 5
# 3 3 20200519 6
# 3.1 3 20200519 6
# 3.2 3 20200519 6
A thought: 20200517:20200519, where I'm making an assumption that these are intended to be dates. This is fragile in that it doesn't know about wrapping days between months. In R, it might be better to use proper Date objects. With that, try:
> df <- data.frame(ID=1:3, START_DATE=seq.Date(as.Date("2020-05-17"), as.Date("2020-05-19"), by="days"), PRI_IDENTITY=4:6)
Browse[2]> df[rep(seq_len(nrow(df)), each=3),]
ID START_DATE PRI_IDENTITY
1 1 2020-05-17 4
1.1 1 2020-05-17 4
1.2 1 2020-05-17 4
2 2 2020-05-18 5
2.1 2 2020-05-18 5
2.2 2 2020-05-18 5
3 3 2020-05-19 6
3.1 3 2020-05-19 6
3.2 3 2020-05-19 6
As akrun suggested, if you are using (or amenable to using) packages from the tidyverse, then instead of the rep(seq_len(... part, you can do
# df <- data.frame(...)
tidyr::uncount(df, 3)

Related

How to divide all previous observations by the last observation iteratively within a data frame column by group in R and then store the result

I have the following data frame:
data <- data.frame("Group" = c(1,1,1,1,1,1,1,1,2,2,2,2),
"Days" = c(1,2,3,4,5,6,7,8,1,2,3,4), "Num" = c(10,12,23,30,34,40,50,60,2,4,8,12))
I need to take the last value in Num and divide it by all of the preceding values. Then, I need to move to the second to the last value in Num and do the same, until I reach the first value in each group.
Edited based on the comments below:
In plain language and showing all the math, starting with the first group as suggested below, I am trying to achieve the following:
Take 60 (last value in group 1) and:
Day Num Res
7 60/50 1.2
6 60/40 1.5
5 60/34 1.76
4 60/30 2
3 60/23 2.60
2 60/12 5
1 60/10 6
Then keep only the row that has the value 2, as I don't care about the others (I want the value that is greater or equal to 2 that is the closest to 2) and return the day of that value, which is 4, as well. Then, move on to 50 and do the following:
Day Num Res
6 50/40 1.25
5 50/34 1.47
4 50/30 1.67
3 50/23 2.17
2 50/12 4.17
1 50/10 5
Then keep only the row that has the value 2.17 and return the day of that value, which is 3, as well. Then, move on to 40 and do the same thing over again, move on to 34, then 30, then 23, then 12, the last value (or Day 1 value) I don't care about. Then move on to the next group's last value (12) and repeat the same approach for that group (12/8, 12/4, 12/2; 8/4, 8/2; 4/2)
I would like to store the results of these divisions but only the most recent result that is greater than or equal to 2. I would also like to return the day that result was achieved. Basically, I am trying to calculate doubling time for each day. I would also need this to be grouped by the Group. Normally, I would use dplyr for this but I am not sure how to link up a loop with dyplr to take advantage of group_by. Also, I could be overlooking lapply or some variation thereof. My expected dataframe with the results would ideally be this:
data2 <- data.frame(divres = c(NA,NA,2.3,2.5,2.833333333,3.333333333,2.173913043,2,NA,2,2,3),
obs_n =c(NA,NA,1,2,2,2,3,4,NA,1,2,2))
data3 <- bind_cols(data, data2)
I have tried this first loop to calculate the division but I am lost as to how to move on to the next last value within each group. Right now, this is ignoring the group, though I obviously have not told it to group as I am unclear as to how to do this outside of dplyr.
for(i in 1:nrow(data))
data$test[i] <- ifelse(!is.na(data$Num), last(data$Num)/data$Num[i] , NA)
I also get the following error when I run it:
number of items to replace is not a multiple of replacement length
To store the division, I have tried this:
division <- function(x){
if(x>=2){
return(x)
} else {
return(FALSE)
}
}
for (i in 1:nrow(data)){
data$test[i]<- division(data$test[i])
}
Now, this approach works but only if i need to run this once on the last observation and only if I apply it to 1 group. I have 209 groups and many days that I would need to run this over. I am not sure how to put together the first for loop with the division function and I also am totally lost as to how to do this by group and move to the next last values. Any suggestions would be appreciated.
You can modify your division function to handle vector and return a dataframe with two columns divres and ind the latter is the row index that will be used to calculate obs_n as shown below:
division <- function(x){
lenx <- length(x)
y <- vector(mode="numeric", length = lenx)
z <- vector(mode="numeric", length = lenx)
for (i in lenx:1){
y[i] <- ifelse(length(which(x[i]/x[1:i]>=2))==0,NA,x[i]/x[1:i] [max(which(x[i]/x[1:i]>=2))])
z[i] <- ifelse(is.na(y[i]),NA,max(which(x[i]/x[1:i]>=2)))
}
df <- data.frame(divres = y, ind = z)
return(df)
}
Check the output of division function created above using data$Num as input
> division(data$Num)
divres ind
1 NA NA
2 NA NA
3 2.300000 1
4 2.500000 2
5 2.833333 2
6 3.333333 2
7 2.173913 3
8 2.000000 4
9 NA NA
10 2.000000 9
11 2.000000 10
12 3.000000 10
Use cbind to combine the above output with dataframe data1, use pipes and mutate from dplyr to lookup the obs_n value in Day using ind, select appropriate columns to generate the desired dataframe data2:
data2 <- cbind.data.frame(data, division(data$Num)) %>% mutate(obs_n = Days[ind]) %>% select(-ind)
Output
> data2
Group Days Num divres obs_n
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 2.300000 1
4 1 4 30 2.500000 2
5 1 5 34 2.833333 2
6 1 6 40 3.333333 2
7 1 7 50 2.173913 3
8 1 8 60 2.000000 4
9 2 1 2 NA NA
10 2 2 4 2.000000 1
11 2 3 8 2.000000 2
12 2 4 12 3.000000 2
You can create a function with a for loop to get the desired day as given below. Then use that to get the divres in a dplyr mutation.
obs_n <- function(x, days) {
lst <- list()
for(i in length(x):1){
obs <- days[which(rev(x[i]/x[(i-1):1]) >= 2)]
if(length(obs)==0)
lst[[i]] <- NA
else
lst[[i]] <- max(obs)
}
unlist(lst)
}
Then use dense_rank to obtain the row number corresponding to each obs_n. This is needed in case the days are not consecutive, i.e. have gaps.
library(dplyr)
data %>%
group_by(Group) %>%
mutate(obs_n=obs_n(Num, Days), divres=Num/Num[dense_rank(obs_n)])
# A tibble: 12 x 5
# Groups: Group [2]
Group Days Num obs_n divres
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 10 NA NA
2 1 2 12 NA NA
3 1 3 23 1 2.3
4 1 4 30 2 2.5
5 1 5 34 2 2.83
6 1 6 40 2 3.33
7 1 7 50 3 2.17
8 1 8 60 4 2
9 2 1 2 NA NA
10 2 2 4 1 2
11 2 3 8 2 2
12 2 4 12 2 3
Explanation of dense ranks (from Wikipedia).
In dense ranking, items that compare equally receive the same ranking number, and the next item(s) receive the immediately following ranking number.
x <- c(NA, NA, 1,2,2,4,6)
dplyr::dense_rank(x)
# [1] NA, NA, 1 2 2 3 4
Compare with rank (default method="average"). Note that NAs are included at the end by default.
rank(x)
[1] 6.0 7.0 1.0 2.5 2.5 4.0 5.0

Placing multiple outputs from each function call using apply into a row in a dataframe in R

I have a function that I repeat, changing the argument each time, using apply/sapply/lapply.
Works great.
I want to return a data set, where each row contains two (or more) variables from each iteration of the function.
Instead I get an unusable list.
do <-function(x){
a <- x+1
b <- x+2
cbind(a,b)
}
over <- [1:6]
final <- lapply(over, do)
Any suggestions?
Without changing your function do, you can use sapply and transpose it.
data.frame(t(sapply(over, do)))
# X1 X2
#1 2 3
#2 3 4
#3 4 5
#4 5 6
#5 6 7
#6 7 8
If you want to use do in current form with lapply, we can do
do.call(rbind.data.frame, lapply(over, do))
You could also try
as.data.frame(Reduce(rbind, final))
# a b
# 1 2 3
# 2 3 4
# 3 4 5
# 4 5 6
# 5 6 7
# 6 7 8
See ?Reduce and ?rbind for information about what they'll do.
You could also modify your final expression as
final <- as.data.frame(Reduce(rbind, lapply(over, do)))
#final
# a b
# 1 2 3
# 2 3 4
# 3 4 5
# 4 5 6
# 5 6 7
# 6 7 8

alternative for loop in data frame

I am facing a problem with the amount of time needed to run my code. Basically, I have several columns a key value in the last column (that I identify as the mean in the reproducible example). I want it to be 1 when it is below the value and 2 when it is above.
Is there an easier way to do this?
a <- c(1,3,5,6,4)
b <- c(10,4,24,5,3)
df <- data.frame (a,b)
df$mean <- rowMeans (df)
for (i in 1:5){
df[i,1:2] [df[i,1:2]<df$mean[i]] <- 1
df[i,1:2] [df[i,1:2]>df$mean[i]] <- 2
}
Thank you in advance
You can simply do,
df[1:2] <- (df[1:2] > df$mean) + 1 #removed as.integer as per #akrun's comment
Which gives,
a b mean
1 1 2 5.5
2 1 2 3.5
3 1 2 14.5
4 2 1 5.5
5 2 1 3.5
Always avoid using loops when possible in R!
Alternative Solution using mutate_each from dplyr
df %>% mutate_each(funs(ifelse(mean>.,1,2)), 1:2)
Also gives
a b mean
1 1 2 5.5
2 1 2 3.5
3 1 2 14.5
4 2 1 5.5
5 2 1 3.5

R sum consecutive duplicate rows and remove all but first

I am stuck with a probably simple question - how to sum consecutive duplicate rows and remove all but first row. And, if there is a NA in between two duplicates (such as 2,na,2) , also sum them and remove all but the first entry.
So far so good, here is my sample data
ia<-c(1,1,2,NA,2,1,1,1,1,2,1,2)
time<-c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a<-as.data.frame(cbind(ia, time))
sample output
a
ia time
1 1 4.5
2 1 2.4
3 2 3.6
4 NA 1.5
5 2 1.2
6 1 4.9
7 1 6.4
8 1 4.4
9 1 4.7
10 2 7.3
11 1 2.3
12 2 4.3
Now I want to
1.) sum the "time" column of consecutive ia's - i.e., sum the time if the number 1 occurs twice or more right after each other, in my case here sum first and second row of column time to 4.5+2.4.
2.) if there is a NA in between two numbers (ia column) which are the same (i.e., ia = 2, NA, 2), then also sum all of those times.
3.) keep only first occurence of the ia, and delete the rest.
In the end, I would want to have something like this:
a
ia time
1 1 6.9
3 2 6.3
6 1 20.4
10 2 7.3
11 1 2.3
12 2 4.3
I found this for summing, but it does not take into account the consecutive factor
aggregate(time~ia,data=a,FUN=sum)
and I found this for deleting
a[cumsum(rle(as.numeric(a[,1]))$lengths),]
although the rle approach keeps the last entry, and I would want to keep the first. I also have no idea how to handle the NAs.
if I have a pattern of 1-NA-2 then the NA should NOT be counted with either of them, in this case the NA row should be removed.
With data.table (as RHertel suggested for na.locf):
library(data.table)
library(zoo)
setDT(a)[na.locf(ia, fromLast=T)==na.locf(ia), sum(time), cumsum(c(T,!!diff(na.locf(ia))))]
# id V1
#1: 1 6.9
#2: 2 6.3
#3: 3 20.4
#4: 4 7.3
#5: 5 2.3
#6: 6 4.3
You first need to replace sequences of NAs with the values surrounding them (if they are the same). This answer shows zoo's na.locf function, which fills in NAs with the last observation. By testing whether it's the same when you carry values backwards or forwards, you can filter out the NAs you don't want, then do the carrying forward:
library(dplyr)
library(zoo)
a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia))
#> ia time
#> 1 1 4.5
#> 2 1 2.4
#> 3 2 3.6
#> 4 2 1.5
#> 5 2 1.2
#> 6 1 4.9
#> 7 1 6.4
#> 8 1 4.4
#> 9 2 7.3
#> 10 1 2.3
#> 11 2 4.3
Now that you've fixed those NAs, you can group consecutive sets of values using cumsum. The full solution is:
result <- a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia)) %>%
mutate(change = ia != lag(ia, default = FALSE)) %>%
group_by(group = cumsum(change), ia) %>%
summarise(time = sum(time))
result
#> Source: local data frame [6 x 3]
#> Groups: group [?]
#>
#> group ia time
#> (int) (dbl) (dbl)
#> 1 1 1 6.9
#> 2 2 2 6.3
#> 3 3 1 15.7
#> 4 4 2 7.3
#> 5 5 1 2.3
#> 6 6 2 4.3
If you want to get rid of the group column, use the additional lines:
result %>%
ungroup() %>%
select(-group)
nas <- which(is.na(df$ia))
add.index <- sapply(nas, function(x) {logi <- which(as.logical(df$ia))
aft <- logi[logi > x][1]
fore <- tail(logi[logi< x], 1)
if(df$ia[aft] == df$ia[fore]) aft else NA})
df$ia[nas] <- df$ia[add.index]
df <- df[complete.cases(df),]
First we determine if the NA values of the column are surrounded by the same value. If yes, the surrounding value replaces the NA. There is no problem if the data has consecutive NA values.
Next we do a standard sum by group operation. cumsum allows us to create a unique group based on changes in the numbers.
df$grps <- cumsum(c(F, !df$ia[-length(df$ia)] == df$ia[-1]))+1
aggregate(time ~ grps, df, sum)
# grps time
# 1 1 6.9
# 2 2 6.3
# 3 3 20.4
# 4 4 7.3
# 5 5 2.3
# 6 6 4.3
This is a base R approach. With packages like dplyr, zoo, or data.table different options are available as they come built with specialized functions to do what we did here.

get z standardized score within each group

Here is the data.
set.seed(23) data<-data.frame(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
ID group value
1 1 1 0.4133934
2 2 2 0.6444651
3 3 3 0.1350871
4 4 1 0.5924411
5 5 2 0.3439465
6 6 3 0.3673059
7 7 1 0.3202062
8 8 2 0.8883733
9 9 3 0.7506174
10 10 1 0.3301955
11 11 2 0.7365258
12 12 3 0.1502212
I want to get z-standardized scores within each group. so I try
library(weights)
data_split<-split(data, data$group) #split the dataframe
stan<-lapply(data_split, function(x) stdz(x$value)) #compute z-scores within group
However, It looks wrong because I want to add a new variable following 'value'
How can I do that? Kindly provide some suggestions(sample code). Any help is greatly appreciated .
Use this instead:
within(data, stan <- ave(value, group, FUN=stdz))
No need to call split nor lapply.
One way using data.table package:
library(data.table)
library(weights)
set.seed(23)
data <- data.table(ID=rep(1:12), group=rep(1:3,times=4), value=(rnorm(12,mean=0.5, sd=0.3)))
setkey(data, ID)
dataNew <- data[, list(ID, stan = stdz(value)), by = 'group']
the result is:
group ID stan
1: 1 1 -0.6159312
2: 1 4 0.9538398
3: 1 7 -1.0782747
4: 1 10 0.7403661
5: 2 2 -1.2683237
6: 2 5 0.7839781
7: 2 8 0.8163844
8: 2 11 -0.3320388
9: 3 3 0.6698418
10: 3 6 0.8674548
11: 3 9 -0.2131335
12: 3 12 -1.3241632
I tried Ferdinand.Kraft's solution but it didn't work for me. I think the stdz function isn't included in the basic R install. Moreover, the within part troubled me in a large dataset with many variables. I think the easiest way is:
data$value.s <- ave(data$value, data$group, FUN=scale)
Add the new column while in your function, and have the function return the whole data frame.
stanL<-lapply(data_split, function(x) {
x$stan <- stdz(x$value)
x
})
stan <- do.call(rbind, stanL)

Resources