If() statement in R - r

I am not very experienced in if statements and loops in R.
Probably you can help me to solve my problem.
My task is to add +1 to df$fz if sum(df$fz) < 450, but in the same time I have to add +1 only to max values in df$fz till that moment when when sum(df$fz) is lower than 450
Here is my df
ID_PP <- c(3,6, 22, 30, 1234456)
z <- c(12325, 21698, 21725, 8378, 18979)
fz <- c(134, 67, 70, 88, 88)
df <- data.frame(ID_PP,z,fz)
After mutating the new column df$new_value, it should look like 134 68 71 88 89
At this moment I have this code, but it adds +1 to all values.
if (sum(df$fz ) < 450) {
mutate(df, new_value=fz+1)
}
I know that I can pick top_n(3, z) and add +1 only to this top, but it is not what I want, because in that case I have to pick a top manually after checking sum(df$fz)

From what I understood from #Oksana's question and comments, we probably can do it this way:
library(tidyverse)
# data
vru <- data.frame(
id = c(3, 6, 22, 30, 1234456),
z = c(12325, 21698, 21725, 8378, 18979),
fz = c(134, 67, 70, 88, 88)
)
# solution
vru %>% #
top_n(450 - sum(fz), z) %>% # subset by top z, if sum(fz) == 450 -> NULL
mutate(fz = fz + 1) %>% # increase fz by 1 for the subset
bind_rows( #
anti_join(vru, ., by = "id"), # take rows from vru which are not in subset
. # take subset with transformed fz
) %>% # bind thous subsets
arrange(id) # sort rows by id
# output
id z fz
1 3 12325 134
2 6 21698 68
3 22 21725 71
4 30 8378 88
5 1234456 18979 89

The clarifications in the comments helped. Let me know if this works for you. Of course, you can drop the cumsum_fz and leftover columns.
# Making variables to use in the calculation
df <- df %>%
arrange(fz) %>%
mutate(cumsum_fz = cumsum(fz),
leftover = 450 - cumsum_fz)
# Find the minimum, non-negative value to use for select values that need +1
min_pos <- min(df$leftover[df$leftover > 0])
# Creating a vector that adds 1 using the min_pos value and keeps
# the other values the same
df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
# Checking the sum of the new value
> sum(df$new_value)
[1] 450
>
> df
ID_PP z fz cumsum_fz leftover new_value
1 6 21698 67 67 383 68
2 22 21725 70 137 313 71
3 30 8378 88 225 225 89
4 1234456 18979 88 313 137 88
5 3 12325 134 447 3 134
EDIT:
Because utubun already posted a great tidyverse solution, I am going to translate my first one completely to base (it was a bit sloppy to mix the two anyway). Same logic as above, and using the data OP provided.
> # Using base
> df <- df[order(fz),]
>
> leftover <- 450 - cumsum(fz)
> min_pos <- min(leftover[leftover > 0])
> df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
>
> sum(df$new_value)
[1] 450
> df
ID_PP z fz new_value
2 6 21698 67 68
3 22 21725 70 71
4 30 8378 88 89
5 1234456 18979 88 88
1 3 12325 134 134

Related

How to merge a single measurement into a dataframe of multiple measurements in R

I have a long dataframe of multiple measurements per ID, at different time points for variables BP1 and BP2.
ID <- c(1,1,1,2,2,2,3,3,4)
Time <- c(56,57,58,61,62,64,66,67,72)
BP1 <- c(70,73,73,74,75,76,74,74,70)
BP2 <- c(122,122,123,126,124,121,130,132,140)
df1 <- data.frame(ID, Time, BP1, BP2)
I would like to merge another dataframe (df2), which contains a single measurement for BP1 and BP2 per ID.
ID <- c(1,2,3,4)
Time <- c(55, 60, 65, 70)
BP1 <- c(70, 72, 73, 74)
BP2 <- c(120, 124, 130, 134)
df2 <- data.frame(ID, Time, BP1, BP2)
How do I combine these dataframes so that the Time variable is in order, and the dataframe looks like this:
Any help greatly appreciated, thank you!
In base R, use rbind() to combine and order() to sort, then clean up the rownames:
df3 <- rbind(df1, df2)
df3 <- df3[order(df3$ID, df3$Time), ]
rownames(df3) <- seq(nrow(df3))
df3
Or, using dplyr:
library(dplyr)
bind_rows(df1, df2) %>%
arrange(ID, Time)
Result from either approach:
ID Time BP1 BP2
1 1 55 70 120
2 1 56 70 122
3 1 57 73 122
4 1 58 73 123
5 2 60 72 124
6 2 61 74 126
7 2 62 75 124
8 2 64 76 121
9 3 65 73 130
10 3 66 74 130
11 3 67 74 132
12 4 70 74 134
13 4 72 70 140

Average over rows pairs and paste the value based on condition

In R, I have a df such as:
a b c
1 124 70 aa
2 129 67 aa
3 139 71 aa
4 125 77 aa
5 125 82 aa
6 121 69 aa
7 135 68 bb
8 137 72 bb
9 137 78 bb
10 140 86 bb
I want to iterate along rows within columns (a, b), computing the mean of all rows pairs, and paste this mean to the same two rows of new columns (a_new, b_new) if the difference between these two rows is >=12. Otherwise just copy the old value. This behaviour should be restricted to groups as marked by another column (c), i.e it should not happen if two rows are from different groups.
In this example, it happens in row 3 (cos in column a, difference with next (4th) row is 14) and in row 5 (cos in column b, difference with next row is 13). However, this should not happen with row 6 cos row 7 is in another c group.
Thus, resulting df would look like:
a b c a_new b_new
1 124 70 aa 124 70
2 129 67 aa 129 67
3 139 71 aa 132 71
4 125 77 aa 132 68
5 125 82 aa 125 75.5
6 121 69 aa 121 75.5
7 135 68 bb 135 68
8 137 72 bb 137 72
9 137 78 bb 137 78
10 140 86 bb 140 86
I've been struggling to do this for a while, figured out that perhaps lag function could be used, but no success. Help would be much appreciated (be it base R, or dplyr, or whatever)
Dput:
structure(list(a = c(124, 129, 139, 125, 125, 121, 135, 137,
137, 140), b = c(70, 67, 71, 77, 82, 69, 68, 72, 78, 86), c = c("aa",
"aa", "aa", "aa", "aa", "aa", "bb", "bb", "bb", "bb")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
We can write a function which works for one chunk.
apply_fun <- function(x) {
inds <- which(abs(diff(x)) >= 12)
if(length(inds))
x[sort(c(inds, inds + 1))] <- c(sapply(inds, function(i)
rep(mean(x[c(i, i + 1)]), 2)))
return(x)
}
and then apply it for multiple columns by group.
library(dplyr)
df %>% group_by(c) %>% mutate_at(vars(a, b), list(new = apply_fun))
# a b c a_new b_new
# <dbl> <dbl> <chr> <dbl> <dbl>
# 1 124 70 aa 124 70
# 2 129 67 aa 129 67
# 3 139 71 aa 132 71
# 4 125 77 aa 132 77
# 5 125 82 aa 125 75.5
# 6 121 69 aa 121 75.5
# 7 135 68 bb 135 68
# 8 137 72 bb 137 72
# 9 137 78 bb 137 78
#10 140 86 bb 140 86
What I understood is to apply to each group given by the indicator column "c" the procedure commented in the code below:
pairAverage <- function(x) {
# x should be a numeric vector of length > 1
if (is.vector(x) & is.numeric(x) & length(x) > 1) {
# copy data to an aux vector
aux <- x
# get differences of lag 1
dh<-diff(x, 1)
# get means of consecutive pairs
med <- c(x$a[2:length(x)] - dh/2)
# get positions (index) of abs(means) >= 12
idx <- match(med[abs(dh) >= 12], med)
# need 2 reps of each mean to replace consecutive values of x
valToRepl <- med[sort(rep(idx,2))]
# ordered indexes pairs of consecutive elements of x to be replaced
idxToRepl <- sort(c(idx,idx+1))
# replace pairs of values
aux[idxToRepl] <- valToRepl
return(aux)
} else {
# do nothing
warning("paramater x should be a numeric vector of length > 1")
return(NULL)
}
}
pairAverageByGroups <- function(x, gr) {
if (is.vector(x) & is.numeric(x) & length(x) == length(gr)) {
x.ls <- split(x, as.factor(gr))
output <- unlist(lapply(x.ls, pairAverage))
names(output) <- NULL
output
} else {
# do nothing
warning("paremater x should be a numeric vector of length > 1")
return(NULL)
}
}
pairAverageByGroups(dd$a, dd$c)
[1] 124 129 132 132 125 121 135 137 137 140

Normalise only some columns in R

I'm new to R and still getting to grips with how it handles data (my background is spreadsheets and databases). the problem I have is as follows. My data looks like this (it is held in CSV):
RecNo Var1 Var2 Var3
41 800 201.8 Y
43 140 39 N
47 60 20.24 N
49 687 77 Y
54 570 135 Y
58 1250 467 N
61 211 52 N
64 96 117.3 N
68 687 77 Y
Column 1 (RecNo) is my observation number; while it is a number, it is not required for my analysis. Column 4 (Var3) is a Yes/No column which, again, I do not currently need for the analysis but will need later in the process to add information in the output.
I need to normalise the numeric data in my dataframe to values between 0 and 1 without losing the other information. I have the following function:
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, min))
sweep(x, 2, apply(x, 2, max), "/")
}
However, when I apply it to my above data by calling
myResult <- normalize(myData)
it returns an error because of the text in Column 4. If I set the text in this column to binary values it runs fine, but then also normalises my case numbers, which I don't want.
So, my question is: How can I change my normalize function above to accept the names of the columns to transform, while outputting the full dataset (i.e. without losing columns)?
I could not get TUSHAr's suggestion to work, but I have found two solutions that work fine:
1. akrun's suggestion above:
myData2 <- myData1 %>% mutate_at(2:3, funs((.-min(.))/max(.-min(.))))
This produces the following:
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
Alternatively, there is the package BBmisc which allowed me the following after transforming my record numbers to factors:
> myData <- myData %>% mutate(RecNo = factor(RecNo))
> myNorm <- normalize(myData2, method="range", range = c(0,1), margin = 1)
> myNorm
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
EDIT: For completion I include TUSHAr's solution as well, showing as always that there are many ways around a single problem:
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)
Thank you for your help!
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)

R - Sum range over lookback period, divided sum of look back - excel to R

I am looking to workout a percentage total over a look back range in R.
I know how to do this in excel with the following formula:
=SUM(B2:B4)/SUM(B2:B4,C2:C4)
This is summing column B over a range of today looking back 3 lines. It then divides this sum buy the total sum of column B + C again looking back 3 lines.
I am looking to achieve the same calculation in R to run across my matrix.
The output would look something like this:
adv dec perct
1 69 376
2 113 293
3 270 150 0.355625492
4 74 371 0.359559402
5 308 96 0.513790386
6 236 173 0.491255962
7 252 134 0.663886572
8 287 129 0.639966969
9 219 187 0.627483444
This is a line of code I could perhaps add the look back range too:
perct <- apply(data.matrix[,c('adv','dec')], 1, function(x) { (x[1] / x[1] + x[2]) } )
If i could get [1] to sum the previous 3 line range and
If i could get [2] to also sum the previous 3 line range.
Still learning how to apply forward and look back periods within R. So any additional learning on the answer would be appreciated!
Here are some approaches. The first 3 use rollsumr and/or rollapplyr in zoo and the last one uses only the base of R.
1) rollsumr Create a matrix with rollsumr whose columns contain the rollling sums, convert that to row proportions and take the "adv" column. Finally assign that to a new column frac in DF. This approach has the shortest code.
library(zoo)
DF$frac <- prop.table(rollsumr(DF, 3, fill = NA), 1)[, "adv"]
giving:
> DF
adv dec frac
1 69 376 NA
2 113 293 NA
3 270 150 0.3556255
4 74 371 0.3595594
5 308 96 0.5137904
6 236 173 0.4912560
7 252 134 0.6638866
8 287 129 0.6399670
9 219 187 0.6274834
1a) This variation is similar except instead of using prop.table we write out the ratio. The code is longer but you may find it clearer.
m <- rollsumr(DF, 3, fill = NA)
DF$frac <- with(as.data.frame(m), adv / (adv + dec))
1b) This is a variation of (1) that is the same except it uses a magrittr pipeline:
library(magrittr)
DF %>% rollsumr(3, fill = NA) %>% prop.table(1) %>% `[`(TRUE, "adv") -> DF$frac
2) rollapplyr We could use rollapplyr with by.column = FALSE like this. The result is the same.
ratio <- function(x) sum(x[, "adv"]) / sum(x)
DF$frac <- rollapplyr(DF, 3, ratio, by.column = FALSE, fill = NA)
3) Yet another variation is to compute the numerator and denominator separately:
DF$frac <- rollsumr(DF$adv, 3, fill = NA) /
rollapplyr(DF, 3, sum, by.column = FALSE, fill = NA)
4) base This uses embed followed by rowSums on each column to get the rolling sums and then uses prop.table as in (1).
DF$frac <- prop.table(sapply(lapply(rbind(NA, NA, DF), embed, 3), rowSums), 1)[, "adv"]
Note: The input used in reproducible form is:
Lines <- "adv dec
1 69 376
2 113 293
3 270 150
4 74 371
5 308 96
6 236 173
7 252 134
8 287 129
9 219 187"
DF <- read.table(text = Lines, header = TRUE)
Consider an sapply that loops through the number of rows in order to index two rows back:
DF$pred <- sapply(seq(nrow(DF)), function(i)
ifelse(i>=3, sum(DF$adv[(i-2):i])/(sum(DF$adv[(i-2):i]) + sum(DF$dec[(i-2):i])), NA))
DF
# adv dec pred
# 1 69 376 NA
# 2 113 293 NA
# 3 270 150 0.3556255
# 4 74 371 0.3595594
# 5 308 96 0.5137904
# 6 236 173 0.4912560
# 7 252 134 0.6638866
# 8 287 129 0.6399670
# 9 219 187 0.6274834

Custom sorting of a dataframe in R

I have a binomail dataset that looks like this:
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
df <-df[order(df$replicate.1..sample.0.1..1000..rep...TRUE..),]
The data is currently soreted in a way to show the instances belonging to 0 group then the ones belonging to the 1 group. Is there a way I can sort the data in a 0-1-0-1-0... fashion? I mean to show a row that belongs to the 0 group, the row after belonging to the 1 group then the zero group and so on...
All I can think about is complex functions. I hope there's a simple way around it.
Thank you,
Here's an attempt, which will add any extra 1's at the end:
First make some example data:
set.seed(2)
df <- data.frame(replicate(4,sample(1:200,10,rep=TRUE)),
addme=sample(0:1,10,rep=TRUE))
Then order:
with(df, df[unique(as.vector(rbind(which(addme==0),which(addme==1)))),])
# X1 X2 X3 X4 addme
#2 141 48 78 33 0
#1 37 111 133 3 1
#3 115 153 168 163 0
#5 189 82 70 103 1
#4 34 37 31 174 0
#6 189 171 98 126 1
#8 167 46 72 57 0
#7 26 196 30 169 1
#9 94 89 193 134 1
#10 110 15 27 31 1
#Warning message:
#In rbind(which(addme == 0), which(addme == 1)) :
# number of columns of result is not a multiple of vector length (arg 1)
Here's another way using dplyr, which would make it suitable for within-group ordering. It's also probably pretty quick. If there's unbalanced numbers of 0's and 1's, it will leave them at the end.
library(dplyr)
df %>%
arrange(addme) %>%
mutate(n0 = sum(addme == 0),
orderme = seq_along(addme) - (n0 * addme) + (0.5 * addme)) %>%
arrange(orderme) %>%
select(-n0, -orderme)

Resources