R: Looping through rows per factor with a function - r

I need to calculate a New value based on Value2 and the newly created previous value of the same New column.
But additionally I want to do this per Group.
I made a for loop previously but it stopped working when i added a second loop for the groups.
for(j in 1:(*EACHGROUP*){
for(i in 2:nrow(DAT) #Start at second place in each Group
DAT$New[i] <- ( DAT$New[i-1]*10^(DAT$Value2[i]) )}
Dummy data
Year <- c(1980,1990,2000,2005,1993,2008,1999,2003,2005)
Group <- c("A","A","A","A","B","B","C","C","C")
Value2 <- c(0,0.25,0.1,-0.3,.5,0.7,-0.8,0.01,0.2)
New <- c(1,1,1,1,1,1,1,1,1)
DAT <- data.frame(cbind(Year,Group,Value2,New))
Output:
Year Group Value2 New
1 1980 A 0 1
2 1990 A 0.25 1
3 2000 A 0.1 1
4 2005 A -0.3 1
5 1993 B 0.5 1
6 2008 B 0.7 1
7 1999 C -0.8 1
8 2003 C 0.01 1
9 2005 C 0.2 1
How can I continue with this approach?
Or should I use "dplyr" for example to do this more easily?
Desired result
Year Group Value2 New
1 1980 A 0 1
2 1990 A 0.25 1.78
3 2000 A 0.1 2.24
4 2005 A -0.3 1.12
5 1993 B 0.5 1
6 2008 B 0.7 5.01
7 1999 C -0.8 1
8 2003 C 0.01 1.02
9 2005 C 0.2 1.62
Best regards

Here is a dplyr way, with an auxiliary function fun.
fun <- function(x, y){
for(i in seq_along(x)[-1]){
x[i] <- x[i - 1] * 10^y[i]
}
x
}
DAT %>%
group_by(Group) %>%
mutate(New = fun(New, Value2))
## A tibble: 9 x 4
## Groups: Group [3]
# Year Group Value2 New
# <dbl> <chr> <dbl> <dbl>
#1 1980 A 0 1
#2 1990 A 0.25 1.78
#3 2000 A 0.1 2.24
#4 2005 A -0.3 1.12
#5 1993 B 0.5 1
#6 2008 B 0.7 5.01
#7 1999 C -0.8 1
#8 2003 C 0.01 1.02
#9 2005 C 0.2 1.62

Related

How to create a dummy variable that takes the value 1 if all values of a variable within a group exceed a certain value

I have a data set like the one below:
dat <- data.frame (id = c(1,1,1,1,1,2,2,2,2,2),
year = c(2015, 2016, 2017,2018, 2019, 2015, 2016, 2017, 2018, 2019),
ratio=c(0.6,0.6,0.65,0.7,0.8,0.4,1,0.5,0.3,0.7))
dat
id year ratio
1 1 2015 0.60
2 1 2016 0.60
3 1 2017 0.65
4 1 2018 0.70
5 1 2019 0.80
6 2 2015 0.40
7 2 2016 1.00
8 2 2017 0.50
9 2 2018 0.30
10 2 2019 0.70
I'd like to create a dummy variable that takes the value of 1 if all the values of the ratio variable within each id exceed 0.5. The resulting data set should be as follows:
dat
id year ratio dummy
1 1 2015 0.60 1
2 1 2016 0.60 1
3 1 2017 0.65 1
4 1 2018 0.70 1
5 1 2019 0.80 1
6 2 2015 0.40 0
7 2 2016 1.00 0
8 2 2017 0.50 0
9 2 2018 0.30 0
10 2 2019 0.70 0
Any help is much appreciated. Thank you!
You already have a base R answer in comments. Here is a dplyr and data.table version -
library(dplyr)
dat %>%
group_by(id) %>%
mutate(dummy = as.integer(all(ratio > 0.5))) %>%
ungroup
# id year ratio dummy
# <dbl> <dbl> <dbl> <int>
# 1 1 2015 0.6 1
# 2 1 2016 0.6 1
# 3 1 2017 0.65 1
# 4 1 2018 0.7 1
# 5 1 2019 0.8 1
# 6 2 2015 0.4 0
# 7 2 2016 1 0
# 8 2 2017 0.5 0
# 9 2 2018 0.3 0
#10 2 2019 0.7 0
data.table -
library(data.table)
setDT(dat)[, dummy := as.integer(all(ratio > 0.5)), id]
dat

Filter a group of a data.frame based on multiple conditions

I am looking for an elegant way to filter the values of a specific group of big data.frame based on multiple conditions.
My data frame looks like this.
data=data.frame(group=c("A","B","C","A","B","C","A","B","C"),
time= c(rep(1,3),rep(2,3), rep(3,3)),
value=c(0.2,1,1,0.1,10,20,10,20,30))
group time value
1 A 1 0.2
2 B 1 1.0
3 C 1 1.0
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
I would like only for the time point 1 to filter out all the values that are smaller than 1 but bigger than 0.1
I want my data.frame to look like this.
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Any help is highly appreciated.
With dplyr you can do
library(dplyr)
data %>% filter(!(time == 1 & (value <= 0.1 | value >= 1)))
# group time value
# 1 A 1 0.2
# 2 A 2 0.1
# 3 B 2 10.0
# 4 C 2 20.0
# 5 A 3 10.0
# 6 B 3 20.0
# 7 C 3 30.0
Or if you have too much free time and you decided to avoid dplyr:
ind <- with(data, (data$time==1 & (data$value > 0.1 & data$value < 1)))
ind <- ifelse((data$time==1) & (data$value > 0.1 & data$value < 1), TRUE, FALSE)
#above two do the same
data$ind <- ind
data <- data[!(data$time==1 & ind==F),]
data$ind <- NULL
group time value
1 A 1 0.2
4 A 2 0.1
5 B 2 10.0
6 C 2 20.0
7 A 3 10.0
8 B 3 20.0
9 C 3 30.0
Another simple option would be to use subset twice and then append the results in a row wise manner.
rbind(
subset(data, time == 1 & value > 0.1 & value < 1),
subset(data, time != 1)
)
# group time value
# 1 A 1 0.2
# 4 A 2 0.1
# 5 B 2 10.0
# 6 C 2 20.0
# 7 A 3 10.0
# 8 B 3 20.0
# 9 C 3 30.0

how to convert wide dataframe columns to long format using sparklyr? [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 2 years ago.
I have a dataframe with columns id, price1, price2, price3,prob1,prob2,prob3 I want to covnert the wide format and price, prob columns into long format
library(dplyr)
library(data.table)
a <- data.table("id" = c(1,2,4),
"price1"=c(1.2,2.44,5.6),
"price2"=c(7.6,8,65),
"price3"=c(1.2,4.5,7.8),
"prob1"=c(0.1,0.3,0.5),
"prob2"=c(0.3,0.35,0.75),
"prob3"=c(0.18,0.31,0.58))
> a
id price1 price2 price3 prob1 prob2 prob3
1 1 1.20 7.6 1.2 0.1 0.30 0.18
2 2 2.44 8.0 4.5 0.3 0.35 0.31
3 4 5.60 65.0 7.8 0.5 0.75 0.58
I want to transform the table a as
b <- data.table("id"=c(1,1,1,2,2,2,3,3,3),
"order"=c(1,2,3,1,2,3,1,2,3),
"price"=c(1.20,7.6,1.2,2.44,8.0,4.5,5.60,65.0,7.8),
"prob"=c(0.1,0.30,0.18,0.3,0.35,0.31,0.5,0.75,0.58))
> b
id order price prob
1: 1 1 1.20 0.10
2: 1 2 7.60 0.30
3: 1 3 1.20 0.18
4: 2 1 2.44 0.30
5: 2 2 8.00 0.35
6: 2 3 4.50 0.31
7: 3 1 5.60 0.50
8: 3 2 65.00 0.75
9: 3 3 7.80 0.58
here order is indicating the sequence number of price and prob values, else it would get shuffled.
I want to get this transformation in sparklyr
You can use pivot_longer specifying names_pattern.
tidyr::pivot_longer(a, cols = -id,
names_to = c('.value', 'order'),
names_pattern = '(.*?)(\\d+)')
# A tibble: 9 x 4
# id order price prob
# <dbl> <chr> <dbl> <dbl>
#1 1 1 1.2 0.1
#2 1 2 7.6 0.3
#3 1 3 1.2 0.18
#4 2 1 2.44 0.3
#5 2 2 8 0.35
#6 2 3 4.5 0.31
#7 4 1 5.6 0.5
#8 4 2 65 0.75
#9 4 3 7.8 0.580

for-loop inside mutate and append result

I have a simple for-loop which works as I would like on vectors, I would like to use my for-loop on a column of a dataframe grouped by another column in the dataframe e.g.:
# here is my for-loop working as expected on a simple vector:
vect <- c(0.5, 0.7, 0.1)
res <- vector(mode = "numeric", length = 3)
for (i in 1:length(vect)) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
[1] 1.9411537 0.9715143 5.5456579
And here is psuedo-code trying to do it on a column of a dataframe:
#Example data
my.df <- data.frame(let = rep(LETTERS[1:3], each = 3),
num1 = 1:3, vect = c(0.5, 0.7, 0.1), num3 = NA)
my.df
let num1 vect num3
1 A 1 0.5 NA
2 A 2 0.7 NA
3 A 3 0.1 NA
4 B 1 0.5 NA
5 B 2 0.7 NA
6 B 3 0.1 NA
7 C 1 0.5 NA
8 C 2 0.7 NA
9 C 3 0.1 NA
# My attempt:
require(tidyverse)
my.df <- my.df %>%
group_by(let) %>%
mutate(for (i in 1:length(vect)) {
num3[i] <- sum(exp(-4 * (vect[i] - vect[-i])))
})
What result should look like (but my psuedo code above doesn't work):
let num1 vect num3
1 A 1 0.5 1.9411537
2 A 2 0.7 0.9715143
3 A 3 0.1 5.5456579
4 B 1 0.5 1.9411537
5 B 2 0.7 0.9715143
6 B 3 0.1 5.5456579
7 C 1 0.5 1.9411537
8 C 2 0.7 0.9715143
9 C 3 0.1 5.5456579
I feel like I am not using tidyverse logic by trying to having a for-loop inside mutate, any suggestions much appreciated.
The simple solution is to create a custom function and pass that to mutate. A working solution:
custom_func <- function(vec) {
res <- vector(mode = "numeric", length = 3)
for (i in 1:length(vect)) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
}
library(tidyverse)
my.df %>%
group_by(let) %>%
mutate(num3 = custom_func(vect))
#> # A tibble: 9 x 4
#> # Groups: let [3]
#> let num1 vect num3
#> <fct> <int> <dbl> <dbl>
#> 1 A 1 0.5 1.94
#> 2 A 2 0.7 0.972
#> 3 A 3 0.1 5.55
#> 4 B 1 0.5 1.94
#> 5 B 2 0.7 0.972
#> 6 B 3 0.1 5.55
#> 7 C 1 0.5 1.94
#> 8 C 2 0.7 0.972
#> 9 C 3 0.1 5.55
I'm wondering whether a more elegant version of the custom function is possible - perhaps someone smarter than me can tell you whether purrr::map, for example, could provide an alternative.
We can use map_dbl from purrr and apply the formula for calculation.
library(dplyr)
library(purrr)
my.df %>%
group_by(let) %>%
mutate(num3 = map_dbl(seq_along(vect), ~ sum(exp(-2 * (vect[.] - vect[-.])))))
# let num1 vect num3
# <fct> <int> <dbl> <dbl>
#1 A 1 0.5 1.94
#2 A 2 0.7 0.972
#3 A 3 0.1 5.55
#4 B 1 0.5 1.94
#5 B 2 0.7 0.972
#6 B 3 0.1 5.55
#7 C 1 0.5 1.94
#8 C 2 0.7 0.972
#9 C 3 0.1 5.55
You can turn your for-loop into a sapply-call and then use it in mutate.
sapply takes a function and aplys it to each list-element. In this case I'm looping over the number of elements in each groups (n()).
my.df %>%
group_by(let) %>%
mutate(num3 = sapply(1:n(), function(i) sum(exp(-2 * (vect[i] - vect[-i])))))
# A tibble: 9 x 4
# Groups: let [3]
# let num1 vect num3
# <fct> <int> <dbl> <dbl>
# 1 A 1 0.5 1.94
# 2 A 2 0.7 0.972
# 3 A 3 0.1 5.55
# 4 B 1 0.5 1.94
# 5 B 2 0.7 0.972
# 6 B 3 0.1 5.55
# 7 C 1 0.5 1.94
# 8 C 2 0.7 0.972
# 9 C 3 0.1 5.55
This is essential equivalent to the very wrong looking for-loop inside a mutate call. In this case, however I'd prefer the custom-function provided by A. Stam.
my.df %>%
group_by(let) %>%
mutate(num3 = {
res <- numeric(length = n())
for (i in 1:n()) {
res[i] <- sum(exp(-2 * (vect[i] - vect[-i])))
}
res
})
You can also replace sapply with purrr's map_dbl.
Or using data.table
library(data.table)
setDT(my.df)[, num3 := unlist(lapply(seq_len(.N),
function(i) sum(exp(-2 * (vect[i] - vect[-i]))))), let]
my.df
# let num1 vect num3
#1: A 1 0.5 1.9411537
#2: A 2 0.7 0.9715143
#3: A 3 0.1 5.5456579
#4: B 1 0.5 1.9411537
#5: B 2 0.7 0.9715143
#6: B 3 0.1 5.5456579
#7: C 1 0.5 1.9411537
#8: C 2 0.7 0.9715143
#9: C 3 0.1 5.5456579

creating time to improvement of +1 variable in r?

I want to create a "time to improvement of +1" variable I have a longitudinal in a long format at baseline, 3, 6 and 9 months. How do I go about it in r? The improvement from the baseline.
The data is like this:
sno time WHZ
1 0 -0.5
1 3 1.4
1 6 -0.7
1 9 2.2
2 0 -0.63
2 3 0.7
2 6 -2.64
2 9 2.1
expected output
sno time WHZ impr First time to imp
1 0 -0.5 0 3
1 3 1.4 1.9 3
1 6 -0.7 -0.2 3
1 9 2.2 2.7 3
2 0 -0.63 0 3
2 3 0.7 1.33 3
2 6 -2.64 -2.01 3
2 9 2.1 2.73 3
Codes I was trying to use to first create the improvement variable:
library(dplyr)
data %>%
group_by(sno)%>%
mutate(ImprvWHZ = data$WHZ - lag(data$WHZ, default = data$WHZ[1]))
If I understand the question correctly, here is a dplyr solution.
library(dplyr)
dat %>%
group_by(sno) %>%
mutate(Improv = WHZ - WHZ[1],
TimeToImprov = ifelse(Improv > 1, time - time[1], NA))
## A tibble: 8 x 5
## Groups: sno [2]
# sno time WHZ Improv TimeToImprov
# <int> <int> <dbl> <dbl> <int>
#1 1 0 -0.5 0 NA
#2 1 3 1.4 1.9 3
#3 1 6 -0.7 -0.200 NA
#4 1 9 2.2 2.7 9
#5 2 0 -0.63 0 NA
#6 2 3 0.7 1.33 3
#7 2 6 -2.64 -2.01 NA
#8 2 9 2.1 2.73 9
And here is a base R solution.
res <- lapply(split(dat, dat$sno), function(DF){
DF$Improv <- DF$WHZ - DF$WHZ[1]
DF$TimeToImprov <- ifelse(DF$Improv > 1, DF$time - DF$time[1], NA)
DF
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
# sno time WHZ Improv TimeToImprov
#1 1 0 -0.50 0.00 NA
#2 1 3 1.40 1.90 3
#3 1 6 -0.70 -0.20 NA
#4 1 9 2.20 2.70 9
#5 2 0 -0.63 0.00 NA
#6 2 3 0.70 1.33 3
#7 2 6 -2.64 -2.01 NA
#8 2 9 2.10 2.73 9
DATA.
dat <- read.table(text = "
sno time WHZ
1 0 -0.5
1 3 1.4
1 6 -0.7
1 9 2.2
2 0 -0.63
2 3 0.7
2 6 -2.64
2 9 2.1
", header = TRUE)

Resources