Inquiring a better way to write code in R - r

I am new to R, and I'd like help in finding a better way to write the following code I've written. Any help would be appreciated.
df$rank[between(df$score,0,1.2)] <- 1
df$rank[between(df$score,1.2,2.1)] <- 2
df$rank[between(df$score,2.1,2.9)] <- 3
df$rank[between(df$score,2.9,3.7)] <- 4
df$rank[between(df$score,3.7,4.5)] <- 5
df$rank[between(df$score,4.5,5.4)] <- 6

You can use cut:
df$rank <- cut(x = df$score,c(0,1.2,2.1,2.9,3.7,4.5,5.4,Inf),FALSE)

library(dplyr)
set.seed(1234)
df <- data.frame(rank = rep(0, 15),
score = runif(15, 0, 6))
df
#> rank score
#> 1 0 0.68222047
#> 2 0 3.73379643
#> 3 0 3.65564840
#> 4 0 3.74027665
#> 5 0 5.16549230
#> 6 0 3.84186363
#> 7 0 0.05697454
#> 8 0 1.39530304
#> 9 0 3.99650255
#> 10 0 3.08550685
#> 11 0 4.16154775
#> 12 0 3.26984901
#> 13 0 1.69640150
#> 14 0 5.54060091
#> 15 0 1.75389504
df %>%
mutate(rank = case_when(between(score, 0, 1.2) ~ 1,
between(score, 1.2, 2.1) ~ 2,
between(score, 2.1, 2.9) ~ 3,
between(score, 2.9, 3.7) ~ 4,
between(score, 3.7, 4.5) ~ 5,
between(score, 4.5, 5.4) ~ 6))
#> rank score
#> 1 1 0.68222047
#> 2 5 3.73379643
#> 3 4 3.65564840
#> 4 5 3.74027665
#> 5 6 5.16549230
#> 6 5 3.84186363
#> 7 1 0.05697454
#> 8 2 1.39530304
#> 9 5 3.99650255
#> 10 4 3.08550685
#> 11 5 4.16154775
#> 12 4 3.26984901
#> 13 2 1.69640150
#> 14 NA 5.54060091
#> 15 2 1.75389504
Created on 2018-04-29 by the reprex package (v0.2.0).

As you didn't add a reproducible example, I created a little one (but keep in mind you should always add an example).
Using ifelse from base you could do this way:
df = data.table(rank = c(1.2, 3.3, 2.5, 3.7, 5.8, 6, 3, 1.1, 0.5))
df$rank2 = ifelse(df$rank>0 & df$rank<=1.2, 1,
ifelse(df$rank>1.2 & df$rank<=2.1, 2,
ifelse(df$rank>2.1 & df$rank<=2.9, 3,
ifelse(df$rank>2.9 & df$rank<=3.7, 4,
ifelse(df$rank>3.7 & df$rank<=4.5, 5, 6)))))
The last ifelse should be your maximun rank value, so the "no" argument will be the last range.
If this is a reocurring problem you should create a function.
Hope it helps.

Related

R dataframe with special cumsum

I have a dateframe like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
# Limits for desired cumulative sum (CumSum)
maxCumSum <- 8
minCumSum <- 0
What I would like to calculate is a cumulative sum of value by group (grp) within the values of maxCumSum and minCumSum. The respective table dt2 should look something like this:
grp t value CumSum
a 1 -1 0
a 2 5 5
a 3 9 8
a 4 -15 0
a 5 6 6
b 1 5 5
b 2 1 6
b 3 7 8
b 4 -11 0
b 5 9 8
Think of CumSum as a water storage with has a certain maximum capacity and the level of which cannot sink below zero.
The normal cumsum does obviously not do the trick since there are no limitations to maximum or minimum. Has anyone a suggestion how to achieve this? In the real dataframe there are of course more than 2 groups and far more than 5 times.
Many thanks!
What you can do is create a function which calculate the cumsum until it reach the max value and start again at the min value like this:
df <- data.frame(grp = c(rep("a", 5), rep("b", 5)), t = c(1:5, 1:5), value = c(-1, 5, 9, -15, 6, 5, 1, 7, -11, 9))
library(dplyr)
maxCumSum <- 8
minCumSum <- 0
f <- function(x, y) max(min(x + y, maxCumSum), minCumSum)
df %>%
group_by(grp) %>%
mutate(CumSum = Reduce(f, value, 0, accumulate = TRUE)[-1])
#> # A tibble: 10 × 4
#> # Groups: grp [2]
#> grp t value CumSum
#> <chr> <int> <dbl> <dbl>
#> 1 a 1 -1 0
#> 2 a 2 5 5
#> 3 a 3 9 8
#> 4 a 4 -15 0
#> 5 a 5 6 6
#> 6 b 1 5 5
#> 7 b 2 1 6
#> 8 b 3 7 8
#> 9 b 4 -11 0
#> 10 b 5 9 8
Created on 2022-07-04 by the reprex package (v2.0.1)

Using mclapply over a list of lists

Suppose I have the below list of lists :
users_days_delays <- list(users, days, delays)
where
users <- list(1, 2, 3)
days <- list(1, 2, 3, 4)
delays <- list(0, 100)
I have the below function
compute_time <- function(user_day_delay){
user <- user_day_delay[[1]]
days <- user_day_delay[[2]]
delays <- user_day_delay[[3]]
time_per_user <- days+delays - 180*user
return(time_per_user)
}
Is there a way for me to use mclapply on the users_days_delays list of lists?
For ex. can I do :
time_per_users <- mclapply(users_days_delays, compute_time)
Thank you!
library(tidyverse)
compute_time <- function(user, day, delay){
time_per_user <- day + delay - 180*user
return(time_per_user)
}
df <- expand_grid(
user = c(1, 2, 3),
day = c(1, 2, 3, 4),
delay = c(0, 100),
)
df$time_per_user <- pmap_dbl(df, compute_time)
print(df)
#> # A tibble: 24 x 4
#> user day delay time_per_user
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 0 -179
#> 2 1 1 100 -79
#> 3 1 2 0 -178
#> 4 1 2 100 -78
#> 5 1 3 0 -177
#> 6 1 3 100 -77
#> 7 1 4 0 -176
#> 8 1 4 100 -76
#> 9 2 1 0 -359
#> 10 2 1 100 -259
#> # ... with 14 more rows
Created on 2022-03-04 by the reprex package (v2.0.1)

Is that possible to get the index of unselected rows of data frame in R?

I want the indices of the unselected rows when using sample() in R. Consider the following case.
df <- data.frame(id = c(1,1,2,2,3,3),
v1 = c(2,2,9,4,7,1),
v2 = c(3,5,8,5,8,5))
ss <- ceiling(0.5*nrow(df)) #size
set.seed(123)
rid <- sample(seq_len(nrow(df)),size=ss,replace=F)
Now, the rows 3,6,2 are randomly selected. Is there a way to know indices of unselected rows (1,4,5)?
Thanks!
You can use df[-rid,]:
df <- data.frame(
id = c(1, 1, 2, 2, 3, 3),
v1 = c(2, 2, 9, 4, 7, 1),
v2 = c(3, 5, 8, 5, 8, 5)
)
ss <- ceiling(0.5 * nrow(df)) # size
set.seed(123)
rid <- sample(seq_len(nrow(df)), size = ss, replace = F)
rid
#> [1] 3 6 2
df
#> id v1 v2
#> 1 1 2 3
#> 2 1 2 5
#> 3 2 9 8
#> 4 2 4 5
#> 5 3 7 8
#> 6 3 1 5
df[rid,]
#> id v1 v2
#> 3 2 9 8
#> 6 3 1 5
#> 2 1 2 5
df[-rid, ]
#> id v1 v2
#> 1 1 2 3
#> 4 2 4 5
#> 5 3 7 8
rownames(df[-rid, ])
#> [1] "1" "4" "5"
Created on 2021-11-05 by the reprex package (v2.0.1)

Transforma data using mutate and ncol?

I am trying to create a script that allows me to transform my data. At the moment I have two different scripts depending on the two ways I usually get the data; either in presence/absence form or with the abundance class written next to each case.
Script 1:
mutate(data, abundance= case_when(data[,2]== 1 ~ 1,
data[,3]==1 ~ 2,
data[,4]==1 ~ 3,
data[,5]==1 ~ 3,
data[,6]==1 ~ 4,
data[,7]==1 ~ 4,
data[,8]==1 ~ 4,
data[,9]== 1 ~ 4,
data[,10]==1 ~ 5,
data[,11]==1 ~ 5,
data[,12]==1 ~ 5,
data[,13]==1 ~ 5,
data[,14]==1 ~ 5,
TRUE ~ NA_real_))
Script 2:
mutate(data, abundance= case_when(data[,2]=="< 1" ~ 2,
data[,2]=="1 - <5%" ~ 3,
data[,2]=="5 - <10%" ~ 3,
data[,2]=="10 - <20%" ~ 4,
data[,2]=="20 - <30%" ~ 4,
data[,2]=="30 - <40%" ~ 4,
data[,2]=="40 - <50%" ~ 4,
data[,2]=="50 - <60%" ~ 5,
data[,2]=="60 - <70%" ~ 5,
data[,2]=="70 - <80%" ~ 5,
data[,2]=="80 - <90%" ~ 5,
data[,2]=="90 - 100%" ~ 5,
TRUE ~ NA_real_)))
Therefore, my data have either two columns or fourteen. I have been thinking about how I could make r distinguish between these two possibilities, and although this option works, the results are not as expected since it adds the value of 1 to all cases.
mutate(data, abundance= ifelse(ncol(data)>2, case_when(data[,2]== 1 ~ 1, data[,3]==1 ~ 2,
data[,4]==1 ~ 3,
data[,5]==1 ~ 3,
data[,6]==1 ~ 4,
data[,7]==1 ~ 4,
data[,8]==1 ~ 4,
data[,9]== 1 ~ 4,
data[,10]==1 ~ 5,
data[,11]==1 ~ 5,
data[,12]==1 ~ 5,
data[,13]==1 ~ 5,
data[,14]==1 ~ 5,
TRUE ~ NA_real_), case_when(data[,2]=="< 0,1" ~ 1,
data[,2]=="< 1" ~ 2,
data[,2]=="1 - <5%" ~ 3,
data[,2]=="5 - <10%" ~ 3,
data[,2]=="10 - <20%" ~ 4,
data[,2]=="20 - <30%" ~ 4,
data[,2]=="30 - <40%" ~ 4,
data[,2]=="40 - <50%" ~ 4,
data[,2]=="50 - <60%" ~ 5,
data[,2]=="60 - <70%" ~ 5,
data[,2]=="70 - <80%" ~ 5,
data[,2]=="80 - <90%" ~ 5,
data[,2]=="90 - 100%" ~ 5,
TRUE ~ NA_real_)))
I hope you can understand me and thank you very much for your help.
The problem with your solution is in the ifelse. The condition has to be of the same length as the data. A naive solution is to replicate the condition to the length of the data ifelse(rep(ncol(data) > 2, nrow(data)), ..., ...)
Below is the solution with some dummy data:
library(tidyverse)
set.seed(123)
# dummy data
a <- data.frame(V1 = round(runif(10)*10), V2= round(runif(10)*10))
b <- cbind(a,a)
colnames(b) <- paste0("V", 1:4)
# test of "case_when" for each input data format
mutate(b, abundance = case_when(b[,1] == 1 ~ 1,
b[,2] == 1 ~ 2,
TRUE ~ NA_real_))
#> V1 V2 V3 V4 abundance
#> 1 3 10 3 10 NA
#> 2 8 5 8 5 NA
#> 3 4 7 4 7 NA
#> 4 9 6 9 6 NA
#> 5 9 1 9 1 2
#> 6 0 9 0 9 NA
#> 7 5 2 5 2 NA
#> 8 9 0 9 0 NA
#> 9 6 3 6 3 NA
#> 10 5 10 5 10 NA
mutate(a, abundance = case_when(a[,2] == 1 ~ 1,
a[,2] == 2 ~ 2))
#> V1 V2 abundance
#> 1 3 10 NA
#> 2 8 5 NA
#> 3 4 7 NA
#> 4 9 6 NA
#> 5 9 1 1
#> 6 0 9 NA
#> 7 5 2 2
#> 8 9 0 NA
#> 9 6 3 NA
#> 10 5 10 NA
# Solution
data <- a
mutate(data, abundance = ifelse(rep(ncol(data) > 2, nrow(data)),
case_when(data[,1] == 1 ~ 1,
data[,2] == 1 ~ 2,
TRUE ~ NA_real_),
case_when(data[,2] == 1 ~ 1,
data[,2] == 2 ~ 2)))
#> V1 V2 abundance
#> 1 3 10 NA
#> 2 8 5 NA
#> 3 4 7 NA
#> 4 9 6 NA
#> 5 9 1 1
#> 6 0 9 NA
#> 7 5 2 2
#> 8 9 0 NA
#> 9 6 3 NA
#> 10 5 10 NA
data <- b
mutate(data, abundance = ifelse(rep(ncol(data) > 2, nrow(data)),
case_when(data[,1] == 1 ~ 1,
data[,2] == 1 ~ 2,
TRUE ~ NA_real_),
case_when(data[,2] == 1 ~ 1,
data[,2] == 2 ~ 2)))
#> V1 V2 V3 V4 abundance
#> 1 3 10 3 10 NA
#> 2 8 5 8 5 NA
#> 3 4 7 4 7 NA
#> 4 9 6 9 6 NA
#> 5 9 1 9 1 2
#> 6 0 9 0 9 NA
#> 7 5 2 5 2 NA
#> 8 9 0 9 0 NA
#> 9 6 3 6 3 NA
#> 10 5 10 5 10 NA

Mutate within a for loop

I have a dataframe like this
structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("Love_ABC", "Love_CNN", "Hate_ABC", "Hate_CNN", "Love_CNBC", "Hate_CNBC"), row.names = c(NA,
8L), class = "data.frame")
I have made the following for loop
channels = c("ABC", "CNN", "CNBC")
for (channel in channels) {
dataframe <- dataframe %>%
mutate(ALL_channel = Love_channel + Hate_channel)
}
But when i run the for loop R tells me " object Love_channel" not found. Have i done something wrong in the for loop?
Here's a way with rlang. Note, reshaping the data is likely more straightforward. Non-standard evaluation (NSE) is a complicated topic.
for (channel in channels) {
DF <- DF %>%
mutate(!!sym(paste0("ALL_", channel)) := !!sym(paste0("Love_", channel)) + !!sym(paste0("Hate_", channel)))
}
DF
## Love_ABC Love_CNN Hate_ABC Hate_CNN Love_CNBC Hate_CNBC ALL_ABC ALL_CNN ALL_CNBC
## 1 1 1 6 6 1 2 7 7 3
## 2 3 3 3 2 2 3 6 5 5
## 3 4 4 6 4 4 4 10 8 8
## 4 6 2 5 5 5 2 11 7 7
## 5 3 6 3 3 6 2 6 9 8
## 6 2 7 6 7 7 7 8 14 14
## 7 5 2 5 2 6 5 10 4 11
## 8 1 6 3 6 3 2 4 12 5
This is a solution with dplyr and tidyr:
library(tidyr)
library(dplyr)
dataframe <- dataframe %>%
tibble::rowid_to_column()
dataframe %>%
pivot_longer(-rowid, names_to = c(NA, "channel"), names_sep = "_") %>%
pivot_wider(names_from = channel, names_prefix = "ALL_", values_from = value, values_fn = sum) %>%
right_join(dataframe, by = "rowid") %>%
select(-rowid)
#> # A tibble: 8 x 9
#> ALL_ABC ALL_CNN ALL_CNBC Love_ABC Love_CNN Hate_ABC Hate_CNN Love_CNBC Hate_CNBC
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 7 7 3 1 1 6 6 1 2
#> 2 6 5 5 3 3 3 2 2 3
#> 3 10 8 8 4 4 6 4 4 4
#> 4 11 7 7 6 2 5 5 5 2
#> 5 6 9 8 3 6 3 3 6 2
#> 6 8 14 14 2 7 6 7 7 7
#> 7 10 4 11 5 2 5 2 6 5
#> 8 4 12 5 1 6 3 6 3 2
The idea is to reshape it to make the sums easier. Then you can join the final result back to the initial dataframe.
start by uniquely identifying each row with a rowid.
reshape with pivot_longer so to have all values neatly in one column. In this step you also separate the names Love/Hate_channel in two and you remove the Love/Hate part (you are interested only on the channel) [that is what the NA does!].
reshape again: this time you want to get one column for each channel. In this step you also sum up what previously was Love and Hate together for each rowid and channel (that's what values_fn=sum does!). Also you add a prefix (names_prefix = "ALL_") to each new column name to have names that respect your expected final result.
with right_join you add the values back to the original dataframe. You have no need for rowid now, so you can remove it.

Resources