Conditionally convert numbers in an R data frame - r

I am trying to convert data so that each column is represeted by 0's, 1's, and 2's. I have a data frame with 5 populations and 6 variables (there are actually 100+ populations and 5,000+ variables in the real data frame):
pop Var1 Var2 Var3 Var4 Var5 Var6
1 Crater 11 11 22 44 11 22
2 Teton 14 44 12 34 33 22
3 Vipond Park 44 11 22 44 33 NA
4 Little Joe 11 44 NA 44 13 44
5 Rainier 14 11 11 NA 11 44
In each column, I have the following combinations of numbers:
1 and 3,
2 and 4,
2 and 3,
1 and 4,
3 and 4,
1 and 2
For each column, I need to convert one of the "doubled numbers" to a 0, the OTHER of the doubled numbers to a 2, and then those variables that are a combination of two numbers to a 1 (the intermediate value). (So, 13, 24, 23, 14, 34, and 12 should become 1.)
For example, for Var1 in the data frame above, 11 should be 0, 14 should be 1, and 44 should be 2. Some columns have only one of the doubled numbers, and then the combination of the numbers as well. There is also missing data. For example, I am trying to convert the above data frame to:
pop Var1 Var2 Var3 Var4 Var5 Var6
1 Crater 0 0 0 0 0 0
2 Teton 1 2 1 1 2 0
3 Vipond Park 2 0 0 0 2 NA
4 Little Joe 0 2 NA 0 1 2
5 Rainier 1 0 2 NA 0 2

Let u be the unique non-NA elements in x. is.twice is a logical vector which is TRUE for the double digits in u and FALSE for the non-double digits in u. uu is the unique double digits and other is the remaining number or it may be zero length if there is no other number. Finally compute the labels associated with c(uu, other) and perform the translation of x:
f <- function(x) {
u <- unique(na.omit(x))
# separate u into uu (double digits) and other
is.twice <- u %% 10 == u %/% 10 # true if double digit
uu <- u[is.twice]
other <- u[!is.twice]
# compute labels associated with c(uu, other)
labels <- c(0, 2)[seq_along(uu)]
if (length(other) > 0) labels <- c(labels, 1)
# translate x to appropriate labels
labels[match(x, c(uu, other))]
}
replace(DF, -1, lapply(DF[-1], f))
which for the sample data gives:
pop Var1 Var2 Var3 Var4 Var5 Var6
1 Crater 0 0 0 0 0 0
2 Teton 1 2 1 1 2 0
3 Vipond Park 2 0 0 0 2 NA
4 Little Joe 0 2 NA 0 1 2
5 Rainier 1 0 2 NA 0 2
Note: The above used this input:
DF <-
structure(list(pop = structure(c(1L, 4L, 5L, 2L, 3L), .Label = c("Crater",
"Little Joe", "Rainier", "Teton", "Vipond Park"), class = "factor"),
Var1 = c(11L, 14L, 44L, 11L, 14L), Var2 = c(11L, 44L, 11L,
44L, 11L), Var3 = c(22L, 12L, 22L, NA, 11L), Var4 = c(44L,
34L, 44L, 44L, NA), Var5 = c(11L, 33L, 33L, 13L, 11L), Var6 = c(22L,
22L, NA, 44L, 44L)), .Names = c("pop", "Var1", "Var2", "Var3",
"Var4", "Var5", "Var6"), class = "data.frame", row.names = c(NA,
-5L))
Update: Fixed.

Related

Rowsums on two vectors of paired columns but conditional on specific values

I have a dataset that looks like the one below where there are three "pairs" of columns pertaining to the type (datA, datB, datC), and the total for each type (datA_total, datB_total, datC_total):
structure(list(datA = c(1L, NA, 5L, 3L, 8L, NA), datA_total = c(20L,
30L, 40L, 15L, 10L, NA), datB = c(5L, 5L, NA, 6L, 1L, NA), datB_total = c(80L,
10L, 10L, 5L, 4L, NA), datC = c(NA, 4L, 1L, NA, 3L, NA), datC_total = c(NA,
10L, 15L, NA, 20L, NA)), class = "data.frame", row.names = c(NA,
-6L))
# datA datA_total datB datB_total datC datC_total
#1 1 20 5 80 NA NA
#2 NA 30 5 10 4 10
#3 5 40 NA 10 1 15
#4 3 15 6 5 NA NA
#5 8 10 1 4 3 20
#6 NA NA NA NA NA NA
I'm trying to create a rowSums across each row to determine the total visits across each data type conditional on whether they meet a criteria of having ANY score ranging (1-5).
Here is my thought process:
Select only the variables that are the data types (i.e. datA, datB, datC)
Across each row based on EACH data type, determine if that data type meets a criteria (i.e. datA -> does it contain (1,2,3,4,5))
If that data type column does contain one of the 5 values above ^, then look to its paired total variable and ready that value to be rowSummed (i.e. datA -> does it contain (1,2,3,4,5)? -> if yes, then grab datA_total value = 20).
The goal is to end up with a total column like below:
# datA datA_total datB datB_total datC datC_total overall_total
#1 1 20 5 80 NA NA 100
#2 NA 30 5 10 4 10 20
#3 5 40 NA 10 1 15 55
#4 3 15 6 5 NA NA 15
#5 8 10 1 4 3 20 24
#6 NA NA NA NA NA NA 0
You'll notice that row #2 only contained a total of 20 even though there is 30 in datA_total. This is a result of the conditional selection in that datA for row#2 contains "NA" rather than one of the five scores (1,2,3,4,5). Hence, the datA_total of 30 was not included in the rowSums calculation.
My code below shows the vectors I created and my attempt at a conditional rowSums but I end up getting an error regarding mutate... I'm not sure how to integrate the "conditional pairing" portion of this problem:
type_vars <- c("datA", "datB", "datC")
type_scores <- c("1", "2", "3", "4", "5")
type_visits <- c("datA_total", "datB_total", "datC_total")
df <- df %>%
mutate(overall_total = rowSums(all_of(type_visits[type_vars %in% type_scores])))
Any help/tips would be appreciated
dplyr's across should do the job.
library(dplyr)
# copying your tibble
data <-
tibble(
datA = c(1, NA, 5, 3, 8, NA),
datA_total = c(20, 30, 40, 15, 10, NA),
datB = c(5, 5, NA, 6, 1, NA),
datB_total = c(80, 10, 10, 5, 4, NA),
datC = c(NA, 4, 1, NA, 3, NA),
datC_total = c(NA, 10, 15, NA, 20, NA)
)
data %>%
mutate(across(c('A', 'B', 'C') %>% paste0('dat', .), \(x) (x %in% 1:5) * get(cur_column() %>% paste0(., '_total')), .names = "{col}_aux")) %>%
rowwise() %>%
mutate(overall_total = sum(across(ends_with('aux')), na.rm = TRUE)) %>%
select(any_of(c(names(data), 'overall_total')))
# A tibble: 6 × 7
datA datA_total datB datB_total datC datC_total overall_total
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 20 5 80 NA NA 100
2 NA 30 5 10 4 10 20
3 5 40 NA 10 1 15 55
4 3 15 6 5 NA NA 15
5 8 10 1 4 3 20 24
6 NA NA NA NA NA NA 0
First, we create an 'aux' column for each dat. It is 0 if dat is not within 1:5, and dat_total otherwise. Then we sum ignoring NA.

R: subset dataframe for all rows after a condition is met

So I'm having a dataset of the following form:
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
I would like to subset the dataframe and create a new dataframe, containing only the rows after Var1 first reached its group-maximum (including the row this happens) up to the row where Var2 becomes 1 for the first time (also including this row). So what I'd like to have should look like this:
ID Var1 Var2
1 12 0
1 11 1
2 8 0
2 7 0
2 6 1
The original dataset contains a number of NAs and the function should simply ignore those. Also if Var2 never reaches "1" for a group is should just add all rows to the new dataframe (of course only the ones after Var1 reaches its group maximum).
However I cannot wrap my hand around the programming. Does anyone know help?
A dplyr solution with cumsum based filter will do what the question asks for.
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)
## A tibble: 5 x 3
## Groups: ID [2]
# ID Var1 Var2
# <int> <int> <int>
#1 1 12 0
#2 1 11 1
#3 2 8 0
#4 2 7 0
#5 2 6 1
Edit
Here is a solution that tries to answer to the OP's comment and question edit.
df1 %>%
group_by(ID) %>%
mutate_at(vars(starts_with('Var')), ~replace_na(., 0L)) %>%
filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)
Data
df1 <- read.table(text = "
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
", header = TRUE)
Using data.table with .I
library(data.table)
setDT(df1)[df1[, .I[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]$V1]
# ID Var1 Var2
#1: 1 12 0
#2: 1 11 1
#3: 2 8 0
#4: 2 7 0
#5: 2 6 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
Var1 = c(2L, 8L, 12L, 11L, 10L, 5L, 8L, 7L, 6L, 5L), Var2 = c(0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L)), class = "data.frame",
row.names = c(NA,
-10L))
Here is data.table translation of Rui Barradas' working solution:
library(data.table)
dat <- fread(text = "
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
", header = TRUE)
dat[, .SD[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]

Summarise a group value into single row

I have a large dataset with longitudinal readings from single individuals.
I want to summarise information over time into a binary variable. i.e. if diff in the input table below is >5 for any value I want to then reduce the observation for A to a new column saying TRUE.
#Input
individual val1 val2 diff
A 32 36 -4
A 36 28 8
A 28 26 2
A 26 26 0
B 65 64 1
B 58 59 -1
B 57 54 3
B 54 51 3
#Output
individual newval
A TRUE
B FALSE
Using dplyr you can:
library(dplyr)
df %>%
group_by(individual) %>% # first group data
summarize(newval = any(diff > 5)) # then evaluate test for each group
#> # A tibble: 2 x 2
#> individual newval
#> <fct> <lgl>
#> 1 A TRUE
#> 2 B FALSE
data
df <- read.table(text = "individual val1 val2 diff
A 32 36 -4
A 36 28 8
A 28 26 2
A 26 26 0
B 65 64 1
B 58 59 -1
B 57 54 3
B 54 51 3
", header = TRUE)
Multiple ways to do this :
In base R we can use aggregate
aggregate(diff~individual, df,function(x) any(x>5))
# individual diff
#1 A TRUE
#2 B FALSE
Or tapply
tapply(df$diff > 5, df$individual, any)
We can also use data.table
library(data.table)
setDT(df)[ ,(newval = any(diff > 5)), by = individual]
An option in base R with rowsum
rowsum(+(df1$diff > 5), df1$individual) != 0
or with by
by(df1$diff > 5, df1$individual, any)
data
df1 <- structure(list(individual = c("A", "A", "A", "A", "B", "B", "B",
"B"), val1 = c(32L, 36L, 28L, 26L, 65L, 58L, 57L, 54L), val2 = c(36L,
28L, 26L, 26L, 64L, 59L, 54L, 51L), diff = c(-4L, 8L, 2L, 0L,
1L, -1L, 3L, 3L)), class = "data.frame", row.names = c(NA, -8L
))

Create a new data frame column that is a combination of other columns

I have 3 columns a , b ,c and I want to combine them into a new column with the help of column mood as the following :
if mod= 1 , data from a
if mod=2 , data from b
if mode=3, data from c
example
mode a b c
1 2 3 4
1 5 53 14
3 2 31 24
2 12 13 44
1 20 30 40
Output
mode a b c combine
1 2 3 4 2
1 5 53 14 5
3 2 31 24 24
2 12 13 44 13
1 20 30 40 20
We can use the row/column indexing to get the values from the dataset. Here, the row sequence (seq_len(nrow(df1))) and the column index ('mode') are cbinded to create a matrix to extract the corresponding values from the subset of dataset
df1$combine <- df1[2:4][cbind(seq_len(nrow(df1)), df1$mode)]
df1$combine
#[1] 2 5 24 13 20
data
df1 <- structure(list(mode = c(1L, 1L, 3L, 2L, 1L), a = c(2L, 5L, 2L,
12L, 20L), b = c(3L, 53L, 31L, 13L, 30L), c = c(4L, 14L, 24L,
44L, 40L)), class = "data.frame", row.names = c(NA, -5L))
Another solution in base R that works by converting "mode" to letters then extracting those values in the matching columns.
df1$combine <- diag(as.matrix(df1[, letters[df1$mode]]))
Also, two ways with dplyr(). Nested if_else :
library(dplyr)
df1 %>%
mutate(combine =
if_else(mode == 1, a,
if_else(mode == 2, b, c)
)
)
And case_when():
df1 %>% mutate(combine =
case_when(mode == 1 ~ a, mode == 2 ~ b, mode == 3 ~ c)
)

tidyverse: row wise calculations by group

I am trying to do an inventory calculation in R which requires a row wise calculation for each Mat-Plant combination. Here's a test data set -
df <- structure(list(Mat = c("A", "A", "A", "A", "A", "A", "B", "B"
), Plant = c("P1", "P1", "P1", "P2", "P2", "P2", "P1", "P1"),
Day = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L), UU = c(0L, 10L,
0L, 0L, 0L, 120L, 10L, 0L), CumDailyFcst = c(11L, 22L, 33L,
0L, 5L, 10L, 20L, 50L)), .Names = c("Mat", "Plant", "Day",
"UU", "CumDailyFcst"), class = "data.frame", row.names = c(NA,
-8L))
Mat Plant Day UU CumDailyFcst
1 A P1 1 0 11
2 A P1 2 10 22
3 A P1 3 0 33
4 A P2 1 0 0
5 A P2 2 0 5
6 A P2 3 120 10
7 B P1 1 10 20
8 B P1 2 0 50
I need a new field "EffectiveFcst" such that when Day = 1 then EffectiveFcst = CumDailyFcst and for following days -
Here's the desired output -
Mat Plant Day UU CumDailyFcst EffectiveFcst
1 A P1 1 0 11 11
2 A P1 2 10 22 22
3 A P1 3 0 33 23
4 A P2 1 0 0 0
5 A P2 2 0 5 5
6 A P2 3 120 10 10
7 B P1 1 10 20 20
8 B P1 2 0 50 40
I am currently using a for loop but the actual table is >300K rows so hoping to do this with tidyverse for more elegant and faster approach. Tried the following but didn't work out -
group_by(df, Mat, Plant) %>%
mutate(EffectiveFcst = ifelse(row_number()==1, CumDailyFcst, 0)) %>%
mutate(EffectiveFcst = ifelse(row_number() > 1, CumDailyFcst - lag(CumDailyFcst, default = 0) + max(lag(EffectiveFcst, default = 0) - lag(UU, default = 0), 0), EffectiveFcst)) %>%
print(n = nrow(.))
We can use accumulate from purrr
library(tidyverse)
df %>%
group_by(Mat, Plant) %>%
mutate(EffectiveFcst = accumulate(CumDailyFcst - lag(UU, default = 0), ~
.y , .init = first(CumDailyFcst))[-1] )
# A tibble: 8 x 6
# Groups: Mat, Plant [3]
# Mat Plant Day UU CumDailyFcst EffectiveFcst
# <chr> <chr> <int> <int> <int> <dbl>
#1 A P1 1 0 11 11
#2 A P1 2 10 22 22
#3 A P1 3 0 33 23
#4 A P2 1 0 0 0
#5 A P2 2 0 5 5
#6 A P2 3 120 10 10
#7 B P1 1 10 20 20
#8 B P1 2 0 50 40

Resources