I have a very wide dataset with multiple psychometric scales and I would like to remove rows if any of a handful of columns contains zero (i.e., a missing response).
I know how to do it when the data frame is small, but my method is not scalable. For example,
dftry <- data.frame(x = c(1, 2, 5, 3, 0), y = c(0, 10, 5, 3, 37), z=c(12, 0, 33, 22, 23))
x y z
1 1 0 12
2 2 10 0
3 5 5 33
4 3 3 22
5 0 37 23
# Remove row if it has 0 in y or z columns
# is there a difference between & and , ?
dftry %>% filter(dftry$y > 0 & dftry$z > 0)
x y z
1 5 5 33
2 3 3 22
3 0 37 23
In my actual data, I want to remove rows if there are zeroes in any of these columns:
# this is the most succinct way of selecting the columns in question
select(c(1:42, contains("BMIS"), "hamD", "GAD"))
You can use rowSums :
cols <- c('y', 'z')
dftry[rowSums(dftry[cols] == 0, na.rm = TRUE) == 0, ]
# x y z
#1 5 5 33
#2 3 3 22
#3 0 37 23
We can integrate this into dplyr for your real use-case.
library(dplyr)
dftry %>%
filter(rowSums(select(.,
c(1:42, contains("BMIS"), "hamD", "GAD")) == 0, na.rm = TRUE) == 0)
Does this work using dplyr:
> library(dplyr)
> dftry
x y z a b c BMIS_1 BMIS_3 hamD GAD m n
1 1 0 12 1 0 12 1 0 12 12 12 12
2 2 10 0 2 10 0 2 10 0 0 0 0
3 5 5 33 5 5 33 5 5 33 33 33 33
4 3 3 22 3 3 22 3 3 22 22 22 22
5 0 37 23 0 37 23 0 37 23 23 23 23
> dftry %>% select(c(1:3,contains('BMIS'), hamD, GAD)) %>% filter_all(all_vars(. != 0))
x y z BMIS_1 BMIS_3 hamD GAD
1 5 5 33 5 5 33 33
2 3 3 22 3 3 22 22
>
Data used:
> dftry
x y z a b c BMIS_1 BMIS_3 hamD GAD m n
1 1 0 12 1 0 12 1 0 12 12 12 12
2 2 10 0 2 10 0 2 10 0 0 0 0
3 5 5 33 5 5 33 5 5 33 33 33 33
4 3 3 22 3 3 22 3 3 22 22 22 22
5 0 37 23 0 37 23 0 37 23 23 23 23
> dput(dftry)
structure(list(x = c(1, 2, 5, 3, 0), y = c(0, 10, 5, 3, 37),
z = c(12, 0, 33, 22, 23), a = c(1, 2, 5, 3, 0), b = c(0,
10, 5, 3, 37), c = c(12, 0, 33, 22, 23), BMIS_1 = c(1, 2,
5, 3, 0), BMIS_3 = c(0, 10, 5, 3, 37), hamD = c(12, 0, 33,
22, 23), GAD = c(12, 0, 33, 22, 23), m = c(12, 0, 33, 22,
23), n = c(12, 0, 33, 22, 23)), class = "data.frame", row.names = c(NA,
-5L))
>
Related
So I have this dataframe and I aim to add a new variable based on others:
Qi
Age
c_gen
1
56
13
2
43
15
5
31
6
3
67
8
I want to create a variable called c_sep that if:
Qi==1 or Qi==2 c_sep takes a random number between (c_gen + 6) and Age;
Qi==3 or Qi==4 c_sep takes a random number between (Age-15) and Age;
And 0 otherwise,
so my data would look something like this:
Qi
Age
c_gen
c_sep
1
56
13
24
2
43
15
13
5
31
6
0
3
67
8
40
Any ideas please
In base R, you can do something along the lines of:
dat <- read.table(text = "Qi Age c_gen
1 56 13
2 43 15
5 31 6
3 67 8", header = T)
set.seed(100)
dat$c_sep <- 0
dat$c_sep[dat$Qi %in% c(1,2)] <- apply(dat[dat$Qi %in% c(1,2),], 1, \(row) sample(
(row["c_gen"]+6):row["Age"], 1
)
)
dat$c_sep[dat$Qi %in% c(3,4)] <- apply(dat[dat$Qi %in% c(3,4),], 1, \(row) sample(
(row["Age"]-15):row["Age"], 1
)
)
dat
# Qi Age c_gen c_sep
# 1 1 56 13 28
# 2 2 43 15 43
# 3 5 31 6 0
# 4 3 67 8 57
If you are doing it more than twice you might want to put this in a function - depending on your requirements.
Try this
df$c_sep <- ifelse(df$Qi == 1 | df$Qi == 2 ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$c_gen[x] + 6, df$Age[x]) ,1)) ,
sapply(1:nrow(df) ,
\(x) sample(seq(df$Age[x] - 15, df$Age[x]) ,1)) , 0))
output
Qi Age c_gen c_sep
1 1 56 13 41
2 2 43 15 42
3 5 31 6 0
4 3 67 8 58
A tidyverse option:
library(tidyverse)
df <- tribble(
~Qi, ~Age, ~c_gen,
1, 56, 13,
2, 43, 15,
5, 31, 6,
3, 67, 8
)
df |>
rowwise() |>
mutate(c_sep = case_when(
Qi <= 2 ~ sample(seq(c_gen + 6, Age, 1), 1),
between(Qi, 3, 4) ~ sample(seq(Age - 15, Age, 1), 1),
TRUE ~ 0
)) |>
ungroup()
#> # A tibble: 4 × 4
#> Qi Age c_gen c_sep
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 56 13 39
#> 2 2 43 15 41
#> 3 5 31 6 0
#> 4 3 67 8 54
Created on 2022-06-29 by the reprex package (v2.0.1)
I want to remove all rows with value 9999 in my dataset.
Here is an example dataset:
df <- data.frame(a=c(1, 3, 4, 6, 9999, 9),
b=c(7, 8, 8, 7, 13, 16),
c=c(11, 13, 9999, 18, 19, 22),
d=c(12, 16, 18, 22, 29, 38))
So the final dataset will only contain row 1, 2, 4 and 6. I have a large dataset and so how to do this without specifying the names of all columns? Thank you!
You could do:
df[which(apply(df, 1, \(i) !any(i == 9999))),]
#> a b c d
#> 1 1 7 11 12
#> 2 3 8 13 16
#> 4 6 7 18 22
#> 6 9 16 22 38
df[-which(df == 9999, TRUE)[,1], ]
a b c d
1 1 7 11 12
2 3 8 13 16
4 6 7 18 22
6 9 16 22 38
An option with dplyr
library(dplyr)
df %>%
filter(!if_any(everything(), ~ . == 9999))
-output
a b c d
1 1 7 11 12
2 3 8 13 16
3 6 7 18 22
4 9 16 22 38
Or with across
df %>%
filter(across(everything(), ~ . != 9999))
I have a large data set, 150k rows, ~11 MB in size. Each row contains an hourly measure of profit, which can be positive, negative, or zero. I am trying to calculate a new variable equal to the profit of each positive "block." Hopefully this is self-explanatory in the data set below.
"Profit" is the input variable. I can get the next two columns but can't solve for "profit_block". Any help would be much appreciated!
dat <- data.frame(profit = c(20, 10, 5, 10, -20, -100, -40, 500, 27, -20),
indic_pos = c( 1, 1, 1, 1, 0, 0, 0, 1, 1, 0),
cum_profit = c(20, 30, 35, 45, 0, 0, 0, 500, 527, 0),
profit_block = c(45, 45, 45, 45, 0, 0, 0, 527, 527, 0))
profit indic_pos cum_profit profit_block
1 20 1 20 45
2 10 1 30 45
3 5 1 35 45
4 10 1 45 45
5 -20 0 0 0
6 -100 0 0 0
7 -40 0 0 0
8 500 1 500 527
9 27 1 527 527
10 -20 0 0 0
I've found the following post below very helpful, but I can't quite conform it to my need here. Thanks again.
Related URL: Assigning a value to each range of consecutive numbers with same sign in R
We can use rleid to create a group based on the sign of the column i.e. same adjacent sign elements will be a single group and then get the max of the 'cum_profit'
library(dplyr)
dat %>%
group_by(grp = rleid(sign(profit))) %>%
mutate(profit_block2 = max(cum_profit)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 10 x 5
# profit indic_pos cum_profit profit_block profit_block2
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 20 1 20 45 45
# 2 10 1 30 45 45
# 3 5 1 35 45 45
# 4 10 1 45 45 45
# 5 -20 0 0 0 0
# 6 -100 0 0 0 0
# 7 -40 0 0 0 0
# 8 500 1 500 527 527
# 9 27 1 527 527 527
#10 -20 0 0 0 0
Description of Data: Dataset contains information regarding users about their age, gender and membership they are holding.
Goal: Create a new column to identify the group/label for each user based on pre-defined conditions.
Age conditions: multiple age brackets :
18 >= age <= 24, 25 >= age <=30, 31 >= age <= 41, 41 >= age <= 60, age >= 61
Gender: M/F
Membership: A,B,C,I
I created sample data frame to try out creation of new column to identify the group/label
df = data.frame(userid = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12),
age = c(18, 61, 23, 35, 30, 25, 55, 53, 45, 41, 21, NA),
gender = c('F', 'M', 'F', 'F', 'M', 'M', 'M', 'M', 'M', 'F', '<NA>', 'M'),
membership = c('A', 'B', 'A', 'C', 'C', 'B', 'A', 'A', 'I', 'I', 'A', '<NA>'))
userid age gender membership
1 1 18 F A
2 2 61 M B
3 3 23 F A
4 4 35 F C
5 5 30 M C
6 6 25 M B
7 7 55 M A
8 8 53 M A
9 9 45 M I
10 10 41 F I
11 11 21 <NA> A
12 12 NA M <NA>
Based on above data there exist 4 * 2 * 5 options (combinations)
Final outcome:
userid age gender membership GroupID
1 1 16 F A 1
2 2 61 M B 40
3 3 23 F A 1
4 4 35 F C 4
5 5 30 M C 5
6 6 25 M B 3
7 7 55 M A 32
8 8 53 M A 32
9 9 45 M I 34
10 10 41 F I 35
userid age gender membership GroupID
1 1 18 F A 1
2 2 61 M B 40
3 3 23 F A 1
4 4 35 F C 4
5 5 30 M C 5
6 6 25 M B 3
7 7 55 M A 32
8 8 53 M A 32
9 9 45 M I 34
10 10 41 F I 35
11 11 21 <NA> A 43 (assuming it will auto-detec combo)
12 12 NA M <NA> 46
I believe my calculation of combinations are correct and if so how can I use dplyr or any other option to get above data frame.
Use multiple if conditions to confirm all the options?
In dplyr is there a way to actually provide conditions for each column to set the grouping conditions:
df %>% group_by(age, gender, membership)
Two options,
One, more automated;
# install.packages(c("tidyverse""), dependencies = TRUE)
library(tidyverse)
df %>% mutate(ageCat = cut(age, breaks = c(-Inf, 24, 30, 41, 60, Inf))) %>%
mutate(GroupID = group_indices(., ageCat, gender, membership)) %>% select(-ageCat)
#> userid age gender membership GroupID
#> 1 1 18 F A 2
#> 2 2 61 M B 9
#> 3 3 23 F A 2
#> 4 4 35 F C 5
#> 5 5 30 M C 4
#> 6 6 25 M B 3
#> 7 7 55 M A 7
#> 8 8 53 M A 7
#> 9 9 45 M I 8
#> 10 10 41 F I 6
#> 11 11 21 <NA> A 1
#> 12 12 NA M <NA> 10
Two, more manual;
Here I make an illustration of a solution with category 1 and 4, you have to code the rest yourself.
df %>% mutate(GroupID =
ifelse((age >= 18 | age > 25) & gender == 'F' & membership == "A", 1,
ifelse((age >= 31 | age > 41) & gender == 'F' & membership == "C", 4, NA)
))
#> userid age gender membership GroupID
#> 1 1 18 F A 1
#> 2 2 61 M B NA
#> 3 3 23 F A 1
#> 4 4 35 F C 4
#> 5 5 30 M C NA
#> 6 6 25 M B NA
#> 7 7 55 M A NA
#> 8 8 53 M A NA
#> 9 9 45 M I NA
#> 10 10 41 F I NA
#> 11 11 21 <NA> A NA
#> 12 12 NA M <NA> NA
the data structure in case others feel like giving it a go,
You can try this:
setDT(df)[,agegrp:= ifelse((df$age >= 18) & (df$age <= 24), 1, ifelse((df$age >= 25) & (df$age <= 30), 2, ifelse((df$age >= 31) & (df$age <= 41),3,ifelse((df$age >= 42) & (df$age <= 60),4,5))))]
setDT(df)[, group := .GRP, by = .(agegrp,gender, membership)]
If you want to use base R only, you could do something like this:
# 1
allcombos <- expand.grid(c("M", "F"), c("A", "B", "C", "I"), 1:5)
allgroups <- do.call(paste0, allcombos) # 40 unique combinations
# 2
agegroups <- cut(df$age,
breaks = c(17, 24, 30, 41, 61, 99),
labels = c(1, 2, 3, 4, 5))
# 3
df$groupid <- paste0(df$gender, df$membership, agegroups)
df$groupid <- factor(df$groupid, levels=allgroups, labels=1:length(allgroups))
expand.grid gives you a data.frame with three columns where every row represents a unique combination of the three arguments provided. As you said, these are 40 combinations. The second line combines every row of the data frame in a single string, like "MA1", "FA1", "MB1", etc.
Then we use cut to each age to its relevant age group with names 1 to 5.
We create a column in df that contains the three character combination of the gender, membership and age group which is then converted to a factor, according to all possible combinations we found in allgroups.
I'm trying to rename my columns in dplyr. I found that doing it with select function. however when I try to rename some selected columns with sequence I cannot rename them the format that I want.
test = data.frame(x = rep(1:3, each = 2),
group =rep(c("Group 1","Group 2"),3),
y1=c(22,8,11,4,7,5),
y2=c(22,18,21,14,17,15),
y3=c(23,18,51,44,27,35),
y4=c(21,28,311,24,227,225))
CC <- paste("CC",seq(0,3,1),sep="")
aa<-test%>%
select(AC=x,AR=group,CC=y1:y4)
head(aa)
AC AR CC1 CC2 CC3 CC4
1 1 Group 1 22 22 23 21
2 1 Group 2 8 18 18 28
3 2 Group 1 11 21 51 311
4 2 Group 2 4 14 44 24
5 3 Group 1 7 17 27 227
6 3 Group 2 5 15 35 225
the problem is even I set CC value from CC0, CC1, CC2, CC3 the output gives automatically head names starting from CC1.
how can I solve this issue?
I think you'll have an easier time crating such an expression with the select_ function:
library(dplyr)
test <- data.frame(x=rep(1:3, each=2),
group=rep(c("Group 1", "Group 2"), 3),
y1=c(22, 8, 11, 4, 7, 5),
y2=c(22, 18, 21, 14, 17, 15),
y3=c(23, 18, 51, 44, 27, 35),
y4=c(21, 28, 311,24, 227, 225))
# build out our select "translation" named vector
DQ <- paste0("y", 1:4)
names(DQ) <- paste0("DQ", seq(0, 3, 1))
# take a look
DQ
## DQ0 DQ1 DQ2 DQ3
## "y1" "y2" "y3" "y4"
test %>%
select_("AC"="x", "AR"="group", .dots=DQ)
## AC AR DQ0 DQ1 DQ2 DQ3
## 1 1 Group 1 22 22 23 21
## 2 1 Group 2 8 18 18 28
## 3 2 Group 1 11 21 51 311
## 4 2 Group 2 4 14 44 24
## 5 3 Group 1 7 17 27 227
## 6 3 Group 2 5 15 35 225