create a new summary variable if condition across many columns - r

I have a dataframe with an ID variable and a bunch of similarly named columns with information
+-------------------------------------------------
| ID | C1 | C2 | C3 | ...
+-------------------------------------------------
| 1 | 99 | 101 | 102 | ...
+-------------------------------------------------
I need to count the number of columns that fulfil certain condition (e.g. <100) If the number of columns was small I would do something like
df %>% mutate (counter= case_when(C1 <100 & C2<100 & C3<100 ~ "3",
C1<100 & C2<100 ~ 2, ...)
But that is obviously not an option with 100+ columns. I Could also pivot, summarise and pivot back, but it also seems like not the cleanest solution. Any ideas of how to do this properly?

We may use rowSums from base R on a logical matrix (df[-1] < 100) to get the count of elements in each row that are less than 100.
df$counter <- rowSums(df[-1] < 100, na.rm = TRUE)
TRUE -> 1 and FALSE -> 0, thus, when we take the row wise sum of logical matrix, each TRUE will be incremented as 1.
Or in a dplyr pipe
library(dplyr)
df %>%
mutate(counter = rowSums(across(-1) < 100, na.rm = TRUE))

Related

filter one column by two conditions (one value and empty cell) in R

I'd like to filter a column wherein value is 0 or value has no data. I've tried the following lines but neither seems to be doing what I want.
# demo2122 = dataframe
# grp_mlk = column
demo2122 %>% filter(grp_mlk == 0 | grp_mlk == "")
demo2122 %>% filter(grp_mlk == 0 | grp_mlk == NA)
Thanks in advance for your help!
Instead of grp_mlk == NA, use is.na(grp_mlk) as by default filter removes the NA rows
library(dplyr)
demo2122 %>%
filter(grp_mlk == 0| is.na(grp_mlk))

Equivalent of an "except" command in R when subsetting a dataframe

I have a dataset like that :
ID
Amount
MemberCard
345890
251000
NO
341862
400238
YES
345791
678921
YES
341750
87023
NO
345716
12987
YES
I need to delete all the observations with an amount > 250000, but i have to keep the IDs 341862 & 345791. So i was wondering if a kind of "except" command exists in R when subsetting, instead of creating a data frame with these 2 observations only and rbind after.
Select a row if ID is one of c(341862, 345791) OR if Amount is less than equal to 25000.
We can use subset in base R -
res <- subset(df, ID %in% c(341862, 345791) | Amount <= 25000)
res
# ID Amount MemberCard
#1 341862 400238 YES
#2 345791 678921 YES
#3 345716 12987 YES
Or with dplyr::filter -
library(dplyr)
df %>% filter(ID %in% c(341862, 345791) | Amount <= 25000)
If all you want is to have empty values for observations with Amount > 250000, you can use replace():
library(tidyverse)
df_new <- df %>%
mutate(Amount = replace(Amount, Amount >250000, NA))
If you want the results to be applied to both columns, you can just add it to mutate():
df_new <- df %>%
mutate(Amount = replace(Amount, Amount > 250000, NA),
MemberCard = replace(Amount, Amount > 250000, NA))
This will preserve the ID, but removes all other values if the condition is met. Hope this helps. 😉
We may also use
subset(df, ID == 341862| ID == 345791|Amount <= 25000)

Combining rows using fuzzy matching of the keys in R

I have a data set that might contain some very similar keys - something like a row of data for each of the email address john.doe#foo.com and john.m.doe#foo.com. How can I combine similarly named keys and do an aggregate in R?
Sample input
|Email | Subscriptions |
-------------------------------------
|john.doe#foo.com | 10 |
|john.m.doe#foo.com | 11 |
|jane.doe#foo.com | 20 |
Expected result
|Email | Subscriptions |
-------------------------------------
|john.doe#foo.com | 21 |
|jane.doe#foo.com | 20 |
I know agrep and few other libraries can do fuzzy matching, but how do I employ it in combining rows in a data set?
Here is one way to use agrep in combination with dplyr:
df <- data.frame(mail = c("john.doe#foo.com", "john.m.doe#foo.com", "jane.doe#foo.com"),
sub = c(10, 11, 20))
df %>%
rowwise() %>%
mutate(new = paste(agrep(mail, df$mail, max = 2, ignore.case = TRUE), collapse = ",")) %>%
group_by(new) %>%
mutate(sub = sum(sub)) %>%
slice(1)
mail sub new
<fct> <dbl> <chr>
1 john.doe#foo.com 21 1,2
2 jane.doe#foo.com 20 3

How to give custom matching parameters to compare code list pattern using R?

I want to compare data frame column for match and mismatch. In here i have codes in column a and their values in column b. I want to get match if column a has 1 and column b has male (1=male and 2=female) row wise and similarly there should be mismatch if given code does not satisfy.
if 1=male or 2=female then match else mismatch
Below is my tried code which works fine for simple pattern matching or exact value matching but i want it to work with codelist
ABData <- data.frame(a = c(1,2,1,1,2),
b = c("Male","Female","Male","Male","Male")
match<- ABData %>% rowwise() %>% filter(grepl(a,b))
mismatch<- ABData %>% rowwise() %>% filter(!grepl(a,b))
expected output:
Match
a expected actual
1 Male Male
2 Female Female
1 Male Male
1 Male Male
Mismatch
a expected actual
2 Female Male
You can create an index to subset :
inds <- with(ABData, a == 1 & b == 'Male' | a == 2 & b == 'Female')
match_df <- subset(ABData, inds)
mismatch_df <- subset(ABData, !inds)
We then add actual column.
match_df <- transform(match_df, actual = b)
mismatch_df <- transform(mismatch_df, actual = ifelse(b == 'Male','Female', 'Male'))

How to remove specific rows when a condition is fulfilled?

I am trying to remove/filter out some specific rows when it meets the condition of the two columns if not the column EP is flagged as 1. What is the specific code for this?
For example: in the dataframe df_NC when the column "Population_type" (binary type) is equal to 1 and the column NC (binary type) is equal to 0 remove the rows when this condition is satisfied, else flag EP as 1
df_ep <- df_NC %>% mutate(EP= case_when(
df_NC$Population_Type == 1 & df_NC$NC == 0 ~ 1,
TRUE ~ 0
))
From your code I'm assuming you are using dplyr package. A couple of mistakes there.
You don't need to use the base notation like df_NC$NC inside dplyr functions, just use the name of the variable.
I don't see a reason create the column EP if you are filtering one of the values (0/FALSE).
df_NC %>%
mutate(EC = if_else(Population_Type == 1 & NC == 0, 1, 0)) %>%
filter(EC == 1)
# Or shorter, considering my second point
df_NC %>%
filter(Population_Type == 1, NC == 0) # Equivalent to EC == 1
Also, try to use boolean (TRUE/FALSE) instead of integer 1/0 to work with "binary" data type.

Resources