How to subset R dataframe based on string matching?

How to subset R dataframe based on string matching? - r

Assume the dataframe is like below:
Name Score
1 John 10
2 John 2
3 James 5
I would like to compute the mean of all the Score values which has John's name.

You can easily perform a mean of every person's score with aggregate:
> aggregate(Score ~ Name, data=d, FUN=mean)
Name Score
1 James 5
2 John 6

Using dplyr:
For each name:
library(dplyr)
df %>%
group_by(Name) %>%
summarise(Value=mean(Score))
Name Value
1 James 5
2 John 6
Filtering:
filter(df, Name=="John") %>%
group_by(Name) %>%
summarise(Value=mean(Score))
Name Value
1 John 6
Using sqldf:
library(sqldf)
sqldf("SELECT Name, avg(Score) AS Score
FROM df GROUP BY Name")
Name Score
1 James 5
2 John 6
Filtering:
sqldf("SELECT Name, avg(Score) AS Score
FROM df
WHERE Name LIKE 'John'")
Name Score
1 John 6

Related

How to filter/sort duplicates based on a text/factor column?

I have a data frame that has a column with unique names and then another column that has factors/char (can do either). I want to be able to filter out duplicate names and IF one of the duplicates has a certain value for a factor I want to keep that one, if there are two duplicates with the same factor, then I don't care which one it keeps.
Name Status
1. John A
2. John B
3. Sally A
4. Alex A
5. Sarah B
6. Joe A
7. Joe A
8. Sue B
9. Sue B
I want to keep the duplicate if the factor/char is set to B. If there are two As or two Bs I don't care which one it keeps.
This is the result I want:
1. John B
2. Sally A
3. Alex A
4. Sarah B
5. Joe A
6. Sue B
I have tried the following but it's still keeping both A and B for John:
Name <- c("John","John","Sally","Alex", "Sarah", "Joe", "Joe", "Sue", "Sue")
Status <- c('A', 'B', 'A', 'A', 'B', 'A', 'A', 'B', 'B')
df_reddit <- data.frame(Name,Status)
df_reddit[, 'Status'] <- as.factor(df_reddit[, 'Status'])
df_reddit$Status <- factor(df_reddit$Status, levels = c("A", "B"))
df_reddit <- df_reddit[order(df_reddit$Status),]
df_reddit[!duplicated(df_reddit[,c('Name')]),]
Any help is appreciated! Would a loop or something be better for this?

A pure dplyr approach may look like so:
library(dplyr)
df_reddit |>
add_count(Name, wt = Status == "B") |>
filter(Status == "B" | n == 0) |>
distinct(Name, Status)
#> Name Status
#> 1 John B
#> 2 Sally A
#> 3 Alex A
#> 4 Sarah B
#> 5 Joe A
#> 6 Sue B

Here's one option using dplyr and data.table:
library(dplyr)
library(data.table)
df_reddit %>%
group_by(Name) %>%
mutate(rle = data.table::rleid(Status)) %>%
filter(rle == max(rle)) %>%
select(-rle) %>%
unique() %>%
ungroup()
# A tibble: 6 × 2
Name Status
<chr> <chr>
1 John B
2 Sally A
3 Alex A
4 Sarah B
5 Joe A
6 Sue B

R: Using regular expression to keep rows of data with 6 digits

mydat <- data.frame(id = c("372303", "KN5232", "231244", "283472-3822"),
name = c("Adam", "Jane", "TJ", "Joyce"))
> mydat
id name
1 372303 Adam
2 KN5232 Jane
3 231244 TJ
4 283472-3822 Joyce
In my dataset, I want to keep the rows where id is a 6 digit number. For those that contain a 6 digit number followed by - and a 4 digit number, I just want to keep the first 6.
My final data should look like this:
> mydat2
id name
1 372303 Adam
3 231244 TJ
2 283472 Joyce
I am using the following grep("^[0-9]{6}$", c("372303", "KN5232", "231244", "283472-3822")) but this does not account for the case where I want to only keep the first 6 digits before the -.

One method would be to split at - and then extract with filter or subset
library(dplyr)
library(tidyr)
library(stringr)
mydat %>%
separate_rows(id, sep = "-") %>%
filter(str_detect(id, '^\\d{6}$'))
-output
# A tibble: 3 × 2
id name
<chr> <chr>
1 372303 Adam
2 231244 TJ
3 283472 Joyce

You can extract the first standalone 6-digit number from each ID and then only keep the items with 6-digit codes only:
mydat <- data.frame(id = c("372303", "KN5232", "231244", "283472-3822"),name = c("Adam", "Jane", "TJ", "Joyce"))
library(stringr)
mydat$id <- str_extract(mydat$id, "\\b\\d{6}\\b")
mydat[grepl("^\\d{6}$",mydat$id),]
Output:
id name
1 372303 Adam
3 231244 TJ
4 283472 Joyce
The \b\d{6}\b matches 6-digit codes as standalone numbers since \b are word boundaries.

You could also extract all 6-digit numbers with a very simple regex (\\d{6}), convert to numeric (as I would expect you would anyway) and remove NA's.
E.g.
library(dplyr)
library(stringr)
mydat |>
mutate(id = as.numeric(str_extract_all(id, "\\d{6}"))) |>
na.omit()
Output:
id name
1 372303 Adam
3 231244 TJ
4 283472 Joyce

Spliting up name and time into a new row

I have a data set that looks like this:
Name <- c("Tom (20)", "35.33L","45.44L", "Mary (18)", "45.00L", "1:25.96")
data.frame(Name)
Name
1 Tom (20)
2 35.33L
3 45.44L
4 Mary (18)
5 45.00L
6 1:25.96
And I want to create a new column call time and put all the time values in. Any idea on how to split them up??
Want it to look like this:
Name Time
1 Tom (20) 35.33L
2 Tom (20) 45.44L
3 Mary (18) 45.00L
4 Mary (18) 1:25.96

Try :
library(dplyr)
library(tidyr)
df %>%
mutate(Time = replace(Name, grepl('\\(\\d+\\)', Name), NA),
Name = replace(Name, !is.na(Time), NA)) %>%
fill(Name) %>%
filter(!is.na(Time))
# Name Time
#1 Tom (20) 35.33L
#2 Tom (20) 45.44L
#3 Mary (18) 45.00L
#4 Mary (18) 1:25.96
Since every person would have an age we separate out the values that follow the pattern (number) and divide the data in two columns Name and Time.

We could do this in dplyr by creating a grouping column
library(dplyr)
library(stringr)
tibble(Name) %>%
group_by(grp = cumsum(str_detect(Name, '^[A-Za-z]'))) %>%
summarise(Time = Name[-1], Name = Name[1], .groups = 'drop') %>%
select(Name, Time)
# A tibble: 4 x 2
# Name Time
# <chr> <chr>
#1 Tom (20) 35.33L
#2 Tom (20) 45.44L
#3 Mary (18) 45.00L
#4 Mary (18) 1:25.96

Creating a variable based off of matching conditions in two datasets

I'm attempting to create a variable in one long dataset (df1) where the value in each row needs to be based off of matching some conditions in another long dataset (df2). The conditions are:
- match on "name"
- the value for df1 should consider observations for that person that occurred before the observation in df1.
- Then I need the number of rows within that subset that meet a third condition (in the data below called "condition")
I've already tried running a for loop (I know, not preferred in R) to write it for each row in 1:nrow(df1), but I keep running into an issue that in my actual data, df1 and df2 are not the same length or a multiple.
I've also tried writing a function and applying it to df1. I tried applying it using apply, but I can't accept two dataframes in the apply syntax. I tried giving it a list of dataframes and using lapply, but it returns back null values.
Here is some generic data that fits the format of the data I'm working with.
df1 <- data.frame(
name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
date_b = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4))
df2 <- data.frame(
name = c("John Smith", "John Smith", "Jane Smith", "Jane Smith"),
date_a = sample(seq(as.Date('2014/01/01'), as.Date('2019/10/01'), by="day"), 4),
condition = c("A", "B", "C", "A")
)
I know the way to get the number of rows could look something like this:
num_conditions <- nrow(df2[which(df1$nam== df2$name & df2$date_a < df1$date_b & df2$condition == "A"), ])
What I would like to see in df1 would would be a column called "num_conditions" that would show the number of observations in df2 for that person that occurred before date_b in df1 and met condition "A".
df1 should look like this:
name date_b num_conditions
John Smith 10/1/15 1
John Smith 11/15/16 0
John Smith 9/19/19 0

I'm sure there are better ways to approach including data.table, but here is one using dplyr:
library(dplyr)
set.seed(12)
df2 %>%
filter(condition == "A") %>%
right_join(df1, by = "name") %>%
group_by(name, date_b) %>%
filter(date_a < date_b) %>%
mutate(num_conditions = n()) %>%
right_join(df1, by = c("name", "date_b")) %>%
mutate(num_conditions = coalesce(num_conditions, 0L)) %>%
select(-c(date_a, condition)) %>%
distinct()
# A tibble: 4 x 3
# Groups: name, date_b [4]
name date_b num_conditions
<fct> <date> <int>
1 John Smith 2016-10-13 2
2 John Smith 2015-11-10 2
3 Jane Smith 2016-07-18 1
4 Jane Smith 2018-03-13 1
R> df1
name date_b
1 John Smith 2016-10-13
2 John Smith 2015-11-10
3 Jane Smith 2016-07-18
4 Jane Smith 2018-03-13
R> df2
name date_a condition
1 John Smith 2015-04-16 A
2 John Smith 2014-09-27 A
3 Jane Smith 2017-04-25 C
4 Jane Smith 2015-08-20 A

Maybe the following is what the question is asking for.
library(tidyverse)
df1 %>%
left_join(df2 %>% filter(condition == 'A'), by = 'name') %>%
filter(date_a < date_b) %>%
group_by(name) %>%
mutate(num_conditions = n()) %>%
select(-date_a, -condition) %>%
full_join(df1) %>%
mutate(num_conditions = ifelse(is.na(num_conditions), 0, num_conditions))
#Joining, by = c("name", "date_b")
## A tibble: 4 x 3
## Groups: name [2]
# name date_b num_conditions
# <fct> <date> <dbl>
#1 John Smith 2019-05-07 2
#2 John Smith 2019-02-05 2
#3 Jane Smith 2016-05-03 0
#4 Jane Smith 2018-06-23 0

R IDs error checking (different names, same ID)

I have data with a list of people's names and their ID numbers. Not all people with the same name will have the same ID number but everyone with different names should have a different ID number. Like this:
Name david david john john john john megan bill barbara chris chris
ID 1 1 2 2 2 3 4 5 6 7 8
I need to make sure that these IDs are correct. So, I want to write a code that says "subset only if ID numbers are the same but their names are different"(so I will be only subsetting ID errors). I don't even know where to start with this because I tried
df1<-df(subset(duplicated(df$Name) & duplicated(df$ID)))
Error in subset.default(duplicated(df$officer) & duplicated(df$ID)) :
argument "subset" is missing, with no default
but it didn't work and I know it doesn't tell R to match and compare names and ID numbers.
Thank you so much in advance.

Updated with the information in the comments below
Here are some test data:
> DF <- data.frame(name = c("A", "A", "A", "B", "B", "C"), id=c(1,1,2,3,4,4))
> DF
name id
1 A 1
2 A 1
3 A 2
4 B 3
5 B 4
6 C 4
So ... if I understand your problem correctly you want to get the information that there are problems with id 4 since two different names (B and C) appear for that id.
library(dplyr)
DF %>% group_by(id) %>% distinct(name) %>% tally()
# A tibble: 4 x 2
id n
<dbl> <int>
1 1 1
2 2 1
3 3 1
4 4 2
Here we get a summary and see that there are two different names (n) for id 4. You can combine that with filter to only see the ids with more than one name
> DF %>% group_by(id) %>% distinct(name) %>% tally() %>% filter(n > 1)
# A tibble: 1 x 2
id n
<dbl> <int>
1 4 2
Did that help?

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to subset R dataframe based on string matching? - r

Assume the dataframe is like below: Name Score 1 John 10 2 John 2 3 James 5 I would like to compute the mean of all the Score values which has John's name.

You can easily perform a mean of every person's score with aggregate: > aggregate(Score ~ Name, data=d, FUN=mean) Name Score 1 James 5 2 John 6

Related

How to filter/sort duplicates based on a text/factor column?

R: Using regular expression to keep rows of data with 6 digits

Spliting up name and time into a new row

Creating a variable based off of matching conditions in two datasets

R IDs error checking (different names, same ID)

Categories

Resources