Create new column in string partial match-based dataframe without repeats

Create new column in string partial match-based dataframe without repeats - r

I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.
DF:
GL GLDESC
1 515100 Payroll-ISL
2 515900 Payroll-ICA
3 532300 Bulk Gas
4 551000 Supply AB
5 551000 Supply XPTO
6 551100 Supply AB
7 551300 Intern
For each row of the data table:
If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll.
If GLDESC contains the word Supply anywhere in the string then I want KIND to be Supply.
In all other cases I want KIND to be Other.
Then, I found this:
DF$KIND <- ifelse(grepl("supply", DF$GLDESC, ignore.case = T), "Supply",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
But with that, I have everything that matches Supply, for example, classified. However, as in DF lines 4 and 5, the same GL has two Supply, which for me is unnecessary. In fact, I need only one type of GLDESC to be matched if for the same GL the string is repeated.
Edit: I can not delet any row. I want to have this as output:
GL GLDESC KIND
A Supply1 Supply
A Supply2 N/A
A Supply3 N/A
A Supply4 N/A
A Supply5 N/A
A Supply6 N/A
A Payroll1 Payroll
B Supply2 Supply
B Payroll Payroll

If we need the repeating element to be NA, use duplicated on 'GLDESC' to get a logical vector and assign those elements in 'KIND' created with ifelse to NA
DF$KIND[duplicated(DF$GLDESC)] <- NA_character_
If we need to change the values by a grouping variable
library(dplyr)
DF %>%
group_by(GL) %>%
mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))
# A tibble: 9 x 3
# Groups: GL [2]
# GL GLDESC KIND
# <chr> <chr> <chr>
#1 A Supply1 Supply
#2 A Supply2 <NA>
#3 A Supply3 <NA>
#4 A Supply4 <NA>
#5 A Supply5 <NA>
#6 A Supply6 <NA>
#7 A Payroll1 Payroll
#8 B Supply2 Supply
#9 B Payroll Payroll
Or with the full changes
DF1 %>%
mutate(KIND = str_remove(GLDESC, "\\d+"),
KIND = replace(KIND, !KIND %in% c("Supply", "Payroll"), "Othere")) %>%
group_by(GL) %>%
mutate(KIND = replace(KIND, duplicated(KIND) & KIND == "Supply", NA_character_))
data
DF1 <- structure(list(GL = c("A", "A", "A", "A", "A", "A", "A", "B",
"B"), GLDESC = c("Supply1", "Supply2", "Supply3", "Supply4",
"Supply5", "Supply6", "Payroll1", "Supply2", "Payroll")), row.names = c(NA,
-9L), class = "data.frame")

Related

Change cell contents if it contains a certain letter

I have a column that lists the race/ethnicity of individuals. I am trying to make it so that if the cell contains an 'H' then I only want H. Similarly, if the cell contains an 'N' then I want an N. Finally, if the cell has multiple races, not including H or N, then I want it to be M. Below is how it is listed currently and the desired output.
Current output
People | Race/Ethnicity
PersonA| HAB
PersonB| NHB
PersonC| AB
PersonD| ABW
PersonE| A
Desired output
PersonA| H
PersonB| N
PersonC| M
PersonD| M
PersonE| A

You can try the following dplyr approach, which combines grepl with dplyr::case_when to first search for N values, then among those not with N values, search for H values, then among those without an H or an N will assign M to those with >1 races and the original letter to those with only one race (assuming each race is represented by a single character).
A base R approach is below as well - no need for dependencies but but less elegant.
Data
df <- read.table(text = "person ethnicity
PersonA HAB
PersonB NHB
PersonC AB
PersonD ABW
PersonE A", header = TRUE)
dplyr (note order matters given your priority)
df %>% mutate(eth2 = case_when(
grepl("N", ethnicity) ~ "N",
grepl("H", ethnicity) ~ "H",
!grepl("H|N", ethnicity) & nchar(ethnicity) > 1 ~ "M",
TRUE ~ ethnicity
))
You could also do it "manually" in base r by indexing (note order matters given your priority):
df[grepl("H", df$ethnicity), "eth2"] <- "H"
df[grepl("N", df$ethnicity), "eth2"] <- "N"
df[!grepl("H|N", df$ethnicity) & nchar(df$ethnicity) > 1, "eth2"] <- "M"
df[nchar(df$ethnicity) %in% 1, "eth2"] <- df$ethnicity[nchar(df$ethnicity) %in% 1]
In both cases the output is:
# person ethnicity eth2
# 1 PersonA HAB H
# 2 PersonB NHB N
# 3 PersonC AB M
# 4 PersonD ABW M
# 5 PersonE A A
Note this is based on your comment about assigning superiority (that N anywhere supersedes those with both N and H, etc)

We could use str_extract. When the number of characters in the column is greater than 1, extract, the 'N', 'M' separately, do a coalesce with the extracted elements along with 'M' (thus if there is no match, we get 'M', or else it will be in the order we placed the inputs in coalecse, For the other case, i.e. number of characters is 1, return the column values. Thus, N supersedes 'H' no matter the position in the string.
library(dplyr)
library(stringr)
df1 %>%
mutate(output = case_when(nchar(`Race/Ethnicity`) > 1
~ coalesce(str_extract(`Race/Ethnicity`, 'N'),
str_extract(`Race/Ethnicity`, 'H'), "M"),
TRUE ~ `Race/Ethnicity`))
-output
People Race/Ethnicity output
1 PersonA HAB H
2 PersonB NHB N
3 PersonC AB M
4 PersonD ABW M
5 PersonE A A
data
df1 <- structure(list(People = c("PersonA", "PersonB", "PersonC", "PersonD",
"PersonE"), `Race/Ethnicity` = c("HAB", "NHB", "AB", "ABW", "A"
)), class = "data.frame", row.names = c(NA, -5L))

Filter dataframe based on one pattern that is conditional on another pattern

Can't seem to wrap my head around a seemingly simple task: how to filter a dataframe based on a pattern in one column, which, however, is to match only if a pattern in another column matches:
Data:
df <- data.frame(
Speaker = c("A", NA, "B", "C", "A", "B", "A", "B", "C"),
Utterance = c("uh-huh",
"(0.666)",
"WOW!",
"#yeah#",
"=right=",
"oka::y¿",
"okay",
"some stuff",
"!more! £TAlk£"),
Orthographic = c("uh-huh", "NA", "wow", "yeah", "right", "okay", "okay", "some stuff", "more talk")
)
I want to remove rows in df where the pattern ^(yeah|okay|right|mhm|mm|uh(-| )?huh)$ matches in column Orthographic but not if these rows contain any character from character class [A-Z:↑↓£#¿?!] in column Utterance.
Expected outcome:
df
Speaker Utterance Orthographic
3 B WOW! wow
4 C #yeah# yeah
6 B oka::y¿ okay
8 B some stuff some stuff
9 C !more! £TAlk£ more talk
Attempts so far: (filters too much!)
library(dplyr)
df %>%
filter(!is.na(Speaker)) %>%
filter(!grepl("^(yeah|okay|right|mhm|mm|uh(-| )?huh)$", Orthographic)
& grepl("[A-Z:↑↓£#¿?!]", Utterance))
Speaker Utterance Orthographic
1 B WOW! wow
2 C !more! £TAlk£ more talk

I think you need | :
library(dplyr)
df %>%
filter(!is.na(Speaker)) %>%
filter(!grepl("^(yeah|okay|right|mhm|mm|uh(-| )?huh)$", Orthographic)
| grepl("[A-Z:↑↓£#¿?!]", Utterance))
# Speaker Utterance Orthographic
#1 B WOW! wow
#2 C #yeah# yeah
#3 B oka::y¿ okay
#4 B some stuff some stuff
#5 C !more! £TAlk£ more talk
Keep rows that does not have ^(yeah|okay|right|mhm|mm|uh(-| )?huh)$ Or have [A-Z:↑↓£#¿?!].

select last non-contemporaneous date in group

People are buying stuff and I have the dates when someone last purchased the item in their zip code. I want to grab the last noncontemporaneous date in that group.
ZCTA5 = c("b", "c", "a", "b", "b", "c", "a", "a", "a", "c")
App.Complete.Date = c("2005-01-23", "2005-01-23",
"2006-07-13", "2006-11-21",
"2006-11-21", "2006-11-21",
"2007-01-01", "2007-01-01",
"2007-01-01", "2007-01-01")
xxx <- data.frame(ZCTA5,App.Complete.Date) %>%
arrange(ZCTA5,App.Complete.Date); xxx
Last.Unique.Date.In.ZCTA5 =c(NA, "2006-07-13", "2006-07-13", "2006-07-13", NA, "2005-01-23",
"2005-01-23", NA, "2005-01-23", "2006-11-21")
Desired output
ZCTA5 App.Complete.Date Last.Unique.Date.In.ZCTA5
1 a 2006-07-13 <NA>
2 a 2007-01-01 2006-07-13
3 a 2007-01-01 2006-07-13
4 a 2007-01-01 2006-07-13
5 b 2005-01-23 <NA>
6 b 2006-11-21 2005-01-23
7 b 2006-11-21 2005-01-23
8 c 2005-01-23 <NA>
9 c 2006-11-21 2005-01-23
10 c 2007-01-01 2006-11-21
I don't want to drop any observations. Mutating in place would be ideal, but I understand joining by ZCTA5 and (not shown but I do have it) individual ID later would be fine.
I couldn't figure out a way to mutate a new variable by lagging the unique App.Complete.Date values so I am stuck. Additionally, slicing has been too cumbersome since I still need the last date without removing contemporaneous dates.
EDIT: If the NA is the same row's App.Complete.Date, that's acceptable.

Try the following:
xxx = xxx %>%
mutate(App.Complete.Date = as.Date(App.Complete.Date),
rn = row_number())
Initial setup to ensure date column is of type date. Adding row numbers in order to preserve duplicate dates in origin.
yyy = xxx %>%
left_join(xxx, by = "ZCTA5") %>%
# discard all the out-of-scope dates
mutate(App.Complete.Date.y = ifelse(App.Complete.Date.y < App.Complete.Date.x,
App.Complete.Date.y, NA)) %>%
# we need to include row number here to preserve all rows in the original
group_by(ZCTA5, App.Complete.Date.x, rn.x) %>%
# na.rm = TRUE handles all the missing values removed in the previous mutate
summarise(App.Complete.Date.y = max(App.Complete.Date.y, na.rm = TRUE), .groups = 'drop') %>%
# summarise may return numeric type rather than date type - convert back
mutate(App.Complete.Date.y = as.Date(App.Complete.Date.y, origin = "1970-01-01")) %>%
# rename to output
select(ZCTA5,
App.Complete.Date = App.Complete.Date.x,
Last.Unique.Date.In.ZCTA5 = App.Complete.Date.y)
You may need to change the origin argument in the last mutate depending on what the base date in your system is set at. When my computer returned 13342 instead of '2006-07-13', I determined the base date was '1970-01-01' because '2006-07-13' is 13342 days after '1970-01-01'.

Extract value of a variable that apears at least twice on two factor levels of an other variable of R dataframe

I have a data frame (df) like:
database minrna genesymbol
A mir-1 abc
A mir-2 bcc
B mir-1 abc
B mir-3 xyb
c mir-1 abc
I want to extract mirna that is predicted at least by two databases. For example in the above df, mir-1' is predicted by databaseA,BandC` and hence, the result I want would be:
database minrna genesymbol
A mir-1 abc
B mir-1 abc
c mir-1 abc
I have tried to search similar questions but I couldn't find something similar to this. Could you please help me to solve this out. Thank you.

We can count number of unique database for each minrna and filter based on that.
This can be done in base R :
subset(df, ave(database, minrna, FUN = function(x) length(unique(x))) >= 2)
# database minrna genesymbol
#1 A mir-1 abc
#3 B mir-1 abc
#5 c mir-1 abc
In dplyr :
library(dplyr)
df %>% group_by(minrna) %>% filter(n_distinct(database) >= 2)
Or with data.table :
library(data.table)
setDT(df)[, .SD[uniqueN(database) >=2], minrna]
data
df <- structure(list(database = c("A", "A", "B", "B", "c"), minrna = c("mir-1",
"mir-2", "mir-1", "mir-3", "mir-1"), genesymbol = c("abc", "bcc",
"abc", "xyb", "abc")), row.names = c(NA, -5L), class = "data.frame")

Use group_by function from {dplyr} package, I will let you figure out the details as a form of exercise.
https://dplyr.tidyverse.org/

R studio - using grepl() to grab specific characters and populate a new column in the dataframe

I have a data set in R studio (Aud) that looks like the following. ID is of type Character and Function is of type character as well
ID Function
F04 FZ000TTY WB002FR088DR011
F05 FZ000AGH WZ004ABD
F06 FZ0005ABD
my goal is to attempt and extract only the "FZ", "TTY", "WB", "FR", "WZ", "ABD" from all the rows in the data set and place them in a new unique column in the data set so that i have something like the following as an example
ID Function SUBFUN1 SUBFUN2 SUBFUN3 SUBFUN4 SUBFUN5
F04 FZ000TTY WB002FR088DR011 FZ TTY WB FR DR
I want to individualize the functions since they represent a certain behavior and that way i can plot per ID the behavior or functions which occur the most over a course of time
I tried the the following
Aud$Subfun1<-
ifelse(grepl("FZ",Aud$Functions.NO.)==T,"FZ", "Other"))
Aud$Subfun2<-
ifelse(grepl("TTY",Aud$Functions.NO.)==T,"TTY","Other"))
I get the error message below in my attempts for subfun1 & subfun2:
Error in `$<-.data.frame`(`*tmp*`, Subfun1, value = logical(0)) :
replacement has 0 rows, data has 343456
Error in `$<-.data.frame`(`*tmp*`, Subfun2, value = logical(0)) :
replacement has 0 rows, data has 343456
I also tried substring() but substring seems to require a start and an end for the character range that needs to be captured in the new column. This is not ideal as the codes FZ, TTY, WB, FR, WZ and ABD all appear at different parts of the function string
Any help would be greatly appreciated with this

Using data.table:
library(data.table)
Aud <- data.frame(
ID = c("F04", "F05", "F06"),
Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD"),
stringsAsFactors = FALSE
)
setDT(Aud)
cbind(Aud, Aud[, tstrsplit(Function, "[0-9]+| ")])
ID Function V1 V2 V3 V4 V5
1: F04 FZ000TTY WB002FR088DR011 FZ TTY WB FR DR
2: F05 FZ000AGH WZ004ABD FZ AGH WZ ABD <NA>
3: F06 FZ0005ABD FZ ABD <NA> <NA> <NA>
Staying in base R one could do something like the following:
our_split <- strsplit(Aud$Function, "[0-9]+| ")
cbind(
Aud,
do.call(rbind, lapply(our_split, "length<-", max(lengths(our_split))))
)

One can use tidyr::separate to divide Function column in multiple columns using regex as separator.
library(tidyverse)
df %>%
separate(Function, into = paste("V",1:5, sep=""),
sep = "([^[:alpha:]]+)", fill="right", extra = "drop")
# ID V1 V2 V3 V4 V5
# 1 F04 FZ TTY WB FR DR
# 2 F05 FZ AGH WZ ABD <NA>
# 3 F06 FZ ABD <NA> <NA> <NA>
([^[:alpha:]]+) : Separate on anything other than alphabates
Data:
df <- read.table(text=
"ID Function
F04 'FZ000TTY WB002FR088DR011'
F05 'FZ000AGH WZ004ABD'
F06 FZ0005ABD",
header = TRUE, stringsAsFactors = FALSE)

A tidyverse way that makes use of stringr::str_extract_all to get a nested list of all occurrences of the search terms, then spreads into the wide format you have as your desired output. If you were extracting any sets of consecutive capital letters, you could use "[A-Z]+" as your search term, but since you said it was these specific IDs, you need a more specific search term. If putting the regex becomes cumbersome, say if you have a vector of many of these IDs, you could paste it together and collapse by |.
library(tidyverse)
Aud <- data_frame(
ID = c("F04", "F05", "F06"),
Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD")
)
search_terms <- "(FZ|TTY|WB|FR|WZ|ABD)"
Aud %>%
mutate(code = str_extract_all(Function, search_terms)) %>%
select(-Function) %>%
unnest(code) %>%
group_by(ID) %>%
mutate(subfun = row_number()) %>%
spread(key = subfun, value = code, sep = "")
#> # A tibble: 3 x 5
#> # Groups: ID [3]
#> ID subfun1 subfun2 subfun3 subfun4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 F04 FZ TTY WB FR
#> 2 F05 FZ WZ ABD <NA>
#> 3 F06 FZ ABD <NA> <NA>
Created on 2018-07-11 by the reprex package (v0.2.0).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create new column in string partial match-based dataframe without repeats - r

Related

Change cell contents if it contains a certain letter

Filter dataframe based on one pattern that is conditional on another pattern

select last non-contemporaneous date in group

Extract value of a variable that apears at least twice on two factor levels of an other variable of R dataframe

R studio - using grepl() to grab specific characters and populate a new column in the dataframe

Categories

Resources