Creating a new column of consecutive token (like n-gram) in R - r

I have this dataset;
A B
URBAN 1
PLAN 2
I wish that new column is added like this;
A A` B
URBAN URB 1
URBAN RBA 1
URBAN BAN 1
PLAN PLA 2
PLAN LAN 2
How do I make the A' column in R?

dat=read.table(text="A B
URBAN 1
PLAN 2",h=T,stringsAsFactors=F)
library(zoo)
d=lapply(dat$A,function(y)
rollapply(1:nchar(y),3,function(x)substr(y,min(x),max(x))))
data.frame(dat[rep(dat$B,lengths(d)),],A1=unlist(d),row.names = NULL)
A B unlist.d.
1 URBAN 1 URB
2 URBAN 1 RBA
3 URBAN 1 BAN
4 PLAN 2 PLA
5 PLAN 2 LAN

Here is one possible way. I am sure there are much more concise way to handle this job. But I think the following will do. For each row in mydf, I applied substr() to create three-letter elements. The Map() part is producing the elements. Since there are some non-desired elements, I further subsetted them with another lapply(). Finally, unnest() splits elements in each list and create a long-format data.
library(tidyverse)
mydf %>%
mutate(whatever = lapply(1:nrow(mydf), function(x) {
unlist(Map(function(j, k) substr(mydf$A[x], start = j, stop = k),
1:nchar(mydf$A[x]), 3:nchar(mydf$A[x])))
}) %>%
lapply(function(x) x[nchar(x) ==3])) %>%
unnest(whatever)
A B whatever
1 URBAN 1 URB
2 URBAN 1 RBA
3 URBAN 1 BAN
4 PLAN 2 PLA
5 PLAN 2 LAN
DATA
mydf <- structure(list(A = c("URBAN", "PLAN"), B = 1:2), .Names = c("A",
"B"), class = "data.frame", row.names = c(NA, -2L))

Here is an option with str_match
library(stringr)
merge(stack(lapply(setNames(str_match_all(mydf$A, "(?=(...))"),
mydf$A), `[`, , 2))[2:1], mydf, by.x = 'ind', by.y = 'A')
Or using similar idea with tidyverse
library(purrr)
library(dplyr)
mydf %>%
mutate(Anew = str_match_all(A, "(?=(...))") %>%
map(~.x[,2])) %>%
unnest
# A B Anew
#1 URBAN 1 URB
#2 URBAN 1 RBA
#3 URBAN 1 BAN
#4 PLAN 2 PLA
#5 PLAN 2 LAN

Related

How to add rows to a table based on a condition?

I would like to create a table in which I can give access to a certain page of a report for a certain user
Imagine I have a table like this:
I have another table in which I have the name of every report page:
I want to get a table in which I have the users and all the pages at which they hace access to depending on their group. Group 1 can see all pages, but group 2 only can see the Team page:
The best option would be doing it in DAX code but I think it could be easier doing it using R. Thanks in advance!
The tidyverse package gives you easy tools to manipulate dataframes. You can create 3 variables (Orders, Sales, Team) recording the access rights for each page (with 1 or 0 for example) using case_when with a condition on Group, and then pivot_longer on these variables, and finally only keep rows where there is an access right with filter.
library(tidyverse)
Group <- c(1,2,1,2,2)
User <- c("alex","pablo","carlos","pepe","paula") %>% paste0("#gmail.com")
df <- data.frame(Group, User)
df2 <- df %>%
mutate(Orders = case_when(Group==1 ~ 1,
Group==2 ~ 0),
Sales = Orders,
Team = 1) %>%
pivot_longer(cols = c(Orders, Sales, Team), names_to = "Page") %>%
filter(value == 1) %>%
select(-value)
Output
> df
Group User
1 1 alex#gmail.com
2 2 pablo#gmail.com
3 1 carlos#gmail.com
4 2 pepe#gmail.com
5 2 paula#gmail.com
> df2
# A tibble: 9 x 3
Group User Page
<dbl> <chr> <chr>
1 1 alex#gmail.com Orders
2 1 alex#gmail.com Sales
3 1 alex#gmail.com Team
4 2 pablo#gmail.com Team
5 1 carlos#gmail.com Orders
6 1 carlos#gmail.com Sales
7 1 carlos#gmail.com Team
8 2 pepe#gmail.com Team
9 2 paula#gmail.com Team
An idea can be,
library(dplyr)
library(tidyr)
df %>%
mutate(page = toString(df1$page_name)) %>%
separate_rows(page, sep = ', ') %>%
mutate(page = replace(page, Group == 2 & page != 'team', NA)) %>%
na.omit()
# A tibble: 9 x 3
Group User page
<dbl> <chr> <chr>
1 1 A orders
2 1 A sales
3 1 A team
4 2 B team
5 1 C orders
6 1 C sales
7 1 C team
8 2 D team
9 2 E team
DATA
dput(df)
structure(list(Group = c(1, 2, 1, 2, 2), User = c("A", "B", "C",
"D", "E")), class = "data.frame", row.names = c(NA, -5L))
dput(df1)
structure(list(page_name = c("orders", "sales", "team")), class = "data.frame", row.names = c(NA,
-3L))
Another idea using fuzzyjoin package:
Data
users <- data.frame(
Group = c("1","2","1","2","2"),
User = c("alex","pablo","carlos","pepe","paula")
)
Group User
1 1 alex
2 2 pablo
3 1 carlos
4 2 pepe
5 2 paula
You can then add a column to the Page dataframe which tell the groups allowed to have access to each category:
pagename <- data.frame(
Page = c("Order","Sales","Team"),
Allowed = c("1","1","1|2")
)
Page Allowed
1 Order 1
2 Sales 1
3 Team 1|2
And finally using fuzzyjoin::regex_left_join:
users |>
fuzzyjoin::regex_left_join(pagename,
by = c(Group = "Allowed")) |>
dplyr::select(-Allowed)
Output
Group User Page
1 1 alex Order
2 1 alex Sales
3 1 alex Team
4 2 pablo Team
5 1 carlos Order
6 1 carlos Sales
7 1 carlos Team
8 2 pepe Team
9 2 paula Team

Recoding multiple variables on different scales using across()

Let's say I have the following data:
data <- data.frame("ID" = c(1:5),
"Var1" = c(1,0,1,0,0),
"Var2" = c(99,2,1,3,2))
Each variable beginning with "Var" has a different numeric scale. I want to recode these numeric values into text. To do this I can use something like:
Var1_recode <- c("1 = 'yes'; 0 = 'no'")
Var2_recode <- c("99 = 'unknown'; 1 = 'weak'; 2 = 'moderate'; 3 = 'strong'")
data_recoded <- data %>%
mutate(Var1 = car::recode(Var1, Var1_recode),
Var2 = car::recode(Var2, Var2_recode))
However, in a large dataset with lots of columns to be recoded, specifying each recoded variable in mutate would lead to lots of repetition. My question: is there a way to use across to recode all of my "Var" variables with the relevant recode variables? The output for this example would look like this:
ID Var1 Var2
1 1 yes unknown
2 2 no moderate
3 3 yes weak
4 4 no strong
5 5 no moderate
I've tried searching for a solution like the following, but I can't work out a way of specifying the relevant recode vector for each column in my data:
data_recoded <- data %>%
mutate(across(.cols = starts_with("Var"), ~ car::recode(.x, relevant_recode_vector_here)))
Any help would be much appreciated.
One option could be:
data %>%
mutate(across(Var1:Var2, ~ car::recode(., get(paste0(cur_column(), "_recode")))))
ID Var1 Var2
1 1 yes unknown
2 2 no moderate
3 3 yes weak
4 4 no strong
5 5 no moderate
Include the recode rules in a list and apply it using Map :
recode_rules <- list(c("1 = 'yes'; 0 = 'no'"),
c("99 = 'unknown'; 1 = 'weak'; 2 = 'moderate'; 3 = 'strong'"))
data[-1] <- Map(car::recode, data[-1], recode_rules)
data
# ID Var1 Var2
#1 1 yes unknown
#2 2 no moderate
#3 3 yes weak
#4 4 no strong
#5 5 no moderate

Adapting string variables to specific characteristics in R

I have the following data:
id code
1 I560
2 K980
3 R30
4 F500
5 650
I would like to do the following two actions regarding the colum code:
i) select the two numbers after the letter and
ii) remove those observations that do not start with a letter. So in the end, the data frame should look like this:
id code
1 I56
2 K98
3 R30
4 F50
In base R, you could do :
subset(transform(df, code = sub('([A-Z]\\d{2}).*', '\\1', code)),
grepl('^[A-Z]', code))
Or using tidyverse functions
library(dplyr)
library(stringr)
df %>%
mutate(code = str_extract(code, '[A-Z]\\d{2}')) %>%
filter(str_detect(code, '^[A-Z]'))
# id code
#1 1 I56
#2 2 K98
#3 3 R30
#4 4 F50
An option with substr from base R
df1$code <- substr(df1$code, 1, 3)
df1[grepl('^[A-Z]', df1$code),]
# id code
#1 1 I56
#2 2 K98
#3 3 R30
#4 4 F50
data
df1 <- structure(list(id = 1:5, code = c("I56", "K98", "R30", "F50",
"650")), row.names = c(NA, -5L), class = "data.frame")

Erasing value in one variable if value in another variable do not have a match in a list in dplyr

I have a table with two fields:
dd <- data.frame(measure = c("a", "a", "b", "b", "c", "c"), class = c(1,11,2,22,3,33), stringsAsFactors = F)
dd
measure class
1 a 1
2 a 11
3 b 2
4 b 22
5 c 3
6 c 33
For each measure, a class is associated. However, not all class can be associated to each measure value. Actually, the only values allowed per measure are available in a list:
ls <- list(a=c(1,10), b=c(2,20,200), c=c(3,30,90))
ls
$`a`
[1] 1 10
$b
[1] 2 20 200
$c
[1] 3 30 90
I need to erase (replace by NA), the measure where the class as no match in the list. I succeeded in base R:
good_match <- mapply(function(xx, yy) any(xx %in% yy), ls[dd$measure], dd$class)
dd$measure[!good_match] <- NA
dd
measure class
1 a 1
2 <NA> 11
3 b 2
4 <NA> 22
5 c 3
6 <NA> 33
However, I would like to do it in dplyr, probably with mutate, so I can pipe
it and make it fit better in my script. I've tried:
library(dplyr)
dd %>% mutate(measure = ifelse(any(class %in% ls[[measure]]), measure, NA))
Error in ls[[measure]] : recursive indexing failed at level 2
I have a feeling it fails because of a problem of vectorization of some sort but I'm stuck. Do you know of a another, more elegant way, of achieving my goal?
We can use a join after converting the named list to a tibble/data.frame
library(tidyverse)
enframe(ls, value = 'class') %>%
unnest %>%
right_join(dd, by = 'class') %>%
transmute(measure = name, class)
# A tibble: 6 x 2
# measure class
# <chr> <dbl>
#1 a 1
#2 <NA> 11
#3 b 2
#4 <NA> 22
#5 c 3
#6 <NA> 33
A base R option would be using stack (instead of enframe) and merge.
NOTE: ls is name of a function. It is better not to name object identifiers with function names

Matching data from one data frame to another

Firstly, apologies if this question isn't phrased in the best way possible, I am new to this but tried to make the question clear. I am trying to achieve the following
I have two data frames and am trying to take data from one of them and add it to a new column in the other, I have created an example of this below
IDa <- c(1,2,3)
score1a <- c(5,10,1)
score2a <- c(NA,8,NA)
score3a <- c(NA,NA,13)
dfa <- data.frame(IDa,score1a,score2a,score3a)
IDb <- c(1,1,1,2,2,3)
timeb <- c(1,2,3,2,3,3)
dfb <- data.frame(IDb,timeb)
score1 corresponds to time 1, score2 to time 2, score3 to time 3
what I want to do is match the score to the appropriate time point, for the appropriate ID, and add this as an additional column in dfb
Hence dfb will have an additional column with 5, NA, NA, 8, NA, 13
Hope that makes sense, thanks for any help with this!
edit: I should add that as you can see the time points available in dfb don't necessarily make sense, for example data is recorded for ID=2 at time point 1 in dfa but dfb has no where to put this (now row for ID=2, timeb=1), so I need to fill dfb as best as possible with the data in dfa.
You can melt the dfa to long form and then merge with dfb after converting the variable column to match the timeb.
library(reshape2)
merge(dfb,transform(melt(dfa, id.var='IDa', na.rm=TRUE),
variable=as.numeric(factor(variable))),
by.x=c('IDb', 'timeb'), by.y=c('IDa', 'variable'), all.x=TRUE)
# IDb timeb value
#1 1 1 5
#2 1 2 NA
#3 1 3 NA
#4 2 2 8
#5 2 3 NA
#6 3 3 13
Or change the column names to and then do the merge
colnames(dfa)[-1] <- 1:3
merge(dfb, melt(dfa, id.var='IDa'),
by.x=c('IDb', 'timeb'), by.y=c('IDa', 'variable'))
Another option would be:
require(dplyr)
require(tidyr)
gather(dfa, Score, Val, -IDa) %>%
mutate(Score = as.numeric(gsub("[a-zA-Z]","", Score))) %>%
left_join(dfb, ., by = c("IDb" = "IDa", "timeb" = "Score"))
# IDb timeb Val
#1 1 1 5
#2 1 2 NA
#3 1 3 NA
#4 2 2 8
#5 2 3 NA
#6 3 3 13
The steps are similar to akrun's answer but using different functions.

Resources