I have a data frame like below having name and email column.
df <- data.frame(name=c("maay,bhtr","nsgu,nhuts thang","affat,nurfs","nukhyu,biyts","ngyst,muun","nsgyu,noon","utrs guus,book","thum,cryant","mumt,cant","bhan,btan","khtri,ntuk","ghaan,rstu","shaan,btqaan","nhue,bjtraan","wutys,cyun","hrtsh,jaan"),
email=c("maay.bhtr#email.com","nsgu.nhuts#gmail.com","asfa.1234#gmail.com","nukhyu.biyts#gmail.com","ngyst.muun#gmail.com","nsgyu.noon#gmail.com","utrs.book#hotmail.com","thum.cryant#live.com","mumt.cant#gmail.com","bhan.btan#gmail.com","khtri.ntuk#gmail.c.om","chang.lee#gmail.com","shaan.btqaan#gmail.com","nhue.bjtraan#gmail.com","wutys.cyun#gmailcom","hrtsh.jaan#gmail.com"))
I am looking for a function by which i can check if the first name or last name matches with mail id then mutate new column to true.
In Base R we can utilize Map() and sapply() to loop through your list and create a logical vector to then append to your df:
Since this code included a lot of nested apply statements let me try to explain whats ging on. The code is probably best understood when starting from the inside.
# t is the strsplit() names column
strsplit(df[,1], ",")
# this next line checks if the names occur in the email address
grepl(t, y, fixed = T)
# this statement wrapped in sapply returns a list with each entry containing two true/false statements for first and last name
# the sapply() statement above allows us to do exactly that for every row
# lastly we convert this list into a single true/false for each df entry
Code:
a <- sapply(Map(function(x, y){
sapply(x, function(t){
grepl(t, y, fixed = T)
})}
, strsplit(df[,1], ","), df[, 2]), function(p){
if(any(p)){
T
} else {
F
}
})
# result
cbind(df, a)
name email a
1 maay,bhtr maay.bhtr#email.com TRUE
2 nsgu,nhuts thang nsgu.nhuts#gmail.com TRUE
3 affat,nurfs asfa.1234#gmail.com FALSE
4 nukhyu,biyts nukhyu.biyts#gmail.com TRUE
5 ngyst,muun ngyst.muun#gmail.com TRUE
6 nsgyu,noon nsgyu.noon#gmail.com TRUE
7 utrs guus,book utrs.book#hotmail.com TRUE
8 thum,cryant thum.cryant#live.com TRUE
9 mumt,cant mumt.cant#gmail.com TRUE
10 bhan,btan bhan.btan#gmail.com TRUE
11 khtri,ntuk khtri.ntuk#gmail.c.om TRUE
12 ghaan,rstu chang.lee#gmail.com FALSE
13 shaan,btqaan shaan.btqaan#gmail.com TRUE
14 nhue,bjtraan nhue.bjtraan#gmail.com TRUE
15 wutys,cyun wutys.cyun#gmailcom TRUE
16 hrtsh,jaan hrtsh.jaan#gmail.com TRUE
Maybe you can try
within(
df,
consistent <- mapply(
function(x, y) 1 - any(mapply(grepl, x, y) | mapply(grepl, x, y)),
strsplit(name, ","),
strsplit(gsub("#.*", "", email), "\\.")
)
)
which gives
name email consistent
1 maay,bhtr maay.bhtr#email.com 0
2 nsgu,nhuts thang nsgu.nhuts#gmail.com 0
3 affat,nurfs asfa.1234#gmail.com 1
4 nukhyu,biyts nukhyu.biyts#gmail.com 0
5 ngyst,muun ngyst.muun#gmail.com 0
6 nsgyu,noon nsgyu.noon#gmail.com 0
7 utrs guus,book utrs.book#hotmail.com 0
8 thum,cryant thum.cryant#live.com 0
9 mumt,cant mumt.cant#gmail.com 0
10 bhan,btan bhan.btan#gmail.com 0
11 khtri,ntuk khtri.ntuk#gmail.c.om 0
12 ghaan,rstu chang.lee#gmail.com 1
13 shaan,btqaan shaan.btqaan#gmail.com 0
14 nhue,bjtraan nhue.bjtraan#gmail.com 0
15 wutys,cyun wutys.cyun#gmailcom 0
16 hrtsh,jaan hrtsh.jaan#gmail.com 0
You could do this as follows - code commented below.
df <- data.frame(name=c("maay,bhtr","nsgu,nhuts thang","affat,nurfs","nukhyu,biyts","ngyst,muun","nsgyu,noon","utrs guus,book","thum,cryant","mumt,cant","bhan,btan","khtri,ntuk","ghaan,rstu","shaan,btqaan","nhue,bjtraan","wutys,cyun","hrtsh,jaan"),
email=c("maay.bhtr#email.com","nsgu.nhuts thang#gmail.com","asfa.1234#gmail.com","nukhyu.biyts#gmail.com","ngyst.muun#gmail.com","nsgyu.noon#gmail.com","utrs guus.book#hotmail.com","thum.cryant#live.com","mumt.cant#gmail.com","bhan.btan#gmail.com","khtri.ntuk#gmail.c.om","chang.lee#gmail.com","shaan.btqaan#gmail.com","nhue.bjtraan#gmail.com","wutys.cyun#gmailcom","hrtsh.jaan#gmail.com"))
library(stringr)
library(dplyr)
## extract all of the names any string of letters unbroken by a space or punctuation or number
names <- str_extract_all(df$name, "[A-Za-z]*") %>%
## make a matrix out of the names
do.call(rbind, .) %>%
## turn the names into a data frame
as.data.frame()
## some of the columns have all "" in them, find which ones are all ""
w <- sapply(names, function(x)all(x == ""))
## if any of the columns are all "" then ...
if(any(w)){
## remove those columns from the dataset
names <- names[,-which(w)]
}
## add email into this dataset that has the individual names
names$email <- df$email
library(tidyr)
## pipe the names dataset (which has individual names and an e-mail address)
out <- names %>%
## switch from wide to long format
pivot_longer(-email, names_to="V", values_to="n") %>%
## create consistent = 1 if the name is not detected in the e-mail
mutate(consistent = !str_detect(email, n)) %>%
## group the data by e-mail
group_by(email) %>%
## take the maximum of consistent by group
## this will be 1 if any of the names are not detected in the e-mail
summarise(consistent = max(consistent)) %>%
## join back together with the original data
left_join(df) %>%
## change the variable ordering back
select(name, email, consistent)
out
# # A tibble: 16 x 3
# name email consistent
# <chr> <chr> <int>
# 1 affat,nurfs asfa.1234#gmail.com 1
# 2 bhan,btan bhan.btan#gmail.com 0
# 3 ghaan,rstu chang.lee#gmail.com 1
# 4 hrtsh,jaan hrtsh.jaan#gmail.com 0
# 5 khtri,ntuk khtri.ntuk#gmail.c.om 0
# 6 maay,bhtr maay.bhtr#email.com 0
# 7 mumt,cant mumt.cant#gmail.com 0
# 8 ngyst,muun ngyst.muun#gmail.com 0
# 9 nhue,bjtraan nhue.bjtraan#gmail.com 0
# 10 nsgu,nhuts thang nsgu.nhuts thang#gmail.com 0
# 11 nsgyu,noon nsgyu.noon#gmail.com 0
# 12 nukhyu,biyts nukhyu.biyts#gmail.com 0
# 13 shaan,btqaan shaan.btqaan#gmail.com 0
# 14 thum,cryant thum.cryant#live.com 0
# 15 utrs guus,book utrs guus.book#hotmail.com 0
# 16 wutys,cyun wutys.cyun#gmailcom 0
#
Note, I had to change two of the values of e-mail in your dataset to match the image you posted.
Related
How do find the length of highest repeated character in a string
col1 repeated letter repeated number
apples333 2 3
summer13 2 0
talk77 0 2
Aa6668 2 3
I can use lengths(regmatches(str, gregexpr("a",str) or str_count(str,"a") but the idea is to automatically check which is the highest repeating char/number and return count.
Using rle and rawConversion functions:
d <- data.frame(col1 = c("apples333", "summer13", "talk77", "Aa6668"))
foo <- function(x, p){
r <- rle(charToRaw(tolower(x)))
res <- max(r$lengths[ grepl(p, rawToChar(r$values, multiple = TRUE)) ])
if(res == 1) res <- 0
res
}
d$repLetter <- sapply(d$col1, foo, p = "[a-z]")
d$repNumber <- sapply(d$col1, foo, p = "[0-9]")
d
# col1 repLetter repNumber
# 1 apples333 2 3
# 2 summer13 2 0
# 3 talk77 0 2
# 4 Aa6668 2 3
There is probably an elegant regex-based solution for this (obviously I am not a big regex-er). The following is based on determining the run length of a vector using the base-rle() function, i.e. counting the repetition of elements.
As a strategy, we develop a function to work on a single string input providing the different portions and associated occurrences/counts. Then, to operate over several input strings, we apply (loop) a function to each element of the input vector.
single loop
Let's see how rle() works:
x <- "abba" # a test string - who does not know ABBA
x_split <- strsplit(x, "") %>% unlist # split the string, unlist to coerce vector
x_rle <- rle(x_split) # apply rle()
# now let's check what we have
x_rle
Run Length Encoding
lengths: int [1:3] 1 2 1
values : chr [1:3] "a" "b" "a"
rle() returns a list. As you want to filter, etc. on your results, it might be easier to turn this into a data frame. We also store the actual input.
With a view to apply this to other strings (e.g. loop over input vector), we wrap this into a function call:
library(dplyr)
check_rle_char_num <- function(x){
# split the string and count occurrences
x_split <- strsplit(x, "") %>% unlist()
x_rle <- rle(x_split).
# turn it into a tibble
df <- with(x_rle, tibble(values, lengths)) %>%
# ----------- store the input string and check for chars/numerics
mutate( input = x
, is_num = grepl(pattern = "[0-9]", values) # logical check for numbers
) %>%
# ----------- order output tibble
select(input, everything())
}
check that it works:
> ( check_rle_char_num("Appllles44777") )
# A tibble: 7 x 4
input values lengths is_num
<chr> <chr> <int> <lgl>
1 Appllles44777 A 1 FALSE
2 Appllles44777 p 2 FALSE
3 Appllles44777 l 3 FALSE
4 Appllles44777 e 1 FALSE
5 Appllles44777 s 1 FALSE
6 Appllles44777 4 2 TRUE
7 Appllles44777 7 3 TRUE
We have all the pieces on which you can filter, select, etc. your desired output.
loop over multiple input strings
We use tidyverse's {purrr} package for this.
# multiple input strings
my_strings <- c("apples333", "summer13","talk77","Aa6668","Appllles44777")
# loop over my_strings
library(purrr)
test <- my_strings %>%
map_dfr(.f = ~ check_rle_char_num(.x)) # map_dfr returns a data frame
test
# A tibble: 29 x 4
input values lengths is_num
<chr> <chr> <int> <lgl>
1 apples333 a 1 FALSE
2 apples333 p 2 FALSE
3 apples333 l 1 FALSE
4 apples333 e 1 FALSE
5 apples333 s 1 FALSE
6 apples333 3 3 TRUE
7 summer13 s 1 FALSE
8 summer13 u 1 FALSE
9 summer13 m 2 FALSE
10 summer13 e 1 FALSE
final push, filter, and reshape a nice output tibble
# per problem statement - filter for maximum and min 2 counts (i.e. > 1)
result <- test %>%
group_by(input, is_num) %>%
filter(lengths == max(lengths), lengths > 1)
> result
# A tibble: 7 x 4
# Groups: input, is_num [7]
input values lengths is_num
<chr> <chr> <int> <lgl>
1 apples333 p 2 FALSE
2 apples333 3 3 TRUE
3 summer13 m 2 FALSE
4 talk77 7 2 TRUE
5 Aa6668 6 3 TRUE
6 Appllles44777 l 3 FALSE
7 Appllles44777 7 3 TRUE
Emulating a bit your results listed in the problem statement, one can reshuffle the columns and provide "nice" column names:
library(tidyr) # for reshuffling
result %>%
tidyr::pivot_wider( names_from = is_num
, values_from = c(values, lengths)
) %>%
#---------- we spread the tibble, "spread" column-names combine previous colnames and TRUE/FALSE - mind that TRUE were numbers
rename( char = values_FALSE
, char_count = lengths_FALSE
, nums = values_TRUE
, nums_count = lengths_TRUE) %>%
#---------- changing order of columns for nice output
select(input, starts_with("char"), starts_with("num"))
# A tibble: 5 x 5
# Groups: input [5]
input char char_count nums nums_count
<chr> <chr> <int> <chr> <int>
1 apples333 p 2 3 3
2 summer13 m 2 NA NA
3 talk77 NA NA 7 2
4 Aa6668 NA NA 6 3
5 Appllles44777 l 3 7 3
final notes
The solution presented
does the filtering on the result data frame (after the loop). If there are no other operations on your data, you can lift this into the function.
does not clean the NAs in the final output. If you need zeros for no letter or no number, you can replace the NAs.
keeps characters and numbers in a single data frame. Obviously, you can split them. One could combine both again based on a join() or bind_cols() on the input-variable. This saves the pivot-wider bit.
does not care for "ties", i.e. you have a sequence of multiple characters and/or numbers with the same count. You may have to handle this.
Last but not least: simplify the code, if none of the columns/variables kept in the tibble help for your problem.
Solution
You can go with this:
library(stringr)
max_freq <- Vectorize(function(x) max(tabulate(factor(x))))
df$repeated_letter <- max_freq(str_extract_all(str_to_lower(df$col1), "[:alpha:]"))
df$repeated_letter <- max_freq(str_extract_all(str_to_lower(df$col1), "[:digit:]"))
df
#> col1 repeated_letter repeated_number
#> 1 apples333 2 3
#> 2 summer13 2 1
#> 3 talk77 1 2
#> 4 Aa6668 2 3
#> 5 Appllles44777 3 3
Explanation
Following a breakdown of the solution step by step with some explanations:
# take your column
df$col1 |>
# set to lower so A and a is the same character
str_to_lower() |>
# extract only letters or digits as list of vectors
str_extract_all("[:alpha:]") |>
# get frequency table for each vector
lapply(factor) |> lapply(tabulate) |>
# extract the count of most repeated letter for each table and return a vector
sapply(max)
#> [1] 2 2 1 2 3
Data
Where df is:
df <- data.frame(col1 = c("apples333", "summer13", "talk77", "Aa6668", "Appllles44777"))
Warnings
When there are no repeated characters, 1 will be returned, which is actually a more consistent answer, since the most repeated character will be repeated once. If you prefer zero, you can replace all ones with zeros.
In case of no characters or no numbers, -Inf will be returned. If you want a different result (like zero) you can replace it. In your example, it was not specified an occurrence like that.
Though late to the party but this method might still be of interest:
library(tidyr)
library(stringr)
library(dplyr)
d %>%
# count the number of character repetitions:
mutate(
# for letters:
dup_w = lapply(str_extract_all(col1, "(?i)([a-z])\\1+"), nchar),
# for numbers:
dup_n = lapply(str_extract_all(col1, "([0-9])\\1+"), nchar)) %>%
# throw all repetition counts into a single column:
pivot_longer(c(dup_w, dup_n)) %>%
# show items in list:
unnest(cols = value) %>%
# group:
group_by(col1, name) %>%
# reduce dataframe to maximum values per group:
filter(value == max(value)) %>%
# widen the dataframe back to original format:
pivot_wider(names_from = name, values_from = value)
# A tibble: 5 x 3
# Groups: col1 [5]
col1 dup_w dup_n
<chr> <int> <int>
1 11applesssss333 5 3
2 summer13 2 NA
3 talk77 NA 2
4 Aa6668 2 3
5 Appllles44777 3 3
Data (with lots ore repetitions to make things clearer):
d <- data.frame(col1 = c("11applesssss333", "summer13", "talk77",
"Aa6668", "Appllles44777"))
in this example I would like to use just one column of a data.frame:
My selected column should be divided in partitions every 70 rows.
For example: 1..70 / 71...140 / 141...210 up to N=65.000
Output: On every subset, specific functionalities should store different attributes.
In this special case I would like to store $MSE and $ME from the verfiy-function of the verification-package. To underline this process again:
All I want is to go over my column every 70 rows;
use the verify-function;
and store some attributes in a new data.frame
.
ID. MSE. ME.
1 (1to70) 0.3 0.6
2 (71to140) 0.2 0.5
3 (141to210) 0.25 0.76
... ... ...
I have tried the following, but I can't handle how to store my attributes per partition as explained above.
set.seed(1) # reproducible data
df <- as.data.frame(runif(65000,0,1))
probabilities.to.check.against <- runif(70,0,1)
store.as.df <- df[1] %>%
mutate(ID = floor((row_number()-1)/70)) %>% # I'm trying to select partitions every 70 rows
group_by(ID) %>%
verify(probabilities.to.check.against, PARTIOTIONS_OF_DF, frcst.type = "cont", obs.type = "cont")
Give this a shot
# split your data
runs <- 70
sdf <- split(df, 0:(nrow(df)-1) %/% runs)
# Validate split
library(purrr)
head(map_dbl(sdf, ~nrow(.x)))
# 0 1 2 3 4 5
# 70 70 70 70 70 70
# Answer
ans <- map_df(sdf, ~as.data.frame(verify(probabilities.to.check.against, .x[,1], frcst.type = "cont", obs.type = "cont")[c("MSE","ME")]), .id="id")
# Output
# id MSE ME
# 1 0 0.1326722 5.145940e-03
# 2 1 0.1662103 -3.211852e-02
# 3 2 0.1522823 1.594105e-02
# 4 3 0.1485422 -1.069273e-01
# 5 4 0.1714966 1.595200e-03
# 6 5 0.2195108 1.866164e-03
# 7 6 0.1942890 -1.029523e-02
# 8 7 0.1730359 4.800538e-04
# 9 8 0.1432483 1.843559e-02
# 10 9 0.1554882 -6.644684e-03
# 11 10 0.1895140 -3.035421e-02
# # etc
Change id if you want to the format you specify with
ans$id <- c(paste0(head(as.numeric(ans$id),-1)*runs+1, "-", tail(as.numeric(ans$id),-1)*runs), tail(as.numeric(ans$id),1)*runs+1)
# [1] "1-70" "71-140" "141-210" "211-280" "281-350"
I am attempting to create a new data.frame object that is composed of the columns of the old data.frame with every row set at a given value (for this example, I will use 7). I am running into problems, however, naming the variables of the new data.frame object. Here is what I'm trying to do:
my.df <- data.frame(x = 1:3, y123=4:6)
my.df
x y123
1 1 4
2 2 5
3 3 6
Then when I attempt to assign these variable names to the new data.frame:
the.names <- names(my.df)
for(i in 1:length(the.names)){
data.frame(
the.names[i] = 7
)
}
But this throws errors with unexpected =, ), and }. This would usually make me suspect a typo, but I've reviewed this several times and can't find anything. Any ideas?
The easiest way is probably to just copy the old data.frame in its entirety, and then assign every cell to your new value:
df <- data.frame(x=1:3,y123=4:6);
df2 <- df;
df2[] <- 7;
df2;
## x y123
## 1 7 7
## 2 7 7
## 3 7 7
If you only want one row in the new data.frame, you can index only the top row when making the copy:
df <- data.frame(x=1:3,y123=4:6);
df2 <- df[1,];
df2[] <- 7;
df2;
## x y123
## 1 7 7
Edit: Here's how you can set each column to a different value:
df <- data.frame(x=1:3,y123=4:6);
df2 <- df;
df2[] <- rep(c(7,31),each=nrow(df2));
df2;
## x y123
## 1 7 31
## 2 7 31
## 3 7 31
You can use within:
> within(my.df, {
+ assign(the.names[1], 0)
+ assign(the.names[2], 1)
+ })
x y123
1 0 1
2 0 1
3 0 1
I have a data set where a bunch of categorical variables were converted to dummy variables (all classes used, NOT n-1) and some were not. I'm trying to recode them in a single column.
For instance
Q1.1 Q1.2 Q1.3 Q1.NA Q2 Q3.1 Q3.2
1 0 0 0 3 0 1
0 1 0 0 4 1 0
0 0 1 0 2 0 1
Is there a simple way to convert this to:
Q1 Q2 Q3
1 3 2
2 4 1
3 2 2
Right now I'm just using strsplit() (as all the dummied variable names contain '.') with a couple loops but feel like there should be a better way. Any suggestions?
I wrote a function a while back that did this sort of thing.
MultChoiceCondense<-function(vars,indata){
tempvar<-matrix(NaN,ncol=1,nrow=length(indata[,1]))
dat<-indata[,vars]
for (i in 1:length(vars)){
for (j in 1:length(indata[,1])){
if (dat[j,i]==1) tempvar[j]=i
}
}
return(tempvar)
}
If your data is called Dat, then:
Dat$Q1<-MultChoiceCondense(c("Q1.1","Q1.2","Q1.3"),Dat)
Here's an approach that uses melt from "reshape2" and cSplit from my "splitstackshape" package along with some "data.table" fun. I've loaded dplyr so that we can pipe all the things.
library(splitstackshape)
library(reshape2)
library(dplyr)
mydf %>%
as.data.table(keep.rownames = TRUE) %>% # Convert to data.table. Keep rownames
melt(id.vars = "rn", variable.name = "V") %>% # Melt the dataset by rownames
.[value > 0] %>% # Subset for all non-zero values
cSplit("V", ".") %>% # Split the "V" column (names) by "."
.[is.na(V_2), V_2 := value] %>% # Replace NA values with actual values
dcast.data.table(rn ~ V_1, value.var = "V_2") # Go wide.
# rn Q1 Q2 Q3
# 1: 1 1 3 2
# 2: 2 2 4 1
# 3: 3 3 2 2
Here's a possible base R approach:
## Which columns are binary?
Bins <- sapply(mydf, function(x) {
all(x %in% c(0, 1))
})
## Two vectors -- part after the dot and before
X <- gsub(".*\\.(.*)$", "\\1", names(mydf)[Bins])
Y <- unique(gsub("(.*)\\..*$", "\\1", names(mydf)[Bins]))
## Use `apply` to subset the X value based on the
## logical version of the binary variable
cbind(mydf[!Bins],
`colnames<-`(t(apply(mydf[Bins], 1, function(z) {
X[as.logical(z)]
})), Y))
# Q2 Q1 Q3
# 1 3 1 2
# 2 4 2 1
# 3 2 3 2
At the end, you can just reorder the columns as required. You may also need to convert them to numeric since in this case, Q1 and Q3 will be factors.
another base R approach
dat <- read.table(header = TRUE, text = "Q1.1 Q1.2 Q1.3 Q1.NA Q2 Q3.1 Q3.2
1 0 0 0 3 0 1
0 1 0 0 4 1 0
0 0 1 0 2 0 1")
## this will take all the unique questions; Q1, Q2, Q3; test if
## they are dummies; and return the column if so or find which
## dummy column is a 1 otherwise
res <- lapply(unique(gsub('\\..*', '', names(dat))), function(x) {
tmp <- dat[, grep(x, names(dat)), drop = FALSE]
if (ncol(tmp) == 1) unlist(tmp, use.names = FALSE) else max.col(tmp)
})
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 3 4 2
#
# [[3]]
# [1] 2 1 2
do.call('cbind', res)
# [,1] [,2] [,3]
# [1,] 1 3 2
# [2,] 2 4 1
# [3,] 3 2 2
I'm assuming your data looks like this, where the categorical columns are encoded using a dot at the end. You may also have a case where all of the values in a row are zero, which indicates a base level (such as how dummyVars in caret works with fullRank=FALSE). If so, here is a vectorized solution.
library(dplyr)
dummyVars.undo = function(df, col_prefix) {
if (!endsWith(col_prefix, '.')) {
# If col_prefix doesn't end with a period, include one, but save the
# "pretty name" as the one without a period
pretty_col_prefix = col_prefix
col_prefix = paste0(col_prefix, '.')
} else {
# Otherwise, strip the period for the pretty column name
pretty_col_prefix = substr(col_prefix, 1, nchar(col_prefix)-1)
}
# Get all columns with that encoding prefix
cols = names(df)[names(df) %>% startsWith(col_prefix)]
# Find the rows where all values are zero. If this isn't the case
# with your data there's no worry, it won't hurt anything.
base_level.idx = rowSums(df[cols]) == 0
# Set the column value to a base value of zero
df[base_level.idx, pretty_col_prefix] = 0
# Go through the remaining columns and find where the maximum value (1) occurs
df[!base_level.idx, pretty_col_prefix] = cols[apply(df[!base_level.idx, cols], 1, which.max)] %>%
strsplit('\\.') %>%
sapply(tail, 1)
# Drop the encoded columns
df[cols] = NULL
return(df)
}
Usage:
# Collapse Q1
df = dummyVars.undo(df, 'Q1')
# Collapse Q3
df = dummyVars.undo(df, 'Q3')
This uses dplyr, but only for the pipe operator %>%. You could certainly remove that if you'd prefer to do base R instead.
I have a data frame with url strings and am using the stringr package in R to produce new columns with a boolean on whether the string contains an element or not.
library(stringr)
url = data.frame(u=c("http://www.subaru.com/vehicles/impreza/index.html",
"http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602",
"http://www.subaru.com/customer-support.html",
"http://www.subaru.com/",
"http://www.subaru.com/vehicles/forester/index.html"))
url
cs = c("customer-support")
f = c("forester")
one_match <- str_c(cs, collapse = "|")
two_match <- str_c(f, collapse = "|")
main <- function(df) {
df$customer_support <- as.numeric(str_detect(url$u, one_match))
df
}
d1 = main(url)
main <- function(df) {
df$forester <- as.numeric(str_detect(url$u, two_match))
df
}
d2 = main(url)
mydt = join(d1, d2)
mydt
The above code produces the following results.
mydt
u
1 http://www.subaru.com/vehicles/impreza/index.html
2 http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602
3 http://www.subaru.com/customer-support.html
4 http://www.subaru.com/
5 http://www.subaru.com/vehicles/forester/index.html
customer_support forester
1 0 0
2 0 0
3 1 0
4 0 0
5 0 1
What I want to do is reshape the data frame so that I restructure columns 2 and 3 so that they are combined and no longer boolean values
It should look like:
page
0
0
customer_support
0
forester
I've tried many different things, including variations of reshape, transform, dcast, etc and nothing seems to get the job done. Can anyone help me get the desired output.
You don't need to write such complicated functions.. You can simply use grepl and ifelse functions as below
urldata = data.frame(u = c("http://www.subaru.com/vehicles/impreza/index.html", "http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602",
"http://www.subaru.com/customer-support.html", "http://www.subaru.com/", "http://www.subaru.com/vehicles/forester/index.html"))
cs = c("customer-support")
f = c("forester")
urldata
## u
## 1 http://www.subaru.com/vehicles/impreza/index.html
## 2 http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602
## 3 http://www.subaru.com/customer-support.html
## 4 http://www.subaru.com/
## 5 http://www.subaru.com/vehicles/forester/index.html
urldata$page <- ifelse(grepl(cs, urldata$u), cs, ifelse(grepl(f, urldata$u), f, 0))
urldata
## u
## 1 http://www.subaru.com/vehicles/impreza/index.html
## 2 http://www.subaru.com/index.html?s_kwcid=subaru&k_clickid=214495e6-dbe0-6668-9222-00003d7cd876&prid=87&k_affcode=76602
## 3 http://www.subaru.com/customer-support.html
## 4 http://www.subaru.com/
## 5 http://www.subaru.com/vehicles/forester/index.html
## page
## 1 0
## 2 0
## 3 customer-support
## 4 0
## 5 forester