Using paste to create logical expression for data frame subset - r

I have two dataframes, remove and dat (the actual dataframe). remove specifies various combinations of the factor variables found in dat, and how many to sample (remove$cases).
Reproducible example:
set.seed(83)
dat <- data.frame(RateeGender=sample(c("Male", "Female"), size = 1500, replace = TRUE),
RateeAgeGroup=sample(c("18-39", "40-49", "50+"), size = 1500, replace = TRUE),
Relationship=sample(c("Direct", "Manager", "Work Peer", "Friend/Family"), size = 1500, replace = TRUE),
X=rnorm(n=1500, mean=0, sd=1),
y=rnorm(n=1500, mean=0, sd=1),
z=rnorm(n=1500, mean=0, sd=1))
What I am trying to accomplish is to read in a row from remove and use it to subset dat. My current approach looks like:
remove <- expand.grid(RateeGender = c("Male", "Female"),
RateeAgeGroup = c("18-39","40-49", "50+"),
Relationship = c("Direct", "Manager", "Work Peer", "Friend/Family"))
remove$cases <- c(36,34,72,58,47,38,18,18,15,22,17,10,24,28,11,27,15,25,72,70,52,43,21,27)
# For each row of remove (combination of factor levels:)
for (i in 1:nrow(remove)) {
selection <- character()
# For each column of remove (particular selection):
for (j in 1:(ncol(remove)-1)){
add <- paste0("dat$", names(remove)[j], ' == "', remove[i,j], '" & ')
selection <- paste0(selection, add)
}
selection <- sub(' & $', '', selection) # Remove trailing ampersand
cat(selection, sep = "\n") # What does selection string look like?
tmp <- sample(dat[selection, ], size = remove$cases[i], replace = TRUE)
}
The output from cat() while the loop runs looks right, for example: dat$RateeGender == "Male" & dat$RateeAgeGroup == "18-39" & dat$Relationship == "Direct" and if I paste that into dat[dat$RateeGender == "Male" & dat$RateeAgeGroup3 == "18-39" & dat$Relationship == "Direct" ,], I get the right subset.
However, if I run the loop as written with dat[selection, ], each subset only returns NAs. I get the same outcome if I use subset(). Note, I have replace = TRUE in the above solely because of the random sampling. In the actual application, there will always be more cases per combination than required.
I know I can dynamically construct formulas for lm() and other functions using paste() in this way, but am obviously missing something in translating this into working with [,].
Any advice would be really appreciated!

You cannot use character expressions as you describe to subset either with [ or subset. If you wanted to do that you would have to construct the entire expression, and then use eval. That said, there is a better solution using merge. For example, let's get all the entries in dat that match the first two rows from remove:
merge(dat, remove[1:2,])
If we want all the rows that don't match those two, then:
subset(merge(dat, remove[1:2,], all.x=TRUE), is.na(cases))
This is assuming you want to join on the columns with the same names across the two tables. If you have a lot of data you should consider using data.table as it is very fast for this type of operation.

I upvoted BrodieG's answer before I realized it doesn't do what you wanted in situations wehre the size of the category is smaller than the number of samples desired. (In fact his method doesn't really do sampling at all, but I think it is is an elegant solution to a different question so I'm not reversing my vote. And you could use a similar split strategy as illustrated below with that data.frame as the input.).
sub <- lapply( split(dat, with(dat, paste(RateeGender, # split vector
RateeAgeGroup,
Relationship, sep="_")) ),
function (d) { n= with(remove, remove[
RateeGender==d$RateeGender[1]&
RateeAgeGroup==d$RateeAgeGroup[1]&
Relationship==d$Relationship[1],
"cases"])
cat(n);
sample(d, n, repl=TRUE) } )

Related

Ho to run a function (many times) that changes variable (tibble) in global env

I'm a newbie in R, so please have some patience and... tips are most welcome.
My goal is to create tibble that holds a "Full Name" (of a person, that may have 2 to 4 names) and his/her gender. I must start from a tibble that contains typical Male and Female names.
Below I present a minimum working example.
My problem: I can call get_name() multiple time (in 10.000 for loop!!) and get the right answer. But, I was looking for a more 'elegant' way of doing it. replicate() unfortunately returns a vector... which make it unusable.
My doubts: I know I have some (very few... right!!) issues, like the if statement, that is evaluated every time (which is redundant), but I don't find another way to do it. Any suggestion?
Any other suggestions about code struct are also welcome.
Thank you very much in advance for your help.
# Dummy name list
unit_names <- tribble(
~Women, ~Man,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
set.seed(12345) # seed for test
# Create a tibble with the full names
full_name <- tibble("Full Name" = character(), "Gender" = character() )
get_name <- function() {
# Get the Number of 'Unit-names' to compose a 'Full-name'
nbr_names <- sample(2:4, 1, replace = TRUE)
# Randomize the Gender
gender <- sample(c("Women", "Man"), 1, replace = TRUE)
if (gender == "Women") {
lim_names <- sum( !is.na(unit_names$"Women"))
} else {
lim_names <- sum( !is.na(unit_names$"Man"))
}
# Sample the Fem/Man List names (may have duplicate)
sample(unlist(unit_names[1:lim_names, gender]), nbr_names, replace = TRUE) %>%
# Form a Full-name
paste ( . , collapse = " ") %>%
# Add it to the tibble (INCLUDE the Gender)
add_row(full_name, "Full Name" = . , "Gender" = gender)
}
# How can I make 10k of this?
full_name <- get_name()
If you pass a larger number than 1 to sample this problem becomes easier to vectorise.
One thing that currently makes your problem much harder is the layout of your unit_names table: you are effectively treating male and female names as individually paired, but they clearly aren’t: hence they shouldn’t be in columns of the same table. Use a list of two vectors, for instance:
unit_names = list(
Women = c("fem1", "fem2", "fem3", "fem4", "fem5", "fem6", "fem7"),
Men = c("male1", "male2", "male3", "male4", "male5")
)
Then you can generate random names to your heart’s delight:
generate_names = function (n, unit_names) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
names = Map(sample, unit_names[genders], name_length, replace = TRUE) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
A note on style, unlike your function the above doesn’t use any global variables. Furthermore, don’t "quote" variable names (you do this in unit_names$"Women" and for the arguments of add_row). R allows this, but this is arguably a mistake in the language specification: these are not strings, they’re variable names, making them look like strings is misleading. You don’t quote your other variable names, after all. You do need to backtick-quote the `Full name` column name, since it contains a space. However, the use of backticks, rather than quotes, signifies that this is a variable name.
I am not 100% of what you are trying to get, but if I got it right...did you try with mutate at dplyr? For example:
result= mutate(data.frame,
concated_column = paste(column1, column2, column3, column4, sep = '_'))
With a LITTLE help from Konrad Rudolph, the following elegant (and vectorized ... and fast) solution that I was looking. map2 does the necessary trick.
Here is the full working example if someone needs it:
(Just a side note: I kept the initial conversion from tibble to list because the data arrives to me as a tibble...)
Once again thanks to Konrad.
# Dummy name list
unit_names <- tribble(
~Women, ~Men,
"fem1", "male1",
"fem2", "male2",
"fem3", "male3",
"fem4", "male4",
"fem5", "male5",
"fem6", NA,
"fem7", NA
)
name_list <- list(
Women = unit_names$Women[!is.na(unit_names$Women)],
Men = unit_names$Men[!is.na(unit_names$Men)]
)
generate_names = function (n, name_list) {
name_length = sample(2 : 4, n, replace = TRUE)
genders = sample(c('Women', 'Men'), n, replace = TRUE)
#names = lapply(name_list[genders], sample, name_length) %>%
names = map2(name_list[genders], name_length, sample) %>%
lapply(paste, collapse = ' ') %>%
unlist()
tibble(`Full name` = names, Gender = genders)
}
full_name <- generate_names(10000, name_list)

Difference between two columns with separated variables by ; in R

I am a beginner in R and while trying to make some exercises I got stuck in one of them. My data.frame is as follow:
LanguageWorkedNow LanguageNextYear
Java; PHP Java; C++; SQL
C;C++;JavaScript; JavaScript; C; SQL
And I need to know the variables which are in LanguageNextYear and are not in LanguageWorkedNow, to set a list with the different ones.
Sorry if the question is duplicated, I'm quite new here and tried to find it, but with no success.
base R
Idea: mapply setdiff on strsplitted NextYear and WorkedNow, and then paste it using collapse=";":
df$New <- with(df, {
a <- mapply(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";"), SIMPLIFY = FALSE)
sapply(a, paste, collapse=";")
})
# SIMPLIFY = FALSE is needed in a general case, it doesn't
# affect the output in the example case
# Or if you use Map instead of mapply, that is the default, so
# it could also be...
df$New <- with(df,
sapply(Map(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";")),
paste, collapse=";"))
data
df <- read.table(text = "WorkedNow NextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=TRUE, stringsAsFactors=FALSE)
Here's a solution using purrr package:
df = read.table(text = "
LanguageWorkedNow LanguageNextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=T, stringsAsFactors=F)
library(purrr)
df$New = map2_chr(df$LanguageWorkedNow,
df$LanguageNextYear,
~{x1 = unlist(strsplit(.x, split=";"))
x2 = unlist(strsplit(.y, split=";"))
paste0(x2[!x2%in%x1], collapse = ";")})
df
# LanguageWorkedNow LanguageNextYear New
# 1 Java;PHP Java;C++;SQL C++;SQL
# 2 C;C++;JavaScript JavaScript;C;SQL SQL
For each row you get your columns and you create vectors of values (separated by ;). Then you check which values of NextYear vector don't exist in WorkedNow vector and you create a string based on / combining those values.
The map function family will help you apply your logic / function to each row. In our case we use map2_chr as we have two inputs (your two columns) and we excpet a string / character output.

Find and replace with sample from a list

I have a dataframe called 'out' with the following strings in it.
out<-data.frame(c("Normal","Normal","Abnormal","Normal","Abnormal","Abnormal","Normal","Abnormal"))
I want to replace the "Normal" with a string sampled from a list as follows
mychoices<-(x="Really bad",x="so so", x="Actually OK")
I have tried:
str_replace_all(out[,1],"Normal", as.character(sample(mychoices,1,replace=F)))
but it only replaces with one of the list throughout. I tried wrapping it in a function as well
out2 <- apply(out, 1, function(x) {
if (stringr::str_detect(x, "Normal")) {
return(str_replace_all(out[,1],"Normal", as.character(sample(mychoices,1,replace=F))))
}
})
But it returns lists within a dataframe.
This is one way of doing what I think you want. I changed your data structure a little to make it easier to work with (gave the column a name, and set stringsAsFactors = FALSE)
out <- data.frame(abornorm = c("Normal","Normal","Abnormal","Normal","Abnormal","Abnormal","Normal","Abnormal"), stringsAsFactors = FALSE)
out$abornorm[out$abornorm == "Normal"] <- sample(c("Really bad", "so so", "Actually OK"), sum(out$abornorm == "Normal", na.rm = TRUE), replace = TRUE)
This takes advantage of the ability to assign a set of indices of a vector, provided your source and target are of the same length.

R Optimizing double loop that uses stri_extract

I have been working on some text scraping/analysis. One thing I did was pull out the top words from documents to compare and learn about different metrics. This was fast and easy. There became an issue with defining what separators to use though and pulling out individual words rather than phrases removed information from the analysis. For example .Net Developer becomes net and developer after the transformation. I already had a list of set phrases/words from an old project someone else gave up on. The next step was pulling out specific keywords from multiple rows for multiple documents.
I have been looking into several techniques including vectorization, parallel processing, using C++ code within R and others. Moving forward I will experiment with all of these techniques and try and speed up my process as well as give me these tools for future projects. In the mean time (without experimentation) I'm wondering what adjustments are obvious which will significantly decrease the time taken e.g. moving parts of the code outside the loop, using better packages etc
I also have a progress bar, but I can remove it if its slowing down my loop significantly.
Here is my code:
words <- read.csv("keyphrases.csv")
df <- data.frame(x=(list.files("sec/new/")))
total = length(df$x)
pb <- txtProgressBar(title = "Progress Bar", min = 0, max =total , width = 300, style=3)
for (i in df$x){
s <- read.csv(paste0("sec/new/",i))
u <- do.call(rbind, pblapply(words$words, function(x){
t <- data.frame(ref= s[,2], words = stri_extract(s[,3], coll=x))
t<-na.omit(t)
}))
write.csv(u,paste0("sec/new_results/new/",i), row.names = F)
setTxtProgressBar(pb, i, title=paste( round(which(df$x== i)/total*100, 2),"% done"))
}
So words has 60,000 rows of words/short phrases - no more than 30 characters each. Length i is around 4000 where each i has between 100 and 5000 rows with each row having between 1 and 5000 characters. Any random characters/strings can be used if my question needs to be reproducible.
I only used lapply because combining it with rbind and do.call worked really well, having a loop within a loop may be slowing down the process significantly too.
So off the bat there are somethings I can do right? Swapping data.frame to data.table or using vectors instead. Do the reading and writing outside the loop somehow? Perhaps write it such that one of the loops isnt nested?
Thanks in advance
EDIT
The key element that needs speeding up is the extract. Whether I use lapply above or cut it down to:
for(x in words$words){t<-data.table(words=stri_extract(s[,3], coll=x))}
This still takes the most time for a long way. skills and t are data tables in this case.
EDIT2
Attempting to create reproducible data:
set.seed(42)
words <- data.frame(words=rnorm(1:60000))
words$wwords <- as.String(words$words)
set.seed(42)
file1 <- data.frame(x=rnorm(1:5000))
file1$x<-as.String(file1$x)
pblapply(words$words, function(x){
t <- data.frame(words = stri_extract(file1$x, coll=x))
})
First things first. Yes, I would definitely switch from data.frame to data.table. Not only is it faster and easier to use, when you start merging data sets data.table will do reasonable things when data.frame will give you unexpected and unintended results.
Secondly, is there an advantage to using R to take care of your separators? You mentioned a number of different techniques you are considering using. If separators are just noise for the purposes of your analysis, why not split the work into two tools and use a tool that is much better at handling separators and continuation lines and so on? For me, Python is a natural choice to do things like parsing a bunch of text into keywords--including stripping off separators and other "noise" words you do not care about in your analysis. Feed the results of the Python parsing into R, and use R for its strengths.
There are a few different ways to get the output of Python into R. I would suggest starting off with something simple: CSV files. They are what you are starting with, they are easy to read and write in Python and easy to read in R. Later you can deal with a direct pipe between Python and R, but it does not give you much advantage until you have a working prototype and it is a lot more work at first. Make Python read in your raw data and turn out a CSV file that R can drop straight into a data.table without further processing.
As for stri_extract, it is really not the tool you need this time. You certainly can match on a bunch of different words, but it is not really what it is optimized for. I agree with #Chris that using merge() on data.tables is a much more efficient--and faster--way to search for a number of key words.
Single Word Version
When you have single words in each lookup, this is easily accomplished with merging:
library(data.table)
#Word List
set.seed(42)
WordList <- data.table(ID = 1:60000, words = sapply(1:60000, function(x) paste(sample(letters, 5), collapse = '')))
#A list of dictionaries
set.seed(42)
Dicts <- list(
Dict1 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Dict2 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Dict3 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
})
)
#Create Dictionary Data.table and add Identifier
Dicts <- rbindlist(lapply(Dicts, function(x){data.table(ref = x)}), use.names = T, idcol = T)
# set key for joining
setkey(WordList, "words")
setkey(Dicts, "ref")
Now we have a data.table with all dictionary words, and a data.table with all words in our word list. Now we can just merge:
merge(WordList, Dicts, by.x = "words", by.y = "ref", all.x = T, allow.cartesian = T)
words ID .id
1: abcli 30174 Dict3
2: abcrg 26210 Dict2
3: abcsj 8487 Dict1
4: abczg 24311 Dict2
5: abdgl 1326 Dict1
---
60260: zyxeb 52194 NA
60261: zyxfg 57359 NA
60262: zyxjw 19337 Dict2
60263: zyxoq 5771 Dict1
60264: zyxqa 24544 Dict2
So we can see abcli appears in Dict3, while zyxeb does not appear in any of the dictionaries. There look to be 264 duplicates (words that appear in >1 dictionary), as the resultant data.table is larger than our word list (60264 > 60000). This is shown as follows:
merge(WordList, Dicts, by.x = "words", by.y = "ref", all.x = T, allow.cartesian = T)[words == "ahlpk"]
words ID .id
1: ahlpk 7344 Dict1
2: ahlpk 7344 Dict2
3: ahlpk 28487 Dict1
4: ahlpk 28487 Dict2
We also see here that duplicated words in our word list are going to create multiple resultant rows.
This is very very quick to run
Phrases + Sentences
In the case where you are searching for phrases within sentences, you will need to perform a string match instead. However, you will still need to make n(Phrases) * n(Sentences) comparisons, which will quick hit memory limits in most R data structures. Fortunately, this is an embarrassingly parallel operation:
Same setup:
library(data.table)
library(foreach)
library(doParallel)
# Sentence List
set.seed(42)
Sentences <- data.table(ID = 1:60000, Sentence = sapply(1:60000, function(x) paste(sample(letters, 10), collapse = '')))
# A list of phrases
set.seed(42)
Phrases <- list(
Phrases1 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Phrases2 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
}),
Phrases3 = sapply(1:15000, function(x) {
paste(sample(letters, 5), collapse = '')
})
)
# Create Dictionary Data.table and add Identifier
Phrases <- rbindlist(lapply(Phrases, function(x){data.table(Phrase = x)}), use.names = T, idcol = T)
# Full Outer Join
Sentences[, JA := 1]
Phrases[, JA := 1]
# set key for joining
setkey(Sentences, "JA")
setkey(Phrases, "JA")
We now want to break up our Phrases table into manageable batches
cl<-makeCluster(4)
registerDoParallel(cl)
nPhrases <- as.numeric(nrow(Phrases))
nSentences <- as.numeric(nrow(Sentences))
batch_size <- ceiling(nPhrases*nSentences / 2^30) #Max data.table allocation is 2^31. Lower this if you are hitting memory allocation limits
seq_s <- seq(1,nrow(Phrases), by = floor(nrow(Phrases)/batch_size))
ln_s <- length(seq_s)
if(ln_s > 1){
str_seq <- paste0(seq_s,":",c(seq_s[2:ln_s],nrow(Phrases) + 1) - 1)
} else {
str_seq <- paste0(seq_s,":",nrow(Phrases))
}
We are now ready to send our job out. The grepl line below is doing the work - testing which phrases match each sentence. We then filter out any non-matches.
ls<-foreach(i = 1:ln_s) %dopar% {
library(data.table)
TEMP_DT <- merge(Sentences,Phrases[eval(parse(text = str_seq[1]))], by = "JA", allow.cartesian = T)
TEMP_DT <- TEMP_DT[, match_test := grepl(Phrase,Sentence), by = .(Phrase,Sentence)][match_test == 1]
return(TEMP_DT)
}
stopCluster(cl)
DT_OUT <- unique(do.call(rbind,ls))
DT_OUT now summarizes the sentences that match, along with the Phrase + Phrase list that it is found in.
This still will take some time (as there is a lot of processing that is necessary) , but nowhere near a year.

Updating dataframe values with functions at scale

I've got a data frame that requires some manual overrides of certain values given different conditions. But, these conditions tend to take a consistent form (I'm always going to be replacing one value with another, given a specific date range, etc.).
I'm wondering if there is a more elegant way to do the following; I suspect there is, given that I'm repeating function calls over and over. Here's an example:
set.seed(1234); library(dplyr)
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
region = sample(c("mideast","americas","asia"), 50, replace = TRUE),
date = rep(seq(as.Date("2010-02-01"), length=10, by = "1 day"),5)
)
Now, as described above, I want to change values in this dataframe subject to certain conditions about values of the other variables. My first thought is to use a function:
changeFunc <- function(df, ...) {
df$region <- ifelse(df$date %in% daterange &
data$biz == string1, string2, as.character(data$region))
return(df)
}
And then to actually update the dataframe, call that function:
daterange <- as.Date('2010-02-05') + 0:02; string1 <- 'telco'; string2 <- 'southeast'
data <- changeFunc(data, daterange, string1, string2)
The problem is, I want to do this over and over again for different date ranges and values for "string 1" and "string 2", eg:
daterange <- as.Date('2010-02-07') + 0:03; string1 <- 'shipping'; string2 <- 'northeast'
data <- changeFunc(data, daterange, string1, string2)
data %.% arrange(biz, region, date)
Is there a more automated way to make many changes to dataframe values subject to consistent rules?
EDIT: Update with a bit more clarity:
What I ultimately want to do is not have to define 'daterange','string1','string2' and call the function repeatedly. Maybe there's some way to leverage an "apply" function to have a list of pre-defined values for the date ranges and strings and update the dataframe all at once? Something like:
valList <- list(daterange = as.Date('2010-02-05') + 0:02, string1 = "telco", string2 = "southeast",
daterange2 = as.Date('2010-02-07') + 0:03, string1_2 = "shipping", string2_2 = "northeast")
And apply this over the dataframe with the changeFunc
But as you can see, I'm a bit unclear about how the function would know that string1_2 is the condition that matches with daterange2 and not daterange

Resources