subset R Characters dataframe - r

I am trying to create a subset from data frame using csv file. The filter to be applied is a character. Here is the code written:
project_subset = subset (x = fed_stimulus, subset = 'Project Status' == "Completed 50% or more", select = 'Project Name')
The code does not return any error but also does not create a subset. Please help

The reason no subset is being created is that with the line
'Project Status' == "Completed 50% or more"
you are just comparing two strings which are not equal. This will always be FALSE, and subset looks for TRUE cases on which to filter.
What you need to do instead is unquote your column name, or pass it through as a data reference.
#unquoted variable name
project_subset = subset (x = fed_stimulus, subset = Project Status == "Completed 50% or more", select = 'Project Name')
or
# quoted variable name but used as a column reference from your original data
project_subset = subset (x = fed_stimulus, subset = fed_stimulus[ ,"Project Status"] == "Completed 50% or more", select = 'Project Name')

When the column name has a space you must surround it in back-ticks `:
project_subset = subset (x = fed_stimulus, subset = `Project Status` == "Completed 50% or more", select = 'Project Name')
However, as the documentation states, subset() is meant as a convenience function for use interactively, so if you intend to use this in a script it is better to use [ like this:
project_subset = fed_stimulus[fed_stimulus$`Project Status` == "Completed 50% or more", "Project Name"]

Related

Rename variables in a column

I have the following data and I want to rename the three variable-names inside the column Categories from:
Sim_long1 %>%
mutate( Categories = case_when(
Categories == "Ave...C "~ "Ave",
Categories == "Min...C" ~ "Min",
Categories == "Max...C"~ "Max"
)
)
Hope this is what you want
A simple answer is to just remove ...C.. business from each of the items. If they are all the same, you can relabel like this:
Sim_long1$Categories <- gsub("...C..", "", Sim_long1$Categories
Then, plot as normally. If you want a more general format you could use this type of syntax:
Sim_long1$Categories <- gsub('\\.*C*','', Sim_long1$Categories)

For loop for specific rows

How would I remove some job titles from the data frame (like below) FROM specific LOB? E.g. I want to keep Technology manager in LOB4 and I don't need technology sales in LOB2. When I execute the code below it removes titles from the entire data frame.
Is there any way to do this?
LOB Title
LOB1 sales rep
LOB2 technology sales
LOB2 receptionist
LOB3 Web Designer
LOB4 Technology Manager
for (i in c("(?=.*technology)", "(?=.*designer)")) {
del <- grepl(i, data[data$LOB == "LOB1" | data$LOB == "LOB2",2], perl = T, ignore.case = T)
data <- data[!del, ]
}
This is likely not working because the grepl statement is returning a vector with length of three that is then used to subset a data.frame with five rows. A for loop is also probably not needed and any of the following will drop technology sales in LOB2:
data[!grepl("(?=.*(technology|designer))", data$Title, perl = TRUE), ]
data[!data$Title == "technology sales", ]
data[!data$Title %in% c("technology sales", "job2 to drop"), ]
data[-2, ]

Using paste to create logical expression for data frame subset

I have two dataframes, remove and dat (the actual dataframe). remove specifies various combinations of the factor variables found in dat, and how many to sample (remove$cases).
Reproducible example:
set.seed(83)
dat <- data.frame(RateeGender=sample(c("Male", "Female"), size = 1500, replace = TRUE),
RateeAgeGroup=sample(c("18-39", "40-49", "50+"), size = 1500, replace = TRUE),
Relationship=sample(c("Direct", "Manager", "Work Peer", "Friend/Family"), size = 1500, replace = TRUE),
X=rnorm(n=1500, mean=0, sd=1),
y=rnorm(n=1500, mean=0, sd=1),
z=rnorm(n=1500, mean=0, sd=1))
What I am trying to accomplish is to read in a row from remove and use it to subset dat. My current approach looks like:
remove <- expand.grid(RateeGender = c("Male", "Female"),
RateeAgeGroup = c("18-39","40-49", "50+"),
Relationship = c("Direct", "Manager", "Work Peer", "Friend/Family"))
remove$cases <- c(36,34,72,58,47,38,18,18,15,22,17,10,24,28,11,27,15,25,72,70,52,43,21,27)
# For each row of remove (combination of factor levels:)
for (i in 1:nrow(remove)) {
selection <- character()
# For each column of remove (particular selection):
for (j in 1:(ncol(remove)-1)){
add <- paste0("dat$", names(remove)[j], ' == "', remove[i,j], '" & ')
selection <- paste0(selection, add)
}
selection <- sub(' & $', '', selection) # Remove trailing ampersand
cat(selection, sep = "\n") # What does selection string look like?
tmp <- sample(dat[selection, ], size = remove$cases[i], replace = TRUE)
}
The output from cat() while the loop runs looks right, for example: dat$RateeGender == "Male" & dat$RateeAgeGroup == "18-39" & dat$Relationship == "Direct" and if I paste that into dat[dat$RateeGender == "Male" & dat$RateeAgeGroup3 == "18-39" & dat$Relationship == "Direct" ,], I get the right subset.
However, if I run the loop as written with dat[selection, ], each subset only returns NAs. I get the same outcome if I use subset(). Note, I have replace = TRUE in the above solely because of the random sampling. In the actual application, there will always be more cases per combination than required.
I know I can dynamically construct formulas for lm() and other functions using paste() in this way, but am obviously missing something in translating this into working with [,].
Any advice would be really appreciated!
You cannot use character expressions as you describe to subset either with [ or subset. If you wanted to do that you would have to construct the entire expression, and then use eval. That said, there is a better solution using merge. For example, let's get all the entries in dat that match the first two rows from remove:
merge(dat, remove[1:2,])
If we want all the rows that don't match those two, then:
subset(merge(dat, remove[1:2,], all.x=TRUE), is.na(cases))
This is assuming you want to join on the columns with the same names across the two tables. If you have a lot of data you should consider using data.table as it is very fast for this type of operation.
I upvoted BrodieG's answer before I realized it doesn't do what you wanted in situations wehre the size of the category is smaller than the number of samples desired. (In fact his method doesn't really do sampling at all, but I think it is is an elegant solution to a different question so I'm not reversing my vote. And you could use a similar split strategy as illustrated below with that data.frame as the input.).
sub <- lapply( split(dat, with(dat, paste(RateeGender, # split vector
RateeAgeGroup,
Relationship, sep="_")) ),
function (d) { n= with(remove, remove[
RateeGender==d$RateeGender[1]&
RateeAgeGroup==d$RateeAgeGroup[1]&
Relationship==d$Relationship[1],
"cases"])
cat(n);
sample(d, n, repl=TRUE) } )

Updating dataframe values with functions at scale

I've got a data frame that requires some manual overrides of certain values given different conditions. But, these conditions tend to take a consistent form (I'm always going to be replacing one value with another, given a specific date range, etc.).
I'm wondering if there is a more elegant way to do the following; I suspect there is, given that I'm repeating function calls over and over. Here's an example:
set.seed(1234); library(dplyr)
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
region = sample(c("mideast","americas","asia"), 50, replace = TRUE),
date = rep(seq(as.Date("2010-02-01"), length=10, by = "1 day"),5)
)
Now, as described above, I want to change values in this dataframe subject to certain conditions about values of the other variables. My first thought is to use a function:
changeFunc <- function(df, ...) {
df$region <- ifelse(df$date %in% daterange &
data$biz == string1, string2, as.character(data$region))
return(df)
}
And then to actually update the dataframe, call that function:
daterange <- as.Date('2010-02-05') + 0:02; string1 <- 'telco'; string2 <- 'southeast'
data <- changeFunc(data, daterange, string1, string2)
The problem is, I want to do this over and over again for different date ranges and values for "string 1" and "string 2", eg:
daterange <- as.Date('2010-02-07') + 0:03; string1 <- 'shipping'; string2 <- 'northeast'
data <- changeFunc(data, daterange, string1, string2)
data %.% arrange(biz, region, date)
Is there a more automated way to make many changes to dataframe values subject to consistent rules?
EDIT: Update with a bit more clarity:
What I ultimately want to do is not have to define 'daterange','string1','string2' and call the function repeatedly. Maybe there's some way to leverage an "apply" function to have a list of pre-defined values for the date ranges and strings and update the dataframe all at once? Something like:
valList <- list(daterange = as.Date('2010-02-05') + 0:02, string1 = "telco", string2 = "southeast",
daterange2 = as.Date('2010-02-07') + 0:03, string1_2 = "shipping", string2_2 = "northeast")
And apply this over the dataframe with the changeFunc
But as you can see, I'm a bit unclear about how the function would know that string1_2 is the condition that matches with daterange2 and not daterange

rename columns containing pattern in r using plyr rename function

I would like to rename all columns in a dataframe containing a pattern in r. Ie, I would like to substitute the column name "variable" for all columns containing "variable", such as "htn.variable". I thought I could use rename from plyr and grepl. I have created an example:
exp<-data.frame(htn.variable = c(1,2,3), id = c(5,6,7), visit = c(1,3,4))
require(plyr)
rename ( exp, c(
names(exp)[grepl ( 'variable',names(exp))] = "variable" ))
But I get the following error:
Error: unexpected '=' in:
" c(
names(exp)[grepl ( 'variable',names(exp))] ="
I think this has to do with calling up a name within a function, and I would like to ask if anyone might have a suggestion how to make this work please? Thanks.
Why bother with rename at all?
colnames(exp)[grepl('variable',colnames(exp))] <- 'variable'
If you only want to replace the part of the column name that is equal to 'variable', use:
colnames(exp) <- gsub('variable', 'replace string', colnames(exp))
rename ( exp, “variable” = names(exp)[grepl ( 'variable',names(exp))])
I am not 100% sure if this is what you need but it may be a start. I stayed away from plyr
for (i in 1:ncol(exp)){
if (substr(names(exp)[i],5,12) == "variable"){
names(exp)[i] <- "new.variable" #or any new var name
}
}
exp
You could also just remove the first four elements of the variable name.

Resources