Using two Not in select - julia

I want to exclude some columns in a dataframe:
neatDf = select(df, :Response_ID, Not([:IP_Address, :Progress, :Finished, :User_Language, :Distribution_Channel, :Response_Type]))
The above works. However, when I want to include a second Not to exclude columns using a Regex, it doesn't work:
neatDf = select(df, :Response_ID, Not([:IP_Address, :Progress, :Finished, :User_Language, :Distribution_Channel, :Response_Type]), Not(r"^Recipient"))
Can I use two Not in a row like above?

If you want to drop a union of both conditions then I think the simplest is to do the following (i.e. you use Not on a union of selectors; you can compute such union using Cols):
select(df, :Response_ID, Not(Cols([:IP_Address, :Progress, :Finished, :User_Language, :Distribution_Channel, :Response_Type], r"^Recipient")))

I think InMemoryDatasets package allows this, though, you should convert your DataFrame to Dataset
using InMemoryDatasets
ds=Dataset(df)
neatds=select(ds,Not([:IP_Address, :Progress, :Finished, :User_Language, :Distribution_Channel, :Response_Type]), Not(r"^Recipient"))

Related

It it possible to use count with a function that matches an expression?

Recently, I had the need of using a simple count function for a analysis workload. The code was something like this:
count(datasetName, variableName %in% c("X330", "X331", "X332", "X333", "X334", "X335", "X336", "X337", "X338", "X339")
Looking at the code, I've been wondering if it is possible to just match the variables names using some sort of matching patterns. I my head, it would look like this:
count(datasetName, variableName %in% match("X33"))
From my research, dplyr contains matching functions, but those expect you to use select. I haven't been able to find how this would work with count.
We can use sum instead of count to get the count of logical vector where TRUE ->1 and FALSE -> 0
with(datasetName, sum(variableName %in% c("X330", "X331", "X332", "X333", "X334", "X335", "X336", "X337", "X338", "X339")))

dplyr select executes weirdly another column

the object 'customer_profiling_vars' is a dataframe with just variable selected by a clustering algorithm (RSKC) as seen in R output below the R code:
customer_profiling_vars
customer_profiling_vars$Variables
Now, I want to select only those variables to my dataset sc_df_tr_dummified in the above vector of variables from the dataframe 'customer_profiling_vars' using dplyr's 'select' :
customer_df_interprete = sc_df_tr_dummified %>%
select(customer_profiling_vars$Variables)
glimpse(customer_df_interprete)
I expect to get the variable 'SalePrice' selected.
However some other variable ('PoolArea.576') gets selected which is very weird:
Just to be sure, I tried using SalePrice directly instead of customer_profiling_vars$Variables, it gives what I intended:
What is wrong with select of dplyr? For me , it seems like it has something to do with the factor nature of 'customer_profiling_vars$Variables':
Thanks in advance!

dplyr's removed function? Calculating a mean for several columns in the data frame in R

I would like to calculate the mean of several columns in my data frame. I wanted to select them using the ‘:’ in the dplyr package. The variable names are: Mcheck5_1_1, Mcheck5_2_1, ..., Mcheck5_8_1 (so there are 8 in total). I learnt that I can select them by
select(df, Mcheck5_1_1:Mcheck5_8_1)
in an online course taught by Roger Pang (https://www.youtube.com/watch?v=aywFompr1F4&feature=youtu.be) at 4min33sec.
However, R complained:
Error in select(df, Mcheck5_1_1:Mcheck5_8_1) :
unused argument (Mcheck5_1_1:Mcheck5_8_1)
I also couldn’t find other people’s using of this ‘:’ feature on Google. I suspect this feature no longer exists?
Right now, I use the following code to solve the problem:
idx = grep("Mcheck5_1_1", names(df))
df$avg = rowMeans(df[, idx:idx+7], na.rm = TRUE)
(I’m hesitate to index those columns using number (e.g., df[138]) for fear that its positive might vary.)
However, I think this solution is not elegant enough. Would you advice me is there any other ways to do it? Is it still possible to use the colon(:) method to index my variables nowadays just that I made some mistakes in my code? Thanks all.
https://www.youtube.com/watch?v=aywFompr1F4&feature=youtu.be
(At 4:33)
Try dplyr::select(df, Mcheck5_1_1:Mcheck5_8_1). It is likely to be a package conflict. See here for a related question.
To calculate the mean for each of those columns:
library(magrittr)
library(purrr)
df %>%
dplyr::select(Mcheck5_1_1:Mcheck5_8_1) %>%
map(mean)
maybe using contains can help because it's used to perform a name search in the columns, so in your case it would be: select(df, contains("Mcheck5_"))

Conditional Lookup in R

I am trying to replace the blank (missing) zipcodes in the df table with the zipcodes in another table called zipless, based on names.
What would be the best approach? A for loop is probably very slow.
I was trying with something like this, but it does not work.
df$zip_new <- ifelse(df, is.na(zip_new),
left_join(df,zipless, by = c("contbr_nm" = "contbr_nm")),
zip_new)
I was able to make it work using this approach, but I am sure it is not the best one.
I first added a new column from the lookup table and in the next step selectively used it, where necessary.
library(dplyr)
#temporarly renaming the lookup column in the lookup table
zipless <- plyr::rename(zipless, c("zip_new"="zip_new_temp"))
#adding the lookup column to the main table
df <- left_join(df, zipless, by = c("contbr_nm" = "contbr_nm"))
#taking over the value from the lookup column zip_new_temp if the condition is met, else, do nothing.
df$zip_new <- ifelse((df$zip_new == "") &
(df$contbr_nm %in% zipless$contbr_nm),
df$zip_new_temp,
df$zip_new)
What would be a proper way to do this?
Thank you very much!
I'd suggest using match to just grab the zips you need. Something like:
miss_zips = is.na(df$zip_new)
df$zip_new[miss_zips] = zipless$zip_new[match(
df$contbr_nm[miss_zips],
zipless$contbr_nm
)]
Without sample data I'm not wholly sure of your column names, but something like that should work.
I can only recommend the data.table-package for things like these. But your general approach is correct. The data.table-package has a much nicer syntax and is designed to handle large data sets.
In data.table it would probably look like this:
zipcodes <- data.table(left_join(df, zipless, by = "contbr_nm"))
zipcodes[, zip_new := ifelse(is.na(zip_new), zip_new_temp, zip_new)]

How to not use loops & IF-statements in R

I have two dataframes in R, one big but imcomplete (import) and I want to create a smaller, complete subset of it (export). Every ID in the $unique_name column is unique, and does not appear twice. Other columns might be for example body mass, but also other categories that correspond to the unique ID. I've made this code, a double-loop and an if-statement and it does work, but it is slow:
for (j in 1:length(export$unique_name)){
for (i in 1:length(import$unique_name)){
if (toString(export$unique_name[j]) == toString(import$unique_name[i])){
export$body_mass[j] <- import$body_mass[i]
}
}
}
I'm not very good with R but I know this is a bad way to do it. Any tips on how I can do it with functions like apply() or perhaps the plyr package?
Bjørn
There are many functions to do this. check out..
library(compare)
compare(DF1,DF2,allowAll=TRUE)
or as mentioned by #A.Webb Merge is pretty handy function.
merge(x = DF1, y = DF2, by.x = "Unique_ID",by.y = "Unique_ID", all.x = T, sort = F)
If you prefer SQL style statements then
library(sqldf)
sqldf('SELECT * FROM DF1 INTERSECT SELECT * FROM DF2')
easy to implement and to avoid for and if conditions
As A.Webb suggested you need join:
# join data on unique_name
joined=merge(export, import[c("unique_name", "body_mass")], c('unique_name'))
joined$body_mass=joined$body_mass.y # update body_mass from import to export
joined$body_mass.x=NULL # remove not needed column
joined$body_mass.y=NULL # remove not needed column
export=joined;
Note:As shown below use "which" function .This would reduce the loop iterations
for (j in 1 : nrow(export)){
index<- which(import$unique_name %in% export$unique_name[j])
if(length(index)=1)
{
export$body_mass[j] <- import[index[1],"body_mass"]
}
}

Resources