R Data Wrangling for Emails - r

Need Help! this is a work related project. I need to clean 16,000 emails... Expected to do by hand :( I need to find a away to pull the domain name from the email and place it into a new column, and parse the name into a new column as well, while still keeping the original email. The data is partially complete.
library(tidyr)
library(magrittr)
Email.Address <- c('john.doe#abccorp.com','jdoe#cisco.com','johnd#widgetco.com')
First.Name <- c('John', 'JDoe','NA' )
Last.Name <- c('Doe','NA','NA')
Company <- c('NA','NA','NA')
data <- data.frame(Email.Address, First.Name, Last.Name, Company)
separate_DF <- data %>% separate(Email.Address, c("Company"), sep="#")

try this
df <- data.frame(Email.Address, First.Name, Last.Name, Company, stringsAsFactors = FALSE)
Corp <- sapply(strsplit(sapply(strsplit(df$Email.Address,"#"),"[[",2),"[.]"),"[[",1)
F.Name <- sapply(strsplit(sapply(strsplit(df$Email.Address,"#"),"[[",1), "[.]"),"[[",1)
L.Name <- sapply(strsplit(sapply(strsplit(df$Email.Address,"#"),"[[",1),"[.]"),tail,n=1)
L.Name[L.Name == F.Name] <- NA
OUT <- data.frame(df$Email.Address, F.Name, L.Name, Corp)
df[df=="NA" |is.na(df)] <- OUT[df=="NA" |is.na(df)]
df
the function separate from tidyr looks good too.
http://blog.rstudio.org/2014/07/22/introducing-tidyr/
From the information you have given, this also works:
library(tidyr)
df <- data.frame(Email.Address, First.Name, Last.Name, Company)
df2 <- separate(df, Email.Address, into = c("Name", "Corp"), sep = "#")
df2 <- separate(df2, Name, into = c("F.Name", "L.Name"), sep = "[.]", extra = "drop")
df2 <- separate(df2, Corp, into = c("Corp", "com"), sep = "[.]", extra = "drop")

Related

resource efficient join and filter method

I have the following condensed data set:
tbl1 <- data.frame(Name = c(rep("A",3), rep("B",2), rep("C",3)), Dat = c(1,1,2,1,1,3,4,4),
Var1 = sample(1:8,8), Var2 = sample(1:8,8))
tbl2 <- data.frame(Name = c("A","A","B","C","C"), Dat = c(1,2,1,3,4), x = c("a","b","b","b","a"))
I need to filter from tbl1 all data sets with the condition x, found in table tbl2. This is my current solution.
tbl11 <- tbl1 %>% mutate(key = paste(Name, Dat, sep = "_"))
tbl2 <- tbl2 %>% mutate(key = paste(Name, Dat, sep = "_"))
tbl3 <- left_join(tbl11, tbl2)
tbl4 <- tbl3 %>% filter(x == "a")
Unfortunately I run into resource issues. For small tables it works. I think there are more efficient way so that I don't have to store the intermediate steps. Your help is much appreciated.
You can subset the data before joining :
tbl4 <- merge(tbl1, subset(tbl2, x == 'a'), by = c('Name', 'Dat'))
Thanks for sharing your ideas. Just for completeness, I have tested the suggested and came up with a correct and more efficient way:
tbl3 <- inner_join(filter(tbl2, x == 'a'), tbl1, by = c('Name', 'Dat'))
Inner_join is significantly faster than merge. And the order of the input is important of course.

How do I write a function (analogous to a SAS macro) in R to import and format a list of Excel files?

I'm looking for a more efficient way to write the following:
Read in all my Excel files
DF1 <- read_excel(DF1, sheet = "ABC", range = cell_cols(1:10) )
DF2 <- read_excel(DF2, sheet = "ABC", range = cell_cols(1:10) )
etc...
DF50 <- read_excel(DF50, sheet = "ABC", range = cell_cols(1:10) )
Add a column to each DF with a location
DF1$Location <- location1
DF2$Location <- location2
etc...
DF50$Location <- location50
Keep only columns with specified names, get rid of blank rows, and convert column CR_NUMBER to an integer
library(hablar)
DF1 <- DF1 %>% select(all_of(colnames_r)) %>% filter(!is.na(NAME)) %>% convert(int(CR_NUMBER))
DF2 <- DF2 %>% select(all_of(colnames_r)) %>% filter(!is.na(NAME)) %>% convert(int(CR_NUMBER))
etc...
DF50 <- DF50 %>% select(all_of(colnames_r)) %>% filter(!is.na(NAME)) %>% convert(int(CR_NUMBER))
You can try to use the following getting the data in a list :
library(readxl)
library(hablar)
library(dplyr)
#Get the complete path of file which has name "DF" followed by a number.
file_names <- list.files('/folder/path', pattern = 'DF\\d+', full.names = TRUE)
list_data <- lapply(seq_along(file_names), function(x) {
data <- read_excel(file_names[x], sheet = "ABC", range = cell_cols(1:10))
data %>%
mutate(Location = paste0('location', x))
select(all_of(colnames_r)) %>%
filter(!is.na(NAME)) %>%
convert(int(CR_NUMBER))
})
list_data is a list of dataframes which is usually better to manage instead of having 50 dataframes in global environment. If you still want all the dataframes separately name the list and use list2env.
names(list_data) <- paste0('DF', seq_along(list_data))
list2env(list_data, .GlobalEnv)

How to apply separate to several columes with same actions in R?

I want to separate one column into 4 columnes by "-", keep the new 4 columnes, drop the original colume, rename the new 4 columnes. Then apply the same action to several columes.
I used the stupid way...I tried to write function with separate and paste, but failed...
R code as below:
tabled <- table_d %>%
separate("A3 CAB PA",into=c("A3 CAB PA_count","A3 CAB PA_sumkm","A3 CAB PA_drive","A3 CAB PA_drive2"),sep="-") %>%
separate("A4 Allroad B9",into=c("A4 Allroad B9_count","A4 Allroad B9_sumkm","A4 Allroad B9_drive","A4 Allroad B9_drive2"),sep="-") %>%
separate("A5 Cabriolet B9",into=c("A5 Cabriolet B9_count","A5 Cabriolet B9_sumkm","A5 Cabriolet B9_drive","A5 Cabriolet B9_drive2"),sep="-") %>%
and so on...
Is that possible to define a function(x) and use lapply(data[,-1],function(x)) to replace the long code above?
I made a data sample with your structure
library(tidyr)
A <- c("7-4-5-9", NA)
B <- c("6-4-5-1", "7-8-6-3")
dat <- data.frame(A,B, stringsAsFactors = FALSE) #example data
for (col in names(dat)){ # you will propably need to assign a subset of columns with something like wanted_cols %in% names(dat))
namecol1 <- paste(col, "count", sep ="_")
namecol2 <- paste(col, "sumkm", sep ="_")
namecol3 <- paste(col, "drive", sep ="_")
namecol4 <- paste(col, "drive2", sep ="_") #set the wanted column structure
cols <- rlang::sym(col)
dat <- dat %>% separate(!!cols, into = c(namecol1, namecol2, namecol3, namecol4), sep = "-", remove = TRUE) # separate and remove the original column
}

Insert Column Name into its Value using R

I need to insert Column Name, Department, into its value. i have code like here:
Department <- c("Store1","Store2","Store3","Store4","Store5")
Department2 <- c("IT1","IT2","IT3","IT4","IT5")
x <- c(100,200,300,400,500)
Result <- data.frame(Department,Department2,x)
Result
The expected result is like:
Department <- c("Department_Store1","Departmentz_Store2","Department_Store3","Department_Store4","Department_Store5")
Department2 <- c("Department2_IT1","Department2_IT2","Department2_IT3","Department2_IT4","Department2_IT5")
x <- c(100,200,300,400,500)
Expected.Result <- data.frame(Department,Department2,x)
Expected.Result
Can somebody help? Thanks
Another way with dplyr and tidyr:
library(dplyr)
library(tidyr)
# Convert to character to avoid warning message, will convert all columns to character
Result[] <- lapply(Result, as.character)
Result %>%
mutate_if(is.factor, as.character) %>% # optional, only convert factor to character, retain all other types
gather(key, value, -x) %>%
mutate(var = paste(key, value, sep = "_")) %>%
select(-value) %>%
spread(key,var)
x Department Department2
1 100 Department_Store1 Department2_IT1
2 200 Department_Store2 Department2_IT2
3 300 Department_Store3 Department2_IT3
4 400 Department_Store4 Department2_IT4
5 500 Department_Store5 Department2_IT5
Data:
Result <- data.frame(
Department = c("Store1","Store2","Store3","Store4","Store5"),
Department2 = c("IT1","IT2","IT3","IT4","IT5"),
x = c(100,200,300,400,500)
)
If you gather the column names in question into a vector dep_col, this is a clean base R solution with a for loop:
df <- data.frame(x = 1:5,
Department = paste0("Store", 1:5),
Department2 = paste0("IT", 1:5))
dep_col <- names(df)[-1]
for (c in dep_col)
df[[c]] <- paste(c, df[[c]], sep = "_")
If I understand correctly, the OP wants to prepend the values in all columns starting with "Department" by the respective column name.
Edit By request of the OP, the code to select columns has been generalized to pick additional column names.
Here is a solution using data.table's fast set() function:
library(data.table)
setDT(Result)
cols <- stringr::str_subset(names(Result), "^(Department|Division|Team)")
for (j in cols) {
set(Result, NULL, j, paste(j, Result[[j]], sep = "_"))
}
Result
Department Department2 x
1: Department_Store1 Department2_IT1 100
2: Department_Store2 Department2_IT2 200
3: Department_Store3 Department2_IT3 300
4: Department_Store4 Department2_IT4 400
5: Department_Store5 Department2_IT5 500
Note that set() updates by reference, i.e., without copying the whole object.

How to get the name of a data.frame within a list?

How can I get a data frame's name from a list? Sure, get() gets the object itself, but I want to have its name for use within another function. Here's the use case, in case you would rather suggest a work around:
lapply(somelistOfDataframes, function(X) {
ddply(X, .(idx, bynameofX), summarise, checkSum = sum(value))
})
There is a column in each data frame that goes by the same name as the data frame within the list. How can I get this name bynameofX? names(X) would return the whole vector.
EDIT: Here's a reproducible example:
df1 <- data.frame(value = rnorm(100), cat = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
df2 <- data.frame(value = rnorm(100,8), cat2 = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
mylist <- list(cat = df1, cat2 = df2)
lapply(mylist, head, 5)
I'd use the names of the list in this fashion:
dat1 = data.frame()
dat2 = data.frame()
l = list(dat1 = dat1, dat2 = dat2)
> str(l)
List of 2
$ dat1:'data.frame': 0 obs. of 0 variables
$ dat2:'data.frame': 0 obs. of 0 variables
and then use lapply + ddply like:
lapply(names(l), function(x) {
ddply(l[[x]], c("idx", x), summarise,checkSum = sum(value))
})
This remains untested without a reproducible answer. But it should help you in the right direction.
EDIT (ran2): Here's the code using the reproducible example.
l <- lapply(names(mylist), function(x) {
ddply(mylist[[x]], c("idx", x), summarise,checkSum = sum(value))
})
names(l) <- names(mylist); l
Here is the dplyr equivalent
library(dplyr)
catalog =
data_frame(
data = someListOfDataframes,
cat = names(someListOfDataframes)) %>%
rowwise %>%
mutate(
renamed =
data %>%
rename_(.dots =
cat %>%
as.name %>%
list %>%
setNames("cat")) %>%
list)
catalog$renamed %>%
bind_rows(.id = "number") %>%
group_by(number, idx, cat) %>%
summarize(checkSum = sum(value))
you could just firstly use names(list)->list_name and then use list_name[1] , list_name[2] etc. to get each list name. (you may also need as.numeric(list_name[x]) if your list names are numbers.

Resources