resource efficient join and filter method

resource efficient join and filter method - r

I have the following condensed data set:
tbl1 <- data.frame(Name = c(rep("A",3), rep("B",2), rep("C",3)), Dat = c(1,1,2,1,1,3,4,4),
Var1 = sample(1:8,8), Var2 = sample(1:8,8))
tbl2 <- data.frame(Name = c("A","A","B","C","C"), Dat = c(1,2,1,3,4), x = c("a","b","b","b","a"))
I need to filter from tbl1 all data sets with the condition x, found in table tbl2. This is my current solution.
tbl11 <- tbl1 %>% mutate(key = paste(Name, Dat, sep = "_"))
tbl2 <- tbl2 %>% mutate(key = paste(Name, Dat, sep = "_"))
tbl3 <- left_join(tbl11, tbl2)
tbl4 <- tbl3 %>% filter(x == "a")
Unfortunately I run into resource issues. For small tables it works. I think there are more efficient way so that I don't have to store the intermediate steps. Your help is much appreciated.

You can subset the data before joining :
tbl4 <- merge(tbl1, subset(tbl2, x == 'a'), by = c('Name', 'Dat'))

Thanks for sharing your ideas. Just for completeness, I have tested the suggested and came up with a correct and more efficient way:
tbl3 <- inner_join(filter(tbl2, x == 'a'), tbl1, by = c('Name', 'Dat'))
Inner_join is significantly faster than merge. And the order of the input is important of course.

Related

tidy way to perform joins iteratively with map / apply functions

I would like to join/merge multiple tibbles/data frames with the use of map/lapply. How would it be possible to perform that?
Reproducible example:
set.seed(42)
df <- tibble::tibble(rank = rep(stringr::str_c("rank",1:10),10),
char_1 = sample(c("a","b","c"), size = 100, replace = TRUE),
points = sample(1:10000, size = 100)
)
my_top <- seq(10,90, by= 10) %>%
as.list() %>%
set_names(c(stringr::str_c("sample_",1:9)))
my_list_1 <- map(my_top , ~ df %>%
sample_n(.x) %>%
mutate(!!str_c(.x, "_score") := sample(1:10000, size = .x)))
I would like to perform this:
df %>% group_by(rank, char_1, points) %>%
left_join(my_list_1[[1]] ) %>%
left_join(my_list_1[[2]] ) %>%
left_join(my_list_1[[3]] )
and so on ... with map function.
I tried this:
map(as.list(names(my_top)), ~ df %>% group_by(rank, char_1, points) %>%
left_join(my_list_1[[.x]] ))
But of course, it is not saving somewhere the joined tibble in order to make a new join with it!

An option would be reduce
library(dplyr)
library(purrr)
df %>%
group_by(rank, char_1, points) %>%
list(.) %>%
c(., my_list_1[1:3]) %>%
reduce(left_join)

This is my first answer, I'm new here. I had a similar problem recently, join_all was the best solution I found.
library(plyr)
#list files that are saved in your computer, for example, in txt format
files <- list.files("path", *.txt)
# open the files and save then as a list
list_of_data_frames <- lapply(files, read_delim, delim = "\t")
# merge files
merged_file <- join_all(list_of_data_frames, by = NULL)

dplyr group_by() %>% summarize_all() throws error "Error: cannot modify grouping variable" with back-ticked variables?

I was having an issue with a group_by() %>% summarize_all() operation, where it was throwing an error
"Error: cannot modify grouping variable"
even though summarize_all() is not supposed to attempt to modify the grouping variable that had spaces and was back-ticked. I tried many many things, and it finally behaved when I changed the variable name to not contain a space.
Is this normal behavior? I've gotten to liking using back-ticked variables since it makes the names more human-readable.
Here's a self-contained example:
x <- sample(c("Dog", "Banana", "Happy"), 2000, replace = TRUE)
y <- runif(2000, min = 0, max = 20)
df <- data.frame(`Grouping Variable` = x,
value = y)
df2 <- as.tbl(df)
df3 <- df2 %>% group_by(Grouping.Variable) %>% summarize_all(sum)
df3
# all good!
df3 <- df2 %>% group_by(`Grouping.Variable`) %>% summarize_all(sum)
df3
# all good!
df2.2 <- df2 %>% rename(`Grouping Variable` = Grouping.Variable)
df3.2 <- df2.2 %>% group_by(`Grouping Variable`) %>% summarize_all(sum)
# Noooo!!!

Make a list element of each group with dplyr's group_by function

I would like to be able to use more automation when creating SpatialLines objects from otherwise tidy data frames.
library(sp)
#create sample data
sample_data <- data.frame(group_id = rep(c("a", "b","c"), 10),
x = rnorm(10),
y = rnorm(10))
#How can I recreate this using dplyr?
a_list <- Lines(list(Line(sample_data %>% filter(group_id == "a") %>% select(x, y))), ID = 1)
b_list <- Lines(Line(list(sample_data %>% filter(group_id == "b") %>% select(x, y))), ID = 2)
c_list <- Lines(Line(list(sample_data %>% filter(group_id == "c") %>% select(x, y))), ID = 3)
SpatialLines(list(a_list, b_list, c_list))
You can see how using something like group_by would make the process pretty easy if you could understand how the data could be piped into a list.

Using your sample data, a wrapper function, and dplyr::do will give you what you want :)
wrapper <- function(df) {
df %>% select(x,y) %>% as.data.frame %>% Line %>% list %>% return
}
y <- sample_data %>% group_by(group_id) %>%
do(res = wrapper(.))
# and now assign IDs (since we can't do that inside dplyr easily)
ids = 1:dim(y)[1]
SpatialLines(
mapply(x = y$res, ids = ids, FUN = function(x,ids) {Lines(x,ID=ids)})
)
I don't use sp so there might be a better way to assign IDs.
For reference, consider reading Hadley's comments on returning non-dataframe from dplyr do calls

R Data Wrangling for Emails

Need Help! this is a work related project. I need to clean 16,000 emails... Expected to do by hand :( I need to find a away to pull the domain name from the email and place it into a new column, and parse the name into a new column as well, while still keeping the original email. The data is partially complete.
library(tidyr)
library(magrittr)
Email.Address <- c('john.doe#abccorp.com','jdoe#cisco.com','johnd#widgetco.com')
First.Name <- c('John', 'JDoe','NA' )
Last.Name <- c('Doe','NA','NA')
Company <- c('NA','NA','NA')
data <- data.frame(Email.Address, First.Name, Last.Name, Company)
separate_DF <- data %>% separate(Email.Address, c("Company"), sep="#")

try this
df <- data.frame(Email.Address, First.Name, Last.Name, Company, stringsAsFactors = FALSE)
Corp <- sapply(strsplit(sapply(strsplit(df$Email.Address,"#"),"[[",2),"[.]"),"[[",1)
F.Name <- sapply(strsplit(sapply(strsplit(df$Email.Address,"#"),"[[",1), "[.]"),"[[",1)
L.Name <- sapply(strsplit(sapply(strsplit(df$Email.Address,"#"),"[[",1),"[.]"),tail,n=1)
L.Name[L.Name == F.Name] <- NA
OUT <- data.frame(df$Email.Address, F.Name, L.Name, Corp)
df[df=="NA" |is.na(df)] <- OUT[df=="NA" |is.na(df)]
df
the function separate from tidyr looks good too.
http://blog.rstudio.org/2014/07/22/introducing-tidyr/
From the information you have given, this also works:
library(tidyr)
df <- data.frame(Email.Address, First.Name, Last.Name, Company)
df2 <- separate(df, Email.Address, into = c("Name", "Corp"), sep = "#")
df2 <- separate(df2, Name, into = c("F.Name", "L.Name"), sep = "[.]", extra = "drop")
df2 <- separate(df2, Corp, into = c("Corp", "com"), sep = "[.]", extra = "drop")

How to get the name of a data.frame within a list?

How can I get a data frame's name from a list? Sure, get() gets the object itself, but I want to have its name for use within another function. Here's the use case, in case you would rather suggest a work around:
lapply(somelistOfDataframes, function(X) {
ddply(X, .(idx, bynameofX), summarise, checkSum = sum(value))
})
There is a column in each data frame that goes by the same name as the data frame within the list. How can I get this name bynameofX? names(X) would return the whole vector.
EDIT: Here's a reproducible example:
df1 <- data.frame(value = rnorm(100), cat = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
df2 <- data.frame(value = rnorm(100,8), cat2 = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
mylist <- list(cat = df1, cat2 = df2)
lapply(mylist, head, 5)

I'd use the names of the list in this fashion:
dat1 = data.frame()
dat2 = data.frame()
l = list(dat1 = dat1, dat2 = dat2)
> str(l)
List of 2
$ dat1:'data.frame': 0 obs. of 0 variables
$ dat2:'data.frame': 0 obs. of 0 variables
and then use lapply + ddply like:
lapply(names(l), function(x) {
ddply(l[[x]], c("idx", x), summarise,checkSum = sum(value))
})
This remains untested without a reproducible answer. But it should help you in the right direction.
EDIT (ran2): Here's the code using the reproducible example.
l <- lapply(names(mylist), function(x) {
ddply(mylist[[x]], c("idx", x), summarise,checkSum = sum(value))
})
names(l) <- names(mylist); l

Here is the dplyr equivalent
library(dplyr)
catalog =
data_frame(
data = someListOfDataframes,
cat = names(someListOfDataframes)) %>%
rowwise %>%
mutate(
renamed =
data %>%
rename_(.dots =
cat %>%
as.name %>%
list %>%
setNames("cat")) %>%
list)
catalog$renamed %>%
bind_rows(.id = "number") %>%
group_by(number, idx, cat) %>%
summarize(checkSum = sum(value))

you could just firstly use names(list)->list_name and then use list_name[1] , list_name[2] etc. to get each list name. (you may also need as.numeric(list_name[x]) if your list names are numbers.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

resource efficient join and filter method - r

You can subset the data before joining : tbl4 <- merge(tbl1, subset(tbl2, x == 'a'), by = c('Name', 'Dat'))

Related

tidy way to perform joins iteratively with map / apply functions

dplyr group_by() %>% summarize_all() throws error "Error: cannot modify grouping variable" with back-ticked variables?

Make a list element of each group with dplyr's group_by function

R Data Wrangling for Emails

How to get the name of a data.frame within a list?

Categories

Resources