I have the following large dataframes:
Jan_Feb2019
Mar_Apr2019
May_Jun2019
Jul_Aug2019
Sep_Oct2019
Nov_Dec2019
Jan_Feb2020
Mar_2020
And i use the following code to generate other dataframes and fill the columns with the data i want.
#Jan_Feb2019
Jan_Feb2019_df <- as.data.frame(Jan_Feb2019$reactions$summary$total_count)
colnames(Jan_Feb2019_df)[1] <- "Reactions"
Jan_Feb2019_df$Shares <- Jan_Feb2019$shares$count
Jan_Feb2019_df$Comments <- Jan_Feb2019$comments$summary$total_count
Jan_Feb2019_df$Message <- Jan_Feb2019$message
Jan_Feb2019_df$Likes <- Jan_Feb2019$likes$summary$total_count
Jan_Feb2019_df$CreatedDate <- Jan_Feb2019$created_time
Jan_Feb2019_df$PostID <- Jan_Feb2019$id
Jan_Feb2019_df$Love <- Jan_Feb2019$reacts_love$summary$total_count
Jan_Feb2019_df$Angry <- Jan_Feb2019$reacts_angry$summary$total_count
Jan_Feb2019_df$Sad <- Jan_Feb2019$reacts_sad$summary$total_count
Jan_Feb2019_df$HAHA <- Jan_Feb2019$reacts_haha$summary$total_count
Jan_Feb2019_df$WOW <- Jan_Feb2019$reacts_wow$summary$total_count
Jan_Feb2019_df$CreatedDate <- anytime(Jan_Feb2019_df[,6])
Jan_Feb2019_df$insights.data <- Jan_Feb2019$insights$data
Jan_Feb2019_df <- Jan_Feb2019_df %>%
unnest(insights.data) %>%
unnest(values) %>%
select(Message,Shares,Comments,Reactions,Likes,CreatedDate,PostID,Love,Angry,Sad,HAHA,WOW,name,value) %>%
pivot_wider(names_from = name, values_from = value)
Is there a way to iterate between all the above dataframes, so i won't have to repeat the process 8 times?
Thanks
The code below is untested. I have tried to follow the code in the question, making it general. There are 2 functions.
fillNewDf takes the old object as only argument and creates and fills the new data frame.
makeNewDf takes the old object name as an argument and calls fillNewDf returning its value.
If the objects are in the global environment then makeNewDf argument envir default value is used.
fillNewDf <- function(X){
vec <- X[['reactions']][['summary']][['total_count']]
Y <- data.frame(Reactions = vec)
Y[['Shares']] <- X[['shares']][['count']]
Y[['Comments']] <- X[['comments']][['summary']][['total_count']]
Y[['Message']] <- X[['message']]
Y[['Likes']] <- X[['likes']][['summary']][['total_count']]
Y[['CreatedDate']] <- X[['created_time']]
Y[['PostID']] <- X[['id']]
Y[['Love']] <- X[['reacts_love']][['summary']][['total_count']]
Y[['Angry']] <- X[['reacts_angry']][['summary']][['total_count']]
Y[['Sad']] <- X[['reacts_sad']][['summary']][['total_count']]
Y[['HAHA']] <- X[['reacts_haha']][['summary']][['total_count']]
Y[['WOW']] <- X[['reacts_wow']][['summary']][['total_count']]
Y[['CreatedDate']] <- anytime(Y[, 6])
Y[['insights.data']] <- X[['insights']][['data']]
Y %>%
unnest(insights.data) %>%
unnest(values) %>%
select(Message, Shares, Comments, Reactions, Likes, CreatedDate, PostID, Love, Angry, Sad, HAHA, WOW, name, value) %>%
pivot_wider(names_from = name, values_from = value)
}
makeNewDf <- function(X, envir = .GlobalEnv){
DF <- get(X, envir = envir)
filNewDf(DF)
}
Now get the names of the objects to be processed with ls() and create a list with the new data frames.
old_names <- ls(pattern = '\\d{4}$')
new_list <- lapply(old_list, makeNewDf)
names(new_list) <- paste(old_names, "df", sep = "_")
If these new data frames are to become objects in the global environment, list2env(new_list) will create them with the same names as the names attribute of new_list.
Related
I'm looking for a more efficient way to write the following:
Read in all my Excel files
DF1 <- read_excel(DF1, sheet = "ABC", range = cell_cols(1:10) )
DF2 <- read_excel(DF2, sheet = "ABC", range = cell_cols(1:10) )
etc...
DF50 <- read_excel(DF50, sheet = "ABC", range = cell_cols(1:10) )
Add a column to each DF with a location
DF1$Location <- location1
DF2$Location <- location2
etc...
DF50$Location <- location50
Keep only columns with specified names, get rid of blank rows, and convert column CR_NUMBER to an integer
library(hablar)
DF1 <- DF1 %>% select(all_of(colnames_r)) %>% filter(!is.na(NAME)) %>% convert(int(CR_NUMBER))
DF2 <- DF2 %>% select(all_of(colnames_r)) %>% filter(!is.na(NAME)) %>% convert(int(CR_NUMBER))
etc...
DF50 <- DF50 %>% select(all_of(colnames_r)) %>% filter(!is.na(NAME)) %>% convert(int(CR_NUMBER))
You can try to use the following getting the data in a list :
library(readxl)
library(hablar)
library(dplyr)
#Get the complete path of file which has name "DF" followed by a number.
file_names <- list.files('/folder/path', pattern = 'DF\\d+', full.names = TRUE)
list_data <- lapply(seq_along(file_names), function(x) {
data <- read_excel(file_names[x], sheet = "ABC", range = cell_cols(1:10))
data %>%
mutate(Location = paste0('location', x))
select(all_of(colnames_r)) %>%
filter(!is.na(NAME)) %>%
convert(int(CR_NUMBER))
})
list_data is a list of dataframes which is usually better to manage instead of having 50 dataframes in global environment. If you still want all the dataframes separately name the list and use list2env.
names(list_data) <- paste0('DF', seq_along(list_data))
list2env(list_data, .GlobalEnv)
The jist of this question is that I have some R code which works fine on a local data frame, but fails on a Spark data frame, even if otherwise the two tables are identical.
In R, given a dataframe of all character columns, one can dynamically type cast all the columns to numeric that can be safely converted to numeric with the following code:
require(dplyr)
require(varhandle)
require(sparklyr)
checkNumeric <- function(column)
{
column %>% as.data.frame %>% .[,1] %>% varhandle::check.numeric(.) %>% all
}
typeCast <- function(df)
{
columns <- colnames(df)
numericIdx <- df %>% mutate(across(columns, checkNumeric)) %>% .[1,]
doThese <- columns[which(numericIdx==T)]
df <- df %>% mutate_at(all_of(vars(doThese)), as.numeric)
return(df)
}
For a trivial example, one could run:
df <- iris
df$Sepal.Length <- as.character(df$Sepal.Length)
newDF <- df %>% typeCast
class(df$Sepal.Length)
class(newDF$Sepal.Length)
Now, this code will not work on a dataset like starwars, which has composite columns. But for other dataframes, I would expect this code to work just fine on a Spark data frame. It doesn't. That is:
sc <- spark_connect('yarn', config=config) # define your Spark configuration somewhere, that's outside the scope of this question
df <- copy_to(sc, iris, "iris")
newDF <- df %>% typeCast
Will fail with the following error.
Error in .[1, ] : incorrect number of dimensions
When debugging, if we try to run this code:
columns <- colnames(df)
df %>% mutate(across(columns, checkNumeric))
This error is returned:
Error in UseMethod("escape") :
no applicable method for 'escape' applied to an object of class "function"
What gives? Why would the code work fine on a local data frame, but not a Spark data frame?
I didn't find an exact solution per se, but I did find a workaround.
typeCheckPartition <- function(df)
{
require(dplyr)
require(varhandle)
checkNumeric <- function(column)
{
column %>% as.data.frame %>% .[,1] %>% varhandle::check.numeric(.) %>% all
}
# this works on non-spark data frames
columns <- colnames(df)
numericIdx <- df %>% mutate(across(all_of(columns), checkNumeric)) %>% .[1,]
return(numericIdx)
}
typeCastSpark <- function(df, max_partitions = 1000, undo_coalesce = T)
{
# numericIdxDf will have these dimensions: num_partition rows x num_columns
# so long as num_columns is not absurd, this coalesce should make collect a safe operation
num_partitions <- sdf_num_partitions(df)
if (num_partitions > max_partitions)
{
undo_coalesce <- T && undo_coalesce
df <- df %>% sdf_coalesce(max_partitions)
} else
{
undo_coalesce <- F
}
columns <- colnames(df)
numericIdxDf <- df %>% spark_apply(typeCheckPartition, packages=T) %>% collect
numericIdx <- numericIdxDf %>% as.data.frame %>% apply(2, all)
doThese <- columns[which(numericIdx==T)]
df <- df %>% mutate_at(all_of(vars(doThese)), as.numeric)
if (undo_coalesce)
df <- df %>% sdf_repartition(num_partitions)
return(df)
}
Just run the typeCastSpark function against your dataframe and it will type cast all of the columns to numeric (that can be).
I've got a question with the map function from the Purrr package.
I can successfully pass on a list of data frames to a function using map
the output remains a list and that's my issue ; I need to have the modified data frames as R objects
As an example with the mtcars dataset:
#I create a second df
mtcars2 <- mtcars
#change one variable just to distinguish them
mtcars2$mpg <- mtcars2$mpg / 2
#create the list
dflist <- list(mtcars,mtcars2)
#then, a simple function example
my_fun <- function(x)
{x <- x %>%
summarise(`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl)
)
}
#then, using map, this works and prints the desired results
list_results <- map(dflist,my_fun)
But, I would need to have the modified mtcars and mtcars2 saved as r objects (dataframes).
Should I add a "save" option of some kind to my function ?
Should I use map_df or dmap ? (My trials were unsuccessful)
In advance, thanks a lot to all of you !
Here is an attempt:
library(purrr)
library(tidyverse)
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$mpg / 2
dflist <- list(mtcars,mtcars2)
To save the objects one would need to give them specific names, and use:
assign("name", object, envir = .GlobalEnv)
here is one way to achieve that:
my_fun <- function(x, list) {
listi <- list[[x]]
assign(paste0("object_from_function_", x), dflist[[x]], envir = .GlobalEnv)
x <- listi %>%
summarise(`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl)
)
return(x)
}
my_fun has two arguments - seq_along(list) to generate specific names and the list that is to be processed
this saves two objects object_from_function_1 and object_from_function_2:
list_results <- map(seq_along(dflist), my_fun, dflist)
another approach would be to use list2env outside of the map function as akrun suggested
dflist <- list(mtcars,mtcars2)
names(dflist) <- c("mtcars","mtcars2")
list2env(dflist, envir = .GlobalEnv) #this will create two objects `mtcars` and `mtcars2`
and run map after you have created the objects as you have already done.
Here is solution using purrr::walk() with get() and assign(). Similar to those above, but not identical.
library(dplyr)
library(purrr)
data(mtcars)
Create the second data frame.
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$mpg / 2
Create the function to apply to each data frame.
sum_mpg_cyl <- function(.data) {
.data %>%
summarise(
`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl)
)
}
Apply sum_mpg_cyl() to mtcars and mtcars2, saving two data frames of summary stats by the same names to the global environment. A potential advantage of this method is that you do not need to create a separate list of data frames.
walk(
.x = c("mtcars", "mtcars2"),
.f = function(df_name) {
# Get the data frame from the global environment
df <- get(df_name, envir = .GlobalEnv)
# Calculate the summary statistics
df <- sum_mpg_cyl(df)
# Save the data frames containing summary statistics back to the global
# environment
assign(df_name, df, envir = .GlobalEnv)
}
)
I would probably also use an anonymous function and save the two data frames of summary stats with different names like this:
# Reset the data
data(mtcars)
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$mpg / 2
walk(
.x = c("mtcars", "mtcars2"),
.f = function(df_name) {
# Get the data frame from the global environment
df <- get(df_name, envir = .GlobalEnv)
# Calculate the summary statistics
df <- df %>%
summarise(
`sum of mpg` = sum(mpg),
`sum of cyl` = sum(cyl)
)
# Rename the data frames containing summary statistics to distinguish
# them from the input data frames
new_df_name <- paste(df_name, "stats", sep = "_")
# Save the data frames containing summary statistics back to the global
# environment
assign(new_df_name, df, envir = .GlobalEnv)
}
)
I have a dataframe uuu_df with records as links of website
dim(uuu_df)
output
1950 1
uuu_df
1) http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=5-Lacs
2) http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=2&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=5-Lacs
3) http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=3&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=5-Lacs
.
.
.
1950) http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=>5&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=20-Crores&BudgetMax=20-Crores
here I'm trying to scrape data using those multiple links from the dataframe along with the condition i.e. if the text of html attribute is equal to "No Results Found!" then skip that record and move on to next record,
this is the snippet of that scraping
UrlPage <- html("http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=2&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=5-Lacs")
ImgNode <- UrlPage %>% html_node("div.noResultHead")
u=ImgNode
u=as(u,"character")
u=paste("No",word(string = u, start = 4, end = 5),sep = " ")
Here is what I have tried
wines=data.frame()
url_test=c()
UrlPage_test=c()
u=c()
ImgNode=c()
for(i in 1:dim(uuu_df)[1]){
url_test[i]=as.character(uuu_df[i,])
UrlPage_test[i] <- html(url_test[i])
ImgNode[i] <- UrlPage_test[i] %>% html_node("div.noResultHead")
u[i]=ImgNode[i]
u[i]=as(u[i],"character")
u[i]=paste("No",word(string = u, start = 4, end = 5),sep = " ")
if(u[i]=="No Results Found!") next
{
map_df(1:5, function(i) # here 1:5 is number of webpages of a website
{
# simple but effective progress indicator
cat(".")
pg <- read_html(sprintf(url_test, i))
data.frame(wine=html_text(html_nodes(pg, ".agentNameh")),
excerpt=html_text(html_nodes(pg, ".postedOn")),
locality=html_text(html_nodes(pg,".localityFirst")),
society=html_text(html_nodes(pg,'.labValu .stop-propagation:nth-child(1)')),
stringsAsFactors=FALSE)
}) -> wines
}
But the Wines dataframe gives me empty dataframe with empty rows and columns
Why is it not able to append rows inside it.
Any suggestion will be helpful. Thanks in advance
P.S: dput() of reproduciable data
text1="http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom="
text2="1"
text3="&proptype="
text4="Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment"
text5="&cityName=Thane&BudgetMin="
text6="&BudgetMax="
bhk=c("1","2","3","4","5",">5")
budg_min=c("5-Lacs","10-Lacs","20-Lacs","30-Lacs","40-Lacs","50-Lacs","60-Lacs","70-Lacs","80-Lacs","90-Lacs","1-Crores","1.2-Crores","1.4-Crores","1.6-Crores","1.8-Crores","2-Crores","2.3-Crores","2.6-Crores","3-Crores","3.5-Crores","4-Crores","4.5-Crores","5-Crores","10-Crores","20-Crores")
budg_max=c("5-Lacs","10-Lacs","20-Lacs","30-Lacs","40-Lacs","50-Lacs","60-Lacs","70-Lacs","80-Lacs","90-Lacs","1-Crores","1.2-Crores","1.4-Crores","1.6-Crores","1.8-Crores","2-Crores","2.3-Crores","2.6-Crores","3-Crores","3.5-Crores","4-Crores","4.5-Crores","5-Crores","10-Crores","20-Crores")
eg <- expand.grid(bhk = bhk, budg_min = budg_min, budg_max = budg_max)
eg <- eg[as.integer(eg$budg_min) <= as.integer(eg$budg_max),]
uuu <- sprintf("%s%s%s%s%s%s%s%s", text1,eg[,1],text3,text4,text5,eg[,2],text6,eg[,3])
uuu_df=data.frame(Links=uuu)
dput(uuu_df)
You should take advantage of the document tree to consistently find the elements you need and control the flow of the loop or vectorized function. In the example below I check the result count to determine if there are results, then parse each node individually to ensure it's consistent. Finally, you can bind them if needed.
Side Note: llply has the .progress argument which more elegantly handles the progress indicator you were trying to devise using cat().
options(stringsAsFactors = FALSE)
library(plyr)
library(dplyr)
library(xml2)
uuu_df <- data.frame(x = c('http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=1&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=5-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=2&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=5-Lacs',
'http://www.magicbricks.com/property-for-sale/residential-real-estate?bedroom=3&proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment&cityName=Thane&BudgetMin=5-Lacs&BudgetMax=90-Lacs'))
urlList <- llply(uuu_df[,1], function(url){
this_pg <- read_html(url)
results_count <- this_pg %>%
xml_find_first(".//span[#id='resultCount']") %>%
xml_text() %>%
as.integer()
if(results_count > 0){
cards <- this_pg %>%
xml_find_all('//div[#class="SRCard"]')
df <- ldply(cards, .fun=function(x){
y <- data.frame(wine = x %>% xml_find_first('.//span[#class="agentNameh"]') %>% xml_text(),
excerpt = x %>% xml_find_first('.//div[#class="postedOn"]') %>% xml_text(),
locality = x %>% xml_find_first('.//span[#class="localityFirst"]') %>% xml_text(),
society = x %>% xml_find_first('.//div[#class="labValu"]') %>% xml_text() %>% gsub('\\n', '', .))
return(y)
})
} else {
df <- NULL
}
return(df)
}, .progress = 'text')
names(urlList) <- uuu_df[,1]
bind_rows(urlList)
Consider working with one large list built using lapply that iterates through url column of dataframe instead of managing many smaller vectors:
urlList <- lapply(uuu_df[1,], function(url){
UrlPage <- html(as.character(url))
ImgNode <- UrlPage %>% html_node("div.noResultHead")
u <- paste("No", word(string = as(ImgNode, "character"), start=4, end=5), sep=" ")
cat(".")
pg <- read_html(url)
if(u!="No Results Found!") {
df <- data.frame(wine=html_text(html_nodes(pg, ".agentNameh")),
excerpt=html_text(html_nodes(pg, ".postedOn")),
locality=html_text(html_nodes(pg,".localityFirst")),
society=html_text(html_nodes(pg,'.labValu .stop-propagation:nth-child(1)')),
stringsAsFactors=FALSE)
} else {
# ASSIGN EMPTY DATAFRAME (FOR CONSISTENT STRUCTURE)
df <- data.frame(wine=c(), excerpt=c(), locality=c(), society=c())
}
# RETURN NAMED LIST
return(list(UrlPage=UrlPage, ImgNode=ImgNode, u=u, df=df))
})
# ROW BIND ONLY DATAFRAME ELEMENT FROM LIST
wines <- map_df(urlList, function(u) u$df)
How can I get a data frame's name from a list? Sure, get() gets the object itself, but I want to have its name for use within another function. Here's the use case, in case you would rather suggest a work around:
lapply(somelistOfDataframes, function(X) {
ddply(X, .(idx, bynameofX), summarise, checkSum = sum(value))
})
There is a column in each data frame that goes by the same name as the data frame within the list. How can I get this name bynameofX? names(X) would return the whole vector.
EDIT: Here's a reproducible example:
df1 <- data.frame(value = rnorm(100), cat = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
df2 <- data.frame(value = rnorm(100,8), cat2 = c(rep(1,50),
rep(2,50)), idx = rep(letters[1:4],25))
mylist <- list(cat = df1, cat2 = df2)
lapply(mylist, head, 5)
I'd use the names of the list in this fashion:
dat1 = data.frame()
dat2 = data.frame()
l = list(dat1 = dat1, dat2 = dat2)
> str(l)
List of 2
$ dat1:'data.frame': 0 obs. of 0 variables
$ dat2:'data.frame': 0 obs. of 0 variables
and then use lapply + ddply like:
lapply(names(l), function(x) {
ddply(l[[x]], c("idx", x), summarise,checkSum = sum(value))
})
This remains untested without a reproducible answer. But it should help you in the right direction.
EDIT (ran2): Here's the code using the reproducible example.
l <- lapply(names(mylist), function(x) {
ddply(mylist[[x]], c("idx", x), summarise,checkSum = sum(value))
})
names(l) <- names(mylist); l
Here is the dplyr equivalent
library(dplyr)
catalog =
data_frame(
data = someListOfDataframes,
cat = names(someListOfDataframes)) %>%
rowwise %>%
mutate(
renamed =
data %>%
rename_(.dots =
cat %>%
as.name %>%
list %>%
setNames("cat")) %>%
list)
catalog$renamed %>%
bind_rows(.id = "number") %>%
group_by(number, idx, cat) %>%
summarize(checkSum = sum(value))
you could just firstly use names(list)->list_name and then use list_name[1] , list_name[2] etc. to get each list name. (you may also need as.numeric(list_name[x]) if your list names are numbers.