Referencing and manipulating data frame in foreach

Referencing and manipulating data frame in foreach - r

for(artist in 1:nrow(epop8)){
message(paste("Artist:", artist))
id = epop8$spotify_id[artist];
cur_showDate = epop8$ShowDate[artist]
dma_show = epop8$DMA_Region_Code[artist]
spotifySubset = na.omit(spotify[spotify$spotify_id == id,])
if(nrow(spotifySubset) == 0){
epop8$last6month[artist] = NA
next
}
monthsOfInterest_6 = epop8$ShowDate[artist] - 180:1
monthsOfInterest_12 = epop8$ShowDate[artist] - 365:1
epop8Subset6MO = unique(epop8[epop8$ShowDate %in% monthsOfInterest_6,
c("spotify_id", "DMA_Region_Code", "ShowDate")]) %>%
subset(., DMA_Region_Code == dma_show) %>% arrange(desc(ShowDate))
epop8Subset1Yr = unique(epop8[epop8$ShowDate %in% monthsOfInterest_12,
c("spotify_id", "DMA_Region_Code", "ShowDate")]) %>%
subset(., DMA_Region_Code == dma_show) %>% arrange(desc(ShowDate))
last6month = epop8Subset6MO[spotifySubset$relatedID %in% epop8Subset6MO$spotify_id,]
last12month = epop8Subset1Yr[spotifySubset$relatedID %in% epop8Subset1Yr$spotify_id,]
epop8$since_related_artist[artist] = ifelse(!nrow(last6month) && !nrow(last12month),
365,
as.double(cur_showDate - last6month$ShowDate[1]))
epop8$related_artist_count_6MO[artist] = nrow(last6month)
epop8$related_artist_count_12MO[artist] = nrow(last12month)
}
I am simply trying to "convert" this for loop into a parallelized foreach loop. I've tried to comprehend how to even approach the foreach loop. I've tried
ntasks <- nrow(epop8)
pb <- tkProgressBar(max=nrow(epop8))
progress <- function(n) setTkProgressBar(pb, n)
opts <- list(progress=progress)
foreach(artist=1:nrow(epop8), .combine = combine,
.packages = "dplyr", .options.snow=opts) %dopar%{
...
}
Where "..." is the exact same code that is contained within my for loop above. Now, I know this isn't the way to do it necessarily but I'm not sure how to update my dataframe or subset my dataframe at all within a foreach parallelized loop. I've scoured the internet but I'm beginning to think that foreach is a little over my head.
We have two dataframes, epop8 and spotify. epop8 is a dataframe full
of receipts from shows around the US and spotify is spotify data for
each artist in epop8.
This code establishes a unique id, a show date, a DMA (region), and a
list of related musical artist IDs to the unique id (spotifySubset).
It then establishes an array of dates in the last 6 months and then
an array of dates in the last 12 months to compare to.
The full dataset of shows is then subsetted down to all shows in the
last 6 months and all shows in the last 12 months within the
specified region.
These two subsets are then subsetted once again to just include
related artists (as determined by spotify) and produces metrics based
on the last6month and last12month data frames.
Using those two dataframes I write results to the original epop8
dataframe for the current artist.
Here is what the datasets look like:
spotify = data.frame(artistName = "Bob", spotify_id = "abcd",
related = c("Al", "Bill", "Charles","Daniel"),
relatedID = c("1234", "efrd", "bcde", "fghi"))
epop8 = data.frame(id = c("abcd", "asdf", "robd", "1234"),
dma = c("654", "332", "489", "654"),
ShowDate = c("2017-10-08", "2011-10-04", "2012-10-01", "2017-08-01"))
So here, when we subset epop8 for the last 6 months and 12 months in the DMA for Bob (the first artist in epop8), we get that there is one match for a show in that DMA in the last 12 months -- Daniel on 2017-08-01.
Hopefully this all makes sense, I'd really love your help on this one!

Related

Problems with binding columns from two data frames using a for loop in R

I have 7 of two different asc files loaded into R, asc[i] and wasc[i], [i] denotes that there are 1:7 ascs and wascs loaded into R. I need to combine the wasc[i] with the asc[i][[1]] (Just the first column in asc[i] with the whole wasc[i] file).
This should be repeated for every pair of asc and wasc files.
The code keeps giving me blank data frames, so I don't know why this doesn't work. The naming is correct, yet the code is not recognizing that the asc[i] and wasc[i] correlate with previously loaded files.
Any help will be greatly appreciated.
# These data frames will reproduce my issue
asc1 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc1 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
asc2 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc2 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
asc3 <- data.frame(x= c(rep("A.tif", 20)), y = 1:20)
wasc3 <- data.frame(x= c(rep("B.tif", 20)), y = c(rep("Imager",20)))
for (i in 1:3) {
d <- paste("asc", i, sep ="")
f <- paste("wasc", i, sep ="")
full_wing <- as.character(paste("full_wing", i, sep = ""))
assign(full_wing,cbind(d[[1]], f))
}
# Output of full_wing1 data frame
dput(full_wing1)
structure(c("asc1", "wasc1"), .Dim = 1:2, .Dimnames = list(NULL,
c("", "f")))
Additional Information:
asc files are 19 columns long
wasc files are 13 columns long
I only want to combine column 1 from the asc file with the entire wasc file, thus cutting out the remaining 18 columns of the asc file.

# put data in a list
asc = mget(ls(pattern = "^asc"))
wasc = mget(ls(pattern = "^wasc"))
full_wing = Map(f = function(w, a) cbind(w, a[[1]]), w = wasc, a = asc)
Map is a nice shortcut for iterating in parallel over multiple arguments. It returns a nice list. You can access the individual elements with, e.g., full_wing[[1]], full_wing[[3]], etc. Map is just a shortcut, the above code is basically equivalent to the for loop below:
results = list()
for (i in seq_along(asc)) {
results[[i]] = cbind(wasc[[i]], asc[[i]][[1]])
}
I use mget to put the data in a list because in your example you already have objects like asc1, asc2, etc. A much better way to go is to never create those variables in the first place, instead read the files directly into a list, something like this:
asc_paths = list.files(pattern = "^asc")
asc = lapply(asc_paths, read.table)
You can see a lot more explanation of this at How to make a list of data frames?
If you only ever need one column of the asc files, another way to simplify this would be to only read in the needed column, see Only read limited number of columns for some recommendations there.

how to use dataframe name inside for loop to save different ggplot2 plots in R

I have a data frame (all.table) that i have subsetted into 3 different data plots name (A1.table, B25.table, and C48.table)
all.table = read.table(file.path(input_file_name), header=T, sep = "\t")
A1.table = subset(all.table, ID == "A1")
B25.table = subset(all.table, ID == "B25")
C48.table = subset(all.table, ID == "C48")
For each graph type I want, I want to generate it based on all 4 tables
for (i in list(all.table, A1.table, B25.table, C48.table)){
ggplot(i, aes(x=Position, fill=Frequency)) + #other plot options
ggsave(file.path(full_output_path, "uniqueFileName.pfd")
#additional plots
#additional saves
}
my problem comes in the ggsave command with how to generate the 'uniqueFileName.pdf'. I would like to name it as some form of all.table.graph1.pdf, all.table.graph2.pdf and A1.table.graph1.pdf, A1.table.graph2.pdf etc
My question is how do I turn the name of the iterator i into a string, and add that string to a '.graph1.pdf' string?
Coming from a python background this seems like it should be rather simple. I am not very versed in R (as is likely obvious from this question) and anything resembling an answer I have found seems incredibly over complicated.

This is a workflow that uses the tidyverse suite of functions. iwalk is similar to lapply in base, but it requires a function that takes 2 arguments, and it automatically inputs the names of the list as the 2nd argument.
The short answer for what you want is paste0, which lets you combine strings.
library(tidyverse)
all.table %>%
filter(ID %in% c("A1", "B25", "C48")) %>% # only needed if there are more IDs than the 3 explictly listed
split(., .$ID) %>% # creates the list of data frames
c(list(all.table = all.table), .) %>% # adds "all.table" as a list element
iwalk(function(df, label) {
ggplot(df, aes(x = Position, fill = Frequency)) +
...
ggsave(file.path(full_output_path, paste0(label, ".graph1.pdf")))
})

Figured out a solution by looking for a python dictionary equivalent:
all.table = read.table(file.path(input_file_name), header=T, sep = "\t")
A1.table = subset(all.table, ID == "A1")
B25.table = subset(all.table, ID == "B25")
C48.table = subset(all.table, ID == "C48")
#Generate a named list of tables
list_of_tables = list(all = all.table, A1 = A1.table, B25 = B25.table, C48 = C48.table)
for (i in 1:length(list_of_tables)){
ggplot(list_of_tables[[i]], aes(x=Frequency, fill=Category)) + #more options
ggsave(file.path(full_output_path, paste0(names(list_of_tables[i]), ".graph1.pdf"))
}
I'm not sure if there is a downside to not using other libraries (ie tidyverse), but this seems like the simplest answer?

How to select a particular dataframe from a list of dataframes in Python equivalent to R?

I have a list of dataframes in R, with which I'm trying to select a particular dataframe as follows:
x = listOfdf$df1$df2$df3
Now, trying hard to find an equivalent way to do so in Python. Like, the syntax on how a particular DataFrame be selected from a list of DataFrames in Pandas Python.

Found a solution to select a particular dataframe/dataframe_column from a list of dataframes.
In R : x = listOfdf$df1$df2$df3
In Python : x = listOfdf['df1']['df2']['df3']
Thank you :)

I see you've already answered your own question, and that's cool. However, as jezrael hints in his comment, you should really consider using a dictionary. That might sound a bit scary coming from R (been there myself, now I prefer Python in most ways), but It will be worth your effort.
First of all, a dictionary is a way of mapping a value or variable to a key (like a name). You use curly brackets { } to build the dictionary, and use square brackets [ ] to index it.
Let's say that you have two dataframes like this:
np.random.seed(123)
# Reproducible input - Dataframe 1
rows = 10
df_1 = pd.DataFrame(np.random.randint(90,110,size=(rows, 2)), columns=list('AB'))
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df_1['dates'] = datelist
df_1 = df_1.set_index(['dates'])
df_1.index = pd.to_datetime(df_1.index)
##%%
# Reproducible input - Dataframe 2
rows = 10
df_2 = pd.DataFrame(np.random.randint(10,20,size=(rows, 2)), columns=list('CD'))
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df_2['dates'] = datelist
df_2 = df_2.set_index(['dates'])
df_2.index = pd.to_datetime(df_2.index)
With a limited number of dataframes you can easily organize them in a dictionary this way:
myFrames = {'df_1': df_1,
'df_2': df_2}
Now you have a reference to your dataframes, as well as your own defined names or keys. You'll find a more elaborate explanation here.
Here's how you use it:
print(myFrames['df_1'])
You can also use that reference to make changes to one of your dataframes, and add that to your dictionary:
df_3 = myFrames['df_1']
df_3 = df_3*10
myFrames.update({'df_3': df_3})
print(myFrames)
Now lets say that you have a whole bunch of dataframes that you'd like to organize the same way. You can make a list of the names of all available dataframes like described below. However, you should be aware that using eval() for many reasons often is not recommended.
Anyway, here we go: First you get a list of strings of all dataframe names like this:
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
It's more than likely that you won't be interested in ALL of them if you've got a lot going on at the same time. So lets say that the names of all your dataframes of particluar interest start with 'df_'. You can isolate them like this:
dfNames = []
for elem in alldfs:
if str(elem)[:3] == 'df_':
dfNames.append(elem)
Now you can use that list in combination with eval() to make a dictionary:
myFrames2 = {}
for dfName in dfNames:
myFrames2[dfName] = eval(dfName)
Now you can loop through that dictionary and do something with each of them.
You could, as an example, take the last column of each dataframe, multiply by 10, and make a new dataframe with those values:
j = 1
for key in myFrames.keys():
# Build new column names for your brand new df
colName = []
colName.append('column_' + str(j))
if j == 1:
# First, make a new df by referencing the dictionary
df_new = myFrames2[key]
# Subset the last column and make sure it doesn't
# turn into a pandas series instead of a dataframe in the process
df_new = df_new.iloc[:,-1].to_frame()
# Set new column names
df_new.columns = colName[:]
else:
# df_new already exists, so you can add
# new columns and names for the rest of the columns
df_new[colName] = myFrames2[key].iloc[:,-1].to_frame()
j = j + 1
print(df_new)
Hope you'll find this useful!
And by the way... For your next question, please provide some reproducible code as well as a few words about what solutions you have tried yourself. You can read more about how to ask an excellent question here.
And here is the whole thing for an easy copy&paste:
#%%
# Imports
import pandas as pd
import numpy as np
np.random.seed(123)
# Reproducible input - Dataframe 1
rows = 10
df_1 = pd.DataFrame(np.random.randint(90,110,size=(rows, 2)), columns=list('AB'))
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df_1['dates'] = datelist
df_1 = df_1.set_index(['dates'])
df_1.index = pd.to_datetime(df_1.index)
##%%
# Reproducible input - Dataframe 2
rows = 10
df_2 = pd.DataFrame(np.random.randint(10,20,size=(rows, 2)), columns=list('CD'))
datelist = pd.date_range(pd.datetime(2017, 1, 1).strftime('%Y-%m-%d'), periods=rows).tolist()
df_2['dates'] = datelist
df_2 = df_2.set_index(['dates'])
df_2.index = pd.to_datetime(df_2.index)
print(df_1)
print(df_2)
##%%
# If you dont have that many dataframes, you can organize them in a dictionary like this:
myFrames = {'df_1': df_1,
'df_2': df_2}
# Now you can reference df_1 in that collecton by using:
print(myFrames['df_1'])
# You can also use that reference to make changes to one of your dataframes,
# and add that to your dictionary
df_3 = myFrames['df_1']
df_3 = df_3*10
myFrames.update({'df_3': df_3})
# And now you have a happy little family of dataframes:
print(myFrames)
##%%
# Now lets say that you have whole bunch of dataframes that you'd like to organize the same way.
# You can make a list of the names of all available dataframes like this:
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
##%%
# It's likely that you won't be interested in all of them if you've got a lot going on.
# Lets say that all your dataframes of interest start with 'df_'
# You get them like this:
dfNames = []
for elem in alldfs:
if str(elem)[:3] == 'df_':
dfNames.append(elem)
##%%
# Now you can use that list in combination with eval() to make a dictionary:
myFrames2 = {}
for dfName in dfNames:
myFrames2[dfName] = eval(dfName)
##%%
# And now you can reference each dataframe by name in that new dictionary:
myFrames2['df_1']
##%%
#Loop through that dictionary and do something with each of them.
j = 1
for key in myFrames.keys():
# Build new column names for your brand new df
colName = []
colName.append('column_' + str(j))
if j == 1:
# First, make a new df by referencing the dictionary
df_new = myFrames2[key]
# Subset the last column and make sure it doesn't
# turn into a pandas series instead for a dataframe in the process
df_new = df_new.iloc[:,-1].to_frame()
# Set new column names
df_new.columns = colName[:]
else:
# df_new already exists, so you can add
# new columns and names for the rest of the columns
df_new[colName] = myFrames2[key].iloc[:,-1].to_frame()
j = j + 1
print(df_new)

Copying data from one dataframe to another

I want to rearrange the data in the dataframe which is originally in following format
3 rows for one project
I have extracted this using "rvest" package
library(rvest)
library(plyr)
library(dplyr)
projects<-NULL
thepage = read_html("https://www.99acres.com/search/project/buy/residential/pune?search_type=QS&search_location=SH&lstAcn=NPSEARCH&lstAcnId=9753976212484323&src=CLUSTER&preference=S&city=19&res_com=R&selected_tab=3&isvoicesearch=N&keyword=pune&strEntityMap=IiI%3D&refine_results=Y&Refine_Localities=Refine%20Localities&action=%2Fdo%2Fquicksearch%2Fsearch&searchform=1&price_min=null&price_max=null")
table = data.frame(html_table(x = thepage, fill = TRUE))
table = as.data.frame(t(table))
ResidentialProjects<-rbind(projects,setNames(table, names(table)))
I want all the details about 1 project (Real Estate Project Name) in one row.
I tried making a code for it as
newdf<-data.frame(matrix(ncol = 10),stringsAsFactors = FALSE)
df=ResidentialProjects
projectName=""
count<-0
for(n in 1:nrow(df)){
if(df[n,]$V1!=projectName){
count = count+1
projectName=df[n,]$V1
newdf[count,c(1,2,3,4)]=df[n,c(1,2,3,4)]
newdf[count,c(5,6,7)]=df[n+1,c(2,3,4)]
newdf[count,c(8,9,10)]=df[n+2,c(2,3,4)]
}else{
print(n)
next
}
}
But its giving me a table of numbers like
Output newdf
what is the problem? or any better option??

How to import large dataset in r splitting and filtering by 3 different criteria when found

I'm dealing with a couple of txt files with climatological data with 3 parameters that differentiate each chunk of data (Parameter measured, station of measurement, and year), each file has more than a million lines, In the past I mannualy selected each parameter one a time, for a station and year and read it into r using read.fwd; but with this size files that is absurd and inefficient. Is there any way to automate this process, taking into account that the file has a "FF" as indicator every time a new parameter for a station and a given year starts and knowing that i want to generate separate files or datasets that have to be named according to the station, year and parameter to be able to use it thereafter?
File to read Format
Circled in red is the FF, I guess intended to mark the start of a new set of records.
Circled in Black is the name of the parameter measured (there are in total 8 different parameter classes)
Circled in blue is the year of meassurement.
Circled in green is the number or identifier of the station of measurement.
In the past, i read just what i need it with read.fwf, given the fixed with in the data, but that separation is not applied in the head of each table.
PRUEBA3 <- read.fwf("SanIgnacio_Pmax24h.txt", header = FALSE, widths = c(5,4,4,6,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,7,2,10,2),skip=1)
Thanks, and every help will be appreciated

You will need to make a function that loops through the txt files. (The output that you linked to was produced by a database; I assume you don't have access to it).
Here is how the function could look like using the fast fread from data.table and a foreach loop (you can make the loop parallel by registering a parallel backend and change %do% into %dopar%):
library(data.table)
library(foreach)
myfiles = dir(pattern = ".txt$")
res = foreach(i = 1:myfiles) %dopar% {
x = fread(myfiles[i], na.strings = c("", " "))
# get row indices for start and end dates
# the "V" variables are column indices, I assume these don't change per file
start.dia = x[, grep("DIA", V2)] + 2
end.dia = x[, grep("MEDIA", V2)] - 2
# get name of station
estacion.detect = x[, grep("ESTACION", V9)]
estacion.name = x[estacion.detect, V10]
mydf = x[start.dia : end.dia, estacion := estacion.name]
# remove empty rows and columns
junkcol = which(colSums(is.na(mydf)) == nrow(mydf))
junkrow = which(rowSums(is.na(mydf)) == ncol(mydf))
if (length(junkcol) > 0) {
mydf = mydf[, !junkcol, with = F]
}
if (length(junkrow) > 0) {
mydf = mydf[!junkrow, ]
}
# further data cleaning
}
# bind all files
all = rbindlist(res)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Referencing and manipulating data frame in foreach - r

Related

Problems with binding columns from two data frames using a for loop in R

how to use dataframe name inside for loop to save different ggplot2 plots in R

How to select a particular dataframe from a list of dataframes in Python equivalent to R?

Copying data from one dataframe to another

How to import large dataset in r splitting and filtering by 3 different criteria when found

Categories

Resources