Unable to update data in dataframe - r

i tried updating data in dataframe but its unable to get updating
//Initialize data and dataframe here
user_data=read.csv("train_5.csv")
baskets.df=data.frame(Sequence=character(),
Challenge=character(),
countno=integer(),
stringsAsFactors=FALSE)
/Updating data in dataframe here
for(i in 1:length((user_data)))
{
for(j in i:length(user_data))
{
if(user_data$challenge_sequence[i]==user_data$challenge_sequence[j]&&user_data$challenge[i]==user_data$challenge[j])
{
writedata(user_data$challenge_sequence[i],user_data$challenge[i])
}
}
}
writedata=function( seqnn,challng)
{
#print(seqnn)
#print(challng)
newRow <- data.frame(Sequence=seqnn,Challenge=challng,countno=1)
baskets.df=rbind(baskets.df,newRow)
}
//view data here
View(baskets.df)

I've modified your code to what I believe will work. You haven't provided sample data, so I can't verify that it works the way you want. I'm basing my attempt here on a couple of common novice mistakes that I'll do my best to explain.
Your writedata function was written to be a little loose with it's scope. When you create a new function, what happens in the function technically happens in its own environment. That is, it tries to look for things defined within the function, and then any new objects it creates are created only within that environment. R also has this neat (and sometimes tricky) feature where, if it can't find an object in an environment, it will try to look up to the parent environment.
The impact this has on your writedata function is that when R looks for baskets.df in the function and can't find it, R then turns to the Global Environment, finds baskets.df there, and then uses it in rbind. However, the result of rbind gets saved to a baskets.df in the function environment, and does not update the object of the same name in the global environment.
To address this, I added an argument to writedata that is simply named data. We can then use this argument to pass a data frame to the function's environment and do everything locally. By not making any assignment at the end, we implicitly tell the function to return it's result.
Then, in your loop, instead of simply calling writedata, we assign it's result back to baskets.df to replace the previous result.
for(i in 1:length((user_data)))
{
for(j in i:length(user_data))
{
if(user_data$challenge_sequence[i] == user_data$challenge_sequence[j] &&
user_data$challenge[i] == user_data$challenge[j])
{
baskets.df <- writedata(baskets.df,
user_data$challenge_sequence[i],
user_data$challenge[i])
}
}
}
writedata=function(data, seqnn,challng)
{
#print(seqnn)
#print(challng)
newRow <- data.frame(Sequence = seqnn,
Challenge = challng,
countno = 1)
rbind(data, newRow)
}

I'm not sure what you're programming background is, but your loops will be very slow in R because it's an interpreted language. To get around this, many functions are vectorized (which simply means that you give them more than one data point, and they do the looping inside compiled code where the loops are fast).
With that in mind, here's what I believe will be a much faster implementation of your code
user_data=read.csv("train_5.csv")
# challenge_indices will be a matrix with TRUE at every place "challenge" and "challenge_sequence" is the same
challenge_indices <- outer(user_data$challenge_sequence, user_data$challenge_sequence, "==") &
outer(user_data$challenge, user_data$challenge, "==")
# since you don't want duplicates, get rid of them
challenge_indices[upper.tri(challenge_indices, diag = TRUE)] <- FALSE
# now let's get the indices of interest
index_list <- which(challenge_indices,arr.ind = TRUE)
# now we make the resulting data set all at once
# this is much faster, because it does not require copying the data frame many times - which would be required if you created a new row every time.
baskets.df <- with(user_data, data.frame(
Sequence = challenge_sequence[index_list[,"row"]],
challenge = challenge[index_list[,"row"]]
)

Related

Vectorized Operation in R causing problems with custom function

I'm writing out some functions for Inventory management. I've recently wanted to add a "photo url column" to my spreadsheet by using an API I've used successfully while initially building my inventory. My Spreadsheet header looks like the following:
SKU | NAME | OTHER STUFF
I have a getProductInfo function that returns a list of product info from an API I'm calling.
getProductInfo<- function(barcode) {
#Input UPC
#Output List of product info
CallAPI(barcode)
Process API return, remove garbage
return(info)
}
I made a new function that takes my inventory csv as input, and attempts to add a new column with product photo url.
get_photo_url_from_product_info_output <- function(in_list){
#Input GetProductInfo Output. Returns Photo URL, or nothing if
#it doesn't exist
if(in_list$DisplayStockPhotos == TRUE){
return(in_list$StockPhotoURL)
} else {
return("")
}
}
add_Photo_URL <- function(in_csv){
#Input CSV data frame, appends photourl column
#Requires SKU (UPC) assumes no photourl column
out_csv <- mutate(in_csv, photo =
get_photo_url_from_product_info_output(
getProductInfo(SKU)
)
)
}
return (out_csv)
}
#Call it
new <- add_Photo_URL(old)
My thinking was that R would simply input the SKU of the from the row, and put it through the double function call "as is", and the vectorized DPLYR function mutate would just vectorize it. Unfortunately I was running into all sorts of problems I couldn't understand. Eventually I figured out that API call was crashing because the SKU field was all messed up as it was being passed in. I put in a breakpoint and found out that it wasn't just passing in the SKU, but instead an entire list (I think?) of SKUs. Every Row all at once. Something like this:
#Variable 'barcode' inside getProductInfo function contains:
[1] 7.869368e+11 1.438175e+10 1.256983e+10 2.454357e+10 3.139814e+10 1.256983e+10 1.313260e+10 4.339643e+10 2.454328e+10
[10] 1.313243e+10 6.839046e+11 2.454367e+10 2.454363e+10 2.454367e+10 2.454348e+10 8.418870e+11 2.519211e+10 2.454375e+10
[19] 2.454381e+10 2.454381e+10 2.454383e+10 2.454384e+10 7.869368e+11 2.454370e+10 2.454390e+10 1.913290e+11 2.454397e+10
[28] 2.454399e+10 2.519202e+10 2.519205e+10 7.742121e+11 8.839291e+11 8.539116e+10 2.519211e+10 2.519211e+10 2.519211e+10
Obviously my initial getProductInfo function can't handle that, so it'll crash.
How should I modify my code, whether it be in the input or API call to avoid this vectorized operation issue?
Well, it's not totally elegant but it works.
I figured out I need to use lapply, which is usually not my strong suit. Initally I tried to nest them like so:
lapply(SKU, get_photo_url_from_product_info_output(getProductInfo())
But that didn't work. So I just came up with bright idea of making another function
get_photo_url_from_sku <- function(barcode){
return(get_photo_url_from_product_info_output(getProductInfo(barcode)))
}
Call that in the lapply:
out_csv<- mutate(in_csv, photocolumn = lapply(SKU, get_photo_url_from_sku))
And it works great. My speed is only limited by my API calls.

Using a function to change a variable in R

I am trying to change a variable in a function but even tho the function is producing the right values, when I go to use them in the next sections, R is still using the initial values.
I created a function to update my variables NetN and NetC:
Reproduction=function(NetN,NetC,cnrep=20){
if(NetC/NetN<=cnrep) {
DeltaC=NetC*p;
DeltaN=DeltaC/cnrep;
Crep=Crep+DeltaC;
Nrep=Nrep+DeltaN;
Brep=(Nrep*14+Crep*12)*2/1e6;
NetN=NetN-DeltaN; #/* Update N, C values */
NetC=NetC*(1-p)
print ("'Using C to allocate'")
}
else {
print("Using N to allocate");
DeltaN=NetN*p;
DeltaC=DeltaN*cnrep;
Nrep=Nrep+DeltaN;
Crep=Crep+DeltaC;
Brep=(Nrep*14+Crep*12)*2/1e6;
NetN=NetN*(1-p);
NetC=NetC-DeltaC;
} } return(c(NetC=NetC,NetN=NetN,NewB=NewB,Crep=Crep,Nrep=Nrep,Brep=Brep))}
When I use my function by say doing:
Reproduction(NetN=1.07149,NetC=0.0922349,cnrep=20)
I get the desired result printed out which includes:
NetC=7.378792e-02
However, when I go to use NetC in the next section of my code, R is still using NetC=0.0922349.
Can I make R update NetC without having to define a new variable?
In R, in general, functions shouldn't change things outside of the function. It's possible to do so using <<- or assign(), but this generally makes your function inflexible and very surprising.
Instead, functions should return values (which yours does nicely), and if you want to keep those values, you explicitly use <- or = to assign them to objects outside of the function. They way your function is built now, you can do that like this:
updates = Reproduction(NetN = 1.07149, NetC = 0.0922349, cnrep = 20)
NetC = updates["NetC"]
This way, you (a) still have all the other results of the function stored in updates, (b) if you wanted to run Reproduction() with a different set of inputs and compare the results, you can do that. (If NetC updated automatically, you could never see two different values), (c) You can potentially change variable names and still use the same function, (d) You can run the function to experiment/see what happens without saving/updating the values.
If you generally want to keep NetN, NetC, and cnrep in sync, I would recommend keeping them together in a named vector or list, and rewriting your function to take that list as input and return that list as output. Something like this:
params = list(NetN = 1.07149, NetC = 0.0922349, cnrep = 20)
Reproduction=function(param_list){
NetN = param_list$NetN
NetC = param_list$NetC
cnrep = param_list$cnrep
if(NetC/NetN <= cnrep) {
DeltaC=NetC*p;
DeltaN=DeltaC/cnrep;
Crep=Crep+DeltaC;
Nrep=Nrep+DeltaN;
Brep=(Nrep*14+Crep*12)*2/1e6;
NetN=NetN-DeltaN; #/* Update N, C values */
NetC=NetC*(1-p)
print ("'Using C to allocate'")
}
else {
print("Using N to allocate");
DeltaN=NetN*p;
DeltaC=DeltaN*cnrep;
Nrep=Nrep+DeltaN;
Crep=Crep+DeltaC;
Brep=(Nrep*14+Crep*12)*2/1e6;
NetN=NetN*(1-p);
NetC=NetC-DeltaC;
}
## Removed extra } and ) ??
return(list(NetC=NetC, NetN=NetN, NewB=NewB, Crep=Crep, Nrep=Nrep, Brep=Brep))
}
This way, you can use the single line params <- Reproduction(params) to update everything in your list. You can access individual items in the list with either params$Netc or params[["NetC"]].

Getting Error "could not find function "assign<-" inside of colnames()

I'm using assign() to assign some new data frames from some other data frame. I then want to name some of the columns in the new data frame. When I use assign() to create the new data frames it works fine. But when I use the assign() inside of colnames() is gives the error 'Error "could not find function "assign<-".'
Here's my snippet of code(abbreviated of course):
for(i in 1:value) {
assign(Name[i], Old.Data.Frame[Old.Data.Frame$1 == Index[i]]) #I'm going to call this line of code 'New Data Frame' for brevity
for(j in 1:ncol(New Data Frame)) {
colnames(New Data Frame)[j] = as.character(Old.Data.Frame[3,j])
I do all this assign() stuff because the names of the Old Data Frame constantly change and I can create any concrete variables in my code, only the dimentions of the frame stay the same.
The only error in this code is that R cannot "find function assign<- in colnames(...". I'm flustered because assign() had just worked in the line before, any help is appreciated, thanks!
You have a list of variable names in Name, which you assign a value (your code block).
for(i in 1:value) { assign(Name[i], Old.Data.Frame[Old.Data.Frame$1 == Index[i]]) }
Could you then try (note I'm separating this code block for debugging purposes):
for(i in 1:value) { colnames(get(Names[i])) <- as.character(Old.Data.Frame[3,] }
get will retrieve the data (data.frame) assigned to the variable name Names[i] (character)

Loop works outside function but in functions it doesn't.

Been going around for hours with this. My 1st question online on R. Trying to creat a function that contains a loop. The function takes a vector that the user submits like in pollutantmean(4:6) and then it loads a bunch of csv files (in the directory mentioned) and binds them. What is strange (to me) is that if I assign the variable id and then run the loop without using a function, it works! When I put it inside a function so that the user can supply the id vector then it does nothing. Can someone help ? thank you!!!
pollutantmean<-function(id=1:332)
{
#read files
allfiles<-data.frame()
id<-str_pad(id,3,pad = "0")
direct<-"/Users/ped/Documents/LearningR/"
for (i in id) {
path<-paste(direct,"/",i,".csv",sep="")
file<-read.csv(path)
allfiles<-rbind(allfiles,file)
}
}
Your function is missing a return value. (#Roland)
pollutantmean<-function(id=1:332) {
#read files
allfiles<-data.frame()
id<-str_pad(id,3,pad = "0")
direct<-"/Users/ped/Documents/LearningR/"
for (i in id) {
path<-paste(direct,"/",i,".csv",sep="")
file<-read.csv(path)
allfiles<-rbind(allfiles,file)
}
return(allfiles)
}
Edit:
Your mistake was that you did not specify in your function what you want to get out from the function. In R, you create objects inside of function (you could imagine it as different environment) and then specify which object you want it to return.
With my comment about accepting my answer, I meant this: (...To mark an answer as accepted, click on the check mark beside the answer to toggle it from greyed out to filled in...).
Consider even an lapply and do.call which would not need return being last line of function:
pollutantmean <- function(id=1:332) {
id <- str_pad(id,3,pad = "0")
direct_files <- paste0("/Users/ped/Documents/LearningR/", id, ".csv")
# READ FILES INTO LIST AND ROW BIND
allfiles <- do.call(rbind, lapply(direct_files, read.csv))
}
ok, I got it. I was expecting the files that are built to be actually created and show up in the environment of R. But for some reason they don't. But R still does all the calculations. Thanks lot for the replies!!!!
pollutantmean<-function(directory,pollutant,id)
{
#read files
allfiles<-data.frame()
id2<-str_pad(id,3,pad = "0")
direct<-paste("/Users/pedroalbuquerque/Documents/Learning R/",directory,sep="")
for (i in id2) {
path<-paste(direct,"/",i,".csv",sep="")
file<-read.csv(path)
allfiles<-rbind(allfiles,file)
}
#averaging polutants
mean(allfiles[,pollutant],na.rm = TRUE)
}
pollutantmean("specdata","nitrate",23:35)

data.table and R's ellipsis (i.e. '...'): pass by reference does not seem to work

Im trying to manipulate a large data table (~37 MB) but in a special way: for other (unrelated) reasons I have implemented a 'hook' like structure meaning that the overall process is like
1) load the data.table from disk
2) fire a certain hook
3) the hook structure looks for this name ans checks whether the user (=me :)) has bound a function to this hook and if so, it is called
4) the data is processed further
The functions look like this:
data = readRDS(pathToFile)
data = data.table(data)
fireHook("After_data_read", data, [some other parameters])
some_more_processing(data)
and the region around fireHook looks like
hooksRegistered = list(
"After_data_read" = function(data, ...) {
# do some stuff
}
)
fireHook = function(hookName, ...) {
for (hookNameRegistered in names(hooksRegistered)) {
if (hookName == hookNameRegistered) {
func = .global.hooksRegistered[[hookName]]
func(hookName, ...)
}
}
}
Observe that one needs to cast an object that already is a data.table into it again (otherwise the pass-by-reference does not work), see Adding new columns to a data.table by-reference within a function not always working and Pass by reference bug?
Problem: this line: func(hookName, ...) takes like forever (> 5 minutes).
The debugger never really gets into the function (so its not the code in the function that takes a long time) and I've tested it with small data.tables and it worked. Also, I noted that the following seems to work:
fireHook = function(hookName, ...) {
args = list(...)
for (hookNameRegistered in names(.global.hooksRegistered)) {
if (hookName == hookNameRegistered) {
func = .global.hooksRegistered[[hookName]]
func(hookName, args)
}
}
}
(notice that I substituted ... by list(...)). To me, it seems as if R is trying to copy the whole table when using .... Is this right/desired? Or am I using it wrong?
regards,
FW

Resources