Using plyr to obtain all strings in dataframe beginning with a string - r

I have a data frame and I'm trying to obtain all strings in there that begin with "RLF" and put them in a list. I tried using the dlply function in plyr, but I couldn't get the syntax quite right.
dlply(.data = unformatted_table,.variables = 1:ncol(unformatted_table), .fun = strsplit("RLF") ,.inform = TRUE)
I looked around alot and couldn't apply the solutions to my problem. Also, please let me know if there's a better way than using dlply.

Related

Convert nested list to dataframe in R

This question sounds like it has already been asked on SO, but I ask it anyway because the existing answers did not work for me and I'm not sure how better to phrase it, as I am new to R and don't entirely grasp the intricacies of its data types.
Time for a minimal example. I am looking for a transformation of target such that targetObject is exactly equal to referenceObject.
reference = '{"airport":[{"name":"brussels","loc":{"lat":"1","lon":"2"}}],"parking":[{"name":"P1"}]}'
target = '{"airport":{"name":"brussels","loc":{"lat":"1","lon":"2"}},"parking":{"name":"P1"}}'
referenceObject = jsonlite::fromJSON(reference)$airport
x = jsonlite::fromJSON(target)$airport
# Transformation
targetObject = do.call(rbind.data.frame, x)
# Currently prints FALSE, should become TRUE
results_same = identical(referenceObject, targetObject)
print(results_same)
I would expect this to be very simple in any language, but R seems to handle the nested loc lists very differently depending on the shape of the outer object airport.
I managed to find a solution by serializing back to JSON. It's not elegant but at least it works.
# Transformation
targetObject = jsonlite::fromJSON(jsonlite::toJSON(list(x), auto_unbox = TRUE))
For now I will not mark this answer as correct because it's more of a workaround than an ideomatic solution.

Parallelize user-defined function using apply family in R

I have a script that takes too long to compute and I'm trying to paralellize its execution.
The script basically loops through each row of a data frame and perform some calculations as shown below:
my.df = data.frame(id=1:9,value=11:19)
sumPrevious <- function(df,df.id){
sum(df[df$id<=df.id,"value"])
}
for(i in 1:nrow(my.df)){
print(sumPrevious(my.df,my.df[i,"id"]))
}
I'm starting to learn to parallelize code in R, this is why I first want to understand how I could do this with an apply-like function (e.g. sapply,lapply,mapply).
I've tried multiple things but nothing worked so far:
mapply(sumPrevious,my.df,my.df$id) # Error in df$id : $ operator is invalid for atomic vectors
Using theparallel package in R you can use the mclapply() function. You will need to adjust your code a little bit to make it run in parallel.
library(parallel)
my.df = data.frame(id=1:9,value=11:19)
sumPrevious <- function(i,df){df.id = df$id[i]
sum(df[df$id<=df.id,"value"])
}
mclapply(X = 1:nrow(my.df),FUN = sumPrevious,my.df,mc.preschedule = T,mc.cores = no.of.cores)
This code will run the sumPrevious in parallel on no.of.cores in your machine.
Well, this is fun playing with. you kind need something like below:
mapply(sumPrevious,list(my.df),my.df$id)
For supply, since the first input is the dataframe, you will have to define a given function for it to be ale to recognize it so:
sapply(my.df$id,function(x,y) sumPrevious(y,x),my.df)
I prefer mapply here since we can set the first value to be imputed as the dataframe directly. But the whole of the dataframe. That's why you have to use the function list.
Map ia a wrapper of mapply and thus would just present the solution in a list format. try it. Also lapply is similar to sapply only that sapply would have to simplify the results into an array format while lapply would give the same results as a list.
Though it seems whatever you are trying to do can simply be done by a cumsum function.
cumsum(df$values)

Looping in R to create transformed variables

I have a dataset of 80 variables, and I want to loop though a subset of 50 of them and construct returns. I have a list of the names of the variables for which I want to construct returns, and am attempting to use the dplyr command mutate to construct the variables in a loop. Specifically my code is:
for (i in returnvars) {
alldta <- mutate(alldta,paste("r",i,sep="") = (i - lag(i,1))/lag(i,1))}
where returnvars is my list, and alldta is my dataset. When I run this code outside the loop with just one of the `i' values, it works fine. The code for that looks like this:
alldta <- mutate(alldta,rVar = (Var- lag(Var,1))/lag(Var,1))
However, when I run it in the loop (e.g., attempting to do the previous line of code 50 times for 50 different variables), I get the following error:
Error: unexpected '=' in:
"for (i in returnvars) {
alldta <- mutate(alldta,paste("r",i,sep="") ="
I am unsure why this issue is coming up. I have looked into a number of ways to try and do this, and have attempted solutions that use lapply as well, without success.
Any help would be much appreciated! If there is an easy way to do this with one of the apply commands as well, that would be great. I did not provide a dataset because my question is not data specific, I'm simply trying to understand, as a relative R beginner, how to construct many transformed variables at once and add them to my data frame.
EDIT: As per Frank's comment, I updated the code to the following:
for (i in returnvars) {
varname <- paste("r",i,sep="")
alldta <- mutate(alldta,varname = (i - lag(i,1))/lag(i,1))}
This fixes the previous error, but I am still not referencing the variable correctly, so I get the error
Error in "Var" - lag("Var", 1) :
non-numeric argument to binary operator
Which I assume is because R sees my variable name Var as a string, rather than as a variable. How would I correctly reference the variable in my dataset alldta? I tried get(i) and alldta$get(i), both without success.
I'm also still open to (and actively curious about), more R-style ways to do this entire process, as opposed to using a loop.
Using mutate inside a loop might not be a good idea either. I am not sure if mutate makes a copy of the data frame but its generally not a good practice to grow a data frame inside a loop. Instead create a separate data frame with the output and then name the columns based on your logic.
result = do.call(rbind,lapply(returnvars,function(i) {...})
names(result) = paste("r",returnvars,sep="")
After playing around with this more, I discovered (thanks to Frank's suggestion), that the following works:
extended <- alldta # Make a copy of my dataset
for (i in returnvars) {
varname <- paste("r",i,sep="")
extended[[varname]] = (extended[[i]] - lag(extended[[i]],1))/lag(extended[[i]],1)}
This is still not very R-styled in that I am using a loop, but for a task that is only repeating about 50 times, this shouldn't be a large issue.

R: using ddply to apply functions to subsets of data

I'm trying to use the ddply method to take a dataframe with various info about 3000 movies and then calculate the mean gross of each genre. I'm new to R, and I've read all the questions on here relating to ddply, but I still can't seem to get it right. Here's what I have now:
> attach(movies)
> ddply(movies, Genre, mean(Gross))
Error in llply(.data = .data, .fun = .fun, ..., .progress = .progress, :
.fun is not a function.
How am I supposed to write a function that takes the mean of the values in the "Gross" column for each set of movies, grouped by genre? I know this seems like a simple question, but the documentation is really confusing to me, and I'm not too familiar with R syntax yet.
Is there a method other than ddply that would make this easier?
Thanks!!
Here is an example using the tips dataset available in ggplot2
library(ggplot2);
mean_tip_by_day = ddply(tips, .(day), summarize, mean_tip = mean(tip/total_bill))
Hope this is useful
You probably don't need plyr for a simple operation like that. tapply() does the job easily and you won't need to load additional packages. The syntax also seems simpler than Ramnath's:
tapply(tips$tip, tips$day, mean)
Note that plyr is a fantastic tool for many tasks. To me, it just seems like overkill here.

Multiple Recodes in R

I am looking to recode a large number of variables, and figure I can probably use some sort of loop to do so. What throws me is how to programmatically name each variable (I just want to keep the var name and append ".rc".
Here is an example. Lets say I have a set of variables, var.1 to var.5. I am looking to create a new variable in my dataframe that is var.1.rc <- var.1 / sum(var.1 to var1.5). Ill do the same for the next variable, and so on.
I am new to R but this would be a HUGE step forward for me.
Is it possible. Best ways to do it? Any help will be much appreciated!
Regards,
Brock
If I understand you correctly, there is actually a pretty easy way to do this. Assuming your original data frame is called dat, you can do this:
dat.rc <- dat/rowSums(dat)
names(dat.rc) <- paste(names(dat), ".rc", sep="")
dat <- data.frame(dat,dat.rc)
You could try the following loop.
Here the eval(parse(text="")) allows you evaluate a pasted together string containing the various static and dynamic portions of the expression to create each new variable.
for (i in 1:5) {
X<-paste("var.",i,".rc<-var.",i,"/(var.1+var.2+var.3+var.4+var.5)",sep="")
eval(parse(text=X))
}

Resources