I am writing a function that helps me subset a dataframe, and then feeds the dataframe to another action. The output for this function would be the result for the second action. However, since I would still need the cleaned dataframe for another purpose, I was wondering if I can store such dataframe in the environment so that it can be called for later?
For instance,
Let's say I have this dataframe.
ID Var1
1 5 3
2 6 1
And my function is like this:
mu_fuc <- function(df, condition) {
#clean dataset
condition <- eval(as.list(match.call())$condition, df)
workingdf <- subset(df, condition < 3). ####I am trying to store this working dataframe for later use.
#second action
result = sum(workingdf[condition])
#output of the function
return(result)
}
Since the result of the function would be used later as well, I can't add workingdf to return. Otherwise, the output of the function would contain workingdf when I try to feed the output to another function, which is something I don't want.
So for example, in this case, if I want to do, I need the output of the function to be of integers only.
my_fun(data, Var1) - 5
I hope I am making myself clear.
Any help is greatly appreciated!!
You can return a list from the function with the result that you want.
mu_fuc <- function(df, condition) {
#clean dataset
condition <- eval(as.list(match.call())$condition, df)
workingdf <- subset(df, condition < 3)
#second action
result = sum(workingdf)
#output of the function
return(list(result = result, workingdf = workingdf))
}
Call it as :
output <- mu_fuc(df, Var1)
You can separate out the result using $ operator and process them separately.
output$result
output$workingdf
You may store workingdf in an attribute.
mu_fuc <- function(df, condition) {
## clean dataset
condition <- eval(as.list(match.call())$condition, df)
workingdf <- subset(df, condition < 3)
## second action
result <- sum(condition)
attr(result, "workingdf") <- workingdf
return(result)
}
Calculation with the result as usual.
r <- mu_fuc(d, Var1)
r - 5
# [1] -1
# attr(,"workingdf")
# ID Var1
# 2 6 1
To avoid the attribute to be displayed for cosmetic reasons use as.numeric
as.numeric(r) - 5
# [1] -1
or
r2 <- as.numeric(mu_fuc(d, Var1))
r2 - 5
# [1] -1
To get workingdf, fetch it from the attribute.
wdf <- attr(mu_fuc(d, Var1), "workingdf")
wdf
# ID Var1
# 2 6 1
Data:
d <- data.frame(ID=5:6, Var1=c(3, 1))
Related
Hi I'd like to groupby two dataframe columns, and apply a function to aother two dataframe columns.
For e.g.,
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,2,4,6,9,5)
vol <- c(3,5,1,6,2,3)
dat <- data.frame(ticker,date,ret,vol)
For each ticker and each date, I'd like to calculate its PIN.
Now, to avoid further confusion, perhaps it helps to just speak out the actual function. YZ is a function in the InfoTrad package, and YZ only accepts a dataframe with two columns. It uses some optimisation tool and returns an estimated PIN.
install.packages(InfoTrad)
library(InfoTrad)
get_pin_yz <- function(data) {
return(YZ(data[ ,c('volume_krw_buy', 'volume_krw_sell')])[['PIN']])
}
I know how to do this in R using for loop. But for loop is very computationally costly, and it might take weeks to finish running my large dataset. Thus, I would like to ask how to do this using groupby.
# output format is wide wrt long format as "dat"
dat_w <- data.frame(ticker = NA, date = NA, PIN = NA)
for (j in c("A", "B")){
for (k in c(1:2)){
subset <- dat %>% subset((ticker == j & date == k), select = c('ret', "vol"))
new_row <- data.frame(ticker = j, date = k, PIN = YZ(subset)$PIN)
dat_w <- rbind(dat_w, new_row)
}
}
dat_w <- dat_w[-1, ]
dat_w
Don't know if this can help you help me -- I know how to do this in python: I just write a function and run df.groupby(['ticker','date']).apply(function).
Finally, the wanted dataframe is:
ticker <- c('A','A','B','B')
date <- c(1,2,1,2)
PIN <- c(1.05e-17,2.81e-09,1.12e-08,5.39e-09)
data.frame(ticker,date,PIN)
Could somebody help out, please?
Thank you!
Best,
Darcy
Previous stuff (Feel free to ignore)
Previously, I wrote this:
My function is:
get_rv <- function(data) {
return(data[['vol']] + data[['ret']])
}
What I want is:
ticker_wanted <- c('A','A', 'B', 'B')
date_wanted <- c(1,2,1,2)
rv_wanted <- c(7,5,10,11)
df_wanted <-data.frame(ticker_wanted,date_wanted,rv_wanted)
But this is not literally what my actual function is. The vol+ret is just an example. I'm more interested in the more general case: how to groupby and apply a general function to two or more dataframes. I use the vol + ret just because I didn't want to bother others by asking them to install some potentially irrelevant package on their PC.
Update based on real-life example:
You can do a direct approach like this:
library(tidyverse)
library(InfoTrad)
dat %>%
group_by(ticker, date) %>%
summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date PIN
<chr> <dbl> <dbl>
1 A 1 1.05e-17
2 A 2 1.56e- 1
3 B 1 1.12e- 8
4 B 2 7.07e- 9
The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.
You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.
I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.
In your example, you can do:
my_function <- function(data) {
data %>%
summarize(rv = sum(ret, vol))
}
library(tidyverse)
df %>%
group_by(ticker, date) %>%
my_function()
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date rv
<chr> <dbl> <dbl>
1 A 1 7
2 A 2 5
3 B 1 10
4 B 2 11
But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.
Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.
you could just do this? (with summarise as an example of your function):
ticker <- c("A", "A", 'A', "B", "B", "B")
date <- c(1,1,2,1,2,1)
ret <- c(1,-2,4,6,9,-5)
vol <- c(3,5,1,6,2,3)
df <- data.frame(ticker,date,ret,vol)
df_wanted <- get_rv(df)
get_rv <- function(data){
result <- data %>%
group_by(ticker,date) %>%
summarise(rv =sum(ret) + sum(vol)) %>%
as.data.frame()
names(result) <- c('ticker_wanted', 'date_wanted', 'rv_wanted')
return(result)
}
Assuming that your dataframe is as follows:
data <- data.frame(ticker,date,ret,vol)
Use split to split your dataframe into a group of dataframes bases on the values of ticker, and date.
dflist = split(data, f = list(data$ticker, data$date), drop = TRUE)
Now use lapply or sapply to run the function YZ() on each dataframe member of dflist.
pins <- lapply(dflist, function(x) YZ(x)$PIN)
I'm trying to extract the name of the i column used in a loop:
for (i in df){
print(name(i))
}
Python code solution example:
for i in df:
print(i)
PS: R gives me the column values If I use the same code than Python (but python gives just the name).
EDIT: It has to be in a loop. As I will do more elaborate things with this.
for (i in names(df)){
print(i)
}
Just do
names(df)
to print all the column names in df. There's no need for a loop, unless you want to do something more elaborate with each column.
If you want the i'th column name:
names(df)[i]
Instead of looping, you can use the imap function from the purrr package. When writing the code, .x is the object and .y is the name.
df <- data.frame(a = 1:10, b = 21:30, c = 31:40)
library(purrr)
imap(df, ~paste0("The name is ", .y, " and the sum is ", sum(.x)))
# $a
# [1] "The name is a and the sum is 55"
#
# $b
# [1] "The name is b and the sum is 255"
#
# $c
# [1] "The name is c and the sum is 355"
This is just a more convenient way of writing the following Base R code, which gives the same output:
Map(function(x, y) paste0("The name is ", y, " and the sum is ", sum(x))
, df, names(df))
You can try the following code:
# Simulating your data
a <- c(1,2,3)
b <- c(4,5,6)
df <- data.frame(a, b)
# Answer 1
for (i in 1:ncol(df)){
print(names(df)[i]) # acessing the name of column
print(df[,i]) # acessing column content
print('----')
}
Or this alternative:
# Answer 2
columns <- names(df)
for(i in columns) {
print(i) # acessing the name of column
print(df[, i]) # acessing column content
print('----')
}
Hope it helps!
Setting the scene:
So I have a directory with 50 .csv files in it.
All files have unique names e.g. 1.csv 2.csv ...
The contents of each may vary in the number of rows but always have 4 columns
The column headers are:
Date
Result 1
Result 2
ID
I want them all to be merged together into one dataframe (mydf) and then I'd like to ignore any rows where there is an NA value.
So that I can count how many complete instances of an "ID" there were. By calling for example;
myfunc("my_files", 1)
myfunc("my_files", c(2,4,6))
My code so far:
myfunc <- function(directory, id = 1:50) {
files_list <- list.files(directory, full.names=T)
mydf <- data.frame()
for (i in 1:50) {
mydf <- rbind(mydf, read.csv(files_list[i]))
}
mydf_subset <- mydf[which(mydf[, "ID"] %in% id),]
mydf_subna <- na.omit(mydf_subset)
table(mydf_subna$ID)
}
My issues and where I need help:
My results come out this way
2 4 6
200 400 600
and I'd like to transpose them to be like this. I'm not sure if calling a table is right or should I call it as.matrix perhaps?
2 100
4 400
8 600
I'd also like to have either the headers from the original files or assign new ones
ID Count
2 100
4 400
8 600
Any and all advice is welcome
Matt
Additional update
I tried amending to incorperate some of the helpful comments below, so I also have a set of code that looks like this;
myfunc <- function(directory, id = 1:50) {
files_list <- list.files(directory, full.names=T)
mydf <- data.frame()
for (i in 1:50) {
mydf <- rbind(mydf, read.csv(files_list[i]))
}
mydf_subset <- mydf[which(mydf[, "ID"] %in% id),]
mydf_subna <- na.omit(mydf_subset)
result <- data.frame(mydf_subna$ID)
transposed_result <- t(result)
colnames(transposed_result) <- c("ID","Count")
}
which I try to call with this:
myfunc("myfiles", 1)
myfunc("myfiles", c(2, 4, 6))
but I get this error
> myfunc("myfiles", c(2, 4, 6))
Error in `colnames<-`(`*tmp*`, value = c("ID", "Count")) :
length of 'dimnames' [2] not equal to array extent
I wonder if perhaps I'm not creating this data.frame correctly and should be using a cbind or not summing the rows by ID maybe?
You need want to change your function to create a data frame rather than a table and then transpose that data frame. Change the line
table(mydf_subna$ID)
to be instead
result <- data.frame(mydf_subna$ID)
then use the t() function which transposes your data frame
transposed_result <- t(result)
colnames(transposed_result) <- c("ID","Count")
Welcome to Stack Overflow.
I am assuming that the function that you have written returns the table which is saved in variable ans.
You may give a try to this code:
ans <- myfunc("my_files", c(2,4,6))
ans2 <- data.frame(ans)
colnames(ans2) <- c('ID' ,'Count')
I have a single column data frame - example data:
1 >PROKKA_00002 Alpha-ketoglutarate permease
2 MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT
3 QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG
4 >PROKKA_00003 lipoprotein
5 MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG
Each sequence of letters is associated with the ">" line above it. I need a two-column data frame with lines starting in ">" in the first column, and the respective lines of letters concatenated as one sequence in the second column. This is what I've tried so far:
y <- matrix(0,5836,2) #empty matrix with 5836 rows and two columns
z <- 0
for(i in 1:nrow(df)){
if((grepl(pattern = "^>", x = df)) == TRUE){ #tried to set the conditional "if a line starts with ">", execute code"
z <- z + 1
y[z,1] <- paste(df[i])
} else{
y[z,2] <- paste(df[i], collapse = "")
}
}
I would eventually convert the matrix y back to a data.frame using as.data.frame, but my loop keeps getting Error: unexpected '}' in "}". I'm also not sure if my conditional is right. Can anyone help? It would be greatly appreciated!
Although I will stick with packages, here is a solution
initialize data
mydf <- data.frame(x=c(">PROKKA_00002 Alpha-ketoglutarate","MTESSITERGAPEL", "MTESSITERGAPEL",">PROKKA_00003 lipoprotein", "MTESSITERGAPEL" ,"MRTIIVIASLLLT"), stringsAsFactors = F)
process
ind <- grep(">", mydf$x)
temp<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], nrow(mydf)))
seqs<-rep(NA, length(ind))
for(i in 1:length(ind)) {
seqs[i]<-paste(mydf$x[temp$from[i]:temp$to[i]], collapse="")
}
fastatable<-data.frame(name=gsub(">", "", mydf[ind,1]), sequence=seqs)
> fastatable
name sequence
1 PROKKA_00002 Alpha-ketoglutarate MTESSITERGAPELMTESSITERGAPEL
2 PROKKA_00003 lipoprotein MTESSITERGAPELMRTIIVIASLLLT
Try creating an index of the rows with the target symbol with the column headers. Then split the data on that index. The call cumsum(ind1)[!ind1] first creates an id rows by coercing the logical vector into numeric, then eliminates the rows with the column headers.
ind1 <- grepl(">", mydf$x)
#split data on the index created
newdf <- data.frame(mydf$x[ind1][cumsum(ind1)], mydf$x)[!ind1,]
#Add names
names(newdf) <- c("Name", "Value")
newdf
# Name Value
# 2 >PROKKA_00002 Alpha-ketoglutarate
# 3 >PROKKA_00002 MTESSITERGAPEL
# 5 >PROKKA_00003 lipoprotein
# 6 >PROKKA_00003 MRTIIVIASLLLT
Data
mydf <- data.frame(x=c(">PROKKA_00002","Alpha-ketoglutarate","MTESSITERGAPEL", ">PROKKA_00003", "lipoprotein" ,"MRTIIVIASLLLT"))
You can use plyr to accomplish this if you are able to assigned a section number to your rows appropriately:
library(plyr)
df <- data.frame(v1=c(">PROKKA_00002 Alpha-ketoglutarate permease",
"MTESSITERGAPELADTRRRIWAIVGASSGNLVEWFDFYVYSFCSLYFAHIFFPSGNTTT",
"QLLQTAGVFAAGFLMRPIGGWLFGRIADRRGRKTSMLISVCMMCFGSLVIACLPGYAVIG",
">PROKKA_00003 lipoprotein",
"MRTIIVIASLLLTGCSHMANDAWSGQDKAQHFLASAMLSAAGNEYAQHQGYSRDRSAAIG"))
df$hasMark <- ifelse(grepl(">",df$v1,fixed=TRUE),1, 0)
df$section <- cumsum(df$hasMark)
t <- ddply(df, "section", function(x){
data.frame(v2=head(x,1),v3=paste(x$v1[2:nrow(x)], collapse=''))
})
t <- subset(t, select=-c(section,v2.hasMark,v2.section)) #drop the extra columns
if you then view 't' I believe this is what you were looking for in your original post
I have a for loop that generates a dataframe every time it loops through. I am trying to create a list of data frames but I cannot seem to figure out a good way to do this.
For example, with vectors I usually do something like this:
my_numbers <- c()
for (i in 1:4){
my_numbers <- c(my_numbers,i)
}
This will result in a vector c(1,2,3,4). I want to do something similar with dataframes, but accessing the list of data frames is quite difficult when i use:
my_dataframes <- list(my_dataframes,DATAFRAME).
Help please. The main goal is just to create a list of dataframes that I can later on access dataframe by dataframe. Thank you.
I'm sure you've noticed that list does not do what you want it to do, nor should it. c also doesn't work in this case because it flattens data frames, even when recursive=FALSE.
You can use append. As in,
data_frame_list = list()
for( i in 1:5 ){
d = create_data_frame(i)
data_frame_list = append(data_frame_list,)
}
Better still, you can assign directly to indexed elements, even if those elements don't exist yet:
data_frame_list = list()
for( i in 1:5 ){
data_frame_list[[i]] = create_data_frame(i)
}
This applies to vectors, too. But if you want to create a vector c(1,2,3,4) just use 1:4, or its underlying function seq.
Of course, lapply or the *lply functions from plyr are often better than looping depending on your application.
Continuing with your for loop method, here's a little example of creating and accessing.
> my_numbers <- vector('list', 4)
> for (i in 1:4) my_numbers[[i]] <- data.frame(x = seq(i))
And we can access the first column of each data frame with,
> sapply(my_numbers, "[", 1)
# $x
# [1] 1
#
# $x
# [1] 1 2
#
# $x
# [1] 1 2 3
#
# $x
# [1] 1 2 3 4
Other ways of accessing the data is my_numbers[[1]] for the first data set,
lapply(my_numbers, "[", 1,) to access the first row of each data frame, etc.
You can use operator [[ ]] for this purpose.
l <- list()
df1 <- data.frame(name = 'df1', a = 1:5 , b = letters[1:5])
df2 <- data.frame(name = 'df2', a = 6:10 , b = letters[6:10])
df3 <- data.frame(name = 'df3', a = 11:20 , b = letters[11:20])
df <- rbind(df1,df2,df3)
for(df_name in unique(df$name)){
l[[df_name]] <- df[df$name == df_name,]
}
In this example, there are three separate data frames and in order to store them
in a list using a for loop, we place them in one. Using the operator [[ we can even name the data frame in the list as we want and store it in the list normally.