How to make a data frame with function in R? - r

For some basic publications I have to make almost same codes for many tables. So I have to make a quite fast code to make data frames from files and to make some same operations with data using only one same formula.
Example:
# Creating function
basic_sum <- function (place, DF, factor_col, sum) {
# Uploading data.frame
DF <- read.csv (place, sep = ";")
# Converting to factor
for (i in factor_col) {
DF [, i] <- as.factor (DF [, i])
}
# Summary
sum <- summary (DF)
View (sum)
}
Than I'm running that code and get a function basic_sum
If I want to work with my Data I call this function with arguments:
basic_sum (place = "~/DataFrame.csv", DF = DataFrame,
factor_col = c (1, 6 : 11), sum = DF_sum)
After running it nothing happens. I mean, I don't have anything new in Environment. No new data, no new vars or something else.
In my thoughts it seems that finally I have to get:
1) data.frame "DataFrame", that was uploaded DataFrame.csv;
2) 1st, 6th, 7th and all other columns until 11th will be factor
3) data.frame "DF_sum" with summary of all my columns from "DataFrame"
4) I will see data.frame "DF_sum".
Well, I see all of it in console, but I need it in Environment and to save it somewhere.
Seems that I'm doing something wrong... But I don't know what.
P.S.: If I try to run it without function (of course replacing DF to DataFrame, factor_col to с (1, 6 : 11) and so on...) everything is all right. But I have to rewrite code every time or at lest replace all DF and other that bother me.
With great regards,
Dmitrii

Related

Using for loop to get a list data frames in R

splitted is a list of data frames coming from a split() on the main data frame.
After splitting, I'm applying a function to every data frame in the splitted list.
Here the function:
getCustomer <- function(df, numberOfProducts = 3){
Gender <- unique(df$gender)
Segment <- unique(df$Segment)
Net_Discount <- sum(df$Discount * df$Sales)
Number_of_Discounts <- sum(df$Discount>0)
Customer.ID <- unique(df$Customer.ID)
Sales <- sum(df$Sales)
Profit <- sum(df$Profit)
lat <- mean(df$lat)
lon <- mean(df$lon)
productsData <- df %>% arrange(Order.Date) %>% top_n(n =numberOfProducts)
Products <- 0
Products_Category <- 0
Products_Order_Date <- 0
for (j in 1:numberOfProducts){
Products[j] <- productsData %>% select(Product.ID) %>% filter(row_number()==j)
Products_Category[j] <- productsData %>% select(Category) %>% filter(row_number()==j)
Products_Order_Date[j] <- productsData %>% select(Order.Date) %>% filte(row_number()==j)
names(Products)[j]<-paste("Product",j)
names(Products_Category)[j]<-paste("Category Product",j)
names(Products_Order_Date)[j]<-paste("Order Date Product",j)
}
output <- data.frame(Customer.ID, Gender,Segment, Net_Discount, Number_of_Discounts, Sales, Profit,
Products, Products_Category, Products_Order_Date, lon,lat)
return(output[1,])
}
I get the right answer for any element of splitted
getCustomer(splitted[[687]],2)
I can even do well with
customer <- list()
customer[[1]]<- getCustomer(splitted[[1]],2)
customer[[2]]<- getCustomer(splitted[[2]],2)
.
.
.
customer[[1576]]<- getCustomer(splitted[[1576]],2)
That is, I can effectively build the whole customer list by assigning element by element.
However, I certainly don't have time for that (1576 single line data frames to assign to the customer list), so I'm trying:
customer <- list()
for (i in 1:length(splitted)){
customer[[i]]<-getCustomer(splitted[[i]],2)
}
After running this last chunk of code, I get:
Error in data.frame(Customer.ID, Gender, Segment, Net_Discount, Number_of_Discounts, : arguments imply differing number of rows: 0, 1
I can't understand this error, since I can build the customer list element by element at a time.
Would apreciate your help.
Solution
Editing this question to let you know the problem was indeed that some data frames in splitted had no rows. So I removed them (only 3).
for (i in 1:length(splitted)){
l[i]<-nrow(splitted[[i]])
}
indices<- which(l==0)
splitted<-splitted[-indices]
Just had to delete 3 samples.
Got no error this time. Thank you all for your time.
Just use lapply, which can apply a function to every element of a list, returning a list in the process:
numberOfProducts <- 2
result <- lapply(splitted, function(x) getCustomer(x, numberOfProducts))
Edit:
It looks like your function has logic which sometimes can result in a data frame with no rows. In this case, you may check for an empty data frame and return NA:
output <- data.frame(Customer.ID, Gender,Segment, Net_Discount, Number_of_Discounts, Sales,
Profit, Products, Products_Category, Products_Order_Date, lon, lat)
return(ifelse(nrow(output) > 0, output[1,], NA))
The problem was indeed that some data frames in splitted had no rows. So I removed them (only 3).
for (i in 1:length(splitted)){
l[i]<-nrow(splitted[[i]])
}
indices<- which(l==0)
splitted<-splitted[-indices]
Just had to delete 3 samples.
Got no error this time. Thank you all for your time.
My usual strategy for troubleshooting something like this is to start running it in chunks. If you use the for loop, check what value of i is when the error occurs. With lapply, I will run in chunks of around 20... and keep going until you find which data frame in your list is causing the error.
Then, run through your function manually with that data frame and look at what output you get. For example:
df <- splitted[[30]] # assuming #30 is the problem
numberOfProducts <- 3
now walk through the function arguments and check that output until you find what causes the error. Keep in mind that if there are multiple places where problems can occur, it might take more than one application of this process to solve all the problems.

Repeat same action with function and apply familly

I'm new on R (and I use R-studio) and I have to analyze a big data frame (60 variables for 10 000 observations). My data frame had a column name specie with lot of different animals species in there. The goal of my work it's to have results of 8 differents species, so I have to work on there separately.
I start with building different subset (like I learn in school) and with awesome packages(special thanks to dplyr & tdyr). But now I have to repeat many identical (or nearly identical) actions on each of the 8 species, so I spent much time to copy/paste and when I make a mistake I must verify and change mistakes on thousands of lines.
Then I try to learn about loops et apply family functions. But I can't do something good.
There is an exemple of an action I do on a specie with the traditional way (organize data):
espece_td_a <- subset(BDD, BDD$espece == "espece A" & BDD$placette =="TOTAL")%>%
select(code_site,passage,adulte)%>%
spread(passage, adulte)
espece_td_a <- full_join(B.irene_td_a, BDD_P3_TOT_site)
espece_td_a <- replace(espece_td_a, is.na(espece_td_a),0)
espece_td_a$P1[B.irene_td_a$P1>0]<-1
espece_td_a$P2[B.irene_td_a$P2>0]<-1
espece_td_a$P3[B.irene_td_a$P3>0]<-1
write.csv(espece_td_a, file = "espece_td_a.csv")
BDD is my data frame.
BDD_P3_TOT_site is vector (or data frame with 1 columns and many rows ?) built with BDD
This "traditional way" work for me, but I must do something like that so many times! And it takes a lot of time...
Then I tried to "apply" this with function :
f <- function(x)
{
select(code_site, passage, adulte)%>%
spread(x, x$passage, x$adulte)%>%
full_join(x, BDD_P3_TOT_site) -> x
x <- replace(x, is.na(x),0)
x$P1[x$P1>0]<-1
x$P2[x$P2>0]<-1
x$P3[x$P3>0]<-1
}
I wish apply this function to my dataset with lapply (with my 8 species in list):
l <- c("espece_a","espece_b","espece_c")
lapply(l,f(x))
Problems :
I know that is a wrong formulation for lapply if I want take my species into BDD.
the function doesn't want work:
I already made 8 subsets (for each of my interest species)
In my global environment: espece_a; espece_b...
Then I wanted to put my subset one by one into my function:
> f(espece_a)
Error in select_(.data, .dots = lazyeval::lazy_dots(...)) : Show Traceback
object 'code_site' not found Rerun with Debug
I wish that my table appears in my Globlal env with a name that make me able to recognize it (ex: "espece_td_a")
You have 3 issues relating to your use of lapply:
You need to return the object x at the end of the f function:
l should be a list of dataframes not just a vector of dataframe names, i.e. l <- list(espece_a,espece_b,espece_c)
When using lapply with an existing function, you only need to pass the name of the function, i.e. lapply(l,f)
Hopefully this should solve your problem.
I solve the function problem :
f <- function(X){
X <- select(X, code_site, passage, adulte)%>%
spread(passage, adulte)
X <- full_join(X, BDD_P3_TOT_site)
X <- replace(X, is.na(X),0)
X$P1[X$P1>0]<-1
X$P2[X$P2>0]<-1
X$P3[X$P3>0]<-1
X <- return(X)
}
test <- f(espece_a)

The way R handles subseting

I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.

Applying a set of operations across several data frames in r

I've been learning R for my project and have been unable to google a solution to my current problem.
I have ~ 100 csv files and need to perform an exact set of operations across them. I've read them in as separate objects (which I assume is probably improper r style) but I've been unable to write a function that can loop through. Each csv is a dataframe that contain information, including a column with dates in decimal year form. I need to create 2 new columns containing year and day of year. I've figured out how to do it manually I would like to find a way to automate the process. Here's what I've been doing:
#setup
library(lubridate) #Used to check for leap years
df.00 <- data.frame( site = seq(1:10), date = runif(10,1980,2000 ))
#what I need done
df.00$doy <- NA # make an empty column which I'm going to place the day of the year
df.00$year <- floor(df.00$date) # grabs the year from the date column
df.00$dday <- df.00$date - df.00$year # get the year fraction. intermediate step.
# multiply the fraction year by 365 or 366 if it's a leap year to give me the day of the year
df.00$doy[which(leap_year(df.00$year))] <- round(df.00$dday[which(leap_year(df.00$year))] * 366)
df.00$doy[which(!leap_year(df.00$year))] <- round(df.00$dday[which(!leap_year(df.00$year))] * 365)
The above, while inelegant, does what I would like it to. However, I need to do this to the other data frames, df.01 - df.99. So far I've been unable to place it in a function or for loop. If I place it into a function:
funtest <- function(x) {
x$doy <- NA
}
funtest(df.00) does nothing. Which is what I would expect from my understanding of how functions work in r but if I wrap it up in a for loop:
for(i in c(df.00)) {
i$doy <- NA }
I get "In i$doy <- NA : Coercing LHS to a list" several times which tells me that the loop isn't treat the dataframe as a single unit but perhaps looking at each column in the frame.
I would really appreciate some insight on what I should be doing. I feel that I could have solved this easily using bash and awk but I would like to be less incompetent using r
the most efficient and direct way is to use a list.
Put all of your CSV's into one folder
grab a list of the files in that folder
eg: files <- dir('path/to/folder', full.names=TRUE)
iterativly read in all those files into a list of data.frames
eg: df.list <- lapply(files, read.csv, <additional args>)
apply your function iteratively over each data.frame
eg: lapply(df.list, myFunc, <additional args>)
Since your df's are already loaded, and they have nice convenient names, you can grab them easily using the following:
nms <- c(paste0("df.0", 0:9), paste0("df.", 10:99))
df.list <- lapply(nms, get)
Then take everything you have in the #what I need done portion and put inside a function, eg:
myFunc <- function(DF) {
# what you want done to a single DF
return(DF)
}
And then lapply accordingly
df.list <- lapply(df.list, myFunc)
On a separate notes, regarding functions:
The reason your funTest "does nothing" is that it you are not having it return anything. That is to say, it is doing something, but when it finishes doing that, then it does "nothing".
You need to include a return(.) statement in the function. Alternatively, the output of last line of the function, if not assigned to an object, will be used as the return value -- but this last sentence is only loosely true and hence one needs to be cautious. The cleanest option (in my opinion) is to use return(.)
regarding the for loop over the data.frame
As you observed, using for (i in someDataFrame) {...} iterates over the columns of the data.frame.
You can iterate over the rows using apply:
apply(myDF, MARGIN=1, function(x) { x$doy <- ...; return(x) } ) # dont forget to return

How to rewrite this Stata code in R?

One of the things Stata does well is the way it constructs new variables (see example below). How to do this in R?
foreach i in A B C D {
forval n=1990/2000 {
local m = 'n'-1
# create new columns from existing ones on-the-fly
generate pop'i''n' = pop'i''m' * (1 + trend'n')
}
}
DONT do it in R. The reason its messy is because its UGLY code. Constructing lots of variables with programmatic names is a BAD THING. Names are names. They have no structure, so do not try to impose one on them. Decent programming languages have structures for this - rubbishy programming languages have tacked-on 'Macro' features and end up with this awful pattern of constructing variable names by pasting strings together. This is a practice from the 1970s that should have died out by now. Don't be a programming dinosaur.
For example, how do you know how many popXXXX variables you have? How do you know if you have a complete sequence of pop1990 to pop2000? What if you want to save the variables to a file to give to someone. Yuck, yuck yuck.
Use a data structure that the language gives you. In this case probably a list.
Both Spacedman and Joshua have very valid points. As Stata has only one dataset in memory at any given time, I'd suggest to add the variables to a dataframe (which is also a kind of list) instead of to the global environment (see below).
But honestly, the more R-ish way to do so, is to keep your factors factors instead of variable names.
I make some data as I believe it is in your R version now (at least, I hope so...)
Data <- data.frame(
popA1989 = 1:10,
popB1989 = 10:1,
popC1989 = 11:20,
popD1989 = 20:11
)
Trend <- replicate(11,runif(10,-0.1,0.1))
You can then use the stack() function to obtain a dataframe where you have a factor pop and a numeric variable year
newData <- stack(Data)
newData$pop <- substr(newData$ind,4,4)
newData$year <- as.numeric(substr(newData$ind,5,8))
newData$ind <- NULL
Filling up the dataframe is then quite easy :
for(i in 1:11){
tmp <- newData[newData$year==(1988+i),]
newData <- rbind(newData,
data.frame( values = tmp$values*Trend[,i],
pop = tmp$pop,
year = tmp$year+1
)
)
}
In this format, you'll find most R commands (selections of some years, of a single population, modelling effects of either or both, ...) a whole lot easier to perform later on.
And if you insist, you can still create a wide format with unstack()
unstack(newData,values~paste("pop",pop,year,sep=""))
Adaptation of Joshua's answer to add the columns to the dataframe :
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep=""),Data) # get old variable
trend <- Trend[,i-1989] # get trend variable
Data <- within(Data,assign(new, old*(1+trend)))
}
}
Assuming popA1989, popB1989, popC1989, popD1989 already exist in your global environment, the code below should work. There are certainly more "R-like" ways to do this, but I wanted to give you something similar to your Stata code.
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep="")) # get old variable
trend <- get(paste("trend",i,sep="")) # get trend variable
assign(new, old*(1+trend))
}
}
Assuming you have population data in vector pop1989
and data for trend in trend.
require(stringr)# because str_c has better default for sep parameter
dta <- kronecker(pop1989,cumprod(1+trend))
names(dta) <- kronecker(str_c("pop",LETTERS[1:4]),1990:2000,str_c)

Resources