I'm new on R (and I use R-studio) and I have to analyze a big data frame (60 variables for 10 000 observations). My data frame had a column name specie with lot of different animals species in there. The goal of my work it's to have results of 8 differents species, so I have to work on there separately.
I start with building different subset (like I learn in school) and with awesome packages(special thanks to dplyr & tdyr). But now I have to repeat many identical (or nearly identical) actions on each of the 8 species, so I spent much time to copy/paste and when I make a mistake I must verify and change mistakes on thousands of lines.
Then I try to learn about loops et apply family functions. But I can't do something good.
There is an exemple of an action I do on a specie with the traditional way (organize data):
espece_td_a <- subset(BDD, BDD$espece == "espece A" & BDD$placette =="TOTAL")%>%
select(code_site,passage,adulte)%>%
spread(passage, adulte)
espece_td_a <- full_join(B.irene_td_a, BDD_P3_TOT_site)
espece_td_a <- replace(espece_td_a, is.na(espece_td_a),0)
espece_td_a$P1[B.irene_td_a$P1>0]<-1
espece_td_a$P2[B.irene_td_a$P2>0]<-1
espece_td_a$P3[B.irene_td_a$P3>0]<-1
write.csv(espece_td_a, file = "espece_td_a.csv")
BDD is my data frame.
BDD_P3_TOT_site is vector (or data frame with 1 columns and many rows ?) built with BDD
This "traditional way" work for me, but I must do something like that so many times! And it takes a lot of time...
Then I tried to "apply" this with function :
f <- function(x)
{
select(code_site, passage, adulte)%>%
spread(x, x$passage, x$adulte)%>%
full_join(x, BDD_P3_TOT_site) -> x
x <- replace(x, is.na(x),0)
x$P1[x$P1>0]<-1
x$P2[x$P2>0]<-1
x$P3[x$P3>0]<-1
}
I wish apply this function to my dataset with lapply (with my 8 species in list):
l <- c("espece_a","espece_b","espece_c")
lapply(l,f(x))
Problems :
I know that is a wrong formulation for lapply if I want take my species into BDD.
the function doesn't want work:
I already made 8 subsets (for each of my interest species)
In my global environment: espece_a; espece_b...
Then I wanted to put my subset one by one into my function:
> f(espece_a)
Error in select_(.data, .dots = lazyeval::lazy_dots(...)) : Show Traceback
object 'code_site' not found Rerun with Debug
I wish that my table appears in my Globlal env with a name that make me able to recognize it (ex: "espece_td_a")
You have 3 issues relating to your use of lapply:
You need to return the object x at the end of the f function:
l should be a list of dataframes not just a vector of dataframe names, i.e. l <- list(espece_a,espece_b,espece_c)
When using lapply with an existing function, you only need to pass the name of the function, i.e. lapply(l,f)
Hopefully this should solve your problem.
I solve the function problem :
f <- function(X){
X <- select(X, code_site, passage, adulte)%>%
spread(passage, adulte)
X <- full_join(X, BDD_P3_TOT_site)
X <- replace(X, is.na(X),0)
X$P1[X$P1>0]<-1
X$P2[X$P2>0]<-1
X$P3[X$P3>0]<-1
X <- return(X)
}
test <- f(espece_a)
Related
For some basic publications I have to make almost same codes for many tables. So I have to make a quite fast code to make data frames from files and to make some same operations with data using only one same formula.
Example:
# Creating function
basic_sum <- function (place, DF, factor_col, sum) {
# Uploading data.frame
DF <- read.csv (place, sep = ";")
# Converting to factor
for (i in factor_col) {
DF [, i] <- as.factor (DF [, i])
}
# Summary
sum <- summary (DF)
View (sum)
}
Than I'm running that code and get a function basic_sum
If I want to work with my Data I call this function with arguments:
basic_sum (place = "~/DataFrame.csv", DF = DataFrame,
factor_col = c (1, 6 : 11), sum = DF_sum)
After running it nothing happens. I mean, I don't have anything new in Environment. No new data, no new vars or something else.
In my thoughts it seems that finally I have to get:
1) data.frame "DataFrame", that was uploaded DataFrame.csv;
2) 1st, 6th, 7th and all other columns until 11th will be factor
3) data.frame "DF_sum" with summary of all my columns from "DataFrame"
4) I will see data.frame "DF_sum".
Well, I see all of it in console, but I need it in Environment and to save it somewhere.
Seems that I'm doing something wrong... But I don't know what.
P.S.: If I try to run it without function (of course replacing DF to DataFrame, factor_col to с (1, 6 : 11) and so on...) everything is all right. But I have to rewrite code every time or at lest replace all DF and other that bother me.
With great regards,
Dmitrii
I'm trying to replicate solution on applying multiple functions in sapply posted on R-Bloggers but I can't get it to work in the desired manner. I'm working with a simple data set, similar to the one generated below:
require(datasets)
crs_mat <- cor(mtcars)
# Triangle function
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)] <- NA
return(cormat)
}
require(reshape2)
crs_mat <- melt(get_upper_tri(crs_mat))
I would like to replace some text values across columns Var1 and Var2. The erroneous syntax below illustrates what I am trying to achieve:
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
gsub("mpg","MPG",x),
# Replace second phrase
gsub("gear", "GeArr",x)
# Ideally, perform other changes
})
Naturally, the code is not syntactically correct and fails. To summarise, I would like to do the following:
Go through all the values in first two columns (Var1 and Var2) and perform simple replacements via gsub.
Ideally, I would like to avoid defining a separate function, as discussed in the linked post and keep everything within the sapply syntax
I don't want a nested loop
I had a look at the broadly similar subject discussed here and here but, if possible, I would like to avoid making use of plyr. I'm also interested in replacing the column values not in creating new columns and I would like to avoid specifying any column names. While working with my existing data frame it is more convenient for me to use column numbers.
Edit
Following very useful comments, what I'm trying to achieve can be summarised in the solution below:
fun.clean.columns <- function(x, str_width = 15) {
# Make character
x <- as.character(x)
# Replace various phrases
x <- gsub("perc85","something else", x)
x <- gsub("again", x)
x <- gsub("more","even more", x)
x <- gsub("abc","ohmg", x)
# Clean spaces
x <- trimws(x)
# Wrap strings
x <- str_wrap(x, width = str_width)
# Return object
return(x)
}
mean_data[,1:2] <- sapply(mean_data[,1:2], fun.clean.columns)
I don't need this function in my global.env so I can run rm after this but even nicer solution would involve squeezing this within the apply syntax.
We can use mgsub from library(qdap) to replace multiple patterns. Here, I am looping the first and second column using lapply and assign the results back to the crs_mat[,1:2]. Note that I am using lapply instead of sapply as lapply keeps the structure intact
library(qdap)
crs_mat[,1:2] <- lapply(crs_mat[,1:2], mgsub,
pattern=c('mpg', 'gear'), replacement=c('MPG', 'GeArr'))
Here is a start of a solution for you, I think you're capable of extending it yourself. There's probably more elegant approaches available, but I don't see them atm.
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
step1 <- gsub("mpg","MPG",x)
# Replace second phrase. Note that this operates on a modified dataframe.
step2 <- gsub("gear", "GeArr",step1)
# Ideally, perform other changes
return(step2)
#or one nested line, not practical if more needs to be done
#return(gsub("gear", "GeArr",gsub("mpg","MPG",x)))
})
Overall situation:
The interface of my measuring devices couldn’t save any further information but the name of the csv it generates during measuring its values. So I used a systematic set of abbreviations to account for changing parameters, such as concentrations, enzymes, feed stocks, buffers etc., That combined formed the title of my csv files which form the names of the data.frames , where I am now trying to read out the names, to combine them with the rest of the data, to form tables that I can use to do regressions.
The Issue:
I just noticed that I lose the names of my data.frames inside the list,
I could rename them after each call of lapply, but this doesn't seam to be a proper solution.
I found suggestion to use the llply, but I can't teach it to keep names either.
# loads plyr package
library(plyr)
# generates a showcase list of dataframes,
data <- list(data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)))
# assigns names to dataframe
names(data) <- list("one","two", "tree", "four")
usses the dataframes name to pass “o” to a column, this part works fine,
But after running it the names are lost
data <- lapply(X = seq_along(data),
FUN = function(i){
x <- data[[i]]
if (gsub("([(a-z)]).*","\\1", names(data)[i]) == "o") {x$enz <- "o"}
return(x)},
USE.NAMES = TRUE)
Same thing with llply, operates as expected but doesn’t keep the name either although I thought I could solve that particular problem (quote: “llply is equivalent to lapply except that it will preserve labels and can display a progress bar.”)
data <- llply(seq_along(data), function(i){
x <- data[[i]]
if (gsub("([(a-z)]).*","\\1", names(data)[i]) == "o") {x$enz <- "o"}
return(x)})
I would very much appreciate a hint how to solve this with out something like
name(data) <- list.with.the.names
after each llply ore lapply call.
Do something like this:
for (i in seq_along(data)) data[[i]]$name <- names(data)[i]
do.call(rbind, data)
# c.1..2. c.3..3. name
#one.1 1 3 one
#one.2 2 3 one
#two.1 1 3 two
#two.2 2 3 two
#tree.1 1 3 tree
#tree.2 2 3 tree
#four.1 1 3 four
#four.2 2 3 four
And continue from there.
I am attempting to write a for loop which will take subsets of a dataframe by person id and then lag the EXAMDATE variable by one for comparison. So a given row will have the original EXAMDATE and also a variable EXAMDATE_LAG which will contain the value of the EXAMDATE one row before it.
for (i in length(uniquerid))
{
temp <- subset(part2test, RID==uniquerid[i])
temp$EXAMDATE_LAG <- temp$EXAMDATE
temp2 <- data.frame(lag(temp, -1, na.pad=TRUE))
temp3 <- data.frame(cbind(temp,temp2))
}
It seems that I am creating the new variable just fine but I know that the lag won't work properly because I am missing steps. Perhaps I have also misunderstood other peoples' examples on how to use the lag function?
So that this can be fully answered. There are a handful of things wrong with your code. Lucaino has pointed one out. Each time through your loop you are going to create temp, temp2, and temp3 (or overwrite the old one). and thus you'll be left with only the output of the last time through the loop.
However, this isnt something that needs a loop. Instead you can make use of the vectorized nature of R
x <- 1:10
> c(x[-1], NA)
[1] 2 3 4 5 6 7 8 9 10 NA
So if you combine that notion with a library like plyr that splits data nicely you should have a workable solution. If I've missed something or this doesn't solve your problem, please provide a reproducible example.
library(plyr)
myLag <- function(x) {
c(x[-1], NA)
}
ddply(part2test, .(uniquerid), transform, EXAMDATE_LAG=myLag(EXAMDATE))
You could also do this in base R using split or the data.table package using its by= argument.
I am trying to put together a function that will loop thru a given data frame in blocks and return a new data frame containing stuff calculated from the original. The length of x will be different each time and the actual problem will have more loops in the function. New-ish to R and have not been able to find anything helpful (I don't think using a list will help)
func<-function(x){
tmp # need to declare this here?
for (i in 1:dim(x)[1]){
tmp[i]<-ave(x[i,]) # add things to it
}
return(tmp)
}
df<-cbind(rnorm(10),rnorm(10))
means<-func(df)
This code does not work but I hope it gets across what I want to do. thanks!
Do you mean you want to loop through each row of df and return a data frame with the calculated values?
You may want to look in to the apply function:
df <- cbind(rnorm(10),rnorm(10))
# apply(df,1,FUN) does FUN(df[i,])
# e.g. mean of each row:
apply(df,1,mean)
For more complicated looping like performing some operation on a per-factor basis, I strongly recommend package plyr, and function ddply within. Quick example:
df <- data.frame( gender=c('M','M','F','F'), height=c(183,176,157,168) )
# find mean height *per gender*
ddply(df,.(gender), function(x) c(height=mean(x$height)))
# returns:
gender height
1 F 162.5
2 M 179.5