Overall situation:
The interface of my measuring devices couldn’t save any further information but the name of the csv it generates during measuring its values. So I used a systematic set of abbreviations to account for changing parameters, such as concentrations, enzymes, feed stocks, buffers etc., That combined formed the title of my csv files which form the names of the data.frames , where I am now trying to read out the names, to combine them with the rest of the data, to form tables that I can use to do regressions.
The Issue:
I just noticed that I lose the names of my data.frames inside the list,
I could rename them after each call of lapply, but this doesn't seam to be a proper solution.
I found suggestion to use the llply, but I can't teach it to keep names either.
# loads plyr package
library(plyr)
# generates a showcase list of dataframes,
data <- list(data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)))
# assigns names to dataframe
names(data) <- list("one","two", "tree", "four")
usses the dataframes name to pass “o” to a column, this part works fine,
But after running it the names are lost
data <- lapply(X = seq_along(data),
FUN = function(i){
x <- data[[i]]
if (gsub("([(a-z)]).*","\\1", names(data)[i]) == "o") {x$enz <- "o"}
return(x)},
USE.NAMES = TRUE)
Same thing with llply, operates as expected but doesn’t keep the name either although I thought I could solve that particular problem (quote: “llply is equivalent to lapply except that it will preserve labels and can display a progress bar.”)
data <- llply(seq_along(data), function(i){
x <- data[[i]]
if (gsub("([(a-z)]).*","\\1", names(data)[i]) == "o") {x$enz <- "o"}
return(x)})
I would very much appreciate a hint how to solve this with out something like
name(data) <- list.with.the.names
after each llply ore lapply call.
Do something like this:
for (i in seq_along(data)) data[[i]]$name <- names(data)[i]
do.call(rbind, data)
# c.1..2. c.3..3. name
#one.1 1 3 one
#one.2 2 3 one
#two.1 1 3 two
#two.2 2 3 two
#tree.1 1 3 tree
#tree.2 2 3 tree
#four.1 1 3 four
#four.2 2 3 four
And continue from there.
Related
I have a working function which is around 250 lines, a simplified version:
myfunction <- function(x){
WithoutNA <<- x[!(is.na(x$Height)),]
Heavy <- WithoutNA[WithoutNA$Weight >= "150",]
Light <- WithoutNA[WithoutNA$Weight < "150",]
HL <<- Heavy[Heavy$FurColor=="light_Brown",]
HD <<- Heavy[Heavy$FurColor=="Dark_Brown",]
LL <<- Light[Leavy$FurColor=="light_Brown",]
LD <<- Light[Leavy$FurColor=="Dark_Brown",]
}
So this function will give 4 different dataframes excluding rows where no Height is present, separated by weight and fur color
the problem I encounter is that if I use this function on two different dataframes the second time it will of course override the 4 dataframes it created the first time the function was used.
if I type in:
myfunction(Horse)
myfunction(Pony)
I would like 8 dataframes called: HL_Horse, HD_Horse, LL_Horse, LD_Horse, HL_Pony, HD_Pony, LL_Pony and LD_Pony
But I can't seem to figure out how to get the Dataframe name into my newly produced dataframes names. Is it even possible to make a 'variable' dataframe name?
This entire concept is flawed. R is a (largely) functional programming language, and users don't expect side effects, particularly (over)writing objects in the calling environment. A far better idea is to have your function return a list of data frames.
Lists are better than directly writing to the calling environment for a number of reasons. They avoid cluttering the global workspace, they can be iterated over, their elements can be named or unnamed, they can be nested, they can be converted into environments, and they can act as a container to allow a function to return multiple objects - just as in your example.
The standard R way to use a function like yours would be something like this:
myfunction <- function(x){
WithoutNA <- x[!(is.na(x$Height)),]
Heavy <- WithoutNA[WithoutNA$Weight >= "150",]
Light <- WithoutNA[WithoutNA$Weight < "150",]
HL <- Heavy[Heavy$FurColor=="light_Brown",]
HD <- Heavy[Heavy$FurColor=="Dark_Brown",]
LL <- Light[Light$FurColor=="light_Brown",]
LD <- Light[Light$FurColor=="Dark_Brown",]
return(list(HL = HL, HD = HD, LL = LL, LD = LD))
}
Now if we give it some toy data:
df <- data.frame(Height = c(2, 2, 2, 2),
Weight = c(100, 100, 200, 200),
FurColor = rep(c("light_Brown", "Dark_Brown"), 2))
horse <- myfunction(df)
pony <- myfunction(df)
We can access each of the 8 data frames easily by doing, for example:
horse$HL
#> Height Weight FurColor
#> 3 2 200 light_Brown
pony$LD
#> Height Weight FurColor
#> 2 2 100 Dark_Brown
Note that getting to each data frame involves the same number of characters as each of your named data frames, except you now have all the other benefits of having your data frames safely and logically stored as lists.
If you want to make your global workspace even less cluttered, you can even nest the lists so that all your data frames are in one master list. So for example, you could have:
equines <- list()
equines$horse <- myfunction(df)
equines$pony <- myfunction(df)
And now you have only a single object in your global workspace but you can access each data frame in a consistent and easy to remember way, e.g.
equines$pony$HL
#> Height Weight FurColor
#> 3 2 200 light_Brown
I'm new on R (and I use R-studio) and I have to analyze a big data frame (60 variables for 10 000 observations). My data frame had a column name specie with lot of different animals species in there. The goal of my work it's to have results of 8 differents species, so I have to work on there separately.
I start with building different subset (like I learn in school) and with awesome packages(special thanks to dplyr & tdyr). But now I have to repeat many identical (or nearly identical) actions on each of the 8 species, so I spent much time to copy/paste and when I make a mistake I must verify and change mistakes on thousands of lines.
Then I try to learn about loops et apply family functions. But I can't do something good.
There is an exemple of an action I do on a specie with the traditional way (organize data):
espece_td_a <- subset(BDD, BDD$espece == "espece A" & BDD$placette =="TOTAL")%>%
select(code_site,passage,adulte)%>%
spread(passage, adulte)
espece_td_a <- full_join(B.irene_td_a, BDD_P3_TOT_site)
espece_td_a <- replace(espece_td_a, is.na(espece_td_a),0)
espece_td_a$P1[B.irene_td_a$P1>0]<-1
espece_td_a$P2[B.irene_td_a$P2>0]<-1
espece_td_a$P3[B.irene_td_a$P3>0]<-1
write.csv(espece_td_a, file = "espece_td_a.csv")
BDD is my data frame.
BDD_P3_TOT_site is vector (or data frame with 1 columns and many rows ?) built with BDD
This "traditional way" work for me, but I must do something like that so many times! And it takes a lot of time...
Then I tried to "apply" this with function :
f <- function(x)
{
select(code_site, passage, adulte)%>%
spread(x, x$passage, x$adulte)%>%
full_join(x, BDD_P3_TOT_site) -> x
x <- replace(x, is.na(x),0)
x$P1[x$P1>0]<-1
x$P2[x$P2>0]<-1
x$P3[x$P3>0]<-1
}
I wish apply this function to my dataset with lapply (with my 8 species in list):
l <- c("espece_a","espece_b","espece_c")
lapply(l,f(x))
Problems :
I know that is a wrong formulation for lapply if I want take my species into BDD.
the function doesn't want work:
I already made 8 subsets (for each of my interest species)
In my global environment: espece_a; espece_b...
Then I wanted to put my subset one by one into my function:
> f(espece_a)
Error in select_(.data, .dots = lazyeval::lazy_dots(...)) : Show Traceback
object 'code_site' not found Rerun with Debug
I wish that my table appears in my Globlal env with a name that make me able to recognize it (ex: "espece_td_a")
You have 3 issues relating to your use of lapply:
You need to return the object x at the end of the f function:
l should be a list of dataframes not just a vector of dataframe names, i.e. l <- list(espece_a,espece_b,espece_c)
When using lapply with an existing function, you only need to pass the name of the function, i.e. lapply(l,f)
Hopefully this should solve your problem.
I solve the function problem :
f <- function(X){
X <- select(X, code_site, passage, adulte)%>%
spread(passage, adulte)
X <- full_join(X, BDD_P3_TOT_site)
X <- replace(X, is.na(X),0)
X$P1[X$P1>0]<-1
X$P2[X$P2>0]<-1
X$P3[X$P3>0]<-1
X <- return(X)
}
test <- f(espece_a)
I'm trying to replicate solution on applying multiple functions in sapply posted on R-Bloggers but I can't get it to work in the desired manner. I'm working with a simple data set, similar to the one generated below:
require(datasets)
crs_mat <- cor(mtcars)
# Triangle function
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)] <- NA
return(cormat)
}
require(reshape2)
crs_mat <- melt(get_upper_tri(crs_mat))
I would like to replace some text values across columns Var1 and Var2. The erroneous syntax below illustrates what I am trying to achieve:
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
gsub("mpg","MPG",x),
# Replace second phrase
gsub("gear", "GeArr",x)
# Ideally, perform other changes
})
Naturally, the code is not syntactically correct and fails. To summarise, I would like to do the following:
Go through all the values in first two columns (Var1 and Var2) and perform simple replacements via gsub.
Ideally, I would like to avoid defining a separate function, as discussed in the linked post and keep everything within the sapply syntax
I don't want a nested loop
I had a look at the broadly similar subject discussed here and here but, if possible, I would like to avoid making use of plyr. I'm also interested in replacing the column values not in creating new columns and I would like to avoid specifying any column names. While working with my existing data frame it is more convenient for me to use column numbers.
Edit
Following very useful comments, what I'm trying to achieve can be summarised in the solution below:
fun.clean.columns <- function(x, str_width = 15) {
# Make character
x <- as.character(x)
# Replace various phrases
x <- gsub("perc85","something else", x)
x <- gsub("again", x)
x <- gsub("more","even more", x)
x <- gsub("abc","ohmg", x)
# Clean spaces
x <- trimws(x)
# Wrap strings
x <- str_wrap(x, width = str_width)
# Return object
return(x)
}
mean_data[,1:2] <- sapply(mean_data[,1:2], fun.clean.columns)
I don't need this function in my global.env so I can run rm after this but even nicer solution would involve squeezing this within the apply syntax.
We can use mgsub from library(qdap) to replace multiple patterns. Here, I am looping the first and second column using lapply and assign the results back to the crs_mat[,1:2]. Note that I am using lapply instead of sapply as lapply keeps the structure intact
library(qdap)
crs_mat[,1:2] <- lapply(crs_mat[,1:2], mgsub,
pattern=c('mpg', 'gear'), replacement=c('MPG', 'GeArr'))
Here is a start of a solution for you, I think you're capable of extending it yourself. There's probably more elegant approaches available, but I don't see them atm.
crs_mat[,1:2] <- sapply(crs_mat[,1:2], function(x) {
# Replace first phrase
step1 <- gsub("mpg","MPG",x)
# Replace second phrase. Note that this operates on a modified dataframe.
step2 <- gsub("gear", "GeArr",step1)
# Ideally, perform other changes
return(step2)
#or one nested line, not practical if more needs to be done
#return(gsub("gear", "GeArr",gsub("mpg","MPG",x)))
})
I am trying to get to grips with R and as an experiment I thought that I would try to play around with some cricket data. In its rawest format it is a yaml file, which I used the yaml R package to turn into an R object.
However, I now have a number of nested lists of uneven length that I want to try and turn into a data frame in R. I have tried a few methods such as writing some loops to parse the data and some of the functions in the tidyr package. However, I can't seem to get it to work nicely.
I wondered if people knew of the best way to tackle this? Replicating the data structure would be difficult here, because the complexity comes in the multiple nested lists and the unevenness of their length (which would make for a very long code block. However, you can find the raw yaml data here: http://cricsheet.org/downloads/ (I was using the ODI internationals).
Thanks in advance!
Update
I have tried this:
1)Using tidyr - seperate
d <- unnest(balls)
Name <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal","WicketFielder","WicketKind","PlayerOut")
a <- separate(d, x, Name, sep = ",",extra = "drop")
Which basically uses the tidyr package returns a single column dataframe that I then try to separate. However, the problem here is that in the middle there is sometimes extras variables that appear in some rows and not others, thereby throwing off the separation.
2) Creating vectors
ballsVector <- unlist(balls[[2]],use.names = FALSE)
names_vector <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal")
names(ballsVector) <- c(names_vector)
ballsMatrix <- matrix(ballsVector, nrow = 1, byrow = TRUE)
colnames(ballsMatrix) <- names_vector
The problem here is that the resulting vectors are uneven in length and therefore cant be combined into a data frame. It will also suffer from the issue that there are sporadic variables in the middle of the dataset (as above).
Caveat: not complete answer; attempt to arrange the innings data
plyr::rbind.fill may offer a solution to binding rows with a different number of columns.
I dont use tidyr but below is some rough code to get the innings data into a data.frame. You could then loop this through all the yaml files in the directory.
# Download and unzip data
download.file("http://cricsheet.org/downloads/odis.zip", temp<- tempfile())
tmp <- unzip(temp)
# Create lists - use first game
library(yaml)
raw_dat <- yaml.load_file(tmp[[2]])
#names(raw_dat)
# Function to process list into dataframe
p_fun <- function(X) {
team = X[[1]][["team"]]
# function to process each list subelement that represents each throw
fn <- function(...) {
tmp = unlist(...)
tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
}
# loop over all throws
lst = lapply(X[[1]][["deliveries"]], fn )
cbind(team, plyr:::rbind.fill(lst))
}
# Loop over each innings
dat <- plyr::rbind.fill(lapply(raw_dat$innings, p_fun))
Some explanation
The list structure and subsetting it. To get an idea of the structure of the list use
str(raw_dat) # but this gives a really long list of data
You can truncate this, to make it a bit more useful
str(raw_dat, 3)
length(raw_dat)
So there are three main list elements - meta, info, and innings. You can also see this with
names(raw_dat)
To access the meta data, you can use
raw_dat$meta
#or using `[[1]]` to access the first element of the list (see ?'[[')
raw_dat[[1]]
#and get sub-elements by either
raw_dat$meta$data_version
raw_dat[[1]][[1]] # you can also use the names of the list elements eg [[`data_version`]]
The main data is in the inningselement.
str(raw_dat$innings, 3)
Look at the names in the list element
lapply(raw_dat$innings, names)
lapply(raw_dat$innings[[1]], names)
There are two list elements, each with sub-elements. You can access these as
raw_dat$innings[[1]][[1]][["team"]] # raw_dat$innings[[1]][["1st innings"]][["team"]]
raw_dat$innings[[2]][[1]][["team"]] # raw_dat$innings[[2]][["2nd innings"]][["team"]]
The above function parsed the deliveries data in raw_dat$innings. To see what it does, work through it from the inside.
Use one record to see how it works
(note the lapply, with p_fun, looped over raw_dat$innings[[1]] and raw_dat$innings[[2]] ; so this is the outer loop, and the lapply, with fn, loops through the deliveries, within an innings ; the inner loop)
X <- raw_dat$innings[[1]]
tmp <- X[[1]][["deliveries"]][[1]]
tmp
#create a named vector
tmp <- unlist(tmp)
tmp
# 0.1.batsman 0.1.bowler 0.1.non_striker 0.1.runs.batsman 0.1.runs.extras 0.1.runs.total
# "IR Bell" "DW Steyn" "MJ Prior" "0" "0" "0"
To use rbind.fill, the elements to bind together need to be data.frames. We also want to remove the leading numbers /
deliveries from the names, as otherwise we will have lots of uniquely names columns
# this regex removes all non-numeric characters from the string
# you could then split this number into over and delivery
gsub("[^0-9]", "", names(tmp))
# this regex removes all numeric characters from the string -
# allowing consistent names across all the balls / deliveries
# (if i was better at regex I would have also removed the leading dots)
gsub("[0-9]", "", names(tmp))
So for the first delivery in the first innings we have
tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 01 IR Bell DW Steyn MJ Prior 0 0 0
To see how the lapply works, use the first three deliveries (you will need to run the function fn in your workspace)
lst = lapply(X[[1]][["deliveries"]][1:3], fn )
lst
# [[1]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 01 IR Bell DW Steyn MJ Prior 0 0 0
#
# [[2]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 02 IR Bell DW Steyn MJ Prior 0 0 0
#
# [[3]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 03 IR Bell DW Steyn MJ Prior 3 0 3
So we end up with a list element for every delivery within an innings. We then use rbind.fill to create one data.frame.
If I was going to try and parse every yaml file I would use a loop.
Use the first three records as an example, and also add the match date.
tmp <- unzip(temp)[2:4]
all_raw_dat <- vector("list", length=length(tmp))
for(i in seq_along(tmp)) {
d = yaml.load_file(tmp[i])
all_raw_dat[[i]] <- cbind(date=d$info$date, plyr::rbind.fill(lapply(d$innings, p_fun)))
}
Then use rbind.fill.
Q1. from comments
A small example with rbind.fill
a <- data.frame(x=1, y=2)
b <- data.frame(x=2, z=1)
rbind(a,b) # error as names dont match
plyr::rbind.fill(a, b)
rbind.fill doesnt go back and add/update rows with the extra columns, where needed (a still doesnt have column z), Think of it as creating an empty dataframe with the number of columns equal to the number of unique columns found in the list of dataframes - unique(c(names(a), names(b))). The values are then filled in each row where possible, and left missing (NA) otherwise..
a R novice is once again seeking for help.
General situation: I am currently creating a script, I got several data frames per experiment.
The experiments vary in time-steps of measurements and number of reactors, therefore I need
two dimensional flexibility of my script to "massage" data into the right shape for the desired tests, and draw the necessary data from multiple data frames.
Unfortunately I choose to use for loops to account for this, which I see now is bad practice in R,
but I have gotten to fare to change directions now.
The Problem: I try to achieve that one dimensional matrix are named by the objects names, inside a for loop, I need them to be in matrix format because of further functions I want to apply.
# Simple but non- flexible examples of what I want to do:
# creates two matrix objects
a1 <- matrix(c(1,2,3,4,5))
a2 <- matrix(c(1,2,3,4,5))
#names header of the objects name
colnames(a1) <- "a1"
colnames(a2) <- "a2"
this works, but I need it to work with in a for loop...
# here are the two flexible but non- working approaches of mine
# creates two matrix objects
a1 <- matrix(c(1,2,3,4,5))
a2 <- matrix(c(1,2,3,4,5))
# should name object according to progress in loop
for(i in 1:2)
{
assign(colnames(paste("a",i,sep="",collapse="")),do.call("c",list(paste("a",i,sep=""))))
}
which isn’t the proper use of assign and creates an error.
the second attempt doesn't create an error but doesn't work either, it creates empty objects
# creates two matrix objects
a1 <- matrix(c(1,2,3,4,5))
a2 <- matrix(c(1,2,3,4,5))
# should name object according to progress in loop
for(i in 1:2)
{
assign(paste("a",i,sep="", colapse=""),do.call("colnames",list(paste("a",i,sep="", colapse=""))))
}
My conclusion: I do not understand the proper way of combining assign, and colnames,
If anyone got suggestion how I could get it up and running, this would be awesome.
So fare I searched for: R combining assign and colnames inside for loop, R using assign and colnames, R naming data with for loops,...
but unfortunately didn’t manage to extrapolate solutions to my problem.
The following is a function that, by default, will take any objects in the parent environment that have a name that starts with a followed by numbers, check that they are one column matrices, and if they are, name the columns with the name of the object.
a1 <- matrix(c(1:5))
a2 <- matrix(c(1:5))
name_cols()
a1
# a1
# [1,] 1
# [2,] 2
# ...
a2
# a2
# [1,] 1
# [2,] 2
# ...
And here is the code:
name_cols <- function(pattern="^a[0-9]+", env=parent.frame()) {
lapply(
ls(pattern=pattern, envir=env),
function(x) {
var <- get(x, envir=env)
if(is.matrix(var) && identical(ncol(var), 1L)) {
colnames(var) <- x
assign(x, var, env)
} } )
invisible(NULL)
}
Note I chose specification by pattern, but you can easily change this to be by specifying the names of the variable (instead of using ls, just pass the names and lapply over those), or potentially even the objects (though you have to use substitute for that, and the wiseness of this becomes questionable).
More generally, if you have several related objects on which you will be performing related analysis (e.g. in this case modifying columns), you should really consider storing them in lists rather than at the top level. If you do this, then you can easily use the built in *pply functions to operate on all your objects at once. For example:
a.lst <- list(a1=matrix(1:5), a2=matrix(1:5))
a.lst <- lapply(names(a.lst), function(x) {colnames(a.lst[[x]]) <- x; a.lst[[x]]})