Evaluating a single code block in two data frames in R - r

I have two data frames
d1 = data.frame(a=1:4,b=2:5)
d2 = data.frame(a=0:3,b=3:6)
and I would like to evaluate the same block of code, for example
c<-exp(a)
d<-b^2
within each data frame. At the moment I have to duplicate the code block as follows:
d1t = within(d1, {
c<-exp(a)
d<-b^2
})
d2t = within(d2, {
c<-exp(a)
d<-b^2
})
which makes my code prone to errors if I make changes to one of the code blocks (they should be the same).
I am not so familiar with environments in R, but I think it should be possible to use them to solve this problem nicely. How can I do it?

This is the perfect situation to write the repeated code blocks into a function:
MyFun <- function(df) {
out = within(df, {
c<-exp(a)
d<-b^2
})
return(out)
}
Will do it as long as the variable names are the same across datasets.
To run the code just do:
d1t <- MyFun(d1)
d2t <- MyFun(d2)
Should work.

We could place the dataframe objects in a list. We search for the names of the objects with the pattern ^d\\d+ ie. 'd' followed by numbers in the global environment. If there are multiple objects (in this case, 2 objects i.e. 'd1' and 'd2'), we can get the values using mget in a list.
lst <- mget(ls(pattern='^d\\d+'))
Now, we loop through the list with lapply and create new variables 'c' and 'd' using transform.
lst1 <- lapply(lst, transform, c=exp(a), d= b^2)
It is better to keep the 'data.frames' within the list. But, if we need to update the original datasets or create new objects i.e. 'd1t' and 'd2t' (not recommended), we can change the names of the list elements with setNames and use list2env to create objects in the global environment.
list2env(setNames(lst1, paste0(names(lst1), 't')), envir=.GlobalEnv)
d1t
# a b c d
#1 1 2 2.718282 4
#2 2 3 7.389056 9
#3 3 4 20.085537 16
#4 4 5 54.598150 25
d2t
# a b c d
#1 0 3 1.000000 9
#2 1 4 2.718282 16
#3 2 5 7.389056 25
#4 3 6 20.085537 36

Related

Insert row/s of NAs on the Nth row to list of data.frames with N from list

After numerous hours I find myself unable to solve the following issue:
I have a list of dataframes. I want to insert (not replace) one or more rows of NAs (always at least one row) to every DF respectively. The numbers of NA to insert are stored in a separate list.
To illustrate, i have the following two lists:
#list of dataframes
listDF <- list(data.frame(1:10),data.frame(1:9))
#list of row-indexes
listRI<- list(1,c(3,5))
My task hence, is to insert a row of NA to the first row to the first dataframe of listDF and two rows of NA (row 3 and 5) to the second dataframe on listDF
From Add new row to dataframe, at specific row-index, not appended?, answer 156, i have made the following function:
insertRow <- function(df, rowindex) {
df[seq(rowindex+1,nrow(df)+1),] <- df[seq(rowindex,nrow(df)),]
df[rowindex,] <- rep(NA,ncol(df))
df
}
After this, I'm not sure how to proceed. Looking around SO and other pages, I figure that the Map-function might help me. The following works as long as it is only one row to add to each df. For instance this works fine:
#Example with insert of single row in both dataframes
Map(function(x,y){insertRow(x,y)},x=listDF,y=list(1,5))
This inserts one row of NA on the first row of the first df and a row of NA on the fifth row of the second df. However if I use:
#Example with insert of single row in both dataframes
Map(function(x,y){insertRow(x,y)},x=listDF,y=listRI)
the function does not work (since the second list of listRI is of length>1. What I miss, if I have understood it correctly, is a a for-loop that updates those lists/dfs of listDF where I want to insert several rows of NA. Can I get some input in how to solve my issue?
As always, please let me know if I need to be clearer. Best/John
Edit:
I edited the example code to not only include first number/numbers of row indexes.
Edit (again):
In case someone runs into this code and intend to use, I found a problem with the insertRow function if intending to add a new row to a dataframe. I solved this by editing the function as follows:
insertRow <- function(df, rowindex) {
if(rowindex<=nrow(df)){df[seq(rowindex+1,nrow(df)+1),] <- df[seq(rowindex,nrow(df)),]
df[rowindex,] <- rep(NA,ncol(df))
return(df)}
if(rowindex>=nrow(df)+1){df[nrow(df)+1,]<-rep(NA,ncol(df))
return(df)}
}
You can add a for loop to go over listRI.
Map(function(x,y){for(i in y) {x <- insertRow(x, i)}; x},x=listDF,y=listRI)
#[[1]]
# X1.10
#1 NA
#2 1
#3 2
#4 3
#5 4
#6 5
#7 6
#8 7
#9 8
#10 9
#11 10
#
#[[2]]
# X1.9
#1 1
#2 2
#3 NA
#4 3
#5 NA
#6 4
#7 5
#8 6
#9 7
#10 8
#11 9

between list calculations per row in R

Let's say i have the following list of df's (in reality i have many more dfs).
seq <- c("12345","67890")
li <- list()
for (i in 1:length(seq)){
li[[i]] <- list()
names(li)[i] <- seq[i]
li[[i]] <- data.frame(A = c(1,2,3),
B = c(2,4,6))
}
What i would like to do is calculate the mean within the same cell position between the lists, keeping the same amount of rows and columns as the original lists. How could i do this? I believe I can use the apply() function, but i am unsure how to do this.
The expected output (not surprising):
A B
1 1 2
2 2 4
3 3 6
In reality, the values within each list are not necessarily the same.
If there are no NAs, then we can Reduce to get the sum of observations for each element and divide by the length of the list
Reduce(`+`, li)/length(li)
# A B
#1 1 2
#2 2 4
#3 3 6
If there are NA values, then it may be better to use mean (which has na.rm argument). For this, we can convert it to array and then use apply
apply(array(unlist(li), dim = c(dim(li[[1]]), length(li))), c(1, 2), mean)
An equivalent option in tidyverse would be
library(tidyverse)
reduce(li, `+`)/length(li)

R enumerate duplicates in a dataframe with unique value

I have a dataframe containing a set of parts and test results. The parts are tested on 3 sites (North Centre and South). Sometimes those parts are re-tested. I want to eventually create some charts that compare the results from the first time that a part was tested with the second (or third, etc.) time that it was tested, e.g. to look at tester repeatability.
As an example, I've come up with the below code. I've explicitly removed the "Experiment" column from the morley data set, as this is the column I'm effectively trying to recreate. The code works, however it seems that there must be a more elegant way to approach this problem. Any thoughts?
Edit - I realise that the example given was overly simplistic for my actual needs (I was trying to generate a reproducible example as easily as possible).
New example:
part<-as.factor(c("A","A","A","B","B","B","A","A","A","C","C","C"))
site<-as.factor(c("N","C","S","C","N","S","N","C","S","N","S","C"))
result<-c(17,20,25,51,50,49,43,45,47,52,51,56)
data<-data.frame(part,site,result)
data$index<-1
repeat {
if(!anyDuplicated(data[,c("part","site","index")]))
{ break }
data$index<-ifelse(duplicated(data[,1:2]),data$index+1,data$index)
}
data
part site result index
1 A N 17 1
2 A C 20 1
3 A S 25 1
4 B C 51 1
5 B N 50 1
6 B S 49 1
7 A N 43 2
8 A C 45 2
9 A S 47 2
10 C N 52 1
11 C S 51 1
12 C C 56 1
Old example:
#Generate a trial data frame from the morley dataset
df<-morley[,c(2,3)]
#Set up an iterative variable
#Create the index column and initialise to 1
df$index<-1
# Loop through the dataframe looking for duplicate pairs of
# Runs and Indices and increment the index if it's a duplicate
repeat {
if(!anyDuplicated(df[,c(1,3)]))
{ break }
df$index<-ifelse(duplicated(df[,c(1,3)]),df$index+1,df$index)
}
# Check - The below vector should all be true
df$index==morley$Expt
We may use diff and cumsum on the 'Run' column to get the expected output. In this method, we are not creating a column of 1s i.e 'index' and also assuming that the sequence in 'Run' is ordered as showed in the OP's example.
indx <- cumsum(c(TRUE,diff(df$Run)<0))
identical(indx, morley$Expt)
#[1] TRUE
Or we can use ave
indx2 <- with(df, ave(Run, Run, FUN=seq_along))
identical(indx2, morley$Expt)
#[1] TRUE
Update
Using the new example
with(data, ave(seq_along(part), part, site, FUN=seq_along))
#[1] 1 1 1 1 1 1 2 2 2 1 1 1
Or we can use getanID from library(splitstackshape)
library(splitstackshape)
getanID(data, c('part', 'site'))[]
I think this is a job for make.unique, with some manipulation.
index <- 1L + as.integer(sub("\\d+(\\.)?","",make.unique(as.character(morley$Run))))
index <- ifelse(is.na(index),1L,index)
identical(index,morley$Expt)
[1] TRUE
Details of your actual data.frame may matter. However, a couple of options working with your example:
#this works if each group starts with 1:
df$index<-cumsum(df$Run==1)
#this is maybe more general, with data.table
require(data.table)
dt<-as.data.table(df)
dt[,index:=seq_along(Speed),by=Run]

How to recode a set of variables in a dataframe in R

I have a dataframe with different variables containing values from 1 to 5. I want to recode some variables in the way that 5 becomes 1 and vice versa (x=6-x).
I want to define a list of variables, that will be recoded like this in my dataframe.
Here is my approach using lapply. I haven't really understood it yet.
#generate example-dataset
var1<-sample(1:5,100,rep=TRUE)
var2<-sample(1:5,100,rep=TRUE)
var3<-sample(1:5,100,rep=TRUE)
dat<-as.data.frame(cbind(var1,var2,var3))
recode.list<-c("var1","var3")
recode.function<- function(x){
x=6-x
}
lapply(recode.list,recode.function,data=dat)
There's no need for an external function or for a package for this. Just use an anonymous function in lapply, like this:
df[recode.list] <- lapply(df[recode.list], function(x) 6-x)
Using [] lets us replace just those columns directly in the original dataset. This is needed since just using lapply would result in the data as a named list.
As noted in the comments, you can actually even skip lapply:
df[recode.list] <- 6 - df[recode.list]
You can use mapvalues from plyr.
require(plyr)
# if you just want to replace 5 with 1 and vice versa
df[, recode.list] <- sapply(df[, recode.list], mapvalues, c(1, 5), c(5,1))
# if you want to apply to x=6-x to all values (in this case you don't need mapvalues)
df[, recode.list] <- sapply(df[, recode.list], mapvalues, 1:5, 5:1)
Here's an option to do this with dplyr:
recode.function<- function(x){
x <- 6-x
}
recode.list <- c("var1","var3")
require(dplyr)
df %>% mutate_each_(funs(recode.function), recode.list)
# var1 var2 var3
#1 2 2 4
#2 3 3 3
#3 3 5 2
#4 3 3 2
#5 4 3 3
#6 5 4 1
#...

Elements within lists.

I'm relatively new in R (~3 months), and so I'm just getting the hang of all the different data types. While lists are a super useful way of holding dissimilar data all in one place, they are also extremely inflexible for function calls, and riddle me with angst.
For the work I'm doing, I often uses lists because I need to hold a bunch of vectors of different lengths. For example, I'm tracking performance statistics of about 10,000 different vehicles, and there are certain vehicles which are so similar they can essentially be treated as the same vehicles for certain analyses.
So let's say we have this list of vehicle ID's:
List <- list(a=1, b=c(2,3,4), c=5)
For simplicity's sake.
I want to do two things:
Tell me which element of a list a particular vehicle is in. So when I tell R I'm working with vehicle 2, it should tell me b or [2]. I feel like it should be something simple like how you can do
match(3,b)
> 2
Convert it into a data frame or something similar so that it can be saved as a CSV. Unused rows could be blank or NA. What I've had to do so far is:
for(i in length(List)) {
length(List[[i]]) <- max(as.numeric(as.matrix(summary(List)[,1])))
}
DF <- as.data.frame(List)
Which seems dumb.
For your first question:
which(sapply(List, `%in%`, x = 3))
# b
# 2
For your second question, you could use a function like this one:
list.to.df <- function(arg.list) {
max.len <- max(sapply(arg.list, length))
arg.list <- lapply(arg.list, `length<-`, max.len)
as.data.frame(arg.list)
}
list.to.df(List)
# a b c
# 1 1 2 5
# 2 NA 3 NA
# 3 NA 4 NA
Both of those tasks (and many others) would become much easier if you were to "flatten" your data into a data.frame. Here's one way to do that:
fun <- function(X)
data.frame(element = X, vehicle = List[[X]], stringsAsFactors = FALSE)
df <- do.call(rbind, lapply(names(List), fun))
# element vehicle
# 1 a 1
# 2 b 2
# 3 b 3
# 4 b 4
# 5 c 5
With a data.frame in hand, here's how you could perform your two tasks:
## Task #1
with(df, element[match(3, vehicle)])
# [1] "b"
## Task #2
write.csv(df, file = "outfile.csv")

Resources