Apply R function to multiple objects and rewrite object - r

I'm trying to do the following:
define a function which creates an additional column based on existing columns in a data frame
apply said function to multiple objects (data frames), rewriting the original data frame
For example, say the function is to divide the Petal.Length by Petal.Width in iris.
divvy <- function(mydataframe){mydataframe$divvy <- mydataframe$Petal.Length/mydataframe$Petal.Width}
This part is easy.
Now imagine I have three (or three thousand) iris dataframes:
iris2 <- iris
iris4 <- iris
iris5 <- iris
What I am trying to avoid is this:
iris <- divvy(iris)
iris2 <- divvy(iris2)
iris4 <- divvy(iris4)
iris5 <- divvy(iris5)
times infinity for the number of iris data frames that I have
... with something along the lines of
lapply(c(iris,iris2,iris4,iris4), function(x) divvy(x))
And end up with iris, iris2, iris4, and iris5 having the new divvy column. How do I do this?
Please note: I do NOT want to create a meta-object that has all of the irises within it.

We could use data.table to do this:
library(data.table)
divvy <- function(x){x[,divvy := Petal.Length/Petal.Width]}
iris2 <- data.table(iris)
iris4 <- data.table(iris)
iris5 <- data.table(iris)
test <- lapply(list(iris2,iris4,iris5), function(x) divvy(x))
Where test looks like this (just showing the first 2 elements of the list):
> test
[[1]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species divvy
1: 5.1 3.5 1.4 0.2 setosa 7.000000
2: 4.9 3.0 1.4 0.2 setosa 7.000000
3: 4.7 3.2 1.3 0.2 setosa 6.500000
4: 4.6 3.1 1.5 0.2 setosa 7.500000
5: 5.0 3.6 1.4 0.2 setosa 7.000000
---
146: 6.7 3.0 5.2 2.3 virginica 2.260870
147: 6.3 2.5 5.0 1.9 virginica 2.631579
148: 6.5 3.0 5.2 2.0 virginica 2.600000
149: 6.2 3.4 5.4 2.3 virginica 2.347826
150: 5.9 3.0 5.1 1.8 virginica 2.833333
[[2]]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species divvy
1: 5.1 3.5 1.4 0.2 setosa 7.000000
2: 4.9 3.0 1.4 0.2 setosa 7.000000
3: 4.7 3.2 1.3 0.2 setosa 6.500000
4: 4.6 3.1 1.5 0.2 setosa 7.500000
5: 5.0 3.6 1.4 0.2 setosa 7.000000
---
146: 6.7 3.0 5.2 2.3 virginica 2.260870
147: 6.3 2.5 5.0 1.9 virginica 2.631579
148: 6.5 3.0 5.2 2.0 virginica 2.600000
149: 6.2 3.4 5.4 2.3 virginica 2.347826
150: 5.9 3.0 5.1 1.8 virginica 2.833333
EDIT*** In response to OP updating questions specs:
You could try this:
for(i in c("iris2", "iris4", "iris5")){
x <- divvy(get(i))
assign(paste0(i,"divvied"), x)
}
Although i'd recommend against assign, especially for a lot of objects. You could extract the elements from the test list which i made in the first half of the answer, you'd still get the same answer, just a little cleaner and less clutter.
What the code does is pulls in the iris data tables as a string, and then reads them using get. This is passed to your divvy function, creating a data.table x. I then use assign to create the data.table with the suffix divvied.

Related

How can subset a dataframe by nrow and groups in r?

I have a dataframe that contains 240,000 obs. of 7 variables. In the dataframe there are 100 groups of 2400 records each, by Symbol. Example:
Complete DataFrame
I want to split this dataframe in new dataframe that contains every first observation and each 240 observation. The new dataframe will be 1000 obs of 7 variables:
New DataFrame
I tried df[seq(1, nrow(df), 240), ] but the new dataframe has each 240 observation and not distinguished by group (Symbol). I mean, I want a new dataframe that contains the rows 240, 480, 720, 960, and so on, for each symbol. In the original data frame every symbol has 2400 obs thus the new dataframe will have 10 obs by group.
Since we don't have your data, we can use an R database: iris. In this example we split iris by Species and select first n rows using head, in this example I set n=5 to extract first 5 rows by Species
> split_data <- lapply(split(iris, iris$Species), head, n=5)
> do.call(rbind, split_data)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
setosa.1 5.1 3.5 1.4 0.2 setosa
setosa.2 4.9 3.0 1.4 0.2 setosa
setosa.3 4.7 3.2 1.3 0.2 setosa
setosa.4 4.6 3.1 1.5 0.2 setosa
setosa.5 5.0 3.6 1.4 0.2 setosa
versicolor.51 7.0 3.2 4.7 1.4 versicolor
versicolor.52 6.4 3.2 4.5 1.5 versicolor
versicolor.53 6.9 3.1 4.9 1.5 versicolor
versicolor.54 5.5 2.3 4.0 1.3 versicolor
versicolor.55 6.5 2.8 4.6 1.5 versicolor
virginica.101 6.3 3.3 6.0 2.5 virginica
virginica.102 5.8 2.7 5.1 1.9 virginica
virginica.103 7.1 3.0 5.9 2.1 virginica
virginica.104 6.3 2.9 5.6 1.8 virginica
virginica.105 6.5 3.0 5.8 2.2 virginica
>
Update
Given your comment, try this using your data.frame:
ind <- seq(from=240, to=240000, by=240) # a row index of length = 1000
split_data <- lapply(split(yourData, yourData$Symbol), function(x) x[ind,] )
do.call(rbind, split_data)
Here is one way using base R.
just like in the answer by user #Jilber Urbina I will give an example use with the built-in dataset iris.
fun <- function(DF, n = 240, start = n){
DF[seq(start, NROW(DF), by = n), ]
}
res <- lapply(split(iris, iris$Species), fun, n = 24)
res <- do.call(rbind, res)
row.names(res) <- NULL
res
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.3 1.7 0.5 setosa
#2 4.6 3.2 1.4 0.2 setosa
#3 6.1 2.8 4.7 1.2 versicolor
#4 6.2 2.9 4.3 1.3 versicolor
#5 6.3 2.7 4.9 1.8 virginica
#6 6.5 3.0 5.2 2.0 virginica
This can be made into a function, I named selectStepN.
#
# x - dataset to subset
# f - a factor, split criterion
# n - the step
#
selectStepN <- function(x, f, n = 240, start = n){
fun <- function(DF, n){
DF[seq(start, NROW(DF), by = n), ]
}
res <- lapply(split(x, f), fun, n = n)
res <- do.call(rbind, res)
row.names(res) <- NULL
res
}
selectStepN(iris, iris$Species, 24)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.3 1.7 0.5 setosa
#2 4.6 3.2 1.4 0.2 setosa
#3 6.1 2.8 4.7 1.2 versicolor
#4 6.2 2.9 4.3 1.3 versicolor
#5 6.3 2.7 4.9 1.8 virginica
#6 6.5 3.0 5.2 2.0 virginica

Rbind-ing data.tables with NA values

I have a big data.table with about 40 columns, and I need to add a record for which I only have 3 of the 40 columns (the rest will be just NA). To make a reproducible example:
require(data.table)
data(iris)
setDT(iris)
# this works (and is the expected result):
rbind(iris, list(6, NA, NA, NA, "test"))
The problem is I have 37+ empty columns (the data I want to input is in the 1st, 2nd and 37th columns of the variable). So, I need to rep some of the NAs. But if I try:
rbind(iris, list(6, rep(NA, 3), "test"))
It won't work (sizes are different). I could do
rbind(iris, list(c(6, rep(NA, 3), "test")))
But it will (obviously) coerce the whole first column to char. I've tried unlisting the list, inverting the list(c( sequence (it only accepts lists), and haven't found anything yet.
Please note that this is not a duplicate of the (several) posts about rbind data.tables, as I'm able to do that. What I haven't been able to, is to maintain proper data classes while doing it and using rep(NA, x).
You can do...
rbind(data.table(iris), c(list(6), logical(3), list("test")))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
---
147: 6.3 2.5 5.0 1.9 virginica
148: 6.5 3.0 5.2 2.0 virginica
149: 6.2 3.4 5.4 2.3 virginica
150: 5.9 3.0 5.1 1.8 virginica
151: 6.0 NA NA NA test
logical(n) is the same as rep(NA, n). I wrapped iris in data.table() so rbindlist is used instead of rbind.data.frame and "test" is treated as a new factor level instead of an invalid level.
I think there are better ways to go, though, like...
newrow = setDT(iris[NA_integer_, ])
newrow[, `:=`(Sepal.Length = 6, Species = factor("test")) ]
rbind(data.table(iris), newrow)
# or
rbind(data.table(iris), list(Sepal.Length = 6, Species = "test"), fill=TRUE)
These approaches are clearer and don't require fiddling with column counting.
I prefer the newrow way, since it leaves a table I can inspect to review the data transformation.
We can use replicate
rbind(iris, c(6, replicate(3, NA, simplify = FALSE), "test"))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1: 5.1 3.5 1.4 0.2 setosa
# 2: 4.9 3.0 1.4 0.2 setosa
# 3: 4.7 3.2 1.3 0.2 setosa
# 4: 4.6 3.1 1.5 0.2 setosa
# 5: 5.0 3.6 1.4 0.2 setosa
# ---
#147: 6.3 2.5 5.0 1.9 virginica
#148: 6.5 3.0 5.2 2.0 virginica
#149: 6.2 3.4 5.4 2.3 virginica
#150: 5.9 3.0 5.1 1.8 virginica
#151: 6.0 NA NA NA test
Or as #Frank commented
rbind(iris, c(6, as.list(rep(NA, 3)), "test"))

Smart spreadsheet parsing (managing group sub-header and sum rows, etc)

Say you have a set of spreadsheets formatted like so:
Is there an established method/library to parse this into R without having to individually edit the source spreadsheets? The aim is to parse header rows and dispense with sum rows so the output is the raw data, like so:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7.0 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 5.7 2.8 4.1 1.3 versicolor
8 6.3 3.3 6.0 2.5 virginica
9 5.8 2.7 5.1 1.9 virginica
10 7.1 3.0 5.9 2.1 virginica
I can certainly hack a tailored solution to this, but wondering there is something a bit more developed/elegant than read.csv and a load of logic.
Here's a reproducible demo csv dataset (can't assume an equal number of lines per group..), although I'm hoping the solution can transpose to *.xlsx:
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17
There is a variety of ways to present spreadsheets so it would be hard to have a consistent methodology for all presentations. However, it is possible to transform the data once it is loaded in R. Here's an example with your data. It uses the function na.locf from package zoo.
x <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE)
library(zoo)
x <- x[x$X!="Mean",] #remove Mean line
x$Species <- x$X #create species column
x$Species[grepl("[0-9]",x$Species)] <- NA #put NA if Species contains numbers
x$Species <- na.locf(x$Species) #carry last observation if NA
x <- x[!rowSums(is.na(x))>0,] #remove lines with NA
X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
3 1 5.1 3.5 1.4 0.2 Setosa
4 2 4.9 3.0 1.4 0.2 Setosa
5 3 4.7 3.2 1.3 0.2 Setosa
9 1 7.0 3.2 4.7 1.4 Versicolor
10 2 6.4 3.2 4.5 1.5 Versicolor
11 3 6.9 3.1 4.9 1.5 Versicolor
15 1 6.3 3.3 6.0 2.5 Virginica
16 2 5.8 2.7 5.1 1.9 Virginica
17 3 7.1 3.0 5.9 2.1 Virginica
I just recently did something similar. Here was my solution:
iris <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE)
First I used a which splits at an index.
split_at <- function(x, index) {
N <- NROW(x)
s <- cumsum(seq_len(N) %in% index)
unname(split(x, s))
}
Then you define that index using:
iris[,1] <- stringr::str_trim(iris[,1])
index <- which(iris[,1] %in% c("Virginica", "Versicolor", "Setosa"))
The rest is just using purrr::map_df to perform actions on each data.frame in the list that's returned. You can add some additional flexibility for removing unwanted rows if needed.
split_at(iris, index) %>%
.[2:length(.)] %>%
purrr::map_df(function(x) {
Species <- x[1,1]
x <- x[-c(1,NROW(x) - 1, NROW(x)),]
data.frame(x, Species = Species)
})

In R, why does is.na cause data.table to display the data.table as ouput? Version 1.9.4

The data.table package (which is amazingly useful) still prints the data.table output in the following scenario. Is this a known issue? It seems to occur when is.na is used.
Earlier Posting for Reference
di <- data.table(iris)
di[is.na(Sepal.Length),Color := "Blue"]
packageVersion("data.table")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
---
146: 6.7 3.0 5.2 2.3 virginica
147: 6.3 2.5 5.0 1.9 virginica
148: 6.5 3.0 5.2 2.0 virginica
149: 6.2 3.4 5.4 2.3 virginica
150: 5.9 3.0 5.1 1.8 virginica
> packageVersion("data.table")
[1] ‘1.9.4’
6/14/2015 Edit:
Thanks for the responses. Indeed it seems that the issue is that no records meet the criteria, whereas my is.na example is just an example of the general issue. To confirm, this line also causes the data.table to display:
di[Sepal.Length > 100,Color := "Blue"]
By the way, even if the column already exists the data.table still gets displayed if no records are found. As so:
d2 <- data.table(iris)
d2[,Clr := NA]
d2[Sepal.Length > 100, Clr := "Blue"]
Sounds like the authorities are already aware of this and have it tackled. I can work around the issue in the meantime.

splitting a data.table, then modifying by reference

I have a use-case where I need to split a data.table, then apply different modify-by-reference operations to each partition. However, splitting forces copying of each table.
Here's a toy example on the iris dataset:
#split the data
DT <- data.table(iris)
out <- split(DT, DT$Species)
#assign partitions to global environment
NAMES <- as.character(unique(DT$Species))
lapply(seq_along(out), function(x) {
assign(NAMES[x], out[[x]], envir=.GlobalEnv)})
#modify by reference, same function applied to different columns for different partitions
#would do this programatically in real use case
virginica[ ,summ:=sum(Petal.Length)]
setosa[ ,summ:=sum(Petal.Width)]
#rbind all (again, programmatic)
do.call(rbind, list(virginica, setosa))
Then I get the following warning:
Warning message:
In `[.data.table`(out$virginica, , `:=`(cumPedal, cumsum(Petal.Width))) :
Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference.
I know this is related to putting data.tables in lists. Is there any workaround for this use case, or a way to avoid using split? Note that in the real case, I want to modify by reference programatically, so hardcoding a solution won't work.
Here's an example of using .EACHI to achieve what it sounds like you're trying to do:
## Create a data.table that indicates the pairs of keys to columns
New <- data.table(
Species = c("virginica", "setosa", "versicolor"),
FunCol = c("Petal.Length", "Petal.Width", "Sepal.Length"))
## Set the key of your original data.table
setkey(DT, Species)
## Now use .EACHI
DT[New, temp := cumsum(get(FunCol)), by = .EACHI][]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species temp
# 1: 5.1 3.5 1.4 0.2 setosa 0.2
# 2: 4.9 3.0 1.4 0.2 setosa 0.4
# 3: 4.7 3.2 1.3 0.2 setosa 0.6
# 4: 4.6 3.1 1.5 0.2 setosa 0.8
# 5: 5.0 3.6 1.4 0.2 setosa 1.0
# ---
# 146: 6.7 3.0 5.2 2.3 virginica 256.9
# 147: 6.3 2.5 5.0 1.9 virginica 261.9
# 148: 6.5 3.0 5.2 2.0 virginica 267.1
# 149: 6.2 3.4 5.4 2.3 virginica 272.5
# 150: 5.9 3.0 5.1 1.8 virginica 277.6
## Basic verification
head(cumsum(DT["setosa", ]$Petal.Width), 5)
# [1] 0.2 0.4 0.6 0.8 1.0
tail(cumsum(DT["virginica", ]$Petal.Length), 5)

Resources