Most efficient way of avoiding loops to create data.frame - r

I have a data.frame which includes the runs scored in each innings of baseball games as a character vector.
I want to create a new data.frame which lists the number of runs in each innings for each game. I can do this with a loop but appreciate that this is too slow for any reasonable number of observations and that the rbind method shown is also not ideal.
The number of innings may vary and an x indicates that the team did not need to bat in 9th inning as game was already won.
library(stringr)
data <- data.frame(gameID=c("a","b","c"),innings=c("002100000","30000000x","10101010101"))
for(i in 1:nrow(data)) {
box <- as.integer(str_split(data$innings[i], "")[[1]])
tempdf <- data.frame(box,id=data$gameID[i])
if(i!=1) {
df <- rbind(df,tempdf)
} else {
df <- tempdf
}
}

This helps a bit (30%):
res <- vector("list", nrow(data))
for(i in seq_along(res))
res[[i]] <- data.frame(box=as.integer(str_split(data$innings[i], "")[[1]]),
id=data$gameID[i])
do.call(rbind, res)

Not sure if this is faster,
library(splitstackshape)
data$innings <- gsub('', ' ', data$innings)
cSplit(data, 'innings', ' ', 'long')

Here's a way using lists with lapply:
library(dplyr) # for bind_rows -- you can also use do.call(rbind, list)
innings <- str_split(data$innings, "")
names(innings) <- data$gameID
innings <- lapply(innings, function(x) data.frame(box = x))
bind_rows(innings, .id = "id")

This should be pretty fast:
## Defined these separately just for readability
innings <- as.character(data$innings) # or use 'stringsAsFactors=FALSE' when defining the data frame
box <- unlist(strsplit(innings, ""))
id <- rep(data$gameID, nchar(innings))
## To get a character matrix back
cbind(box, id)
## To get a data frame back
data.frame(box=box, id=id, stringsAsFactors=FALSE)
Using a matrix is faster, but if you want to have mixed classes use a data frame. Also, for a data frame, it's faster to use characters than factors (thus the stringsAsFactors=FALSE argument). If you want box to be numeric, you can wrap it in as.integer (but then the matrix option wont work, of course).

Related

create multiple smaller dataframes from a larger one by IDs in R

I have a big dataframe of almost 2mlns entries divided by 11 columns. I want to split the database into multiple smaller database by filtering by the first two columns. I give here an example of the db.
investor asset price col4 col5 ecc
44KL TLSA
451L F
4639L AAPL
44KL UBI
44KL F
I want to create a new single dataframe for each investor paired with a single asset.
This means I want the investor '44KL' to be divided into three different dataframes called TSLA, UBI and F. And this must apply for all the investors I have in my dataset.
I've tried with a parallel approach by doing this:
I first used unique() on the database to create the 'investor_ids' and the 'asset_list'
then I tried with:
file_names <- investors %>%
dplyr::filter(investor %in% investor_ids) %>%
dplyr::filter(asset %in% asset_list) %>%
dplyr::arrange(investor) %>%
dplyr::mutate(name = stringr::str_c("INV", investor, asset, num_trx, stat, sep = "_")) %>%
purrr::pluck("name")
for_asset <- function(df) {
for(inv in investor){
for (ass in assets) {
df <- subset(df, subset = asset == ass)
}
}
}
Parallel --------------------------------------------------------------
cl <- parallel::makeCluster(parallel::detectCores())
doParallel::registerDoParallel(cl)
tictoc::tic()
foreach::foreach(i = seq_along(file_names), .errorhandling = "pass") %dopar% {
df <- for_asset(db_test)
nm <- paste0("dev/test-data/investors-rdata-assetbased/", file_names[i], ".RData")
save(df, file = nm)
}
time <- tictoc::toc()
parallel::stopCluster(cl)
but I end up with the correct number of dataframes, but all are just NULL values.
Can you help me?
i then want to move on by applying computations on the new formed dfs so I need something easy to use.
I tried with split but I get a list of lists on which I don't know how to work
You can do this:
dfs = split(df, df[c(1,2)], drop=TRUE)
purrr::walk(names(dfs), function(d) {
readr::write_csv(dfs[[d]],paste0("dev/test-data/investors-rdata-assetbased/",d,".csv"))
})
A better option, by far, is to set your df to data.table i.e.
library(data.table)
setDT(df)
and then work with each investor/asset subgroup, using df[i,j,by=.(investor,asset)]

Iteratively adding a row containing characters and numbers to a dataframe

I have a list containing named elements. I am iterating over the list names, performing the computation for each corresponding element, "encapsulating" the results and the name in a vector and finally adding the vector to a table. The row or vector after each iteration contains a mix of characters and numbers.
The first row is getting added but from the second row onwards there is a problem.
In this example, there is supposed to be one column (first) containing alphanumeric names. All rows after the first one contain NAs.
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s))
}
df <- as.data.frame(df)
I know there are possibly more efficient ways but for the moment this is more intuitive for me as it is assuring that each computation is associated with a particular name. There can be several columns and rows and the names are extremely helpful to join tables, query, compare etc. They make it easier to trace back results to a particular element in my original list.
Additionally, I would be glad to know other ways in which the element names are always retained while transforming.
Thankyou!
You have to set stringsAsFactors = FALSE in rbind. With stringsAsFactors = TRUE the first iteration in the loop converts the string variables into factors (with the factor levels being the values).
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s), stringsAsFactors = FALSE)
}
An easier solution would be to utilize sapply().
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame(name = names(x), m = sapply(x, mean), s = sapply(x, sum))

Building forvalues loops in R

[Working with R 3.2.2]
I have three data frames with the same variables. I need to modify the value of some variables and change the name of the variables (rename the columns). Instead of doing this data frame by data frame, I would like to use a loop.
This is the code I want to run:
#Change the values of the variables
vlist <- c("var1", "var2", "var3")
dataframe0[,vlist] <- dataframe0[,vlist]/10
dataframe1[,vlist] <- dataframe1[,vlist]/10
dataframe2[,vlist] <- dataframe2[,vlist]/10
#Change the name of the variables
colnames(dataframe0)[colnames(dataframe0)=="var1"] <- "temp_min"
colnames(dataframe0)[colnames(dataframe0)=="var2"] <- "temp_max"
colnames(dataframe0)[colnames(dataframe0)=="var3"] <- "prep"
colnames(dataframe1)[colnames(dataframe1)=="var1"] <- "temp_min"
colnames(dataframe1)[colnames(dataframe1)=="var2"] <- "temp_max"
colnames(dataframe1)[colnames(dataframe1)=="var3"] <- "prep"
colnames(dataframe2)[colnames(dataframe2)=="var1"] <- "temp_min"
colnames(dataframe2)[colnames(dataframe2)=="var2"] <- "temp_max"
colnames(dataframe2)[colnames(dataframe2)=="var3"] <- "prep"
I know the logic to do it with programs like Stata, with a forvalues loop:
#Change the values of the variables
forvalues i=0/2 {
dataframe`i'[,vlist] <- dataframe`i'[,vlist]/10
#Change the name of the variables
colnames(dataframe`i')[colnames(dataframe`i')=="var1"] <- "temp_min"
colnames(dataframe`i')[colnames(dataframe`i')=="var2"] <- "temp_max"
colnames(dataframe`i')[colnames(dataframe`i')=="var3"] <- "prep"
}
But, I am not able to reproduce it in R. How should I proceed? Thanks in advance!
I would go working with a list of dataframe, you can still 'split' it after if really needed:
df1 <- data.frame("id"=1:10,"var1"=11:20,"var2"=11:20,"var3"=11:20,"test"=1:10)
df2 <- df1
df3 <- df1
dflist <- list(df1,df2,df3)
for (i in seq_along(dflist)) {
df[[i]]['test'] <- df[[i]]['test']/10
colnames( dflist[[i]] )[ colnames(dflist[[i]]) %in% c('var1','var2','var3') ] <- c('temp_min','temp_max','prep')
# eventually reassign df1-3 to their list value:
# assign(paste0("df",i),dflist[[i]])
}
The interest of using a list is that you can access them a little more easily in a programmatic way.
I did change your code from 3 calls to only one, as colnames give a vector you can subset it and replace in one pass, this is assuming your var1 to var3 are always in the same order.
Addendum: if you want a single dataset at end you can use do.call(rbind,dflist) or with data.table package rbindlist(dflist).
More details on working with list of data.frames in Gregor's answer here

How can I make a tibble/tbl_df/data_frame from a vector or vectors

I have a name and a vector
my.name <- 'data.values'
my.vec <- 1:5
and I'd like to make a tibble/tbl_df/data_frame with one column that has my.name as the name of that column and my.vec as the values. What I have is
df <- data_frame(placeholder = rep(NA, length(my.vec)))
df[[my.name]] <- my.vec
df[['placeholder']] <- NULL
Which just feels silly. Is there an easier way to do this?
I am also interested in the case where I have multiple vectors and multiple names, e.g.
my.name1 <- 'data.values.day1'
my.name2 <- 'data.values.day2'
my.vec1 <- 1:5
my.vec2 <- 2:6
...
I think the best answer came in a comment.
DirtySockSniffer recommended:
as_data_frame(setNames(list(my.vec), my.name)))
which generalizes nicely to the multiple column situation
as_data_frame(setNames(list(my.vec1, my.vec2),
c(my.name1, my.name2)))
You can create a data_frame first and then set its column names:
my.data <- data_frame(my.vec.1, my.vec.2, ...)
names(my.data) <- c(my.name.1, my.name.2, ...) # Order is important here

merge data frames based on non-identical values in R

I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)

Resources