Speeding up an R for loop to paste multiple variables together - r

I'm new here but could use some help. I have a list of data frames, and for each element within my list (i.e., data.frame) I want to quickly paste one column in a data set to multiple other columns in the same data set, separated only by a period (".").
So if I have one set of data in a list of data frames:
list1[[1]]
A B C
2 1 5
4 2 2
Then I want the following result:
list1[[1]]
A B C
2.5 1.5 5
4.2 2.2 2
Where C is pasted to A and B individually. I then want this operation to take place for each data frame in my list.
I have tried the following:
pasteX<-function(df) {for (i in 1:dim(df)[2]-1) {
df[,i]<-as.numeric(sprintf("%s.%s", df[,i], df$C))
}
return(df)}
list2<-lapply(list1, pasteX)
But this approach is verrrry slow for larger matrices and lists. Any recommendations for making this code faster? Thanks!

Assuming everything is integers < 10
lapply(list1, function(x){
x[,-3] <- x[,-3] + x[,3]/10
x})

We can use Map
list1[[1]][-3] <- Map(function(x, y) as.numeric(sprintf('%s.%s', x, y)),
list1[[1]][-3], list1[[1]][3])
If there are many datasets, loop using lapply, convert the first two columns to matrix and paste with the third column, update the output, and return the dataset
lapply(list1, function(x) {
x[1:2] <- as.numeric(sprintf('%s.%s', as.matrix(x[1:2]), x[,3]));
x })
#[[1]]
# A B C
#1 2.5 1.5 5
#2 4.2 2.2 2
Or using tidyverse
library(tidyverse)
map(list1, ~ .x %>%
mutate_at(1:2, funs(as.numeric(sprintf('%s.%s', ., C)))))
Or with data.table
library(data.table)
lapply(list1, function(x) setDT(x)[, (1:2) :=
lapply(.SD, function(x) as.numeric(sprintf('%s.%s', x, C))) ,
.SDcols = 1:2][])

try this:
df <- data.frame(a = c(1,2,3), b = c(3,2,1), c = c(2,1,1))
pastex <- function(x){
m<- sapply(df[,1:2], function(x) as.numeric(paste(x, df$c, sep = '.')))
m <- as.data.frame(m)
m <- cbind(m, df["c"])
return(m)
}
mylist <- list(df1 = df, df2 = df)
lapply(mylist, pastex)

Related

Function to transform lists of different lengths into data frame utilizing recycling

I'm trying to create a data frame with multiple areas where the rows are equal to the longest array in the list. Other arrays in the list then recycle the elements until they meet the longest number. I have to do this in a very specific way, using individual functions for each.
DF <- function(x) {
maxLength <- listMax(x)
newList <- listExtend(x,maxLength)
finalList <- data.frame(newList)
print(finalList)
}
The data frame in my function doesn't work because of the uneven numbers, which I assume stems from newList in the DF function. I can use a single loop, sapply(), cbind() or rbind() to transform the vectors and put them in data frame but each attempt has either resulted in all 1s or other egregious issues.
The issue is you need to apply listExtend to each vector of fullList in the DF function.
x <- c(1:7)
y <- c("a","b","c","d","e")
z <- c(TRUE,TRUE,FALSE,FALSE)
fullList <- list(x,y,z)
DF <- function(x) {
maxLength <- max(lengths(x))
newList <- lapply(x, function(l) rep(l, length.out = maxLength))
finalDF <- data.frame(newList)
return(finalDF)
}
outDF <- DF(fullList)
colnames(outDF) <- c('x', 'y', 'z')
outDF
#------
x y z
1 1 a TRUE
2 2 b TRUE
3 3 c FALSE
4 4 d FALSE
5 5 e TRUE
6 6 a TRUE
7 7 b FALSE

Nested for loop leading to: Error in [<-.data.frame`(`*tmp*` replacement has x rows, data has y

I have 6 data frames (dfs) with a lot of data of different biological groups and another 6 data frames (tax.dfs) with taxonomical information about those groups. I want to replace a column of each of the 6 dfs with a column with the scientific name of each species present in the 6 tax.dfs.
To do that I created two lists of the data frames and I'm trying to apply a nested for loop:
dfs <- list(df.birds, df.mammals, df.crocs, df.snakes, df.turtles, df.lizards)
tax.dfs <- list(tax.birds,tax.mammals, tax.crocs, tax.snakes, tax.turtles, tax.lizards )
for(i in dfs){
for(y in tax.dfs){
i[,1] <- y[,2]
}}
And this is the output I'm getting:
Error in `[<-.data.frame`(`*tmp*`, , 1, value = c("Aotus trivirgatus", :
replacement has 64 rows, data has 43
But both data frames have the same number of rows, I actually used dfs to create tax.dfs applying the tnrs_match_names function from rotl package.
Any suggestions of how I could fix this error or that help me to find another way to do what I need to will be greatly appreciated.
Thank You!
For what it is worth, to iterate over two objects simultaneously, the following works:
Example Data:
df1 <- data.frame(a=1, b=2)
df2 <- data.frame(c=3, d=4)
df3 <- data.frame(e=5, f=6)
df_1 <- data.frame(a='A', b='B')
df_2 <- data.frame(c='C', d='D')
df_3 <- data.frame(e='E', f='F')
dfs <- list(df1, df2, df3)
df_s <- list(df_1, df_2, df_3)
Using mapply:
out <- mapply(function(one, two) {
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
e f
1 F 6
Here, one and two in mapply correspond to the different elements in dfs and df_s. Having said that, let's make it a bit more interesting. Let's change my third example to the following:
df_3 <- data.frame(e=c('E', 'e'), f=c('F', 'f'))
df_s <- list(df_1, df_2, df_3) # needs to be executed again
Now, let's adjust the function:
out <- mapply(function(one, two) {
if(nrow(one) != nrow(two)){return('Wrong dimensions')}
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
[1] "Wrong dimensions"

Counting function in R

I have a dataset like this
id <- 1:12
b <- c(0,0,1,2,0,1,1,2,2,0,2,2)
c <- rep(NA,3)
d <- rep(NA,3)
df <-data.frame(id,b)
newdf <- data.frame(c,d)
I want to do simple math. If x==1 or x==2 count them and write how many 1 and 2 are there in this dataset. But I don't want to count whole dataset, I want my function count them four by four.
I want to a result like this:
> newdf
one two
1 1 1
2 2 1
3 0 3
I tried this with lots of variation but I couldn't success.
afonk <- function(x) {
ifelse(x==1 | x==2, x, newdf <- (x[1]+x[2]))
}
afonk(newdf$one)
lapply(newdf, afonk)
Thanks in advance!
ismail
Fun with base R:
# counting function
countnum <- function(x,num){
sum(x == num)
}
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
dfl <- split(df$b,f = df$group)
# make data frame of counts
newdf <- data.frame(one = sapply(dfl,countnum,1),
two = sapply(dfl,countnum,2))
Edit based on comment:
# make list of groups of 4
df$group <- rep(1:ceiling(nrow(df)/4),each = 4)[1:nrow(df)]
table(subset(df, b != 0L)[c("group", "b")])
Which you prefer depends on what type of result you need. A table will work for a small visual count, and you can likely pull the data out of the table, but if it is as simple as your example, you might opt for the data.frame.
We could use dcast from data.table. Create a grouping variable using %/% and then dcast from 'long' to 'wide' format.
library(data.table)
dcast(setDT(df)[,.N ,.(grp=(id-1)%/%4+1L, b)],
grp~b, value.var='N', fill =0)[,c(2,4), with=FALSE]
Or a slightly more compact version would be using fun.aggregate as length.
res <- dcast(setDT(df)[,list((id-1)%/%4+1L, b)][b!=0],
V1~b, length)[,V1:=NULL][]
res
# 1 2
#1: 1 1
#2: 2 1
#3: 0 3
If we need the column names to be 'one', 'two'
library(english)
names(res) <- as.character(english(as.numeric(names(res))))

Expand an list in an R dataframe into additional rows of the dataframe?

In a separate question earlier today I asked how to flatten nested lists into a row in a dataframe. I wish to further understand how to manipulate my lists within a dataframe, this time by expanding the list vertically within the dataframe, adding new rows to accommodate the the data.
In this case I wish to go from this structure:
CAT COUNT TREAT
A 1,2,3 Treat-a, Treat-b
B 4,5 Treat-c,Treat-d,Treat-e
To this structure:
CAT COUNT TREAT
A 1 Treat-a
A 2 Treat-b
A 3 NA
B 4 Treat-c
B 5 Treat-d
B NA Treat-e
Code to generate the test (source) data:
df<-data.frame(CAT=c("A","B"))
df$COUNT <-list(1:3,4:5) # as numbers
df$TREAT <-list(paste("Treat-", letters[1:2],sep=""),paste("Treat-", letters[3:5],sep=""))
I tried using CBIND in an approach similar to the answer supplied to my earlier question but it failed due to the different number of values between the multiple lists. Thank you for your patience as I attempt to grasp these basic manipulation tasks.
Here's what I came up with, using a helper function cbind.fill: (cbind a df with an empty df (cbind.fill?))
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
#Split df by CAT
df.split <- split(df, df$CAT)
#Apply cbind.fill to make a matrix filled with NA where needed
rawlist <- lapply(df.split, function(x) cbind(as.character(x$CAT), cbind.fill(unlist(x$COUNT), unlist(x$TREAT) ) ))
#Bind rows and convert matrix to data.frame
df.new <- as.data.frame(do.call(rbind, rawlist))
#Column names
colnames(df.new) <- names(df)
df.new
CAT COUNT TREAT
1 A 1 Treat-a
2 A 2 Treat-b
3 A 3 <NA>
4 B 4 Treat-c
5 B 5 Treat-d
6 B <NA> Treat-e
Extending my answer from your previous question, you can create another function, let's call this one flattenLong that combines flatten and melt from "data.table" to get your desired output.
The function looks like this:
flattenLong <- function(indt, cols) {
ob <- setdiff(names(indt), cols)
x <- flatten(indt, cols, TRUE)
mv <- lapply(cols, function(y) grep(sprintf("^%s_", y), names(x)))
setorderv(melt(x, measure.vars = mv, value.name = cols), ob)[]
}
Usage is simply:
flattenLong(df, c("COUNT", "TREAT"))
## CAT variable COUNT TREAT
## 1: A 1 1 Treat-a
## 2: A 2 2 Treat-b
## 3: A 3 3 NA
## 4: B 1 4 Treat-c
## 5: B 2 5 Treat-d
## 6: B 3 NA Treat-e
For convenience, here's the flatten function again:
flatten <- function(indt, cols, drop = FALSE) {
require(data.table)
if (!is.data.table(indt)) indt <- as.data.table(indt)
x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
nams <- paste(rep(cols, x), sequence(x), sep = "_")
indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE), .SDcols = cols]
if (isTRUE(drop)) {
indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE),
.SDcols = cols][, (cols) := NULL]
}
indt[]
}

Creating new aggregated R dataframes with a function in lapply

I want to create new aggregated dataframes from existing ones that are collected in a list. Ideally they would appear as their own dataframes indicated by a prefix. This is where I've got:
dfList <- list(A = data.frame(y = sample(1:100), x = c("grp1","grp2")),
B = data.frame(y = sample(1:100), x = c("grp1","grp2")))
agrFun <- function(df){
prefix <- deparse(substitute(df))
assign(paste0(prefix,"_AGR"),
aggregate(y ~ x, data = df, sum))
}
lapply(seq_along(dfList), function(x) agrFun(dfList[[x]]))
Aggregation happens as intended but I'm not sure what I need to do otherwise in order to create dataframes A_AGR and B_AGR.
EDIT:
A bit of clarification. I want to have the aggregated dataframes appear in the environment.
So instead of this
ls()
[1] "agrFun" "dfList"
my goal is to have
ls()
[1] "A_AGR" "B_AGR" "agrFun" "dfList"
EDIT2:
Or more ideal would be to have dfList include dataframes A, A_AGR, B and B_AGR after the process.
EDIT3:
And I also want to preserve the names of the dataframes.
Your way to do seems extremely complicated. You can create the wanted named data.frames in a list with this one liner:
setNames(lapply(dfList, function(u) aggregate(y~x, u, sum)), paste0(names(dfList),"_AGR"))
#$A_AGR
# x y
#1 grp1 2340
#2 grp2 2710
#$B_AGR
# x y
#1 grp1 2573
#2 grp2 2477
With your function agrFun:
lst = setNames(lapply(dfList, function(x) agrFun(x)), paste0(names(dfList),"_AGR"))
If you want to append the two lists:
dfList = append(lst, dfList)

Resources