R:Subsetting over all data frames inside a list - r

I'm new in the use of R and stackoverflow. I'm trying to deal with a list of data frame and have the following problem (hope, that this is a good example for reproducing). Assume, that I've a list of 3 data frames with 4 columns (my real code contains 10 data frames with 20 columns):
df1 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df2 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df3 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df_list <- list(df1=df1,df2=df2,df3=df3)
For each data frame I've a different condition for subsetting:
For example:
#If I would subset them in a singular way outside of the list
df1_s <- df1[which(df1$k <=12 & df1$k >0), df1$h_1] #Taking only rows of k=12 to k=1
and only the column h_1
df2_s <- df2[which(df2$k <=4 & df2$k >0), df2$h_3]
df3_s <- df3[which(df3$k <=12 & df2$k >0), df2$h_2]
How I can subset the three data frames in the list in a most efficient way ?
I think something with lapply and putting the numbers of subsetting in a vector would be good approach, but I've no idea how to do it or how I can subset in lists.
I hope you can help me. Before posting, I tried to find a solution in other posts, that are dealing with subsetting of data frames in lists, but that doesn't work for my Code.

Here's an mapply approach (same idea as the other answer):
# function: w/ arguments dataframe and a vector = [column name, upper, lower]
rook <- function(df, par) {
out <- df[par[1]][, 1]
out[out <= par[2] & out > par[3]]
}
# list of parameters
par_list <- list(
c('h_1', 12, 0),
c('h_3', 4 , 0),
c('h_2', 12, 0)
)
# call mapply
mapply(rook, df_list, par_list)

Here's a solution using base R. As #www mentioned, the idea is to use an apply-type function (mapply or pmap from purrr) to apply multiple arguments to a function in sequence. This solution also makes use of the eval-parse construct to do flexible subsetting. See e.g. the discussion here http://r.789695.n4.nabble.com/using-a-condition-given-as-string-in-subset-function-how-td1676426.html.
subset_fun <- function(data, criteria, columns) {
subset(data, eval(parse(text = criteria)), columns)
}
criterion <- list("k <= 12 & k > 0", "k <= 4 & k > 0", "k <= 12 & k > 0")
cols <- list("h_1", "h_3", "h_2")
out <- mapply(subset_fun, df_list, criterion, cols)
str(out)
# List of 3
# $ df1.h_1: num [1:12] -0.0589 1.0677 0.2122 1.4109 -0.6367 ...
# $ df2.h_3: num [1:4] -0.826 -1.506 -1.551 0.862
# $ df3.h_2: num [1:12] 0.8948 0.0305 0.9131 -0.0219 0.2252 ...

We can use the pmap function from the purrr package. The key is to define a function to take arguments based on the k and column name, and then organize a list with these arguments, and then use pmap.
library(tidyverse)
# Define a function
subset_fun <- function(dat, k1, k2, col){
dat2 <- dat %>%
filter(k <= k1, k > k2) %>%
pull(col)
return(dat2)
}
# Define lists for the function arguments
par <- list(dat = df_list, # List of data frames
k1 = list(12, 4, 12), # The first number
k2 = list(0, 0, 0), # The second number
col = list("h_1", "h_3", "h_2")) # The column name
# Apply the subset_fun
df_list2 <- pmap(par, subset_fun)
df_list2
# $df1
# [1] -0.6868529 -0.4456620 1.2240818 0.3598138 0.4007715 0.1106827 -0.5558411 1.7869131
# [9] 0.4978505 -1.9666172 0.7013559 -0.4727914
#
# $df2
# [1] -0.9474746 -0.4905574 -0.2560922 1.8438620
#
# $df3
# [1] -0.2803953 0.5629895 -0.3724388 0.9769734 -0.3745809 1.0527115 -1.0491770 -1.2601552
# [9] 3.2410399 -0.4168576 0.2982276 0.6365697
DATA
set.seed(123)
df1 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df2 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df3 <- data.frame(k=20:0, h_1=rnorm(21), h_2=rnorm(21), h_3= rnorm(21))
df_list <- list(df1=df1,df2=df2,df3=df3)

Consider Map, the wrapper to mapply to return a list of dataframes. And because you subset one column, to avoid return as a vector, cast back with data.frame and use setNames to rename.
Here, mapply or Map, sibling to lapply, is chosen because you want to iterate element-wise across a list of equal length objects. Mapply takes an unlimited number of arguments, here being four, requiring lengths to be equal or multiples of lengths:
low_limits <- c(0, 0, 0)
high_limits <- c(12, 4, 12)
h_cols <- c("h_1", "h_2", "h_3")
subset_fct <- function(df, lo, hi, col)
setNames(data.frame(df[which(df$k > lo & df$k <= hi), col]), col)
new_df_list <- Map(subset_fct, df_list, low_limits, high_limits, h_cols)
# EQUIVALENT CALL
new_df_list <- mapply(subset_fct, df_list, low_limits,
high_limits, h_cols, SIMPLIFY = FALSE)
Output (uses set.seed(456) at top to reproduce random numbers)
new_df_list
# $df1
# h_1
# 1 1.0073523
# 2 0.5732347
# 3 -0.9158105
# 4 1.3110974
# 5 0.9887263
# 6 1.6539287
# 7 -1.4408052
# 8 1.9473564
# 9 1.7369362
# 10 0.3874833
# 11 2.2800340
# 12 1.5378833
# $df2
# h_2
# 1 0.11815133
# 2 0.86990262
# 3 -0.09193621
# 4 0.06889879
# $df3
# h_3
# 1 -1.4122604
# 2 -0.9997605
# 3 -2.3107388
# 4 0.9386188
# 5 -1.3881885
# 6 -0.6116866
# 7 0.3184948
# 8 -0.2354058
# 9 1.0750520
# 10 -0.1007956
# 11 1.0701526
# 12 1.0358389

Related

lapply to add column to existing dataframes

I have a list of data frames, and want to perform a function on each column in the data frame.
I've been googling for a while, but the issue I have is this:
df.1 <- data.frame(data=cbind(rnorm(5, 0), rnorm(5, 2), rnorm(5, 5)))
df.2 <- data.frame(data=cbind(rnorm(5, 0), rnorm(5, 2), rnorm(5, 5)))
names(df.1) <- c("a", "b", "c")
names(df.2) <- c("a", "b", "c")
ls.1<- list(df.1,df.2)
res <- lapply(ls.1, function(x){
x$d <- x$b + x$c
return(x)
})
Returns a new list "res" with a group of unnamed dataframes in them (res[[1]], res[[2]] etc).
[[1]]
a b c d
1 2.2378686 3.640607 4.793172 8.433780
2 -0.4411046 3.690850 5.290814 8.981664
3 -1.1490879 3.081092 4.982820 8.063912
4 -0.3024211 1.929033 4.743569 6.672602
5 1.3658726 3.395564 2.800131 6.195695
[[2]]
a b c d
1 0.3452530 3.264709 7.384127 10.648836
2 -1.2031949 3.118633 4.840496 7.959129
3 0.6177369 1.119107 4.938917 6.058024
4 -1.0470713 1.942357 5.747748 7.690106
5 0.8732836 2.704501 5.805754 8.510254
I'm interested in adding columns to the original dataframes (df.1, df.2) How would I do this?
You can name your list elements, or use tibble::lst which will do it for you:
ls.1<- list(df.1 = df.1,df.2 = df.2)
ls.2<- tibble::lst(df.1, df.2)
res1 <- lapply(ls.1, function(x){
x$d <- x$b + x$c
return(x)
})
res2 <- lapply(ls.2, function(x){
x$d <- x$b + x$c
return(x)
})
# $df.1
# a b c d
# 1 0.6782608 4.0774244 2.845351 6.922776
# 2 2.3620601 1.9395314 5.438832 7.378364
# 3 -0.5913838 2.0579972 4.312360 6.370357
# 4 0.5532147 0.8581389 5.867889 6.726027
# 5 -0.3251044 1.9838598 4.321008 6.304867
#
# $df.2
# a b c d
# 1 1.9918131 3.195105 5.715858 8.910963
# 2 0.2525537 2.507358 5.040691 7.548050
# 3 0.5038298 3.112855 5.265974 8.378830
# 4 0.4873384 3.377182 5.685714 9.062896
# 5 -0.6539881 0.157948 5.407508 5.565456
To overwrite the original data.frames you can use list2env on the output.
In order to add columns, you will have to either overwrite your ls.1 with res or perhaps manually assign result to your original data.frames, e.g. df.1 <- res[[1]]. But there are a hundred ways to skin a cat (pun intended) and there may be other better approaches.

Applying function using multiple columns as argument , function returns a data.frame

I am trying to apply a function that uses multiple columns of a dataframe as arguments, with the function returning a dataframe for each row. I can use a for loop here, but Wanted to check if there is any other way of doing this
A simple example is being provided here. my original problem is slightly more complicated.
DF1<-data.frame(start=seq(from=1, to=5, by=1),end=seq(from=10, to=14, by=1))
rep_fun <- function(x,y)
{
data.frame( A=seq(x, y)) #produces a sequence between x and y
}
DF2<-data.frame()
for (i in 1:nrow(DF1)){
temp<-data.frame(rep_fun(DF1$start[i],DF1$end[i]))
DF2<-rbind(temp,DF2) # this contains a dataframe that has a sequence between 'start' and 'end' for each row in DF1
}
The desired result which I am able to obtain through a for-loop is shown below. Not all rows are being shown here. Rows 1 to 10, shows the sequence corresponding to row 5 in DF1
> DF2
A
1 5
2 6
3 7
4 8
5 9
6 10
7 11
8 12
9 13
10 14
11 4
12 5
1) lapply Split DF1 by nrow(DF1):1 so that it comes out in reverse order and then lapply over that list and rbind its components together. No packages are used.
DF3 <- do.call("rbind", lapply(split(DF1, nrow(DF1):1), with, rep_fun(start, end)))
rownames(DF3) <- NULL
identical(DF2, DF3)
## [1] TRUE
2) Map or this alternative:
fun <- function(x) with(x, rep_fun(start, end))
DF4 <- do.call("rbind", Map(fun, split(DF1, nrow(DF1):1), USE.NAMES = FALSE))
identical(DF4, DF2)
## [1] TRUE
3) Map/rev Like (2) this uses Map but this time using rep_fun directly. Also, it uses rev to order the output after the computation rather than split to order the input before the computation.
DF5 <- do.call("rbind", with(DF1, rev(Map(rep_fun, start, end))))
identical(DF5, DF2)
## [1] TRUE

Find top deciles from dataframe by group

I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here.
Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign), I have left the naming process until after the lapply. I am then using a for loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some sort of reshaping?
#Reproducible data
data <- data.frame("custID" = c(1:10, 1:20),
"v1" = rep(c("A", "B"), c(10,20)),
"v2" = c(30:21, 20:19, 1:3, 20:6), stringsAsFactors = TRUE)
#Function to analyze customer distribution for each category (v1)
pf <- function(cat, df) {
df <- df[df$v1 == cat,]
df <- df[order(-df$v2),]
#Divide the customers into top percents
nr <- nrow(df)
p10 <- round(nr * .10, 0)
cat("Number of people in the Top 10% :", p10, "\n")
p20 <- round(nr * .20, 0)
p11_20 <- p20-p10
cat("Number of people in the 11-20% :", p11_20, "\n")
#Keep only those customers in the top groups
df <- df[1:p20,]
#Create a variable to identify the percent group the customer is in
top_pct <- integer(length = p10 + p11_20)
#Identify those in each group
top_pct[1:p10] <- 10
top_pct[(p10+1):p20] <- 20
#Add this variable to the data frame
df$top_pct <- top_pct
#Keep only custID and the new variable
df <- subset(df, select = c(custID, top_pct))
return(df)
}
##Run the customer distribution function
v1Levels <- levels(data$v1)
res <- lapply(v1Levels, pf, df = data)
#Explore the results
summary(res)
# Length Class Mode
# [1,] 2 data.frame list
# [2,] 2 data.frame list
print(res)
# [[1]]
# custID top_pct
# 1 1 10
# 2 2 20
#
# [[2]]
# custID top_pct
# 11 1 10
# 16 6 10
# 12 2 20
# 17 7 20
##Merge the two data frames but with top_pct as a different variable for each category
#Change the new variable name
for(i in 1:length(res)) {
names(res[[i]])[2] <- paste0(v1Levels[i], "_top_pct")
}
#Merge the results
res_m <- res[[1]]
for(i in 2:length(res)) {
res_m <- merge(res_m, res[[i]], by = "custID", all = TRUE)
}
print(res_m)
# custID A_top_pct B_top_pct
# 1 1 10 10
# 2 2 20 20
# 3 6 NA 10
# 4 7 NA 20
Stick to your Stata instincts and use a single data set:
require(data.table)
DT <- data.table(data)
DT[,r:=rank(v2)/.N,by=v1]
You can see the result by typing DT.
From here, you can group the within-v1 rank, r, if you want to. Following Stata idioms...
DT[,g:={
x = rep(0,.N)
x[r>.8] = 20
x[r>.9] = 10
x
}]
This is like gen and then two replace ... if statements. Again, you can see the result with DT.
Finally, you can subset with
DT[g>0]
which gives
custID v1 v2 r g
1: 1 A 30 1.000 10
2: 2 A 29 0.900 20
3: 1 B 20 0.975 10
4: 2 B 19 0.875 20
5: 6 B 20 0.975 10
6: 7 B 19 0.875 20
These steps can also be chained together:
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0]
(Thanks to #ExperimenteR:)
To rearrange for the desired output in the OP, with values of v1 in columns, use dcast:
dcast(
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0],
custID~v1)
Currently, dcast requires the latest version of data.table, available (I think) from Github.
You don't need the function pf to achieve what you want. Try dplyr/tidyr combo
library(dplyr)
library(tidyr)
data %>%
group_by(v1) %>%
arrange(desc(v2))%>%
mutate(n=n()) %>%
filter(row_number() <= round(n * .2)) %>%
mutate(top_pct= ifelse(row_number()<=round(n* .1), 10, 20)) %>%
select(custID, top_pct) %>%
spread(v1, top_pct)
# custID A B
#1 1 10 10
#2 2 20 20
#3 6 NA 10
#4 7 NA 20
The idiomatic way to do this kind of thing in R would be to use a combination of split and lapply. You're halfway there with your use of lapply; you just need to use split as well.
lapply(split(data, data$v1), function(df) {
cutoff <- quantile(df$v2, c(0.8, 0.9))
top_pct <- ifelse(df$v2 > cutoff[2], 10, ifelse(df$v2 > cutoff[1], 20, NA))
na.omit(data.frame(id=df$custID, top_pct))
})
Finding quantiles is done with quantile.

Creating a function to replace NAs from one data.frame with values from another

I regularly have situations where I need to replace missing values from a data.frame with values from some other data.frame that is at a different level of aggregation. So, for example, if I have a data.frame full of county data I might replace NA values with state values stored in another data.frame. After writing the same merge... ifelse(is.na()) yada yada a few dozen times I decided to break down and write a function to do this.
Here's what I cooked up, along with an example of how I use it:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols){
mergedDf <- merge(naDf, fillDf, by=mergeCols)
for (col in fillCols){
colWithNas <- mergedDf[[paste(col, "x", sep=".")]]
colWithOutNas <- mergedDf[[paste(col, "y", sep=".")]]
k <- which( is.na( colWithNas ) )
colWithNas[k] <- colWithOutNas[k]
mergedDf[col] <- colWithNas
mergedDf[[paste(col, "x", sep=".")]] <- NULL
mergedDf[[paste(col, "y", sep=".")]] <- NULL
}
return(mergedDf)
}
## test case
fillDf <- data.frame(a = c(1,2,1,2), b = c(3,3,4,4) ,f = c(100,200, 300, 400), g = c(11, 12, 13, 14))
naDf <- data.frame( a = sample(c(1,2), 100, rep=TRUE), b = sample(c(3,4), 100, rep=TRUE), f = sample(c(0,NA), 100, rep=TRUE), g = sample(c(0,NA), 200, rep=TRUE) )
fillNaDf(naDf, fillDf, mergeCols=c("a","b"), fillCols=c("f","g") )
So after I got this running I had this odd feeling that someone has probably solved this problem before me and in a much more elegant way. Is there a better/easier/faster solution to this problem? Also, is there a way that eliminates the loop in the middle of my function? That loop is there because I am often replacing NAs in more than one column. And, yes, the function assumes the columns we're filling from are named the same and the columns we are filling to and the same applies to the merge.
Any guidance or refactoring would be helpful.
EDIT on Dec 2 I realized I had logic flaws in my example which I fixed.
What a great question.
Here's a data.table solution:
# Convert data.frames to data.tables (i.e. data.frames with extra powers;)
library(data.table)
fillDT <- data.table(fillDf, key=c("a", "b"))
naDT <- data.table(naDf, key=c("a", "b"))
# Merge data.tables, based on their keys (columns a & b)
outDT <- naDT[fillDT]
# a b f g f.1 g.1
# [1,] 1 3 NA 0 100 11
# [2,] 1 3 NA NA 100 11
# [3,] 1 3 NA 0 100 11
# [4,] 1 3 0 0 100 11
# [5,] 1 3 0 NA 100 11
# First 5 rows of 200 printed.
# In outDT[i, j], on the following two lines
# -- i is a Boolean vector indicating which rows will be operated on
# -- j is an expression saying "(sub)assign from right column (e.g. f.1) to
# left column (e.g. f)
outDT[is.na(f), f:=f.1]
outDT[is.na(g), g:=g.1]
# Just keep the four columns ultimately needed
outDT <- outDT[,list(a,b,g,f)]
# a b g f
# [1,] 1 3 0 0
# [2,] 1 3 11 0
# [3,] 1 3 0 0
# [4,] 1 3 11 0
# [5,] 1 3 11 0
# First 5 rows of 200 printed.
Here's a slightly more concise/robust version of your approach. You could replace the for-loop with a call to lapply, but I find the loop easier to read.
This function assumes any columns not in mergeCols are fair game to have their NAs filled. I'm not really sure this helps, but I'll take my chances with the voters.
fillNaDf.ju <- function(naDf, fillDf, mergeCols) {
mergedDf <- merge(fillDf, naDf, by=mergeCols, suffixes=c(".fill",""))
dataCols <- setdiff(names(naDf),mergeCols)
# loop over all columns we didn't merge by
for(col in dataCols) {
rows <- is.na(mergedDf[,col])
# skip this column if it doesn't contain any NAs
if(!any(rows)) next
rows <- which(rows)
# replace NAs with values from fillDf
mergedDf[rows,col] <- mergedDf[rows,paste(col,"fill",sep=".")]
}
# don't return ".fill" columns
mergedDf[,names(naDf)]
}
My preference would be to pull out the code from merge that does the matching and do it myself so that I could keep the ordering of the original data frame intact, both row-wise and column-wise. I also use matrix indexing to avoid any loops, though to do so I create a new data frame with the revised fillCols and replace the columns of the original with it; I thought I could fill it in directly but apparently you can't use matrix ordering to replace parts of a data.frame, so I wouldn't be surprised if a loop over the names would be faster in some situations.
With matrix indexing:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
na.ind <- is.na(naDf[,fillCols])
fill.ind <- cbind(match(naB, fillB)[row(na.ind)[na.ind]], col(na.ind)[na.ind])
naX <- naDf[,fillCols]
fillX <- fillDf[,fillCols]
naX[na.ind] <- fillX[fill.ind]
naDf[,colnames(naX)] <- naX
naDf
}
With a loop:
fillNaDf2 <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
m <- match(naB, fillB)
for(col in fillCols) {
fix <- which(is.na(naDf[,col]))
naDf[fix, col] <- fillDf[m[fix],col]
}
naDf
}

How to find which elements of one set are in another set?

I have two sets: A with columns x,y, and B also with columns x, y.
I need to find the index of the rows of A which are inside of B (both x and y must match).
I have come up with a simple solution (see below), but this comparison is inside of the loop and paste adds much more extra time.
B <- data.frame(x = sample(1:1000, 1000), y = sample(1:1000, 1000))
A <- B[sample(1:1000, 10),]
#change some elements
A$x[c(1,3,7,10)] <- A$x[c(1,3,7,10)] + 0.5
A$xy <- paste(A$x, A$y, sep='ZZZ')
B$xy <- paste(B$x, B$y, sep='ZZZ')
indx <- which(A$xy %in% B$xy)
indx
For example for a single observation an alternative to paste is almost 3 times faster
ind <- sample(1:1000, 1)
xx <- B$x[ind]
yy <- B$y[ind]
ind <- which(with(B, x==xx & y==yy))
# [1] 0.0160000324249268 seconds
xy <- paste(xx,'ZZZ',yy, sep='')
ind <- which(B$xy == xy)
# [1] 0.0469999313354492 seconds
How about using merge() to do the matching for you?
A$id <- seq_len(nrow(A))
sort(merge(A, B)$id)
# [1] 2 4 5 6 8 9
Edit:
Or, to get rid of two unnecessary sorts, use the sort= option to merge()
merge(A, B, sort=FALSE)$id
# [1] 2 4 5 6 8 9

Resources