I am trying to make a function in R which returns a list of dataframes which are subsetted by each factor level.
An example to help explain what I am trying to do;
#Creating a dataset for my example
f1<-c("a","a","b","b","c","c")
f2<-c("x","y","x","y","x","y")
v1<-c(1:6)
v2<-c(7:12)
factors<-as.data.frame(cbind(f1,f2))
integers<-as.data.frame(cbind(v1,v2))
df<-cbind(factors,integers)
#The function
partition<-function(data){
factors<-Filter(is.factor,data) #Splitting data into factors
subsets<-list(NULL) #Creating an empty list where I
will put the subsets nm=0
for( i in 1:ncol(factors)){
nm=nm+nlevels(factors[,i])
}
nm
for( i in 1:ncol(factors)){
for(j in 1:nlevels(factors[,i])){
for(k in 1:nm){
subsets[[k]]<-df[which(factors[,i]==levels(factors[,i])[j]), ]
}
}
}
return(subsets)
}
partition(df)
This yields:
[[1]]
f1 f2 v1 v2
2 a y 2 8
4 b y 4 10
6 c y 6 12
[[2]]
f1 f2 v1 v2
2 a y 2 8
4 b y 4 10
6 c y 6 12
[[3]]
f1 f2 v1 v2
2 a y 2 8
4 b y 4 10
6 c y 6 12
[[4]]
f1 f2 v1 v2
2 a y 2 8
4 b y 4 10
6 c y 6 12
[[5]]
f1 f2 v1 v2
2 a y 2 8
4 b y 4 10
6 c y 6 12
As you can see, these are all the same datasets. By removing the loop over k, all the datasets are different and subsetting properly however it only gives me three datasets (as there are two levels in the last factor variable we keep the subset where f1 == "c").
Removing the for loop over k we get;
[[1]]
f1 f2 v1 v2
1 a x 1 7
3 b x 3 9
5 c x 5 11
[[2]]
f1 f2 v1 v2
2 a y 2 8
4 b y 4 10
6 c y 6 12
[[3]]
f1 f2 v1 v2
5 c x 5 11
6 c y 6 12
Where we are missing the subsets where f1 == "a" and f1 == "b"
Note I should get 5 dataframes as we have 2 + 3 factor levels (This is calculated as nm in the first for loop before the subsetting.
So my question is, how do I get the above to run without overiding what had already be subsetted?
For some background, this is working towards making a classification model where It would yield nfactor(df) predictions, then I will run a GLM to weight each prediction.
Thank you for any insight into my problem.
Update
The first answer from Glen simplifies my code which may make the problem I am having more apparent. Here is the updated code (note it runs much more effectively on large datasets with the split() function so thank you Glen.
for(k in 1:nm){
for( i in 1:ncol(factors)){
for( j in 1:nlevels(factors[,i])){
subsets[[k]]<-split(df,factors[,i])[j]
}
}
}
Returns the same as my original question . The problem is that when I run a loop over k through nm, the loop over writes what was already generated. How do I stop this from happening?
If I understand your question correctly. You can do this real easily with the split function.
f1<-c("a","a","b","b","c","c")
f2<-c("x","y","x","y","x","y")
v1<-c(1:6)
v2<-c(7:12)
factors<-as.data.frame(cbind(f1,f2))
integers<-as.data.frame(cbind(v1,v2))
df<-cbind(factors,integers)
tmp1=split(df,f1)
tmp2=split(df,f2)
c(tmp1,tmp2)
library(plyr)
library(foreach)
x<-foreach(i= colnames(Filter(is.factor,df)), .combine='c') %do%
plyr::dlply(df, i)
Returns a list of 5 dataframes. c is used to combine each result of the foreach loop (which is a list in itself). Without this we get a list of lists. With c, it combinds all the lists into 1 lists.
Related
I want to save residuals from a linear model to a dataframe. I was trying to do it with the line of code (note that this was supposed to go inside a loop):
resi <- NULL
resi <- cbind(resi, colnames(dados[1])=residuals(m))
Here I intended to save the residuals vector from my model m under the same column name from the dados object (which is basicaly a date), but I get the error:
Error: unexpected '=' in "resi <- cbind(resi, colnames(dados[1])="
You want `colnames <- ()`.
cbind(d, `colnames<-`(d, letters[1:4]))
# X1 X2 X3 X4 a b c d
# 1 1 4 7 10 1 4 7 10
# 2 2 5 8 11 2 5 8 11
# 3 3 6 9 12 3 6 9 12
It's similar to setNames() but also compatible with matrices.
Toydata
d <- data.frame(matrix(1:12, 3, 4))
It is possible to do this in tibble
library(tibble)
tibble(resi, !!colnames(dados)[1] :=residuals(m))
I'd like to compare the values of all the different list objects and then output or record the matching values with corresponding list object indices that have the matching value in program R.
Here is the code:
a <- c(10,23,50,7,3)
b <- c(1,1,2,2,3)
c <- c(33,24,4,10,1)
d <- c(1,4,7,8,3)
y <- c()
z <- c()
r <- list()
table <- data.frame(a,b,c,d)
for (i in 1:5){ for (j in 3:4){
x <- table$b[j]
for (l in which(table$b == x)){
if (abs((table$c[j] - table$a[i])-table$d[l])<10)
{y <- rbind(y, table[j,])}}
z <- y} y <- c() r[[i]] <- z}
The output:
> r
[[1]]
a b c d
4 7 2 10 8
41 7 2 10 8
[[2]]
NULL
[[3]]
NULL
[[4]]
a b c d
4 7 2 10 8
41 7 2 10 8
[[5]]
a b c d
3 50 2 4 7
31 50 2 4 7
4 7 2 10 8
41 7 2 10 8
I'd like to record the matching values between list objects and also the indices of the list objects that have these matching values. For example in this instance r[[1]] and r[[4]] have matching vector value 7 2 10 8, I'd like to write in the data frame both the vector and the indices of list object that have this value. How to do this?
EDIT.
I'd like the output to be a vector whose first element would be the matching vectors' 3rd value (c column) and the rest of the elements would be indices of the list objects.
In this case:
10
1
4
Using R 3.3.2.
My equation is:
Lt_1<-Lt +(Linf-Lt)*(1-exp(-K))
Values for equation parameters:
Lt<-40 #only value that changes
Linf<-139.086
K<-0.413
year=c(0:10)
I want to loop through the equation for the length of year but need to have Lt start as 40, but then take the value of Lt_1 from the last time the equation was calculated.
What I have tried:
#dataframe to old output
new<- as.data.frame(matrix( 0, nrow=length(year), ncol= 2))
predict_length<-for(i in seq_along(year)){
Lt_1<-Lt +(Linf-Lt)*(1-exp(-K))
new[1,2]<-Lt
new[i,2]<-Lt_1
new[i,1]<-data[i]
}
new
Output:
V1 V2
1 0 40.00000
2 1 73.52453
3 2 73.52453
4 3 73.52453
5 4 73.52453
6 5 73.52453
7 6 73.52453
8 7 73.52453
9 8 73.52453
10 9 73.52453
11 10 73.52453
The loop isn't working - the second LT_1 is repeated for the remainder of the data frame.
Consider using Reduce, the common higher order function found in other languages, since you are essentially nesting equation calls together:
eq <- function(Lt) Lt + (Linf-Lt)*(1-exp(-K))
Lt_1 <- eq(Lt)
Lt_2 <- eq(eq(Lt))
Lt_3 <- eq(eq(eq(Lt)))
Hence, wrap Reduce inside an sapply that iteratively with rep passes the input Lt values at successively increasing times:
new <- as.data.frame(matrix( 0, nrow=length(year), ncol= 2))
new$V1 <- year
new$V2 <- sapply(year, function(i)
Reduce(function(x, y) eq(x), rep(40, i+1)))
new
# V1 V2
# 1 0 40.00000
# 2 1 73.52453
# 3 2 95.70645
# 4 3 110.38339
# 5 4 120.09456
# 6 5 126.52008
# 7 6 130.77161
# 8 7 133.58468
# 9 8 135.44598
# 10 9 136.67754
# 11 10 137.49241
The main question OP had raised is about reason why loop is not working for him. The main reason is that for all calculations Lt has been used in formula.
The two changes will be needed in for-loop:
1. Declare Lt_1 out of for-loop
2. Use Lt-1 inplace of Lt1 in formula.
The modified code will be:
Lt<-40 #only value that changes
Linf<-139.086
K<-0.413
year=c(0:10)
#dataframe to old output
new<- as.data.frame(matrix( 0, nrow=length(year), ncol= 2))
new[1,1]<- 0
new[1,2]<- Lt
Lt_1 <- Lt;
for(i in 2:length(year)){
Lt_1<-Lt_1 +(Linf-Lt_1)*(1-exp(-K))
new[i,2]<-Lt_1
new[i,1]<- i
}
new
# V1 V2
#1 0 40.00000
#2 2 73.52453
#3 3 95.70645
#4 4 110.38339
#5 5 120.09456
#6 6 126.52008
#7 7 130.77161
#8 8 133.58468
#9 9 135.44598
#10 10 136.67754
#11 11 137.49241
Here's a quick and dirty solution:
lt0 <- 40
cfunc <- function(lt){
linf <- 139.086
k <- 0.413
lt1 <- lt + (linf-lt)*(1-exp(-k))
}
year = 0:10
val <- c(cfunc(lt0), rep(0, length(year)-1))
for(i in 2:length(year)){
val[i] <- cfunc(val[i-1])
}
output:
> val
[1] 73.52453 95.70645 110.38339 120.09456 126.52008 130.77161 133.58468 135.44598 136.67754 137.49241
[11] 138.03158
There ought to be a better way than using a for loop
I'm trying to explore a large dataset, both with data frames and with charts. I'd like to analyze the distribution of each variable by different metrics (e.g., sum(x), sum(x*y)) and for different sub-populations. I have 4 sub-populations, 2 metrics, and many variables.
In order to accomplish that, I've made a list structure such as this:
$variable1
...$metric1 <--- that's a df.
...$metric2
$variable2
...$metric1
...$metric2
Inside one of the data_frames (e.g., list$variable1$metric1), I've calculated distributions of the unique values for variable1 and for each of the four population groups (represented in columns). It looks like this:
$variable1$metric1
unique_values med_all med_some_not_all med_at_least_some med_none
1 (1) 12-17 Years Old NA NA NA NA
2 (2) 18-25 Years Old 0.278 0.317 0.278 0.317
3 (3) 26-34 Years Old 0.225 0.228 0.225 0.228
4 (4) 35 or Older 0.497 0.456 0.497 0.456
$variable1$metric2
unique_values med_all med_some_not_all med_at_least_some med_none
1 (1) 12-17 Years Old NA NA NA NA
2 (2) 18-25 Years Old 0.544 0.406 0.544 0.406
3 (3) 26-34 Years Old 0.197 0.310 0.197 0.310
4 (4) 35 or Older 0.259 0.284 0.259 0.284
What I'm trying to figure out is a good way to loop through the list of lists (probably melting the DFs in the process) and then output a ton of bar charts. In this case, the natural plot format would be, for each dataframe, a stacked bar chart with one stacked bar for each sub-population, grouping by the variable's unique values.
But I'm not familiar with iterated plotting and so I've hit a dead end. How might I plot from that list structure? Alternately, is there a better structure in which i should be storing this information?
I find nested lists to be pretty tricky to work with, so I would combine them all into a single data frame that labels the name of the variable and the name of the metric:
lst <- list(alpha= list(a= data.frame(matrix(1:4, 2)), b= data.frame(matrix(6:9, 2))), beta = list(c = data.frame(matrix(11:14, 2))))
level1 <- lapply(lst, function(x) do.call(rbind, lapply(names(x), function(y) {x[[y]]$metric=y ; x[[y]]})))
dat <- do.call(rbind, lapply(names(level1), function(x) {level1[[x]]$variable=x ; level1[[x]]}))
dat
# X1 X2 metric variable
# 1 1 3 a alpha
# 2 2 4 a alpha
# 3 6 8 b alpha
# 4 7 9 b alpha
# 5 11 13 c beta
# 6 12 14 c beta
Now you can use standard tools for manipulating a single data frame to perform your data analysis.
here's a start:
lst <- list(alpha= list(a= data.frame(matrix(1:4, 2)), b= data.frame(matrix(6:11, 2))),
beta = list(c = data.frame(matrix(11:14, 2))))
lst
$alpha
$alpha$a
X1 X2
1 1 3
2 2 4
$alpha$b
X1 X2 X3
1 6 8 10
2 7 9 11
$beta
$beta$c
X1 X2
1 11 13
2 12 14
#We can subset by number or by name
lst[['alpha']]
$a
X1 X2
1 1 3
2 2 4
$b
X1 X2 X3
1 6 8 10
2 7 9 11
lst[[1]]
$a
X1 X2
1 1 3
2 2 4
$b
X1 X2 X3
1 6 8 10
2 7 9 11
#The dollar sign naming convention reminds us that we are looking at a list.
#Let's sum the columns of both data frames in the alpha list
lapply(lst[['alpha']], colSums)
$a
X1 X2
3 7
$b
X1 X2 X3
13 17 21
Let's try to find the sum of each column of each data frame:
lapply(lst, colSums)
Error in FUN(X[[i]], ...) :
'x' must be an array of at least two dimensions
What happened? R is correctly refusing to run an array function on a list. The function colSums needs to be fed data frames, matrices, and other arrays above one-dimension. We have to nest an lapply function inside of another one. The logic can get complicated:
lapply(lst, function(x) lapply(x, colSums))
$alpha
$alpha$a
X1 X2
3 7
$alpha$b
X1 X2 X3
13 17 21
$beta
$beta$c
X1 X2
23 27
We can use rbind to put data.frames together:
rbind(lst$alpha$a, lst$beta$c)
X1 X2
1 1 3
2 2 4
3 11 13
4 12 14
Be sure not to do it the way you might be thinking (I've done it many times):
do.call(rbind, lst)
a b
alpha List,2 List,3
beta List,2 List,2
That isn't the result you're looking for. And make sure that the dimensions and column names are the same:
do.call(rbind, lst[[1]])
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
R is refusing to combine data frames that have 2 columns in one (alpha$a) and three columns in the other (alpha$b).
I changed the lst to make alpha$b have two columns like the others and combined them:
bind1 <- lapply(lst2, function(x) do.call(rbind, x))
bind1
$alpha
X1 X2
a.1 1 3
a.2 2 4
b.1 6 9
b.2 7 10
b.3 8 11
$beta
X1 X2
c.1 11 13
c.2 12 14
That combines the elements of each list. Now I can combine the outer list to make one big data frame.
do.call(rbind, bind1)
X1 X2
alpha.a.1 1 3
alpha.a.2 2 4
alpha.b.1 6 9
alpha.b.2 7 10
alpha.b.3 8 11
beta.c.1 11 13
beta.c.2 12 14
Here's a strategy based on melting a list (recursively),
lst = list(alpha= list(a= data.frame(matrix(1:4, 2)),
b= data.frame(matrix(6:11, 2))),
beta = list(c = data.frame(matrix(11:14, 2))))
library(reshape2)
m = melt(lst, id=1:2)
library(ggplot2)
ggplot(m, aes(X1,X2)) + geom_bar(stat="identity") + facet_grid(L1~L2)
I have three sets of identifiers: "x", "y", and "z". I also have two, 2-column data frames that each map one set of identifiers to another set of identifiers.
x2y = data.frame( x = c("A","A","B","B","C","D","E","F"),
y = c(1,2,1,2,3,4,4,5) )
y2z = data.frame( y = c(1,1,2,3,4,4,5,5,5),
z = c(1,2,3,3,6,7,6,7,8) )
This can be visualized in the figure below. Note that each arrow corresponds to one row in a data frame.
Question:
How do I use these two mappings (two data frames) to make a mapping
from x to z (displayed on the right of the figure above). I
think of it as a "transitive mapping": x to y and y to z gives x to z.
The data frame that I would like is...
x2z = data.frame( x = c("A","A","A","B","B","B","C","D","D","E","E","F","F","F"),
z = c(1,2,3,1,2,3,3,6,7,6,7,6,7,8) )
Notes: My data frames are usually ~50,000 rows, so efficient code is very important. When I've solved this problem with loops, it took several minutes to run.
My only requirement is that the code be in R.
You want to merge:
merge(x2y, y2z)[c('x','z')]
## x z
## 1 A 1
## 2 A 2
## 3 B 1
## 4 B 2
## 5 A 3
## 6 B 3
## 7 C 3
## 8 D 6
## 9 D 7
## 10 E 6
## 11 E 7
## 12 F 6
## 13 F 7
## 14 F 8
It helps here that the names agree where necessary.