I'm trying to explore a large dataset, both with data frames and with charts. I'd like to analyze the distribution of each variable by different metrics (e.g., sum(x), sum(x*y)) and for different sub-populations. I have 4 sub-populations, 2 metrics, and many variables.
In order to accomplish that, I've made a list structure such as this:
$variable1
...$metric1 <--- that's a df.
...$metric2
$variable2
...$metric1
...$metric2
Inside one of the data_frames (e.g., list$variable1$metric1), I've calculated distributions of the unique values for variable1 and for each of the four population groups (represented in columns). It looks like this:
$variable1$metric1
unique_values med_all med_some_not_all med_at_least_some med_none
1 (1) 12-17 Years Old NA NA NA NA
2 (2) 18-25 Years Old 0.278 0.317 0.278 0.317
3 (3) 26-34 Years Old 0.225 0.228 0.225 0.228
4 (4) 35 or Older 0.497 0.456 0.497 0.456
$variable1$metric2
unique_values med_all med_some_not_all med_at_least_some med_none
1 (1) 12-17 Years Old NA NA NA NA
2 (2) 18-25 Years Old 0.544 0.406 0.544 0.406
3 (3) 26-34 Years Old 0.197 0.310 0.197 0.310
4 (4) 35 or Older 0.259 0.284 0.259 0.284
What I'm trying to figure out is a good way to loop through the list of lists (probably melting the DFs in the process) and then output a ton of bar charts. In this case, the natural plot format would be, for each dataframe, a stacked bar chart with one stacked bar for each sub-population, grouping by the variable's unique values.
But I'm not familiar with iterated plotting and so I've hit a dead end. How might I plot from that list structure? Alternately, is there a better structure in which i should be storing this information?
I find nested lists to be pretty tricky to work with, so I would combine them all into a single data frame that labels the name of the variable and the name of the metric:
lst <- list(alpha= list(a= data.frame(matrix(1:4, 2)), b= data.frame(matrix(6:9, 2))), beta = list(c = data.frame(matrix(11:14, 2))))
level1 <- lapply(lst, function(x) do.call(rbind, lapply(names(x), function(y) {x[[y]]$metric=y ; x[[y]]})))
dat <- do.call(rbind, lapply(names(level1), function(x) {level1[[x]]$variable=x ; level1[[x]]}))
dat
# X1 X2 metric variable
# 1 1 3 a alpha
# 2 2 4 a alpha
# 3 6 8 b alpha
# 4 7 9 b alpha
# 5 11 13 c beta
# 6 12 14 c beta
Now you can use standard tools for manipulating a single data frame to perform your data analysis.
here's a start:
lst <- list(alpha= list(a= data.frame(matrix(1:4, 2)), b= data.frame(matrix(6:11, 2))),
beta = list(c = data.frame(matrix(11:14, 2))))
lst
$alpha
$alpha$a
X1 X2
1 1 3
2 2 4
$alpha$b
X1 X2 X3
1 6 8 10
2 7 9 11
$beta
$beta$c
X1 X2
1 11 13
2 12 14
#We can subset by number or by name
lst[['alpha']]
$a
X1 X2
1 1 3
2 2 4
$b
X1 X2 X3
1 6 8 10
2 7 9 11
lst[[1]]
$a
X1 X2
1 1 3
2 2 4
$b
X1 X2 X3
1 6 8 10
2 7 9 11
#The dollar sign naming convention reminds us that we are looking at a list.
#Let's sum the columns of both data frames in the alpha list
lapply(lst[['alpha']], colSums)
$a
X1 X2
3 7
$b
X1 X2 X3
13 17 21
Let's try to find the sum of each column of each data frame:
lapply(lst, colSums)
Error in FUN(X[[i]], ...) :
'x' must be an array of at least two dimensions
What happened? R is correctly refusing to run an array function on a list. The function colSums needs to be fed data frames, matrices, and other arrays above one-dimension. We have to nest an lapply function inside of another one. The logic can get complicated:
lapply(lst, function(x) lapply(x, colSums))
$alpha
$alpha$a
X1 X2
3 7
$alpha$b
X1 X2 X3
13 17 21
$beta
$beta$c
X1 X2
23 27
We can use rbind to put data.frames together:
rbind(lst$alpha$a, lst$beta$c)
X1 X2
1 1 3
2 2 4
3 11 13
4 12 14
Be sure not to do it the way you might be thinking (I've done it many times):
do.call(rbind, lst)
a b
alpha List,2 List,3
beta List,2 List,2
That isn't the result you're looking for. And make sure that the dimensions and column names are the same:
do.call(rbind, lst[[1]])
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
R is refusing to combine data frames that have 2 columns in one (alpha$a) and three columns in the other (alpha$b).
I changed the lst to make alpha$b have two columns like the others and combined them:
bind1 <- lapply(lst2, function(x) do.call(rbind, x))
bind1
$alpha
X1 X2
a.1 1 3
a.2 2 4
b.1 6 9
b.2 7 10
b.3 8 11
$beta
X1 X2
c.1 11 13
c.2 12 14
That combines the elements of each list. Now I can combine the outer list to make one big data frame.
do.call(rbind, bind1)
X1 X2
alpha.a.1 1 3
alpha.a.2 2 4
alpha.b.1 6 9
alpha.b.2 7 10
alpha.b.3 8 11
beta.c.1 11 13
beta.c.2 12 14
Here's a strategy based on melting a list (recursively),
lst = list(alpha= list(a= data.frame(matrix(1:4, 2)),
b= data.frame(matrix(6:11, 2))),
beta = list(c = data.frame(matrix(11:14, 2))))
library(reshape2)
m = melt(lst, id=1:2)
library(ggplot2)
ggplot(m, aes(X1,X2)) + geom_bar(stat="identity") + facet_grid(L1~L2)
Related
Here is a sample of my data frame:
Site Plot Plot_size Sp1 Sp2 Sp3 ... Sp108
C 1 N/A 25 4 3 12
G 1 N/A 30 35 5 22
M 1 S 10 15 7 37
M 1 M 2 3 1 7
M 1 L 2 3 1 7
created using this code:
plots_cond <- trees_only %>%
group_by(Site, Plot, Plot_size) %>%
count(Species) %>%
pivot_wider(names_from = Species, values_from= n)
Variables Sp1 ... Sp108 (columns 7:114) are species counts. I need to convert these count values to density per hectare (i.e. count value/hectares).
Hectare values differ according to Site, and Site + Plot_size for Site M:
C: 0.005
G: 7.5
MS: 1.6
MM: 8
ML: 16
I'm an extreme beginner and having trouble constructing an appropriate script to execute this. I believe I need to select columns 7:114 and divide by the associated hectare value if Site=='' or Site=='' + Plot_size == '', but I don't know how to put this all together into a cohesive piece of code.
Any assistance would be greatly appreciated!
You already used the tidyverse packages to prepare your dataset, therefore I will also use these packages.
First of all it might be helpful to create a column for hectors and use this column as a divisor later on. So I combined the Site and Plot_size column into one column with the paste0 function. To ignore the unwanted <N/A> inside the Plot_size column, I used the ifelse function (<N/A> will be replaced with nothing). In a second step the hector codes (C, G, MS, MM, ML) will be replaced with your given hector values by the case_when function.
In a last step all species counts will be divided by the new hectar column and saved inside new columns with the suffix "_density" (across applys the same transformation to multiple columns).
Data
df <- read.table(text = "Site Plot Plot_size Sp1 Sp2 Sp3
C 1 N/A 25 4 3
G 1 N/A 30 35 5
M 1 S 10 15 7
M 1 M 2 3 1
M 1 L 2 3 1 ",
stringsAsFactors = F,
header = T)
Code
library(tidyverse)
df %>%
# create a hectar combination of Site and Plot_size
mutate(Hectars = paste0(Site, ifelse(Plot_size == "N/A",
"", Plot_size)),
# replace the combination with numbers
Hectars = case_when(Hectars == "C" ~ 0.005,
Hectars == "G" ~ 7.5,
Hectars == "MS" ~ 1.6,
Hectars == "MM" ~ 8,
Hectars == "ML" ~ 16,
TRUE ~ NA_real_),
# devide each species count by hectar
across(starts_with("Sp"),
~ ./Hectars,
.names = "{.col}_density"))
Output
Site Plot Plot_size Sp1 Sp2 Sp3 Hectars Sp1_density Sp2_density Sp3_density
1 C 1 N/A 25 4 3 0.005 5000.000 800.000000 600.0000000
2 G 1 N/A 30 35 5 7.500 4.000 4.666667 0.6666667
3 M 1 S 10 15 7 1.600 6.250 9.375000 4.3750000
4 M 1 M 2 3 1 8.000 0.250 0.375000 0.1250000
5 M 1 L 2 3 1 16.000 0.125 0.187500 0.0625000
I want to save residuals from a linear model to a dataframe. I was trying to do it with the line of code (note that this was supposed to go inside a loop):
resi <- NULL
resi <- cbind(resi, colnames(dados[1])=residuals(m))
Here I intended to save the residuals vector from my model m under the same column name from the dados object (which is basicaly a date), but I get the error:
Error: unexpected '=' in "resi <- cbind(resi, colnames(dados[1])="
You want `colnames <- ()`.
cbind(d, `colnames<-`(d, letters[1:4]))
# X1 X2 X3 X4 a b c d
# 1 1 4 7 10 1 4 7 10
# 2 2 5 8 11 2 5 8 11
# 3 3 6 9 12 3 6 9 12
It's similar to setNames() but also compatible with matrices.
Toydata
d <- data.frame(matrix(1:12, 3, 4))
It is possible to do this in tibble
library(tibble)
tibble(resi, !!colnames(dados)[1] :=residuals(m))
I am trying to make a function in R which returns a list of dataframes which are subsetted by each factor level.
An example to help explain what I am trying to do;
#Creating a dataset for my example
f1<-c("a","a","b","b","c","c")
f2<-c("x","y","x","y","x","y")
v1<-c(1:6)
v2<-c(7:12)
factors<-as.data.frame(cbind(f1,f2))
integers<-as.data.frame(cbind(v1,v2))
df<-cbind(factors,integers)
#The function
partition<-function(data){
factors<-Filter(is.factor,data) #Splitting data into factors
subsets<-list(NULL) #Creating an empty list where I
will put the subsets nm=0
for( i in 1:ncol(factors)){
nm=nm+nlevels(factors[,i])
}
nm
for( i in 1:ncol(factors)){
for(j in 1:nlevels(factors[,i])){
for(k in 1:nm){
subsets[[k]]<-df[which(factors[,i]==levels(factors[,i])[j]), ]
}
}
}
return(subsets)
}
partition(df)
This yields:
[[1]]
f1 f2 v1 v2
2 a y 2 8
4 b y 4 10
6 c y 6 12
[[2]]
f1 f2 v1 v2
2 a y 2 8
4 b y 4 10
6 c y 6 12
[[3]]
f1 f2 v1 v2
2 a y 2 8
4 b y 4 10
6 c y 6 12
[[4]]
f1 f2 v1 v2
2 a y 2 8
4 b y 4 10
6 c y 6 12
[[5]]
f1 f2 v1 v2
2 a y 2 8
4 b y 4 10
6 c y 6 12
As you can see, these are all the same datasets. By removing the loop over k, all the datasets are different and subsetting properly however it only gives me three datasets (as there are two levels in the last factor variable we keep the subset where f1 == "c").
Removing the for loop over k we get;
[[1]]
f1 f2 v1 v2
1 a x 1 7
3 b x 3 9
5 c x 5 11
[[2]]
f1 f2 v1 v2
2 a y 2 8
4 b y 4 10
6 c y 6 12
[[3]]
f1 f2 v1 v2
5 c x 5 11
6 c y 6 12
Where we are missing the subsets where f1 == "a" and f1 == "b"
Note I should get 5 dataframes as we have 2 + 3 factor levels (This is calculated as nm in the first for loop before the subsetting.
So my question is, how do I get the above to run without overiding what had already be subsetted?
For some background, this is working towards making a classification model where It would yield nfactor(df) predictions, then I will run a GLM to weight each prediction.
Thank you for any insight into my problem.
Update
The first answer from Glen simplifies my code which may make the problem I am having more apparent. Here is the updated code (note it runs much more effectively on large datasets with the split() function so thank you Glen.
for(k in 1:nm){
for( i in 1:ncol(factors)){
for( j in 1:nlevels(factors[,i])){
subsets[[k]]<-split(df,factors[,i])[j]
}
}
}
Returns the same as my original question . The problem is that when I run a loop over k through nm, the loop over writes what was already generated. How do I stop this from happening?
If I understand your question correctly. You can do this real easily with the split function.
f1<-c("a","a","b","b","c","c")
f2<-c("x","y","x","y","x","y")
v1<-c(1:6)
v2<-c(7:12)
factors<-as.data.frame(cbind(f1,f2))
integers<-as.data.frame(cbind(v1,v2))
df<-cbind(factors,integers)
tmp1=split(df,f1)
tmp2=split(df,f2)
c(tmp1,tmp2)
library(plyr)
library(foreach)
x<-foreach(i= colnames(Filter(is.factor,df)), .combine='c') %do%
plyr::dlply(df, i)
Returns a list of 5 dataframes. c is used to combine each result of the foreach loop (which is a list in itself). Without this we get a list of lists. With c, it combinds all the lists into 1 lists.
Guys I have a code that generates 2 columns of data (e.g Number, Median) which refers to a particular person...but I have taken samples of 7 people
so basically I get this output:
[[1]
Number Median
1 5
2 3
.....
[[2]]
Number Median
1 6
2 4
....
[[3]]
Number Median
1 3
2 5
So I basically get this output....up til [[7]]
I tried transferring this output in excel using this code
write.csv(cbind(data),"data1.csv")
and I get this type of output:
list(c(Median =.......It lists all the median on the rows
But I want it to save the data referring to the 'median' and 'Number' in columns NOT ROWS
If I just type
write.csv(data,"data1.csv")
I get an error
arguments imply differing number of rows: 157, 179, 178, 180
As Marius said, you have a list of data.frames which can't be written to a .csv file. You need to do:
NewDataFrame <- do.call("rbind", YourList)
write.csv(NewDataFrame, "Data.csv")
do.call takes each of the elements from a list and applies whatever function you tell it (in this case rbind) to all of them.
Here are two options. Both use the following sample data:
myList <- list(data.frame(matrix(1:4, ncol = 2)),
data.frame(matrix(3:10, ncol = 2)),
data.frame(matrix(11:14, ncol =2)))
myList
# [[1]]
# X1 X2
# 1 1 3
# 2 2 4
#
# [[2]]
# X1 X2
# 1 3 7
# 2 4 8
# 3 5 9
# 4 6 10
#
# [[3]]
# X1 X2
# 1 11 13
# 2 12 14
Option 1: Write a csv file where the data.frames are presented as they are in the list
sink("list_of_dataframes.csv", type="output")
invisible(lapply(myList, function(x) dput(write.csv(x))))
sink()
If you open the resulting "list_of_dataframes.csv" file in a text editor, you will get something that looks like this. When you read this into a spreadsheet program, the first column will include the rownames and NULL separating each data.frame:
"","X1","X2"
"1",1,3
"2",2,4
NULL
"","X1","X2"
"1",3,7
"2",4,8
"3",5,9
"4",6,10
NULL
"","X1","X2"
"1",11,13
"2",12,14
NULL
Option 2: Write or search around for a version of cbind that accommodates binding data.frames with differing number of rows.
Here is one such function that I've written.
cbind2 <- function(datalist) {
nrows <- max(sapply(datalist, nrow))
expandmyrows <- function(mydata, rowsneeded) {
temp1 = names(mydata)
rowsneeded = rowsneeded - nrow(mydata)
temp2 = setNames(data.frame(
matrix(rep(NA, length(temp1) * rowsneeded),
ncol = length(temp1))), temp1)
rbind(mydata, temp2)
}
do.call(cbind, lapply(datalist, expandmyrows, rowsneeded = nrows))
}
And here is that function applied to your list:
cbind2(myList)
# X1 X2 X1 X2 X1 X2
# 1 1 3 3 7 11 13
# 2 2 4 4 8 12 14
# 3 NA NA 5 9 NA NA
# 4 NA NA 6 10 NA NA
That output should be easy for you to use with write.csv and related functions.
With one more requirement - that the resulting vector is in the same order as the original.
I have a very basic function that percentiles a vector, and works just the way I want it to do:
ptile <- function(x) {
p <- (rank(x) - 1)/(length(which(!is.na(x))) - 1)
p[p > 1] <- NA
p
}
data <- c(1, 2, 3, 100, 200, 300)
For example, ptile(data) generates:
[1] 0.0 0.2 0.4 0.6 0.8 1.0
What I'd really like to be able to do is use this same function (ptile) and have it work within levels of a factor. So suppose I have a "factor" f as follows:
f <- as.factor(c("a", "a", "b", "a", "b", "b"))
I'd like to be able to transform "data" into a vector that tells me, for each observation, what its corresponding percentile is relative to other observations within its same level, like this:
0.0 0.5 0.0 1.0 0.5 1.0
As a shot in the dark, I tried:
tapply(data,f,ptile)
and see that it does, in fact, succeed at doing the ranking/percentiling, but does so in a way that I have no idea which observations match up to their indices in the original vector:
[1] a a b a b b
Levels: a b
> tapply(data,f,ptile)
$a
[1] 0.0 0.5 1.0
$b
[1] 0.0 0.5 1.0
This matters because the actual data I'm working with can have 1000-3000 observations (stocks) and 10-55 levels (things like sectors, groupings by other stock characteristics, etc), and I need the resulting vector to be in the same order as the way it went in, in order for everything to line up, row by row in my matrix.
Is there some "apply" variant that would do what I am seeking? Or a few quick lines that would do the trick? I've written this functionality in C# and F# with a lot more lines of code, but had figured that in R there must be some really direct, elegant solution. Is there?
Thanks in advance!
The ave function is very useful. The main gotcha is to remember that you always need to name the function with FUN=:
dt <- data.frame(data, f)
dt$rank <- with(dt, ave(data, list(f), FUN=rank))
dt
#---
data f rank
1 1 a 1
2 2 a 2
3 3 b 1
4 100 a 3
5 200 b 2
6 300 b 3
Edit: I thought I was answering the question in the title but have been asked to include the code that uses the "ptile" function:
> dt$ptile <- with(dt, ave(data, list(f), FUN=ptile))
> dt
data f rank ptile
1 1 a 1 0.0
2 2 a 2 0.5
3 3 b 1 0.0
4 100 a 3 1.0
5 200 b 2 0.5
6 300 b 3 1.0
For what you are trying to do, I would first put the stock, sector, value as columns in a data-frame. E.g with some made-up data:
> set.seed(1)
> df <- data.frame(stock = 1:10,
+ sector = sample(letters[1:2], 10, repl = TRUE),
+ val = sample(1:10))
> df
stock sector val
1 1 a 3
2 2 a 2
3 3 b 6
4 4 b 10
5 5 a 5
6 6 b 7
7 7 b 8
8 8 b 4
9 9 b 1
10 10 a 9
Then you can use the ddply function from the plyr package to do the "sectorwise" percentile (there are other ways, but I find the plyr to be very useful, and would recommend you take a look at it):
require(plyr)
df.p <- ddply(df, .(sector), transform, pct = ptile(val))
Now of course in df.p the rows will be arranged by the factor (i.e. sector), and it's a simple matter to restore it to the original order, e.g.:
> df.p[ order(df.p$stock),]
stock sector val pct
1 1 a 3 0.3333333
2 2 a 2 0.0000000
5 3 b 6 0.4000000
6 4 b 10 1.0000000
3 5 a 5 0.6666667
7 6 b 7 0.6000000
8 7 b 8 0.8000000
9 8 b 4 0.2000000
10 9 b 1 0.0000000
4 10 a 9 1.0000000
In particular the pct column is the final vector you are seeking in your original question.
When you call tapply() with INDEX=f you get a result that is subsetted by f and broken into a list in order of the levels of f. To reverse that process, simply:
unlist(tapply(data, f, ptile))[order(order(f))]
Your example data vector happened to be in numeric order already, but this works even if the data is in random order...
ptile <- function(x) {
p <- (rank(x) - 1)/(length(which(!is.na(x))) - 1)
p[p > 1] <- NA
# concatenated with the original data to make the match clear
paste(round(p * 100, 2), x, sep="% ")
}
data <- sample(c(1:5, (1:5)*100), 10)
f <- sample(letters[1:2], 10, replace=TRUE)
result <- unlist(tapply(data, f, ptile))[order(order(f))]
data.frame(result, data, f)