Conditionally divide and replace values in certain columns in R - r

Here is a sample of my data frame:
Site Plot Plot_size Sp1 Sp2 Sp3 ... Sp108
C 1 N/A 25 4 3 12
G 1 N/A 30 35 5 22
M 1 S 10 15 7 37
M 1 M 2 3 1 7
M 1 L 2 3 1 7
created using this code:
plots_cond <- trees_only %>%
group_by(Site, Plot, Plot_size) %>%
count(Species) %>%
pivot_wider(names_from = Species, values_from= n)
Variables Sp1 ... Sp108 (columns 7:114) are species counts. I need to convert these count values to density per hectare (i.e. count value/hectares).
Hectare values differ according to Site, and Site + Plot_size for Site M:
C: 0.005
G: 7.5
MS: 1.6
MM: 8
ML: 16
I'm an extreme beginner and having trouble constructing an appropriate script to execute this. I believe I need to select columns 7:114 and divide by the associated hectare value if Site=='' or Site=='' + Plot_size == '', but I don't know how to put this all together into a cohesive piece of code.
Any assistance would be greatly appreciated!

You already used the tidyverse packages to prepare your dataset, therefore I will also use these packages.
First of all it might be helpful to create a column for hectors and use this column as a divisor later on. So I combined the Site and Plot_size column into one column with the paste0 function. To ignore the unwanted <N/A> inside the Plot_size column, I used the ifelse function (<N/A> will be replaced with nothing). In a second step the hector codes (C, G, MS, MM, ML) will be replaced with your given hector values by the case_when function.
In a last step all species counts will be divided by the new hectar column and saved inside new columns with the suffix "_density" (across applys the same transformation to multiple columns).
Data
df <- read.table(text = "Site Plot Plot_size Sp1 Sp2 Sp3
C 1 N/A 25 4 3
G 1 N/A 30 35 5
M 1 S 10 15 7
M 1 M 2 3 1
M 1 L 2 3 1 ",
stringsAsFactors = F,
header = T)
Code
library(tidyverse)
df %>%
# create a hectar combination of Site and Plot_size
mutate(Hectars = paste0(Site, ifelse(Plot_size == "N/A",
"", Plot_size)),
# replace the combination with numbers
Hectars = case_when(Hectars == "C" ~ 0.005,
Hectars == "G" ~ 7.5,
Hectars == "MS" ~ 1.6,
Hectars == "MM" ~ 8,
Hectars == "ML" ~ 16,
TRUE ~ NA_real_),
# devide each species count by hectar
across(starts_with("Sp"),
~ ./Hectars,
.names = "{.col}_density"))
Output
Site Plot Plot_size Sp1 Sp2 Sp3 Hectars Sp1_density Sp2_density Sp3_density
1 C 1 N/A 25 4 3 0.005 5000.000 800.000000 600.0000000
2 G 1 N/A 30 35 5 7.500 4.000 4.666667 0.6666667
3 M 1 S 10 15 7 1.600 6.250 9.375000 4.3750000
4 M 1 M 2 3 1 8.000 0.250 0.375000 0.1250000
5 M 1 L 2 3 1 16.000 0.125 0.187500 0.0625000

Related

How to make a function to get average (B) when (A) is a certain condition on data.frame in R

this is for setting
#this is for setting
A <- c(1,1,2,2,2,3,4,4,5,5,5)
B <- c(1:10)
C <- c(11:20)
ABC <- data.frame(A,B,C)
#so, I made up my own ABC like this
A B C
1 1 1 11
2 1 2 12
3 2 3 13
4 2 4 14
5 2 5 15
6 3 6 16
7 4 7 17
8 4 8 18
9 5 9 19
10 5 10 20
On this setting,
I want to know, when (A) are in a specific condition, how to get average (B)or(C)
For example
if condition(A) are 2:4, get mean (B), and mean(C)
new_ABC <- subset(ABC, ABC$A >= 2 & ABC$A <= 4)
mean(newABC$B)
mean(newABC$C)
and it works.
But if I want to make a function like this, I tried severe days, I have no idea...
getMeanB <- function(condition){
for(i in min(condition) : max(condition){
# I do not really know what to do..
}
}
Any helps will very thanks!!
If the argument 'condition' is a vector, then we can do it
getMean <- function(data, condition, cName) {
minC <- min(condition)
maxC <- max(condition)
i1 <- data[[cName]] >= minC & data[[cName]] <= maxC
colMeans(data[i1,setdiff(names(data), cName)], na.rm = TRUE)
}
getMean(ABC, 2:4, "A")
# B C
# 5.5 15.5
NOTE: Here, the 'data' and 'cName' arguments are added to make it more dynamic and applied to other datasets with different column names.

Getting the mean of two rows in data frame based on the condition of one column

I have the following sample data frame:
df<-data.frame(A=c(rep(1:4,4)),B=runif(16,min=0,max=16),c=rnorm(16, mean=6,sd=2))
I want to collapse the data frame by creating a new value for column A as the average of two existing values. For instance, create a new data frame with values of A as 1,2.5,4 instead of 1:4. In doing so, the rows of the df containing 2 and 3 will be removed and the new df will contain only 12 rows instead of 16.
EDIT
Desired output is some thing like this.
A B c
1 1 8.248871 4.402726
2 2.5 11.694196 4.2878085
4 4 8.036312 2.014886
5 1 9.828333 3.240945
6 2.5 1.151633 2.918831
8 4 12.176981 4.874183
9 1 14.067821 5.480923
10 2.5 6.145208 6.139448
12 4 6.352356 2.586025
13 1 3.423057 5.114978
14 2.5 11.005555 3.265489
16 4 14.579750 3.783269
I would write a collapse function:
collapse_df <- function(mydf, column, omit, pair, FUN) {
ind <- which(mydf[,column] %in% pair[1])
ind2 <- mydf[,column] %in% pair[-1]
m <- mapply(function(x,y) lapply(mydf[x:y,!names(mydf) %in% omit], FUN), ind, ind+1L)
mydf[ind,] <- cbind.data.frame(mydf[ind,omit], t(m))
mydf[!ind2,]
}
collapse_df(df, "A", pair=2:3, FUN=mean)
# A B c
# 1 1.0 1.060170 4.797753
# 2 2.5 9.577249 6.635214
# 4 4.0 5.612720 3.413631
# 5 1.0 1.734932 10.487560
# 6 2.5 9.577249 6.635214
# 8 4.0 6.529387 5.760596
# 9 1.0 2.517647 5.469165
# 10 2.5 9.577249 6.635214
# 12 4.0 4.243273 6.493916
# 13 1.0 10.118011 4.431953
# 14 2.5 9.577249 6.635214
# 16 4.0 1.563981 5.047428
We can also summarise with other functions like the median or sum:
collapse_df(df, column="A", pair=2:3, FUN=median)
collapse_df(df, "A", 2:3, sum)
collapse_df(iris[-5], column=1, seq(5,6,.1), median)

Creating balanced groups based on three categorical variables

I'm creating a group assignment for a college class (~180 students) I'm instructing. It's important that these groups be as heterogeneous as possible across three variables (field of study (FOS), sex, division:i.e., newer/older students).
FOS has 5 levels, sex has 2, division has 2. Given the project, I'd like to create about 8-9 groups. In other words, I'd like groups of approximately 6 with a "good" balance of different fields of study, males/females, and new and older students. I'd then simply post the names with the automated assignments.
The instructor before did it all by hand, but I've tried playing around with R to see if there's a more systematic way of doing this, but only came up with repeated (and clunky) sorting. I expect the 5 FOS levels to vary in size, so I recognize that it will not be a perfect solution. Interested in people's clever solutions. Here's a reproducible sample:
dat <- data.frame(
student = 1:180,
gender = factor(sample(LETTERS[1:2], 180, replace = T, prob = c(.52,.48)),
labels=c("female","male")),
division = factor(sample(LETTERS[1:2], 180, replace = T, prob = c(.6,.4)),
labels=c("lower","upper")),
field = factor(sample(LETTERS[1:5], 180, replace = T,
prob = c(.26,.21,.35,.07,.11)),
labels = c("humanities","natural science",
"social science","engineer","other")))
This was what I was playing with, but it's really increasing the randomness in assignment and not so much the balance as can be seen:
library(dplyr)
dat$rand <- sample(1:180,180)
dat1 <- arrange(dat, field, division, gender, rand)
dat1$grp <- 1:(nrow(dat1)/6) #issue if not divisible
Which does not result in adequate balance:
with(dat1, table(gender, grp)) #as a check
with(dat1, table(field, grp))
with(dat1, table(division, grp))
I know this is an old question, but I had a similar problem today and here's the solution I came up with. Basically you assign groups randomly then use either chi square test for categorical variables or ANOVA for continuous variables to test for group differences for each variable. You set a threshold for the p-value that you do not want to drop below. The code will reshuffle the groups until all p values are above that threshold. If it goes through 10,000 iterations without reaching a grouping solution, it will stop and suggest that you lower the threshold.
set.seed(905)
#let's say you have a continuous variable you would also like to keep steady across groups
dat$age <- sample(18:35, nrow(dat), replace = TRUE)
dat$group <- rep_len(1:20, length.out = nrow(dat)) #if you wanted to make 20 groups
dat$group <- as.factor(dat$group)
a <- 0.1; b <- 0.1; c <- 0.1; d <- 0.1
thresh <- 0.85 #Minimum threshold for p value
z <- 1
while (a < thresh | b < thresh |c < thresh |d < thresh) {
dat <- transform(dat, group = sample(group)) #shuffles the groups
x <- summary(aov(age ~ group, dat)) #ANOVA for continuous variables
a <- x[[1]]['group','Pr(>F)']
x <- summary(table(dat$group, dat$gender)) #Chi Sq for categorical variables
b <- x[['p.value']]
x <- summary(table(dat$group, dat$division))
c <- x[['p.value']]
x <- summary(table(dat$group, dat$field))
d <- x[['p.value']]
z <- z + 1
if (z > 10000) {
print('10,000 tries, no solution, reduce threshold')
break
}
}
With enough datapoints per combination of the variables, you should be able to do this:
dat <- groupdata2::fold(dat, k = 8,
cat_col = c("gender", "division", "field"))
with(dat, table(gender, .folds))
## .folds
## gender 1 2 3 4 5 6 7 8
## female 11 12 11 12 12 11 12 12
## male 10 11 11 11 11 11 11 11
with(dat, table(field, .folds))
## .folds
## field 1 2 3 4 5 6 7 8
## humanities 5 8 9 7 9 6 6 5
## natural science 2 3 4 6 3 9 2 4
## social science 9 7 6 8 5 6 9 6
## engineer 3 3 2 1 3 0 2 4
## other 2 2 1 1 3 1 4 4
with(dat, table(division, .folds))
## .folds
## division 1 2 3 4 5 6 7 8
## lower 11 15 13 14 10 13 11 15
## upper 10 8 9 9 13 9 12 8

Iterated plotting from list of list of dataframes

I'm trying to explore a large dataset, both with data frames and with charts. I'd like to analyze the distribution of each variable by different metrics (e.g., sum(x), sum(x*y)) and for different sub-populations. I have 4 sub-populations, 2 metrics, and many variables.
In order to accomplish that, I've made a list structure such as this:
$variable1
...$metric1 <--- that's a df.
...$metric2
$variable2
...$metric1
...$metric2
Inside one of the data_frames (e.g., list$variable1$metric1), I've calculated distributions of the unique values for variable1 and for each of the four population groups (represented in columns). It looks like this:
$variable1$metric1
unique_values med_all med_some_not_all med_at_least_some med_none
1 (1) 12-17 Years Old NA NA NA NA
2 (2) 18-25 Years Old 0.278 0.317 0.278 0.317
3 (3) 26-34 Years Old 0.225 0.228 0.225 0.228
4 (4) 35 or Older 0.497 0.456 0.497 0.456
$variable1$metric2
unique_values med_all med_some_not_all med_at_least_some med_none
1 (1) 12-17 Years Old NA NA NA NA
2 (2) 18-25 Years Old 0.544 0.406 0.544 0.406
3 (3) 26-34 Years Old 0.197 0.310 0.197 0.310
4 (4) 35 or Older 0.259 0.284 0.259 0.284
What I'm trying to figure out is a good way to loop through the list of lists (probably melting the DFs in the process) and then output a ton of bar charts. In this case, the natural plot format would be, for each dataframe, a stacked bar chart with one stacked bar for each sub-population, grouping by the variable's unique values.
But I'm not familiar with iterated plotting and so I've hit a dead end. How might I plot from that list structure? Alternately, is there a better structure in which i should be storing this information?
I find nested lists to be pretty tricky to work with, so I would combine them all into a single data frame that labels the name of the variable and the name of the metric:
lst <- list(alpha= list(a= data.frame(matrix(1:4, 2)), b= data.frame(matrix(6:9, 2))), beta = list(c = data.frame(matrix(11:14, 2))))
level1 <- lapply(lst, function(x) do.call(rbind, lapply(names(x), function(y) {x[[y]]$metric=y ; x[[y]]})))
dat <- do.call(rbind, lapply(names(level1), function(x) {level1[[x]]$variable=x ; level1[[x]]}))
dat
# X1 X2 metric variable
# 1 1 3 a alpha
# 2 2 4 a alpha
# 3 6 8 b alpha
# 4 7 9 b alpha
# 5 11 13 c beta
# 6 12 14 c beta
Now you can use standard tools for manipulating a single data frame to perform your data analysis.
here's a start:
lst <- list(alpha= list(a= data.frame(matrix(1:4, 2)), b= data.frame(matrix(6:11, 2))),
beta = list(c = data.frame(matrix(11:14, 2))))
lst
$alpha
$alpha$a
X1 X2
1 1 3
2 2 4
$alpha$b
X1 X2 X3
1 6 8 10
2 7 9 11
$beta
$beta$c
X1 X2
1 11 13
2 12 14
#We can subset by number or by name
lst[['alpha']]
$a
X1 X2
1 1 3
2 2 4
$b
X1 X2 X3
1 6 8 10
2 7 9 11
lst[[1]]
$a
X1 X2
1 1 3
2 2 4
$b
X1 X2 X3
1 6 8 10
2 7 9 11
#The dollar sign naming convention reminds us that we are looking at a list.
#Let's sum the columns of both data frames in the alpha list
lapply(lst[['alpha']], colSums)
$a
X1 X2
3 7
$b
X1 X2 X3
13 17 21
Let's try to find the sum of each column of each data frame:
lapply(lst, colSums)
Error in FUN(X[[i]], ...) :
'x' must be an array of at least two dimensions
What happened? R is correctly refusing to run an array function on a list. The function colSums needs to be fed data frames, matrices, and other arrays above one-dimension. We have to nest an lapply function inside of another one. The logic can get complicated:
lapply(lst, function(x) lapply(x, colSums))
$alpha
$alpha$a
X1 X2
3 7
$alpha$b
X1 X2 X3
13 17 21
$beta
$beta$c
X1 X2
23 27
We can use rbind to put data.frames together:
rbind(lst$alpha$a, lst$beta$c)
X1 X2
1 1 3
2 2 4
3 11 13
4 12 14
Be sure not to do it the way you might be thinking (I've done it many times):
do.call(rbind, lst)
a b
alpha List,2 List,3
beta List,2 List,2
That isn't the result you're looking for. And make sure that the dimensions and column names are the same:
do.call(rbind, lst[[1]])
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
R is refusing to combine data frames that have 2 columns in one (alpha$a) and three columns in the other (alpha$b).
I changed the lst to make alpha$b have two columns like the others and combined them:
bind1 <- lapply(lst2, function(x) do.call(rbind, x))
bind1
$alpha
X1 X2
a.1 1 3
a.2 2 4
b.1 6 9
b.2 7 10
b.3 8 11
$beta
X1 X2
c.1 11 13
c.2 12 14
That combines the elements of each list. Now I can combine the outer list to make one big data frame.
do.call(rbind, bind1)
X1 X2
alpha.a.1 1 3
alpha.a.2 2 4
alpha.b.1 6 9
alpha.b.2 7 10
alpha.b.3 8 11
beta.c.1 11 13
beta.c.2 12 14
Here's a strategy based on melting a list (recursively),
lst = list(alpha= list(a= data.frame(matrix(1:4, 2)),
b= data.frame(matrix(6:11, 2))),
beta = list(c = data.frame(matrix(11:14, 2))))
library(reshape2)
m = melt(lst, id=1:2)
library(ggplot2)
ggplot(m, aes(X1,X2)) + geom_bar(stat="identity") + facet_grid(L1~L2)

What is the most efficient way to return ranks of a vector within levels of a factor, as a vector having the same order/length as the original vector?

With one more requirement - that the resulting vector is in the same order as the original.
I have a very basic function that percentiles a vector, and works just the way I want it to do:
ptile <- function(x) {
p <- (rank(x) - 1)/(length(which(!is.na(x))) - 1)
p[p > 1] <- NA
p
}
data <- c(1, 2, 3, 100, 200, 300)
For example, ptile(data) generates:
[1] 0.0 0.2 0.4 0.6 0.8 1.0
What I'd really like to be able to do is use this same function (ptile) and have it work within levels of a factor. So suppose I have a "factor" f as follows:
f <- as.factor(c("a", "a", "b", "a", "b", "b"))
I'd like to be able to transform "data" into a vector that tells me, for each observation, what its corresponding percentile is relative to other observations within its same level, like this:
0.0 0.5 0.0 1.0 0.5 1.0
As a shot in the dark, I tried:
tapply(data,f,ptile)
and see that it does, in fact, succeed at doing the ranking/percentiling, but does so in a way that I have no idea which observations match up to their indices in the original vector:
[1] a a b a b b
Levels: a b
> tapply(data,f,ptile)
$a
[1] 0.0 0.5 1.0
$b
[1] 0.0 0.5 1.0
This matters because the actual data I'm working with can have 1000-3000 observations (stocks) and 10-55 levels (things like sectors, groupings by other stock characteristics, etc), and I need the resulting vector to be in the same order as the way it went in, in order for everything to line up, row by row in my matrix.
Is there some "apply" variant that would do what I am seeking? Or a few quick lines that would do the trick? I've written this functionality in C# and F# with a lot more lines of code, but had figured that in R there must be some really direct, elegant solution. Is there?
Thanks in advance!
The ave function is very useful. The main gotcha is to remember that you always need to name the function with FUN=:
dt <- data.frame(data, f)
dt$rank <- with(dt, ave(data, list(f), FUN=rank))
dt
#---
data f rank
1 1 a 1
2 2 a 2
3 3 b 1
4 100 a 3
5 200 b 2
6 300 b 3
Edit: I thought I was answering the question in the title but have been asked to include the code that uses the "ptile" function:
> dt$ptile <- with(dt, ave(data, list(f), FUN=ptile))
> dt
data f rank ptile
1 1 a 1 0.0
2 2 a 2 0.5
3 3 b 1 0.0
4 100 a 3 1.0
5 200 b 2 0.5
6 300 b 3 1.0
For what you are trying to do, I would first put the stock, sector, value as columns in a data-frame. E.g with some made-up data:
> set.seed(1)
> df <- data.frame(stock = 1:10,
+ sector = sample(letters[1:2], 10, repl = TRUE),
+ val = sample(1:10))
> df
stock sector val
1 1 a 3
2 2 a 2
3 3 b 6
4 4 b 10
5 5 a 5
6 6 b 7
7 7 b 8
8 8 b 4
9 9 b 1
10 10 a 9
Then you can use the ddply function from the plyr package to do the "sectorwise" percentile (there are other ways, but I find the plyr to be very useful, and would recommend you take a look at it):
require(plyr)
df.p <- ddply(df, .(sector), transform, pct = ptile(val))
Now of course in df.p the rows will be arranged by the factor (i.e. sector), and it's a simple matter to restore it to the original order, e.g.:
> df.p[ order(df.p$stock),]
stock sector val pct
1 1 a 3 0.3333333
2 2 a 2 0.0000000
5 3 b 6 0.4000000
6 4 b 10 1.0000000
3 5 a 5 0.6666667
7 6 b 7 0.6000000
8 7 b 8 0.8000000
9 8 b 4 0.2000000
10 9 b 1 0.0000000
4 10 a 9 1.0000000
In particular the pct column is the final vector you are seeking in your original question.
When you call tapply() with INDEX=f you get a result that is subsetted by f and broken into a list in order of the levels of f. To reverse that process, simply:
unlist(tapply(data, f, ptile))[order(order(f))]
Your example data vector happened to be in numeric order already, but this works even if the data is in random order...
ptile <- function(x) {
p <- (rank(x) - 1)/(length(which(!is.na(x))) - 1)
p[p > 1] <- NA
# concatenated with the original data to make the match clear
paste(round(p * 100, 2), x, sep="% ")
}
data <- sample(c(1:5, (1:5)*100), 10)
f <- sample(letters[1:2], 10, replace=TRUE)
result <- unlist(tapply(data, f, ptile))[order(order(f))]
data.frame(result, data, f)

Resources