Best way to find weighted averages from two dataframes in R? - r

new to R so sorry if this is a bit broad but I'm not really even sure where to start with an approach to this problem.
I have two dataframes, df1 containing demographic data from certain Census tracts.
AfricanAmerican AsianAmerican Hispanic White
Tract1 0.25 0.25 0.25 0.25
Tract2 0.50 0.10 0.20 0.10
Tract3 0.05 0.10 0.35 0.50
And df2 contains observations polygons with the percentage of each census tract that makes up its area.
Poly1 Poly2 Poly3
Tract1 0.33 0.25 0.00
Tract2 0.33 0.25 0.10
Tract3 0.34 0.50 0.90
What I want to do is get the weighted averages of the demographic data in each observation polygon
AfricanAmerican AsianAmerican Hispanic White
Poly1 0.26 0.15 0.27 0.29
Poly2 0.21 0.14 0.29 0.34
Poly3 0.10 0.10 0.34 0.46
So far I'm thinking I could do something like
sum(df1$AfricanAmerican * df2$Poly1)
Then use a for loop to iterate over all demographic variables for one polygon, then nest that in another for loop to iterate over all polygons, but given that I have hundreds of Census tracts and polygons in my working dataset, I was wondering if there was a better approach?

Use colSums of the products in mapply.
t(mapply(function(...) colSums(`*`(...)), list(df1), df2))
# AfricanAmerican AsianAmerican Hispanic White
# [1,] 0.2645 0.1495 0.2675 0.2855
# [2,] 0.2125 0.1375 0.2875 0.3375
# [3,] 0.0950 0.1000 0.3350 0.4600
If you want to round to two digits, just wrap round(..., 2) around it.
Data:
df1 <- read.table(header=T, text='
AfricanAmerican AsianAmerican Hispanic White
Tract1 0.25 0.25 0.25 0.25
Tract2 0.50 0.10 0.20 0.10
Tract3 0.05 0.10 0.35 0.50
')
df2 <- read.table(header=T, text='
Poly1 Poly2 Poly3
Tract1 0.33 0.25 0.00
Tract2 0.33 0.25 0.10
Tract3 0.34 0.50 0.90
')

Libraries
library(tidyverse)
Sample Data
df1 <-
tibble(
Trat = paste0("Trat",1:3),
AfricanAmerican = c(.25,.5,.05),
AsianAmerican = c(.25,.1,.1),
Hispanic = c(.25,.2,.35)
)
df2 <-
tibble(
Trat = paste0("Trat",1:3),
Poly1 = c(.33,.33,.34),
Poly2 = c(.25,.25,.5),
Poly3 = c(0,.1,.9)
) %>%
#Pivot df2 making a single column for all Poly values
pivot_longer(cols = -Trat,names_to = "Poly")
Code
df1 %>%
#Join df1 and df2 by Trat
left_join(df2) %>%
#Grouping by Poly
group_by(Poly) %>%
#Sum values across variables AfricanAmerican to Hispanic
summarise(across(AfricanAmerican:Hispanic,function(x)sum(x*value)))
Output
Joining, by = "Trat"
# A tibble: 3 x 4
Poly AfricanAmerican AsianAmerican Hispanic
<chr> <dbl> <dbl> <dbl>
1 Poly1 0.264 0.150 0.268
2 Poly2 0.212 0.138 0.288
3 Poly3 0.095 0.1 0.335

Related

Column Mean for rows with unique values

how can I compute the mean R, R1, R2, R3 values from the rows sharing the same lon,lat field? I'm sure this questions exists multiple times but I could not easily find it.
lon lat length depth R R1 R2 R3
1 147.5348 -35.32395 13709 1 0.67 0.80 0.84 0.83
2 147.5348 -35.32395 13709 2 0.47 0.48 0.56 0.54
3 147.5348 -35.32395 13709 3 0.43 0.29 0.36 0.34
4 147.4290 -35.27202 12652 1 0.46 0.61 0.60 0.58
5 147.4290 -35.27202 12652 2 0.73 0.96 0.95 0.95
6 147.4290 -35.27202 12652 3 0.77 0.92 0.92 0.91
I'd recommend using the split-apply-combine strategy, where you're splitting by BOTH lon and lat, applying mean to each group, then recombining into a single data frame.
I'd recommend using dplyr:
library(dplyr)
mydata %>%
group_by(lon, lat) %>%
summarize(
mean_r = mean(R)
, mean_r1 = mean(R1)
, mean_r2 = mean(R2)
, mean_r3 = mean(R3)
)

apply a function on columns with specific names

I am new in R.
I have hundreds of data frames like this
ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
AA ABCD 0.09 0.67 0.10 0.14
AB ABCE 0.04 0.85 0.04 0.06
AC ABCG 0.43 0.21 0.54 0.14
AD ABCF 0.16 0.62 0.25 0.97
AF ABCJ 0.59 0.37 0.66 0.07
This is just an example. The number and names of the Ratio_ columns are different between data frames, but all of them start with Ratio_. I want to apply a function (for example, log(x)), to the Ratio_ columns without specify the column number or the whole name.
I know how to do it df by df, for the one in the example:
A <- function(x) log(x)
df_log<-data.frame(df[1:2], lapply(df[3:6], A))
but I have a lot of them, and as I said the number of columns is different in each.
Any suggestion?
Thanks
Place the datasets in a list and then loop over the list elements
lapply(lst, function(x) {i1 <- grep("^Ratio_", names(x));
x[i1] <- lapply(x[i1], A)
x})
NOTE: No external packages are used.
data
lst <- mget(paste0("df", 1:100))
This type of problem is very easily dealt with using the dplyr package. For example,
df <- read.table(text = 'ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
AA ABCD 0.09 0.67 0.10 0.14
AB ABCE 0.04 0.85 0.04 0.06
AC ABCG 0.43 0.21 0.54 0.14
AD ABCF 0.16 0.62 0.25 0.97
AF ABCJ 0.59 0.37 0.66 0.07',
header = TRUE)
library(dplyr)
df_transformed <- mutate_each(df, funs(log(.)), starts_with("Ratio_"))
df_transformed
# > df_transformed
# ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
# 1 AA ABCD -2.4079456 -0.4004776 -2.3025851 -1.96611286
# 2 AB ABCE -3.2188758 -0.1625189 -3.2188758 -2.81341072
# 3 AC ABCG -0.8439701 -1.5606477 -0.6161861 -1.96611286
# 4 AD ABCF -1.8325815 -0.4780358 -1.3862944 -0.03045921
# 5 AF ABCJ -0.5276327 -0.9942523 -0.4155154 -2.65926004

Indexing certain values in a function

I have a data frame that looks like this:
df <-
ID TIME AMT k10 k12 k21
1.00 0.00 50.00 0.10 0.40 0.01
1.00 1.00 0.00 0.10 0.40 0.01
1.00 2.00 0.00 0.10 0.40 0.01
1.00 3.00 50.00 0.10 0.40 0.01
1.00 4.00 0.00 0.10 0.40 0.01
2.00 0.00 100.00 0.25 0.50 0.06
2.00 1.00 0.00 0.25 0.50 0.06
2.00 2.00 0.00 0.25 0.50 0.06
I am using the values of k10, k12, k21 to process certain calculations in the function below. Each of these values is specific to a subject ID and doesn't with time. My Question is: How can I can write it in the function so it uses, the first value for each subject ID? As you may notice in the function below, this is what I am currently using:
k10 <- d$k10
k12 <- d$k12
k21 <- d$k21
Each of these gives a vector of the same value at all time points which is obviously no need for that. I just need one value for each. I think that is one reason why I am getting warnings saying number of items to replace is not a multiple of replacement length
#This is the function that I am using:
TwoCompIVbolus <- function(d){
#set initial values in the compartments
d$A1[d$TIME==0] <- d$AMT[d$TIME==0] # drug amount in the central compartment at time zero.
d$A2[d$TIME==0] <- 0 # drug amount in the peripheral compartment at time zero.
k10 <- d$k10
k12 <- d$k12
k21 <- d$k21
k20 <- 0
E1 <- k10+k12
E2 <- k21+k20
#calculate hybrid rate constants
lambda1 <- 0.5*(k12+k21+k10+sqrt((k12+k21+k10)^2-4*k21*k10))
lambda2 <- 0.5*(k12+k21+k10-sqrt((k12+k21+k10)^2-4*k21*k10))
for(i in 2:nrow(d))
{
t <- d$TIME[i]-d$TIME[i-1]
A1last <- d$A1[i-1]
A2last <- d$A2[i-1]
A1term = (((A1last*E2+A2last*k21)-A1last*lambda1)*exp(-t*lambda1)-((A1last*E2+A2last*k21)-A1last*lambda2)*exp(-t*lambda2))/(lambda2-lambda1)
d$A1[i] = A1term + d$AMT[i] #Amount in the central compartment
A2term = (((A2last*E1+A1last*k12)-A2last*lambda1)*exp(-t*lambda1)-((A2last*E1+A1last*k12)-A2last*lambda2)*exp(-t*lambda2))/(lambda2-lambda1)
d$A2[i] = A2term #Amount in the peripheral compartment
}
d
}
#to apply it for each subject
simdf <- ddply(df, .(ID), TwoCompIVbolus)
You can just use k10 <- d$k10[1]

Round multiple vectors in dataframe with plyr

The numbers in this data.frame are rounded to 3 decimal places:
habitats_df <- data.frame(habitat = c("beach", "grassland", "freshwater"), v1 = c(0.000, 0.670, 0.032), v2 = c(0.005, 0.824, 0.012))
habitat v1 v2
1 beach 0.000 0.005
2 grassland 0.670 0.824
3 freshwater 0.032 0.012
I need them rounded to 2 decimal places. I tried to use plyr::l_ply like this:
library(plyr)
l_ply(habitats_df[,2:3], function(x) round(x, 2))
But it didn't work. How can I use plyr:: l_ply to round the numbers in habitats_df?
You don't really need plyr for this, since a simply lapply combined with round does the trick. I provide a solution in base R as well as plyr
Try this in base R:
roundIfNumeric <- function(x, n=1)if(is.numeric(x)) round(x, n) else x
as.data.frame(
lapply(habitats_df, roundIfNumeric, 2)
)
habitat v1 v2
1 beach 0.00 0.00
2 grassland 0.67 0.82
3 freshwater 0.03 0.01
And the same with plyr:
library(plyr)
quickdf(llply(habitats_df, roundIfNumeric, 2))
habitat v1 v2
1 beach 0.00 0.00
2 grassland 0.67 0.82
3 freshwater 0.03 0.01
# plyr alternative
library(plyr)
data.frame(habitat = habitats_df$habitat,
numcolwise(.fun = function(x) round(x, 2))(habitats_df))
# habitat v1 v2
# 1 beach 0.00 0.00
# 2 grassland 0.67 0.82
# 3 freshwater 0.03 0.01
# base alternative
data.frame(habitat = habitats_df$habitat,
lapply(habitats_df[ , -1], function(x) round(x, 2)))

How to select groups of rows and store to variables?

I currently use the following code to input a csv file, plot the data points based off one column and store a CpK number to a variable. This code works to calculate the CpK for the entire data set and the graph works as well. I am now looking to calculate the CpK number for each month in the dataset (graphing is not necessary). I looked through the data.table documentation as well as other R documentation, but I having a tough time selecting only the data for each month.
Current Code:(I could have calculated the CpK in one formula, but I have it broken up purposely)
mydf <- read.csv('ID35.csv', header = TRUE, sep=",")
date <- strptime(mydf$DATETIME, "%Y/%m/%d %H:%M:%S")
plot(date,mydf$AVG,xlab='Date',ylab='AVG',main='Data')
abline(h=mydf$MIN,col=3,lty=1)
abline(h=mydf$MAX,col=3,lty=1)
grid(NULL,NULL,col="black")
legend("topright", legend = c(" ", " "), text.width = strwidth("1,000,000"), lty = 1:2, xjust = 1, yjust = 1, title = "Points")
myavg <-mean(mydf$AVG, na.rm=TRUE)
newds <- (mydf$AVG - myavg)^2
newsum <- sum(newds, na.rm=TRUE)
N <- length(mydf$AVG) - 1
newN <- 1/N
total <- newN*newsum
sigma <- total^(1/2)
USL <- mean(mydf$MAX, na.rm=TRUE)
LSL <- mean(mydf$MIN, na.rm=TRUE)
cpk <- min(((USL-myavg)/(3*sigma)),((myavg-LSL)/(3*sigma)))
cpk
Here is what the dataset looks like(date formatting is already done):
mydf(only 24/1000 rows):
Code DATETIME AVG MIN TARG_AVG MAX
N9 2012/04/10 14:03:37 0.2647 0.22 0.25 0.27
NA 2012/03/30 07:48:17 0.2589 0.22 0.25 0.27
NB 2012/03/24 19:23:08 0.2912 0.22 0.25 0.27
NB 2012/03/25 16:10:17 0.2659 0.22 0.25 0.27
NC 2012/04/10 00:58:29 0.2622 0.22 0.25 0.27
ND 2012/04/14 18:32:52 0.2600 0.22 0.25 0.27
NG 2012/04/21 14:47:47 0.2671 0.22 0.25 0.27
NH 2012/04/09 20:31:17 0.2648 0.22 0.25 0.27
NL 2012/04/24 07:28:17 0.2527 0.22 0.25 0.27
NP 2012/04/23 13:26:50 0.2640 0.22 0.25 0.27
NQ 2012/04/14 20:30:42 0.2590 0.22 0.25 0.27
NS 2012/05/02 09:09:52 0.2651 0.22 0.25 0.27
NU 2012/05/04 13:07:49 0.2688 0.22 0.25 0.27
NV 2012/05/19 23:07:08 0.2716 0.22 0.25 0.27
NX 2012/05/03 02:00:13 0.2670 0.22 0.25 0.27
NY 2012/05/04 12:56:52 0.2680 0.22 0.25 0.27
NZ 2012/05/06 10:05:38 0.2697 0.22 0.25 0.27
O0 2012/05/07 22:01:11 0.2675 0.22 0.25 0.27
O3 2012/06/21 18:09:47 0.2606 0.22 0.25 0.27
04 2012/06/21 18:47:36 0.2545 0.22 0.25 0.27
51 2012/07/24 21:13:08 0.2541 0.22 0.25 0.27
O5 2012/07/26 16:54:09 0.2575 0.22 0.25 0.27
O6 2012/07/20 02:42:29 0.2603 0.22 0.25 0.27
OD 2012/08/25 20:56:55 0.2559 0.22 0.25 0.27
OH 2012/08/28 10:30:11 0.2372 0.22 0.25 0.27
From the table above the only two columns I care about are the DATETIME and the AVG. Once I have the new "myavg" variable for each month I can use the same formula to calculate the CpK number. I am thinking the variable name could be something like' 2012/08' I think the code should go something like:
for(each month mydf$DATETIME) (date like 2012/04*,2012/05*)
monthavg <-(mydf$AVG, na.rm=TRUE)
Is there a way to store the CpK number for each month in variables I can access?
aggregate(mydf$AVG, list(month=months(as.Date(mydf$DATETIME))), mean)
# month x
# 1 April 0.2618125
# 2 August 0.2465500
# 3 July 0.2573000
# 4 June 0.2575500
# 5 March 0.2720000
# 6 May 0.2682429

Resources