Subset and loop calculation over each column in dplyr - r

I have a dataframe of values for 50 IDs repeated over 10 iterations. I would like to subset by ID and then perform calculations, and repeat that for each column from x1 to x5. I used a for-loop but it is very inefficient (my actual dataset has a lot more IDs).
Here are the calculations I would like to perform. I've had varying success with the conversion to dplyr:
First calculation, gives me the correct value for x1, but need to repeat for each column from x1 to x5.
V1.x1 <- preds.df %>%
split(.$ID) %>%
sapply(function(ID) {
(ID$x1 - mean(ID$x1))^2 # for X1 only
}) %>%
mean()
A different calculation that involves subtracting from a corresponding value in another df data.pop. My dplyr attempt is wrong even for just x1:
## This is what I want to achieve, which I implemented using for-loop:
# df for for-loop
Bsq.perID <- data.frame(matrix(NA,
nrow = nrow(data.pop), # 50 observations
ncol = 5) # 5 models
# For-loop:
for (ids in 1:nrow(data.pop)){
current.ID <- preds.df[preds.df$ID == ids, ] # get current ID over all 10 iterations
for (i in 1:5){
Bsq.perID[ids, i] <- (mean(current.ID[, i]) - data.pop[ids, "real.val"])^2
}
}
Bsq.values <- colMeans(Bsq.perID)
## My wrong dplyr attempt of the above:
B1.x1 <- preds.df %>%
split(.$ID) %>%
sapply(function(ID) {
(mean(ID$x1) - data.pop$real.val)^2
}) %>%
mean()
The structure of preds.df looks like this:
head(preds.df)
x1 x2 x3 x4 x5 iteration ID
1 20.005984 6.78242996 3.526411 21.463892 8.792720 1 1
2 2.890490 7.28232755 18.670470 6.717213 19.830930 1 2
3 4.868658 24.88117301 1.883913 3.897779 14.371414 1 3
4 6.495532 5.79591685 7.745554 20.153269 7.935672 1 4
5 19.297779 0.05068784 21.744816 14.957751 14.232126 1 5
6 7.090456 22.06322779 8.388263 10.672151 9.921884 1 6
tail(preds.df)
x1 x2 x3 x4 x5 iteration ID
495 16.306927 2.8873609 9.7764755 23.798867 10.246443 10 45
496 4.767296 23.2086303 8.8394391 7.806442 24.898483 10 46
497 19.966301 13.7151699 10.2483011 15.199162 9.658736 10 47
498 18.134534 22.1658901 5.6481757 18.501411 23.787457 10 48
499 7.877636 7.2356274 8.2862336 3.790823 11.610848 10 49
500 8.554774 0.9199501 0.9650191 17.155611 1.158619 10 50

I would approach it like this:
library(dplyr)
library(rio)
preds.df <- import("~/Downloads/preds.df.csv")
data.pop <- import("~/Downloads/data.pop.csv")
## added a row because data.pop is only 49 rows in the data you sent
data.pop <- bind_rows(data.pop, data.pop[1,])
You could use dplyr with mutate() to do this:
dat1 <- preds.df %>%
group_by(ID) %>%
mutate(across(x1:x5, function(x)(x-mean(x))^2))
Then, for the second part you could merge the data
data.pop <- data.pop %>%
mutate(ID = 1:n())
dat2 <- dat1 %>% left_join(data.pop)
Next, summarise on ID to calculate the mean of x1 to x5 within ID, then from each one, you can subtract real.val and square.
dat2 <- dat2 %>%
select(c(ID, x1:x5, real.val)) %>%
group_by(ID) %>%
mutate(across(x1:x5, function(x)(x-real.val)^2)) %>%
summarise_all(mean) %>%
select(-real.val)

Related

Make simple table from data frame using formula

I've got a large dataframe with 72 rows and 72 columns. All are numbers. I just want to make a table with the sum of the the first row multiplied by the sum of the first column and so on and so forth. So for example, if this is my df
x0 <- c(0,0,11,0)
x0.1 <- c(0,251,0,0)
x0.2 <- c(0,495,0,0)
x0.4 <- c(0,0,0,6)
df <- data.frame(x0,x0.1,x0.2,x0.4)
I want my table to look something like this
1 0
2 124911
3 5445
4 36
Would you be looking for:
rowSums(df)*colSums(df)
Solution using sapply:
sapply(1:nrow(df), function(i){sum(df[i,]*sum(df[,i]))})
Here is a tidyverse approach:
library(dplyr)
library(tidyr)
df %>%
unite(x, c(x0, x0.1, x0.2, x0.4), sep = "") %>%
mutate(x = sub("^0+(?!$)", "", x, perl=TRUE))
x
1 0
2 2514950
3 11000
4 6

Problem with sample_n(2, replace=F) in a list of data with some data's size less than 2

I need help with sample_n() in ‘dplyr’ in R:
I have a list of data riskset[[1]], riskset[[2]],..., riskset[[1000]]), each element riskset[[i]] of the list is a data frame of observations, and I divided the observations in each riskset into group 1:4 based on the distribution of a variable. So the data in riskset[[i]] looks like this:
id sex grp ...
1 F 1 ...
2 M 3 ...
3 F 1 ...
4 M 4 ...
5 F 2 ...
6 F 3 ...
......................
I want to sample 2 observations from each grp within each riskset and save them as a list of sample. I used
sample<- list()
for(i in 1:1000){
sample[[i]] <- riskset[[i]] %>% group_by(grp) %>% sample_n(2,replace=F)
}
It gave me error:
size must be less or equal than 1 (size of data), set ‘replace = TRUE’ to use sampling with replacement.
I tried the code on the riskset which has more than 2 obs in each grp, it worked. But it doesn’t work on the riskset which has less than 2 obs in some group. For the group that has less than 2 obs, I want all the obs it has. And for the group that has more than 2 obs, I want to sample 2 obs without replacement. How can I achieve my sampling goal using R functions? Thanks in advance!
We can use map to loop over the list ('riskset'), then grouped by 'grp', apply the sample_n
library(tidyerse)
out <- map(riskset, ~ .x %>%
group_by(grp) %>%
sample_n(pmin(n(), 2), replace = TRUE))
Or another option is slice
map(riskset, ~ .x %>%
group_by(grp) %>%
slice(if(n() < 2) 1 else sample(row_number(), 2))
Or without using if/else
map(riskset, ~ .x %>%
group_by(grp) %>%
slice(sample(seq_len(pmin(n(), 2)))))
data
iris1 <- iris %>%
select(grp = Species, everything()) %>%
slice(c(1:5, 51))
riskset <- list(iris1, iris1)

Vectorising linear interpolation function for use with mutate

I have a data frame that looks like this:
# Set RNG
set.seed(33550336)
# Create toy data frame
df <- expand.grid(day = 1:10, dist = seq(0, 100, by = 10))
df1 <- df %>% mutate(region = "Here")
df2 <- df %>% mutate(region = "There")
df3 <- df %>% mutate(region = "Everywhere")
df_ref <- do.call(rbind, list(df1, df2, df3))
df_ref$value <- runif(nrow(df_ref))
# > head(df_ref)
# day dist region value
# 1 1 0 Here 0.39413117
# 2 2 0 Here 0.44224203
# 3 3 0 Here 0.44207487
# 4 4 0 Here 0.08007335
# 5 5 0 Here 0.02836093
# 6 6 0 Here 0.94475814
This represents a reference data frame and I'd like to compare observations against it. My observations are taken on a specific day that is found in this reference data frame (i.e., day is an integer from 1 to 10) in a region that is also found in this data frame (i.e., Here, There, or Everywhere), but the distance (dist) is not necessarily an integer between 0 and 100. For example, my observation data frame (df_obs) might look like this:
# Observations
df_obs <- data.frame(day = sample(1:10, 3, replace = TRUE),
region = sample(c("Here", "There", "Everywhere")),
dist = runif(3, 0, 100))
# day region dist
# 1 6 Everywhere 68.77991
# 2 7 There 57.78280
# 3 10 Here 85.71628
Since dist is not an integer, I can't just lookup the value corresponding to my observations in df_ref like this:
df_ref %>% filter(day == 6, region == "Everywhere", dist == 68.77991)
So, I created a lookup function that uses the linear interpolation function approx:
lookup <- function(re, di, da){
# Filter to day and region
df_tmp <- df_ref %>% filter(region == re, day == da)
# Approximate answer from distance
approx(unlist(df_tmp$dist), unlist(df_tmp$value), xout = di)$y
}
Applying this to my first observation gives,
lookup("Everywhere", 68.77991, 6)
#[1] 0.8037013
Nevertheless, when I apply the function using mutate I get a different answer.
df_obs %>% mutate(ref = lookup(region, dist, day))
# day region dist ref
# 1 6 Everywhere 68.77991 0.1881132
# 2 7 There 57.78280 0.1755198
# 3 10 Here 85.71628 0.1730285
I suspect that this is because lookup is not vectorised correctly. Why am I getting different answers and how do I fix my lookup function to avoid this?

Compute variable according to factor levels

I am kind of new to R and programming in general. I am currently strugling with a piece of code for data transformation and hope someone can take a little bit of time to help me.
Below a reproducible exemple :
# Data
a <- c(rnorm(12, 20))
b <- c(rnorm(12, 25))
f1 <- rep(c("X","Y","Z"), each=4) #family
f2 <- rep(x = c(0,1,50,100), 3) #reference and test levels
dt <- data.frame(f1=factor(f1), f2=factor(f2), a,b)
#library loading
library(tidyverse)
Goal : Compute all values (a,b) using a reference value. Calculation should be : a/a_ref with a_ref = a when f2=0 depending on the family (f1 can be X,Y or Z).
I tried to solve this by using this code :
test <- filter(dt, f2!=0) %>% group_by(f1) %>%
mutate("a/a_ref"=a/(filter(dt, f2==0) %>% group_by(f1) %>% distinct(a) %>% pull))
I get :
test results
as you can see a is divided by a_ref. But my script seems to recycle the use of reference values (a_ref) regardless of the family f1.
Do you have any suggestion so A is computed with regard of the family (f1) ?
Thank you for reading !
EDIT
I found a way to do it 'manualy'
filter(dt, f1=="X") %>% mutate("a/a_ref"=a/(filter(dt, f1=="X" & f2==0) %>% distinct(a) %>% pull()))
f1 f2 a b a/a_ref
1 X 0 21.77605 24.53115 1.0000000
2 X 1 20.17327 24.02512 0.9263973
3 X 50 19.81482 25.58103 0.9099366
4 X 100 19.90205 24.66322 0.9139422
the problem is that I'd have to update the code for each variable and family and thus is not a clean way to do it.
# use this to reproduce the same dataset and results
set.seed(5)
# Data
a <- c(rnorm(12, 20))
b <- c(rnorm(12, 25))
f1 <- rep(c("X","Y","Z"), each=4) #family
f2 <- rep(x = c(0,1,50,100), 3) #reference and test levels
dt <- data.frame(f1=factor(f1), f2=factor(f2), a,b)
#library loading
library(tidyverse)
dt %>%
group_by(f1) %>% # for each f1 value
mutate(a_ref = a[f2 == 0], # get the a_ref and add it in each row
"a/a_ref" = a/a_ref) %>% # divide a and a_ref
ungroup() %>% # forget the grouping
filter(f2 != 0) # remove rows where f2 == 0
# # A tibble: 9 x 6
# f1 f2 a b a_ref `a/a_ref`
# <fctr> <fctr> <dbl> <dbl> <dbl> <dbl>
# 1 X 1 21.38436 24.84247 19.15914 1.1161437
# 2 X 50 18.74451 23.92824 19.15914 0.9783583
# 3 X 100 20.07014 24.86101 19.15914 1.0475490
# 4 Y 1 19.39709 22.81603 21.71144 0.8934042
# 5 Y 50 19.52783 25.24082 21.71144 0.8994260
# 6 Y 100 19.36463 24.74064 21.71144 0.8919090
# 7 Z 1 20.13811 25.94187 19.71423 1.0215013
# 8 Z 50 21.22763 26.46796 19.71423 1.0767671
# 9 Z 100 19.19822 25.70676 19.71423 0.9738257
You can do this for more than one variable using:
dt %>%
group_by(f1) %>%
mutate_at(vars(a:b), funs(./.[f2 == 0])) %>%
ungroup()
Or generally use vars(a:z) to use all variables between a and z as long as they are one after the other in your dataset.
Another solution could be using mutate_if like:
dt %>%
group_by(f1) %>%
mutate_if(is.numeric, funs(./.[f2 == 0])) %>%
ungroup()
Where the function will be applied to all numeric variables you have. The variables f1 and f2 will be factor variables, so it just excludes those ones.

R Looping aggregate by group count

I want to write a loop that can aggregate the number of instances (of certain values) that are grouped by year. More specifically, say the variable is x1. I want to have two groups, one is when x1 = 1, and the other when it is a combination of some values (2,3, and 5 in the below example):
year x1
2000 1
2000 1
2000 2
2000 3
2000 5
The end result should look like this:
year x2 x3
2000 2 3
where x2 and x3 are the counts when x1 = 1 and x1 = c(2,3,5), respectively. How can one accomplish this?
Edit: Probably should have mentioned this earlier. I work with two datasets; one df1 is yearly (spanning approx. 200 years) and the other df2 is incident-based (around 50k observations; this is where x1 is currently located). So the idea of the loop is to look at each year[i] in df2 and aggregate the counts by grouping them as x2 and x3 in df1.
Edit2: Ah, I solved why the submitted answers were not working for me. Apparently I ran into the dplyr before plyr problem discussed in this answer; I followed ManneR's answer and detached plyr. Now the group_by command works again.
I am not sure what was wrong with user3349904's answer as it seems to do what you are asking. Its not easy to know exactly what you are asking for without knowing what your data looks like. If your issue with the other solution due to the fact that df1 needs to hold the x2 and x3 values? The last part will solve for that.
I tried to replicate your problem from scratch so here's my shot at a solution.
library(dplyr)
#create DF1 (years)
df1 <- as.data.frame(matrix(ncol=3,nrow = 200))
df1$V1 <- c(1800:1999)
colnames(df1) <- c("year","x2","x3")
#create DF2 (transactions)
df2 <- as.data.frame(matrix(ncol=2,nrow=50000))
#add random sample data
df2$V1 <- sample(1800:1999,50000,replace = T)
df2$V2 <- sample(1:5,5000,replace = T)
colnames(df2) <- c("year","x1")
# group by year in df2 and aggregate counts based on categories
df2 %>% group_by(year) %>%
summarise(x2 = sum(x1==1), x3 = sum(x1 %in% c(2,3,5))) -> df3
# match years in df3 and df1 and bring lookup value to df1
df1$x2 <- df3$x2[match(df1$year,df3$year)]
df1$x3 <- df3$x3[match(df1$year,df3$year)]
Here is another option using dplyr/tidyr
library(dplyr)
library(tidyr)
df1 %>%
group_by(year, grp = paste0("x", (x1 != 1) + 2)) %>%
summarise(x1= n()) %>%
spread(grp, x1)
# year x2 x3
#* <int> <int> <int>
#1 2000 2 3
Or using base R
xtabs(Freq~year + x1, transform(df1, x1= paste0("x", (x1!=1)+2), Freq= 1))
Assuming you are starting from a data frame called df, this will count the cases as you describe them by year:
library(dplyr)
df %>% group_by(year) %>% summarise(x2 = sum(x1==1), x3 = sum(x1 %in% c(2,3,5)))

Resources