Storing output from R multiple loops into a list - r

I'm trying to carry out the following action on the columns of a dataframe (df1):
term1+term2+term3*req_no
req_no is a range of numbers: 20:24
df1:
ID term1 term2 term3
X299 1.2 2.3 0.12
X300 1.4 0.6 2.4
X301 0.3 1.6 1.2
X302 0.9 0.6 0.4
X303 0.3 1.8 0.3
X304 1.3 0.3 2.1
I need help t get this output and here's my attempt:
Required output:
ID 20 21 22 23 24
X299 5.9 6.02 6.14 6.26 6.38
X300 50 52.4 54.8 57.2 59.6
X301 25.9 27.1 28.3 29.5 30.7
X302 9.5 9.9 10.3 10.7 11.1
X303 8.1 8.4 8.7 9 9.3
X304 43.6 45.7 47.8 49.9 52
Here's:
results <- list()
req_no <- 20:25
for(i in 1:nrow(df1){
for(j in rq_no){
res <- term1+term2+term3*j
results[j] <- res
}
results[[i]]
}
results2 <- do.call("rbind",result)
Help will be appreciated.

Here are a couple different approaches, though neither as succinct as Parfait's. Sample data:
df <- data.frame(ID=c("X299", "X300"),
term1=c(1.2, 1.4),
term2=c(2.3, 0.6),
term3=c(0.12, 2.4))
req_no <- 20:25
Loop approach
Your initial approach is headed in the right direction, but in the future, it would help to specify exactly what your error or problem is. For an iterated and perhaps easier-to-read approach, here's one answer:
results <- matrix(data=NA, nrow=nrow(df), ncol=length(req_no)) # Empty matrix to store our results
colnames(results) <- req_no # Optional; name columns based off of req_no values
for(i in 1:nrow(df)) {
# Do the calculation we want; returns a vector length 6
res <- df[i,]$term1 + df[i,]$term2 + (df[i,]$term3 * req_no)
# Save results for row i of df into row i of results matrix
results[i,] <- res
}
# Now bind the columns (named 20 through 25) to the respective rows of df
output <- cbind(df, results)
output
From your initial attempt, note:
We only do one loop, since it is easy to multiply by a vector in R
There are a few ways to subset data from a data frame in R. In this case, df[i,] gets everything in the i-th row, while $termX gets value in the column named termX
Using a results matrix instead of a list makes it very easy to copy the temporary computations (for each row) into rows of the matrix
Rather than rbind() (row bind), we want cbind() (column bind) to bind those results to new columns of the original rows.
Output:
ID term1 term2 term3 20 21 22 23 24 25
1 X299 1.2 2.3 0.12 5.9 6.02 6.14 6.26 6.38 6.5
2 X300 1.4 0.6 2.40 50.0 52.40 54.80 57.20 59.60 62.0
Dplyr/purrr functions
This could also be solved using tidy functions. In essence it's a pretty similar approach to Parfait's answer, but I've made the steps a bit more verbose to see what's going on.
# Use purrr's map functions to do the computation we want
nested_df <- df %>%
# Make new column holding term3 * req_no (stores a vector in each new cell)
mutate(term3r = map(term3, ~ .x * req_no)) %>%
# Make new column which sums the three columns of interest (stores a vector in each new cell)
mutate(sum = pmap(list(term1, term2, term3r), ~ ..1 + ..2 + ..3))
# "Unnest" those vectors which store our sums, and keep only those and ID
output <- nested_df %>%
# Creates six new columns (named ...1 to ...6) with the elements of each sum
unnest_wider(sum) %>%
# Keeps only the output data and IDs
select(ID, ...1:...6)
output
Output:
# A tibble: 2 x 7
ID ...1 ...2 ...3 ...4 ...5 ...6
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 X299 5.9 6.02 6.14 6.26 6.38 6.5
2 X300 50 52.4 54.8 57.2 59.6 62

Consider directly assigning new columns with sapply using your formula:
df[paste0(req_no)] <- sapply(req_no, function(r) with(df, term1 + term2 + term3 * r))
df
# ID term1 term2 term3 20 21 22 23 24
# 1 X299 1.2 2.3 0.12 5.9 6.02 6.14 6.26 6.38
# 2 X300 1.4 0.6 2.40 50.0 52.40 54.80 57.20 59.60
# 3 X301 0.3 1.6 1.20 25.9 27.10 28.30 29.50 30.70
# 4 X302 0.9 0.6 0.40 9.5 9.90 10.30 10.70 11.10
# 5 X303 0.3 1.8 0.30 8.1 8.40 8.70 9.00 9.30
# 6 X304 1.3 0.3 2.10 43.6 45.70 47.80 49.90 52.00

Related

Looking for function or formula to create table with means and standard deviations for many groups and many variables using tidyverse

I need to prepare a table that includes the means and standards deviations for each level of several demographic variables and for many variables.
Consider the following data:
df <- tibble(place=c("London","Paris","London","Rome","Rome","Madrid","Madrid"),gender=c("m","f","f","f","m","m","f"), education = c(1,1,2,3,5,5,3), var1 = c(2.2,3.1,4.5,1,5,1.4,2.3),var2 = c(4.2,2.1,2.5,4,5,4.4,1.3),var3 = c(0.2,0.1,3.5,3,5,2.4,4.3))
I would like to get a dataframe that contains the grouping variables (place, gender, education) and their levels (e.g., London, Paris, etc.) in the first column and their means and standard deviations for each variable starting with var (var1, var2, var3) in additional columns.
I know how to do this for one group and several variables at a time. However, since I need to repeat this dozens of times I am looking for a way to automate this process. It would be great to have a function to which I simply need to pass (a) the names of the grouping variables (e.g., gender, education) and (b) the variables from which to get the M / SD (e.g. var1, var2).
The solution I look for should look like this (the stats are not correct in the example below):
my_results <- tibble(grouping_vars = c("place_London","place_Paris","place_Rome","place_Madrid","gender_m","gender_f","last_element"),mean_var1=c(1.3,2.5,4.5,1.7,2.5,3.6,4.0),sd_var1=c(0.01,0.41,0.21,0.12,0.02,0.38,0.28),mean_var2=c(4.3,4.5,4.0,1.2,2.5,1.6,2.3),sd_var2=c(0.21,0.1,0.1,0.32,0.22,0.18,0.08),mean_var3=c(2.3,2.5,2.0,3.2,3.5,0.6,5),sd_var3=c(0.51,0.15,0.51,0.52,0.52,0.15,0.48))
grouping_vars mean_var1 sd_var1 mean_var2 sd_var2 mean_var3 sd_var3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 place_London 1.3 0.01 4.3 0.21 2.3 0.51
2 place_Paris 2.5 0.41 4.5 0.1 2.5 0.15
3 place_Rome 4.5 0.21 4 0.1 2 0.51
4 place_Madrid 1.7 0.12 1.2 0.32 3.2 0.52
5 gender_m 2.5 0.02 2.5 0.22 3.5 0.52
6 gender_f 3.6 0.38 1.6 0.18 0.6 0.15
7 last_element 4 0.28 2.3 0.08 5 0.48
Since I typically work with tidyverse, I would particularly appreciate solutions that use these packages (probably dplyr or purrr?).
EDIT:
I thought there would be an elegant way to do this using map(). Maybe there is but I haven't found it yet. For the mean time, I figured out a way that simply restructures the data into an appropriate long format and then computes the statistics.
df %>%
# all grouping vars need to be of the same type, here "factor" is most appropriate
mutate_at(grouping_vars, list(factor)) %>%
# pivot longer, so that each row is a unique combination of grouping variable and grouping level
pivot_longer(
cols = one_of(grouping_vars),
names_to = "group_var",
values_to = "group_level"
) %>%
# merge grouping variable and group level into a single column
unite(var_level,group_var,group_level, sep="_") %>%
# group by group level
group_by(var_level) %>%
# compute means and sd for each test variable
summarise_at(test_vars, list(~mean(., na.rm = TRUE), ~sd(., na.rm = TRUE)))
The result seems fine, e.g., the mean of var1 of the two people who live in London (2.2 + 4.5) is 3.35.
# A tibble: 10 x 7
var_level var1_mean var2_mean var3_mean var1_sd var2_sd var3_sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 education_1 2.65 3.15 0.15 0.636 1.48 0.0707
2 education_2 4.5 2.5 3.5 NA NA NA
3 education_3 1.65 2.65 3.65 0.919 1.91 0.919
4 education_5 3.2 4.7 3.7 2.55 0.424 1.84
5 gender_f 2.72 2.48 2.72 1.47 1.13 1.83
6 gender_m 2.87 4.53 2.53 1.89 0.416 2.40
7 place_London 3.35 3.35 1.85 1.63 1.20 2.33
8 place_Madrid 1.85 2.85 3.35 0.636 2.19 1.34
9 place_Paris 3.1 2.1 0.1 NA NA NA
10 place_Rome 3 4.5 4 2.83 0.707 1.41
Any thoughts on possible risks of this approach or how this could be improved?
One option is the describeBy function from psych:
library(psych)
describeBy(df,group = c("gender","education"), mat= TRUE)
Then subset what you want from there.
Another, surprisingly simple option with dplyr:
library(dplyr)
group.vars <- c("gender","education")
measure.vars <- c("var1","var2")
df %>%
group_by_at(group.vars) %>%
summarize_at(measure.vars,
list(mean =~ mean(.),sd =~ sd(.)))
# A tibble: 5 x 6
# Groups: gender [2]
gender education var1_mean var2_mean var1_sd var2_sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 f 1 3.1 2.1 NA NA
2 f 2 4.5 2.5 NA NA
3 f 3 1.65 2.65 0.919 1.91
4 m 1 2.2 4.2 NA NA
5 m 5 3.2 4.7 2.55 0.424
You can continue adding additional function to that list. For every element, the name will be appended to the variable and the result will be come the column values. Recall that ~ is shorthand for function(x).

Unlisting multiple columns in R data frame

I have a dataframe with values for multiple macro variables. When i compute log of the values and then the log differences it changes the variables into lists, causing problems with my script later on.
Example code:
#Compute log of relevant macrovariables
macro[,c("hp", "unem", "m1", "inc")] <- log(macro[,c("hp", "unem", "m1", "inc")])
colnames(macro)[2:5] <- paste(colnames(macro)[2:5], "log", sep = "_")
#Computing log differences
macro$ldiff_hp <- c(-diff(macro$hp_log), na.omit)
Im trying to unlist the columns and convert them to numeric with either of the following:
#Alternative 1
macro[,15:19]<- unlist(as.numeric(macro[,15:19]))
#Alternative 2
macro[,15:19] <- sapply(macro[,15:19],as.numeric)
It gives me the following error output:
> macro[,15:19]<- unlist(as.numeric(macro[,15:19]))
Error in unlist(as.numeric(macro[, 15:19])) :
(list) object cannot be coerced to type 'double'
Using the economics dataset from ggplot2 as example data and making use of dplyrs lag function the log differenced vars can be computed like so:
library(ggplot2)
library(dplyr)
macro <- ggplot2::economics
vars <- c("uempmed", "psavert")
vars_log <- paste(vars, "log", sep = "_")
vars_ldiff <- paste(vars, "ldiff", sep = "_")
#Compute log of relevant macrovariables
macro[, vars_log] <- sapply(macro[, vars], log)
# Lag values
macro[, vars_ldiff] <- sapply(macro[, vars_log], dplyr::lag)
# First Difference of logs
macro[, vars_ldiff] <- macro[, vars_log] - macro[, vars_ldiff]
macro
#> # A tibble: 574 x 10
#> date pce pop psavert uempmed unemploy uempmed_log psavert_log
#> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1967-07-01 507. 198712 12.6 4.5 2944 1.50 2.53
#> 2 1967-08-01 510. 198911 12.6 4.7 2945 1.55 2.53
#> 3 1967-09-01 516. 199113 11.9 4.6 2958 1.53 2.48
#> 4 1967-10-01 512. 199311 12.9 4.9 3143 1.59 2.56
#> 5 1967-11-01 517. 199498 12.8 4.7 3066 1.55 2.55
#> 6 1967-12-01 525. 199657 11.8 4.8 3018 1.57 2.47
#> 7 1968-01-01 531. 199808 11.7 5.1 2878 1.63 2.46
#> 8 1968-02-01 534. 199920 12.3 4.5 3001 1.50 2.51
#> 9 1968-03-01 544. 200056 11.7 4.1 2877 1.41 2.46
#> 10 1968-04-01 544 200208 12.3 4.6 2709 1.53 2.51
#> # ... with 564 more rows, and 2 more variables: uempmed_ldiff <dbl>,
#> # psavert_ldiff <dbl>
Created on 2020-03-23 by the reprex package (v0.3.0)

Estimating missing values in time-series data frame based on a "rate of change"

I am trying to use a loop in R to estimate values that will replace the NAs in my data frame based on a rate of change ("rate") that multiplies my last value (ok, this is confusing, but please refer to the example below). This is something similar to my data:
l1 <- c(NA,NA,NA,27,31,0.5)
l2 <- c(NA,8,12,28,39,0.5)
l3 <- c(NA,NA,NA,NA,39,0.3)
l4 <- c(NA,NA,11,15,31,0.2)
l5 <- c(NA,NA,NA,NA,51,0.9)
data <- as.data.frame(rbind(l1,l2,l3,l4,l5))
colnames(data) <- c("dbh1","dbh2","dbh3","dbh4","dbh5","rate")
So I created a loop to identify my first no-NA value in each line, then use that value to estimate its previous values based on the "rate". So for instance, in row 1, the first NA value would be replace by "27-(0.5*3)", then the second one would be "27-(0.5*2)" and the third one by "27-(0.5*1)". This is the loop I came up with. I know the first part (the outside loop) works but the the inside one doesn't:
for (i in 1: nrow(data)) {
dbh.cols <- data3[i,c("dbh1","dbh2","dbh3","dbh4","dbh5")]
sample.year <- which(dbh.cols != "NA")
data$first.dbh[i] <- min(dbh.cols, na.rm = T)
data$first.index[i] <- min(sample.year)
for (j on 1: (min(sample.year)-1)) {
ifelse(is.na(data[i,j]), min(dbh.cols, na.rm = T) - (min(sample.year)-j)*rate[i,j], data[i,j])
}
}
I am not good at programming so probably my internal loop strategy with "ifelse" is too weird (and wrong) but I just couldn't think of anything else that would work here... Any suggestions?
1) This uses no explicit loops, just an apply. It assumes that the NAs are all leading as in the example given.
fillIn <- function(x) {
rate <- tail(x, 1)
n <- sum(is.na(x)) # no of NAs
c(x[n+1] - rate * seq(n, 1), na.omit(x))
}
replace(data, TRUE, t(apply(data, 1, fillIn)))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2) Here is a second approach that uses na.approx from the zoo package. It does not require apply. Here data1 has the same content as data except that the first column is filled in. The other NAs remain. The last line uses na.approx to fill in the remaining NAs linearly.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind( data[cbind(1:nrow(data), NAs + 1)] - data$rate * NAs, data[-1] )
replace(data, TRUE, t(na.approx(t(data1))))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2a) A variation on (2) uses na.locf in the middle line to bring forward the first non-NA in each row. The first and last lines are the same.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind(na.locf(t(data), fromLast = TRUE)[1, ] - data$rate * NAs, data[-1])
replace(data, TRUE, t(na.approx(t(data1))))
You do not need to use multiple for loops for this. Here is some simplified code to do what you want just for the for loop. Working explicitly with your data we need to get the first non-NA value from each row.
for_estimate <- apply(data, 1, function(x) x[min(which(is.na(x) == FALSE))])
Secondly, we need to determine what integer to multiply the rate by for each row depending on how many NA values there are.
# total number of NA values per row
n_na <- apply(data,1, function(x) sum(is.na(x)) )
# make it a matrix with a 0's appended on
n_na <- matrix(c(n_na, rep(0, nrow(data) * (ncol(data)-1))),
nrow = nrow(data), ncol = ncol(data)-1)
# fill in the rest of the matrix
for(i in 2:ncol(n_na)){
n_na[,i] <- n_na[,i-1] -1
}
Once we have that we can use this code to back fill the NA values in that way you are interested in.
for(i in (ncol(data)-1):1){
if(sum(is.na(data[,i]))>0){
to_fill <- which(is.na(data[,i])==TRUE)
data[to_fill,i] <- for_estimate[to_fill] - (data$rate[to_fill]*(n_na[to_fill,i])
}
}
output
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9

referencing an xts object with a matrix

I have a 207x7 xts object (called temp). I have a 207x3 matrix (called ac.topn), each row of which contains the columns I'd like from the corresponding row in the xts object.
For example, given the following top two rows of temp and ac.topn,
temp
v1 v2 v3 v4 v5 v6 v7
1997-09-30 14.5 8.7 -5.8 2.6 4.7 1.9 17.2
1997-10-31 6.0 -2.0 -25.7 2.9 4.9 9.6 8.4
head(ac.topn)
Rank1 Rank2 Rank3
1997-09-30 7 4 2
1997-10-31 6 5 7
I would like to get the result:
1997-09-30 17.2 2.6 8.7 (elements 7, 4, and 2 from the first row of temp)
1997-10-31 9.6 4.9 8.4 (elements 6, 5, 7 from the second row of temp)
My first attempt was temp[,ac.topn]. I've browsed for help, but am struggling to word my request effectively.
Thank you.
Well, this works, but I've got to think there's a better way...
result <- do.call(rbind,lapply(index(temp),function(i)temp[i,ac.topn[i]]))
colnames(result) <- colnames(as.topn)
result
# Rank1 Rank2 Rank3
# 1997-09-30 17.2 2.6 8.7
# 1997-10-31 9.6 4.9 8.4
You may subset a matrix version of the xts object, using indexing via a numeric matrix:
m <- as.matrix(temp)
cols <- as.vector(ac.topn)
rows <- rep(1:nrow(ac.topn), ncol(ac.topn))
vals <- m[cbind(rows, cols)]
xts(x = matrix(vals, nrow = nrow(temp)), order.by = index(temp))
# [,1] [,2] [,3]
# 1997-09-30 17.2 2.6 8.7
# 1997-10-31 9.6 4.9 8.4
However, I say the same as #jlhoward: I've got to think there's a better way...

Aggregate a RANGE of values using R language

I have a CSV file having more than 2000rows with 8 columns. The schema of the csv is as follows.
col0 col1 col2 col3......
1.77 9.1 9.2 8.8
2.34 6.3 0.9 0.44
5.34 6.3 0.9 0.44
9.34 6.3 0.9 0.44........
.
.
.
2000rows with data as above
I am trying to aggregate specific sets of rows(set1: rows1-76, set2:rows96-121..) from the above csv e.g between 1.77 to 9.34 and for all the columns for their corresponding rows- the aggregate of these rows would be one row in my output file. I have tried various methods but i could do it for only a single set in the csv file.
The output would be a csv file having aggregate values of the specified intervals like follows.
col0 col1 col2 col3
3.25 8.2 4.4 3.3 //(aggregate of rows 1-3)
2.2 3.3 9.9 1.2 //(aggregate of rows 6-10)
and so on..
Considering what Manetheran points out, you should, if not already done, add a column showing which row belongs to which set.
The data.table-way:
require(data.table)
set.seed(123)
dt <- data.table(col1=rnorm(100),col2=rnorm(100),new=rep(c(1,2),each=50))
dt[,lapply(.SD,mean),by="new"]
new col1 col2
1: 1 0.03440355 -0.25390043
2: 2 0.14640827 0.03880684
You can replace mean with any other "aggregate-function"
Here's a possible approach using the base packages:
# Arguments:
# - a data.frame
# - a list of row ranges passes as list
# of vectors=[startRowIndex,endRowIndex]
# used to split the data.frame into sub-data.frames
# - a function that takes a sub-data.frame and returns
# the aggregated result
aggregateRanges <- function(DF,ranges,FUN){
l <- lapply(ranges,function(x){
return(FUN(DF[x[1]:x[2],]))
}
)
return(do.call(rbind.data.frame,l))
}
# example data
data <- read.table(
header=TRUE,
text=
"col0 col1 col2 col3
1.77 9.1 9.2 8.8
2.34 6.3 0.9 0.44
5.34 6.3 0.9 0.44
9.34 6.3 0.9 0.44
7.32 4.5 0.3 0.42
3.77 2.3 0.8 0.13
2.51 1.4 0.7 0.21
5.44 5.7 0.7 0.18
1.12 6.1 0.6 0.34")
# e.g. aggregate by summing sub-data.frames rows
result <-
aggregateRanges(
data,
ranges=list(c(1,3),c(4,7),c(8,9)),
FUN=function(dfSubset) {
rowsum.data.frame(dfSubset,group=rep.int(1,nrow(dfSubset)))
}
)
> result
col0 col1 col2 col3
1 9.45 21.7 11.0 9.68
11 22.94 14.5 2.7 1.20
12 6.56 11.8 1.3 0.52

Resources