How to apply dplyr arrange function within a loop? - r

Hello I want to apply dplyr arrange function on a column within a for loop, but for some reason it does not work. Here is a minimal example:
for (j in colnames(df1)[3:ncol(df1)]){
# create datframe for each column
t <- select(df1, all_of(j))
t %>% arrange(j)
var_list[[j]] <- t
for (i in var_list[[j]]$Timestep )
### arrange each timestep df by each colun once
scenario[[i+1]] <- var_list[[j]][min:max,]
# subset data to the scenarios of interest
}
I guess the Problem is that j delivers a character string "variable", but dplyr arrange requires it without "". I have tried as.name(), paste() and eval parse functions but neither of them worked. Any ideas? Thank you!

This seems to work :
df1 <- mtcars
var_list <- list()
for (j in colnames(df1)[3:ncol(df1)]){
# create datframe for each column
t <- select(df1, all_of(j))
var_list[[j]] <- t %>% arrange_at(1)
for (i in var_list[[j]]$Timestep )
### arrange each timestep df by each colun once
scenario[[i+1]] <- var_list[[j]][min:max,]
# subset data to the scenarios of interest
}
var_list

Unfortunately does not work. It should sort the column j in ascending order. Maybe arrange is not the best function to use - however order or sort do not work either. Cant get what I am doing wrong. However I found a work around using long data format:
for (j in colnames(df1)[3:ncol(df1)]){
# create datframe for each timestep
t <- select(df1, Trial, Timestep, all_of(j))
var_list[[j]] <- melt(t, id.vars=c("Trial", "Timestep"))
var_list[[j]] <- var_list[[j]] %>% arrange(value)

Related

How to insert new column names to a tibble in r?

I have the following Tibbles.
tmp <- tibble()
tmp2 <- tibble()
tmp <- tmp %>% rbind( colSums( y_matrix) )
tmp2 <- tmp2 %>% rbind( proportions( colSums( y_matrix )))
data <- bind_cols(tmp,tmp2)
I want to add column names for "data" accordingly. The number of columns in tmp and tmp2 will change from time to time. So how can I add column names without defining them one by one?
The expected column names in the output is like this.
c1 c2 c1_prop c2_prop
Is there any method to create this?
I don't have enough reputation to comment this data.table solution, which you could could always send to as_tibble(). If this wasn't what you were after, could you put an explicit example of the data and expected output?
library(data.table)
setDT(data)
setnames(data, ncol(tmp)+(1:ncol(tmp2)), paste0(names(tmp),"_prop"))
However, wouldn't it just be better to name the columns correctly before merging?

For loop in R for creating new data frames with respect to rows of a particular column

Hello I have created a forloop to split my data with respect to a certain row as so:
for(i in 1:(nrow(df)))
{
team_[[i]] <- df %>% filter(team == i)
}
R doesn't like this, as it says team_ not found. the code does run if I include a list as such
team_ <- list()
for(i in 1:(nrow(df)))
{
team_[[i]] <- df %>% filter(team == i)
}
This works... However, I am given a list with thousands of empty items and just a few that contain my filtered data sets.
is there a simpler way to create the data sets without this list function?
thank you
A simpler option is split from base R which would be faster than using == to subset in a loop
team_ <- split(df, df$team)
If we want to do some operations for each row, in tidyverse, it can be done with rowwise
library(dplyr)
df %>%
rowwise %>%
... step of operations ...
or with group_by
df %>%
group_by(team) %>%
...
The methods akrun suggests are much better than a loop, but you should understand why this isn't working. Remember for(i in 1:nrow(df)) will give you one list item for i = 1, i = 2, etc, right up until i = nrow(df), which is several thousand by the sounds of thing. If you don't have any rows where team is 1, you will get an empty data frame as the first item, and the same will be true for every other value of i that isn't represented.
A loop like this would work:
for(i in unique(df$team)) team_[[i]] <- df %>% filter(team == i)
But I would stick to a non-looping method as described by akrun.

Lagged values multiple columns with function in R

I would like to create lagged values for multiple columns in R.
First, I used a function to create lead/lag like this:
mleadlag <- function(x, n, ts_id) {
pos <- match(as.numeric(ts_id) + n, as.numeric(ts_id))
x[pos]
}
Second, I would like to apply this function for several columns in R. firm.characteristics is list of columns I would like to compute lagged values.
library(dplyr)
firm.characteristics <- colnames(df)[4:6]
for(i in 1:length(firm.characteristics)){
df <- df %>%
group_by(company) %>%
mutate(!!paste0("lag_", i) := mleadlag(df[[i]] ,-1, fye)) %>%
ungroup()
}
However, I didn't get the correct values. The output for all companies in year t is the last row in year t-1. It didn't group by the company any compute the lagged values.
Can anyone help me which is wrong in the loop? Or what should I do to get the correct lagged values?
Thank you so much for your help.
Reproducible sample could be like this:
set.seed(42) ## for sake of reproducibility
n <- 6
dat <- data.frame(company=1:n,
fye=2009,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
dat2 <- data.frame(company=1:n,
fye=2010,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
dat3 <- data.frame(company=1:n,
fye=2011,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
df <- rbind(dat,dat2,dat3)
I would try to stay away from loops in the tidyverse. Many of the tidyverse applications that would traditionally require loops already exist and are very fast, which creates more efficient and intuitive code (the latter being my opinion). This is a great use case for dplyr's across() functionality. I first changed the df to a tibble.
df %>%
as_tibble() %>%
group_by(company) %>%
mutate(
across(firm.characteristics, ~lag(., 1L))
) %>%
ungroup()
This generates the required lagged values. For more information see dplyr's across documentation.

apply a function that scores values of adjacent columns over a range of columns

I'd like to apply a function, which calculates the difference of two values from adjacent columns and scores the difference based on one of the input values, to a range of columns of a dataframe. The score shout appear as new column next to one of the columns that was used for the calculation. I wrote a function which is doing the job for single vectors/columns but I got stuck when I tried to use this function with mutate_at over a range of columns.
Here is what I tried so far:
# data
set.seed(123)
df <-data.frame(d1= 20,
d2= seq(20,15,-0.1)[1:50],
d3= seq(20,15,-0.1)[1:50]+ rnorm(50,0,3))
# scoring function
f_score <- function(a,b){
ifelse(a-b>=a*0.2,"high",
ifelse(a*0.2>a-b & a-b>=a*0.15,"mid",
ifelse(a*0.15>a-b & a-b>=a*0.1,"low","ok")))
}
# scoring function works for single columns
f_score(df$d1,df$d2) %>% setNames(round(df$d1-df$d2,2))
# and scoring function works this way,too
f_score(df[,1:2],df[,2:3])
# I can easily do this
df1 <- mutate(df,score=f_score(d1,d2))
df1
# this comes close to what I want to achieve
df2 <- df %>% mutate_at(vars(names(.)[2:3]), .funs= funs(score= f_score(d1,.)))
df2
#but the second calculation should use the values from d2 instead of d1
#I would like to do something like this
df3 <- df %>% mutate_at(vars(names(.)[2:3]), .funs= funs(score=f_score(c(1:2),.)))
#but this is not working
# or
df3 <- df %>% mutate_at(vars(names(.)[2:3]), .funs= funs(score=f_score(df[1:2],.)))
# I would like to end up with something like this
df4 <- mutate_at(df, vars(c(d2)), .funs= funs(score_d2= f_score(d1,.)))
df4 <- mutate_at(df4, vars(c(d3)), .funs= funs(score_d3= f_score(d2,.)))
df4 <- select(df4,d1,d2, score_d2, d3, score_d3)
I am quite new to R and SO and I hope I could my problem clear. Any help and explanation to the problem with my code is highly appreciated.

A better way to split apply and combine in R using sp::merge() as function

I have a dataframe with 500 observations for each of 3106 US counties. I would like to merge that dataframe with a SpatialPolygonsDataFrame.
I have tried a few approaches. I have found that if I filter the data by a variable iter_id I can use sp::merge() to merge the datasets. I presume that I can then rbind them back together. sp::merge() does not allow a right or full join and the spatial data needs to be in the left position. So a many to one will not work. The really nasty way I have tried is:
(I am not sure how to represent the dataframe with the variables of interest here)
library(choroplethr)
data(continental_us_states)
us <- tigris::counties(continental_us_states)
gm_y_corr <- tribble(~GEOID,~iter_id,~neat_variable,
01001,1,"value_1",
01003,1,"value_2",
...
01001,2,"value_3",
01003,2,"value_4",
...
01001,500,"value_5",
01003,500,"value_6")
filtered <- gm_y_corr %>%
filter(iter_id ==1)
us.gm <- sp::merge(us, filtered ,by='GEOID')
for (j in 2:500) {
tmp2 <- gm_y_corr %>%
filter(iter_id == j)
tmp3 <- sp::merge(us, tmp2,by='GEOID')
us.gm <- rbind(us.gm,tmp3)
}
I know there must be a better way. I have tried group_by. But multple matches are found. So I must not be understanding the group_by.
> geo_dat <- gm_y_corr %>%
+ group_by(iter_id)%>%
+ sp::merge(us, .,by='GEOID')
Error in .local(x, y, ...) : non-unique matches detected
I would like to merge the spatial data with the interesting data.
Here you can use the splitting functionality of base R in split or the more recent dplyr::group_split. This will separate your data frame according to your splitting variable and you can lapply or purrr::map a function such as merge to it and then dplyr::bind_rows to collapse the returned list back to a dataframe. Since I can't manage to get the us data I have just written what I suspect would work.
gm_y_corr %>%
group_by(iter_id) %>% # group
group_split() %>% # split
lapply(., function(x){ # apply function(x) merge(us, x, by = "GEOID") to leach list element
merge(us, x, by = "GEOID")
}) %>%
bind_rows() # collapse to data frame
equivalently this is the same as using base R functionality. The new group_by %>% group_split is a little more intuitive in my opinion.
gm_y_corr %>%
split(.$iter_id) %>%
lapply(., function(x){
merge(us, x, by = "GEOID")
}) %>%
bind_rows()
If you wanted to just use group_by you would have to follow up with dplyr::do function which I believe does a similar thing to what I have just done above. But without you having to split it yourself.

Resources