How to subset and apply a function across the dataset

How to subset and apply a function across the dataset - r

I have a list of prices for different items in the same dataset.
abc1 <- c("2005-09-18", "ABC", 99.00)
abc2 <- c("2005-09-19", "ABC", 98.00)
abc3 <- c("2005-09-20", "ABC", 98.50)
abc4 <- c("2005-09-21", "ABC", 97.75)
def1 <- c("2005-09-14", "DEF", 79.00)
def2 <- c("2005-09-15", "DEF", 78.00)
def3 <- c("2005-09-16", "DEF", 78.50)
def4 <- c("2005-09-20", "DEF", 77.75)
df <- data.frame(rbind(abc1, abc2, abc3, abc4, def1, def2, def3, def4))
the above quick table would result in :
X1 X2 X3
abc1 2005-09-18 ABC 99
abc2 2005-09-19 ABC 98
abc3 2005-09-20 ABC 98.5
abc4 2005-09-21 ABC 97.75
def1 2005-09-14 DEF 79
def2 2005-09-15 DEF 78
def3 2005-09-16 DEF 78.5
def4 2005-09-20 DEF 77.75
I would like to add a column, say X4, which would be the variation of today, versus the previous day, for a specific X2. So x4 would have the following value:
X4
0,0%
-1,0%
0,5%
-0,8%
0,0%
-1,3%
0,6%
-1,0%
The goal would be to do that for all the different items in X3. Ideally without splitting the table. I think the date is always going to be in the right order, but just in case.

We can group by 'X2' and take the difference of adjacent elements with diff
library(dplyr)
df %>%
group_by(X2) %>%
mutate(X4 = c(0, diff(X3)))
Or after grouping by 'X2', take the difference between the 'X2' and the lag of 'X2'
df %>%
group_by(X2) %>%
mutate(X4 = X3 - lag(X3, default = first(X3)))

Just a little hint: You wanted to calculate the difference in percent, not the absolute difference.
You have to adjust the formula to do so, otherwise your results are wrong :-).
df %>%
dplyr::group_by(X2) %>%
dplyr::mutate(X4 = (X3/lag(X3, default = first(X3)) - 1) * 100)
X1 X2 X3 X4
<fct> <fct> <dbl> <dbl>
1 2005-09-18 ABC 99 0
2 2005-09-19 ABC 98 -1.01
3 2005-09-20 ABC 98.5 0.510
4 2005-09-21 ABC 97.8 -0.761
5 2005-09-14 DEF 79 0
6 2005-09-15 DEF 78 -1.27
7 2005-09-16 DEF 78.5 0.641
8 2005-09-20 DEF 77.8 -0.955

Related

Create new variables based on the names of other variables

I have a dataset that looks something like this:
""
"region"
"region_a_price_raw"
"region_b_price_raw"
"region_c_price_raw"
"region_a_adjusted"
"region_b_adjusted"
"region_c_adjusted"
"region_a_pct_chng"
"region_b_pct_chng"
"region_c_pct_chng"
"1"
"C"
0.691277900885566
-1.12168419402904
-1.80708124084338
-0.823054962637259
-1.56205680347623
2.39150423647063
94
43
100
"2"
"B"
-0.917718229751991
0.35628937645658
0.587525813366388
0.839040270582852
0.240455566072964
-0.281641015285604
27
48
21
"3"
"B"
1.2846493277039
0.13190349180679
1.26024317859471
-0.971360861843787
0.257888869705433
-0.979961536031851
92
64
82
What I need to do is create a new variable that has the price variable for each region, for the raw, adjusted and pct_chng variables.
I know how to do this manually. However, there are a lot of regions (far more than the three in the example), as well as multiple percent change variables (I only included one here for sake of brevity).
So what I'm hoping is that, since each relevant price variable includes the region name in it's own variable name, there is some way to do this where I can write a function that automatically detects the region in the variable name, since it's in the named there. I unfortunately don't know how to do this elegantly at present.
library(dplyr)
#creating sample data
df1 <- data.frame(region = sample(LETTERS[1:3],15,replace = TRUE), region_a_price_raw = rnorm(15),region_b_price_raw=rnorm(15),region_c_price_raw=rnorm(15))
df2 <- data.frame(region_a_adjusted=rnorm(15),region_b_adjusted=rnorm(15),region_c_adjusted=rnorm(15))
df3 <- data.frame(region_a_pct_chng=sample(1:100,15,replace = TRUE),region_b_pct_chng=sample(1:100,15,replace = TRUE),region_c_pct_chng=sample(1:100,15,replace = TRUE))
sample <- cbind(df1,df2,df3)
#here's how it would work manually. this would take forever in the actual dataset though
sample <- sample %>%
mutate(price_raw=case_when(region=="A"~region_a_price_raw,
region=="B"~region_b_price_raw,
region=="C"~region_c_price_raw)) %>%
mutate(price_adjusted=case_when(region=="A"~region_a_adjusted,
region=="B"~region_b_adjusted,
region=="C"~region_c_adjusted)) %>%
mutate(pct_chng=case_when(region=="A"~region_a_pct_chng,
region=="B"~region_b_pct_chng,
region=="C"~region_c_pct_chng))
I'm hoping someone has a way to do this that won't have me manually doing this across each region and price variable.

(I think there's a more direct way than this for combining the last three lines into one using a little regex...)
library(dplyr); library(tidyr)
sample %>%
mutate(row = row_number()) %>%
pivot_longer(-c(row, region)) %>%
separate(name, c("drop", "region", "type"), sep = "_", extra = "merge") %>%
pivot_wider(names_from = type, values_from = value)
Result
# A tibble: 45 × 6
row drop region price_raw adjusted pct_chng
<int> <chr> <chr> <dbl> <dbl> <dbl>
1 1 region a 0.222 -0.869 92
2 1 region b 0.149 -0.972 19
3 1 region c 1.04 0.116 94
4 2 region a -0.844 -0.755 13
5 2 region b -0.963 -0.547 81
6 2 region c 0.198 1.38 61
7 3 region a 0.444 -0.130 48
8 3 region b -0.0665 -1.69 13
9 3 region c -1.63 0.574 56
10 4 region a 0.0558 -1.00 7
# … with 35 more rows

You never gave a seed to your data. So will use the data with 3 rows above:
sample %>%
pivot_longer(-c(rn, region), names_to = c('grp', '.value'),
names_pattern = 'region_([^_+])_(.+)$') %>%
filter(tolower(region) == grp)
region grp price_raw adjusted pct_chng
<chr> <chr> <dbl> <dbl> <int>
1 C c -1.81 2.39 100
2 B b 0.356 0.240 48
3 B b 0.132 0.258 64
Data
sample <- structure(list(region = c("C", "B", "B"), region_a_price_raw = c(0.691277900885566,
-0.917718229751991, 1.2846493277039), region_b_price_raw = c(-1.12168419402904,
0.35628937645658, 0.13190349180679), region_c_price_raw = c(-1.80708124084338,
0.587525813366388, 1.26024317859471), region_a_adjusted = c(-0.823054962637259,
0.839040270582852, -0.971360861843787), region_b_adjusted = c(-1.56205680347623,
0.240455566072964, 0.257888869705433), region_c_adjusted = c(2.39150423647063,
-0.281641015285604, -0.979961536031851), region_a_pct_chng = c(94L,
27L, 92L), region_b_pct_chng = c(43L, 48L, 64L), region_c_pct_chng = c(100L,
21L, 82L)), class = "data.frame", row.names = c(NA, 3L))

How to prevent R from rounding in frequency function?

I used the freq function of frequency package to get frequency percent on my dataset$MoriskyAdherence, then R gives me percent values with rounding. I need more decimal places.
MoriskyAdherence=dataset$MoriskyAdherence
freq(MoriskyAdherence)
The result is:
The Percent values are 35.5, 41.3,23.8. The sum of them is 100.1.
The exact amounts should be 35.5, 41.25, 23.75.
What should I do?
I used sprintf, as.data.frame,formatC, and some other function to deal with it.But...

The function freq returns a character data frame, and has no option to adjust the number of decimal places. However, it is easy to recreate the table however you want it. For example, I have written this function, which will give you the same result but with two decimal places instead of one:
freq2 <- function(data_frame)
{
df <- frequency::freq(data_frame)
lapply(df, function(x)
{
n <- suppressWarnings(as.numeric(x$Freq))
sum_all <- as.numeric(x$Freq[nrow(x)])
raw_percent <- suppressWarnings(100 * n / sum_all)
t_row <- grep("Total", x[,2])[1]
valid_percent <- suppressWarnings(100*n / as.numeric(x$Freq[t_row]))
x$Percent <- format(round(raw_percent, 2), nsmall = 2)
x$'Valid Percent' <- format(round(valid_percent, 2), nsmall = 2)
x$'Cumulative Percent' <- format(round(cumsum(valid_percent), 2), nsmall = 2)
x$'Cumulative Percent'[t_row:nrow(x)] <- ""
x$'Valid Percent'[(t_row + 1):nrow(x)] <- ""
return(x)
})
}
Now instead of
freq(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.0 35.0 35.0
#> 3 Low Adherence 66 41.3 41.3 76.3
#> 4 Medium Adherence 38 23.8 23.8 100.0
#> 41 Total 160 100.0 100.0
#> 1 Missing <blank> 0 0.0
#> 5 <NA> 0 0.0
#> 7 Total 160 100.0
you can do
freq2(MoriskyAdherence)
#> Building tables
#> |===========================================================================| 100%
#> $`x:`
#> x label Freq Percent Valid Percent Cumulative Percent
#> 2 Valid High Adherence 56 35.00 35.00 35.00
#> 3 Low Adherence 66 41.25 41.25 76.25
#> 4 Medium Adherence 38 23.75 23.75 100.00
#> 41 Total 160 100.00 100.00
#> 1 Missing <blank> 0 0.00
#> 5 <NA> 0 0.00
#> 7 Total 160 100.00
which is exactly what you were looking for.

Two (potential) solutions:
Solution #1:
Make changes inside the function freq. This can be done by retrieving the function's code with the command freq (without round brackets), or by retrieving the code, with comments, from https://rdrr.io/github/wilcoxa/frequencies/src/R/freq.R.
My hunch is that to obtain more decimals, changes must be implemented at this point in the code:
# create a list of frequencies
message("Building tables")
all_freqs <- lapply_pb(names(x), function(y, x1 = as.data.frame(x), maxrow1 = maxrow, trim1 = trim){
makefreqs(x1, y, maxrow1, trim1)
})
Solution #2:
If you're only after percentages with more decimals, you can use aggregate. Let's suppose your data has this structure: a dataframe with two variables, one numeric, one a factor by which you want to group:
set.seed(123)
Var1 <- sample(LETTERS[1:4], 10, replace = T)
Var2 <- sample(10:100, 10, replace = T)
df <- data.frame(Var1, Var2)
Var1 Var2
1 B 97
2 D 51
3 B 71
4 D 62
5 D 19
6 A 91
7 C 32
8 D 13
9 C 39
10 B 96
Then to obtain your percentages by factor, you would use aggregatethus:
aggregate(Var2 ~ Var1, data = df, function(x) sum(x)/sum(Var2)*100)
Var1 Var2
1 A 15.93695
2 B 46.23468
3 C 12.43433
4 D 25.39405
You can control the number of decimals by using round:
aggregate(Var2 ~ Var1, data = df, function(x) round(sum(x)/sum(Var2)*100,3))

Why is R not ordering my tibble in the correct order?

I've got a dataframe with two columns that I need to arrange in chronological order and then combine. R is strangely placing the integer 100 just after 10. I can't figure out how to stop this behavior.
Here is a reprex example.
library(tidyverse)
library(glue)
set.seed(123)
df <- tibble(x = 0:100,
y = sample(0:100, 101, T))
df_i <- df %>%
mutate(id = row_number(),
z = glue('{x}.{y}'))
df_i %>%
arrange(z)
# A tibble: 101 x 4
x y id z
<int> <int> <int> <glue>
1 0 30 1 0.30
2 1 78 2 1.78
3 10 24 11 10.24
4 100 22 101 100.22
5 11 89 12 11.89
6 12 90 13 12.90
7 13 68 14 13.68
8 14 90 15 14.90
9 15 56 16 15.56
10 16 91 17 16.91
# … with 91 more rows
You can see that the fourth row is not in the correct order. It looks like the x and y columns are not in order either. I feel like this is something trivial but it's causing some sneaky problems.

'z' is a glue object (and according to ?glue - Format and interpolate a string, so it would be a string output) and it needs to be converted to numeric
df_i %>%
arrange(as.numeric(z))
If we check the glue source code, it calls glue_data which in turn calls as_glue and checking the as_glue it converts to character
methods('as_glue')
getAnywhere('as_glue.default')
#function (x, ...)
#{
# as_glue(as.character(x))
#}
The behaviour is similar to sorting a character vector of numbers
sort(c('1', '2', '10', '20', '100'))
#[1] "1" "10" "100" "2" "20"

Elegant way of adding columns on a specific position in a data frame

I have a data.frame with 3 cols: date, rate, price. I want to add columns that come from a matrix, after rate and before price.
df = tibble('date' = c('01/01/2000', '02/01/2000', '03/01/2000'),
'rate' = c(7.50, 6.50, 5.54),
'price' = c(92, 94, 96))
I computed the lags of rate using a function that outputs a matrix:
rate_Lags = matrix(data = c(NA, 7.50, 5.54, NA, NA, 7.50), ncol=2, dimnames=list(c(), c('rate_tMinus1', 'rate_tMinus2'))
I want to insert those lags after rate (and before price) using names indexing rather than column order.
The add_column function from tibble package (Adding a column between two columns in a data.frame) does not work because it only accepts an atomic vector (hence if I have 10 lags I will have to call add_column 10 times). I could use apply in my rate_Lags matrix. Then, however, I lose the dimnames from my rate_Lags matrix.
Using number indexing (subsetting) (https://stat.ethz.ch/pipermail/r-help/2011-August/285534.html) could work if I knew the position of a specific column name (any function that retrieves the position of a column name?).
Is there any simple way of inserting a bunch of columns in a specific position in a data frame/tibble object?

You may be overlooking the following
library(dplyr)
I <- which(names(df) == "rate")
if (I == ncol(df)) {
cbind(df, rate_Lags)
} else {
cbind(select(df, 1:I), rate_Lags, select(df, (I+1):ncol(df)))
}
# date rate rate_tMinus1 rate_tMinus2 price
# 1 0.0005 7.50 NA NA 92
# 2 0.0010 6.50 7.50 NA 94
# 3 0.0015 5.54 5.54 7.5 96

Maybe this is not very elegant, but you only call the function once and I believe it's more or less general purpose.
fun <- function(DF, M){
nms_DF <- colnames(DF)
nms_M <- colnames(M)
inx <- which(sapply(nms_DF, function(x) length(grep(x, nms_M)) > 0))
cbind(DF[seq_len(inx)], M, DF[ seq_along(nms_DF)[-seq_len(inx)] ])
}
fun(df, rate_Lags)
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96

We could unclass the dataset to a list and then use append to insert 'rate_Lags' at specific locations, reconvert the list to data.frame
i1 <- match('rate', names(df))
data.frame(append(unclass(df), as.data.frame(rate_Lags), after = i1))
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
Or with tidyverse
library(tidyverse)
rate_Lags %>%
as_tibble %>%
append(unclass(df), ., after = i1) %>%
bind_cols
# A tibble: 3 x 5
# date rate rate_tMinus1 rate_tMinus2 price
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 01/01/2000 7.5 NA NA 92
#2 02/01/2000 6.5 7.5 NA 94
#3 03/01/2000 5.54 5.54 7.5 96

how to extract two columns of data using R by loop

I have a dataframe with 1000 columns of data
str(MT)
'data.frame': 1356 obs. of 1000 variables:
$ Date : Factor w/ 1356 levels "Apr-1900","Apr-1901",..: 453 340 792 1 905 679 566 114 1244 1131 ...
$ Year : int 1900 1900 1900 1900 1900 1900 1900 1900 1900 1900 ...
$ X1 : num -27.4 -27.8 -17 1.7 7.9 ...
$ X2 : num -27.21 -27.99 -17.05 1.69 7.75 ...
$ X3 : num -26.67 -27.84 -16.75 2.24 7.82 ...
$ X4 : num -26.64 -27.98 -16.83 2.46 7.97 ...
.....
$ X1000 : num -29.13 -30.61 -20.47 -0.46 6.5
I would like to split this dataframe into three columns ( Date, Year and Xn) using a loop so that the end of it all I will have 1000 separate csv files with 3 columns of data. My codes thus far is
for (i in ncol(MT)) {
x[[i]]<-data.frame(MT$Date, Year, MT$[[i]]) }
However, is giving me errors. Your guidance would be appreciated as this I am new to R

Your code has some syntax and algorithm errors:
Your for loop is not looping through a range of values, it's "looping" once for i = ncol(MT), it should be (i in 1:ncol(MT)) ;
Actually, you shouldn't loop through all columns, since two of them aren't Xn, so (i in 1:(ncol(MT)-2));
It's not clear if you did, but you should create x before trying to allocate data to it, preferably with its final size;
You didn't use MT$ to select the Year column;
You used both $ and [[ to subset the Xn column. You should use just [ instead, because this way you get to use i and keep the column name.
Fixing all these, with some example data, you get:
MT <- data.frame(Date = rnorm(5), Year = rnorm(5), X1 = rnorm(5), X2 = rnorm(5), X3 = rnorm(5))
nX <- ncol(MT)-2
listofdf <- lapply(1:nX, function(x) NULL)
for (i in 1:nX) {
listofdf[[i]] <- data.frame(MT$Date, MT$Year, MT[i+2])
}
listofdf
# [[1]]
# MT.Date MT.Year X1
# 1 -0.94184053 1.0241134 -0.4329728
# 2 0.59637577 -0.6195477 -1.3011527
# 3 0.33474278 1.0628674 -0.8957239
# 4 -0.04328685 0.4275993 -0.7840214
# 5 0.78799652 0.5707058 -0.4243622
#
# [[2]]
# MT.Date MT.Year X2
# 1 -0.94184053 1.0241134 2.2380838
# 2 0.59637577 -0.6195477 -0.9995170
# 3 0.33474278 1.0628674 0.3452450
# 4 -0.04328685 0.4275993 -1.0453718
# 5 0.78799652 0.5707058 -0.6292885
#
# [[3]]
# MT.Date MT.Year X3
# 1 -0.94184053 1.0241134 -0.05293727
# 2 0.59637577 -0.6195477 0.84947635
# 3 0.33474278 1.0628674 1.17748809
# 4 -0.04328685 0.4275993 1.73233398
# 5 0.78799652 0.5707058 -0.61874653
If you're just going to save them as .csv files, it's not necessary to store in a list though. Instead, you can use:
for (i in 1:nX) {
tempdf <- data.frame(MT$Date, MT$Year, MT[i+2])
write.csv(tempdf, paste0("MT_subset_X", i, ".csv"))
}

Reusing the sample data created by #Molx, and doing some reshaping as #Neal Fultz suggested in comments, using tidyr
# generate sample data
MT <- data.frame(Date = rnorm(5), Year = rnorm(5), X1 = rnorm(5), X2 = rnorm(5), X3 = rnorm(5))
Then fit all variables and values excluding Date and Year as key-value column pairs
> require(tidyr)
> MTg <- gather(MT, var, value, -c(Date, Year))
> MTg
Date Year var value
1 -1.5356474 -1.0963886 X1 -0.74075807
2 -1.1346928 0.2925819 X1 1.42787059
3 0.7031032 0.3361561 X1 -0.27112156
4 1.0140557 1.2587298 X1 0.85693377
5 0.2529787 -3.0113663 X1 0.12686607
6 -1.5356474 -1.0963886 X2 0.21406288
7 -1.1346928 0.2925819 X2 -1.11363330
8 0.7031032 0.3361561 X2 -0.30324978
9 1.0140557 1.2587298 X2 0.48954893
10 0.2529787 -3.0113663 X2 0.85898166
11 -1.5356474 -1.0963886 X3 -0.44394680
12 -1.1346928 0.2925819 X3 -0.86942530
13 0.7031032 0.3361561 X3 -1.62344294
14 1.0140557 1.2587298 X3 0.09880026
15 0.2529787 -3.0113663 X3 -0.76091871
Then run through all possible variable names, exporting them into individual csv files with same name as var.
varnames <- levels(MTg$var) # get variable names
dummy <- lapply(varnames, function(x)
write.csv(MTg[MTg$var==x,], file=paste0(x, ".csv"))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to subset and apply a function across the dataset - r

We can group by 'X2' and take the difference of adjacent elements with diff library(dplyr) df %>% group_by(X2) %>% mutate(X4 = c(0, diff(X3))) Or after grouping by 'X2', take the difference between the 'X2' and the lag of 'X2' df %>% group_by(X2) %>% mutate(X4 = X3 - lag(X3, default = first(X3)))

Related

Create new variables based on the names of other variables

How to prevent R from rounding in frequency function?

Why is R not ordering my tibble in the correct order?

Elegant way of adding columns on a specific position in a data frame

how to extract two columns of data using R by loop

Categories

Resources