Related
I have the following example:
df <- data.frame(
id = c(1,2,3),
fix_01.2012 = c(2,5,7),
fix_02.2012 = c(5,1,7),
fix_03.2012 = c(6,1,5),
fox_01.2012 = c(0.4, 0.5, 0.7),
fox_02.2012 = c(0.6, 0.5, 0.8),
fox_03.2012 = c(0.7, 0.5, 0.9)
)
id fix_01.2012 fix_02.2012 fix_03.2012 fox_01.2012 fox_02.2012 fox_03.2012
1 1 2 5 6 0.4 0.6 0.7
2 2 5 1 1 0.5 0.5 0.5
3 3 7 7 5 0.7 0.8 0.9
The table below is what I want to get.
I want to create a new column for each date (e.g. "01.2012"):
res_date = fix_date * fox_date
As I have many dates / pairs of dates, I guess this needs to be done by looping through the names.
id fix_01.2012 fix_02.2012 fix_03.2012 fox_01.2012 fox_02.2012 fox_03.2012 res_01.2012 res_02.2012 res_03.2012
1 1 2 5 6 0.4 0.6 0.7 0.8 3.0 4.2
2 2 5 1 1 0.5 0.5 0.5 2.5 0.5 0.5
3 3 7 7 5 0.7 0.8 0.9 4.9 5.6 4.5
Anyone can help? Thanks very much in advance!
Here is an idea that uses split.default to split the data frame based on similar column names (based on your conditions). We then loop over that list and multiply the columns. In this case, we use Reduce (rather than i[1]*i[2]) to multiply in order to account for more than two columns
do.call(cbind,
lapply(split.default(df[-1], gsub('.*_', '', names(df[-1]))), function(i) Reduce(`*`, i)))
# 01.2012 02.2012 03.2012
#[1,] 0.8 3.0 4.2
#[2,] 2.5 0.5 0.5
#[3,] 4.9 5.6 4.5
Bind them back to the original with cbind.data.frame()
If you want a tidyverse approach, it will take using a bit of tidy evaluation to get what you want.
library(tidyverse)
df <- data.frame(
id = c(1,2,3),
fix_01.2012 = c(2,5,7),
fix_02.2012 = c(5,1,7),
fix_03.2012 = c(6,1,5),
fox_01.2012 = c(0.4, 0.5, 0.7),
fox_02.2012 = c(0.6, 0.5, 0.8),
fox_03.2012 = c(0.7, 0.5, 0.9)
)
# colnames with "fix"
fix <- names(df)[grepl("fix",names(df))]
# colnames with "fox"
fox <- names(df)[grepl("fox",names(df))]
# Iterate over the two vectors of names and column bind the results (map2_dfc).
# Since these are strings, we need to have them evaluated as symbols
# Creating the column name just requires the string to be evaluated.
map2_dfc(fix, fox, ~transmute(df, !!paste0("res", str_extract(.x, "_(0\\d)")) := !!sym(.x) * !!sym(.y)))
#> res_01 res_02 res_03
#> 1 0.8 3.0 4.2
#> 2 2.5 0.5 0.5
#> 3 4.9 5.6 4.5
Much more verbose than the other answers, but to my eye easier to read/edit/adapt, is a heavy gather-spread approach (the way I'd reason the problem if I was solving it step-by-step):
library(tidyr)
library(dplyr)
df %>%
gather(-id, key=colname, value=value) %>%
separate(colname, c('fixfox', 'date'), sep='_') %>%
spread(key=fixfox, value=value) %>%
mutate(res=fix*fox) %>%
gather(-id, -date, key=colname, value=value) %>%
unite(new_colname, colname, date, sep='_') %>%
spread(key=new_colname, value=value)
I have two data frames both with 220 obs and 80 variables. The first data frame, df1, has only the data points 1, 2, and 3. The second data frame, df2, has different numeric values consisting of decimals, such as 0.12, -0.03, 0.01 etc. (supposed to portray market cap weighted stock returns for a given month). PS: The length of the original data set is 80.
For example
df1 = data.frame(a = c(2, 2, 1), b = c(3, 2, 3), c = c(1, 1, 2), d = c(3, 3, 1))
a b c d
1 2 3 1 3
2 2 2 1 3
3 1 3 2 1
df2 = data.frame(a = c(0.1, 0.1, 0.2), b = c(0.3, 0.4, 0.6), c = c(0.2, 0.3, 0.5), d = c(0.1, 0.5, 0.6))
a b c d
1 0.1 0.3 0.2 0.1
2 0.1 0.4 0.3 0.5
3 0.2 0.6 0.5 0.6
How can I sum the rows of df2and turn into a matrix with 220 obs and 3 variables based on the values in df1. Note that df1 and df2 have the same column names in the same order. How can I create a third data frame df3 based on the indicator variables from df1 by summing the rows of df2? I want to sum the rows of df2 based on the values in df1 to create df3:
df3 =
X1 X2 X3
1 0.2 0.1 0.4
2 0.3 0.5 0.5
3 0.8 0.5 0.6
Let's first look at (X1,1). Row 1 in df1 only contain one data point with value 1, which is (c,1). Thus, we sum row 1 of df2 to get 0.2. Now look at (X1,3) (last value of column X1). Observe row 3 in df1 to find two data points with value 1. In df2 those two values are 0.2 (a,3) and 0.6 (d,3), and sum the values to get 0.8.
Here is the explanation of how df3 looks like:
calculation = data.frame("1" = c("0+0+0.2+0", "0+0+0.3+0", "0.2+0+0+0.6"), "2" = c("0.1+0+0+0", "0.1+0.4+0+0", "0+0+0.5+0"), "3" = c("0+0.3+0+0.1", "0+0+0+0.5", "0+0.6+0+0"))
X1 X2 X3
1 0 + 0 + 0.2 + 0 0.1 + 0 + 0 + 0 0 + 0.3 + 0 + 0.1
2 0 + 0 + 0.3 + 0 0.1 + 0.4 + 0 + 0 0 + 0 + 0 + 0.5
3 0.2 + 0 + 0 + 0.6 0 + 0 + 0.5 + 0 0 + 0.6 + 0 + 0
More practical explanation based on stocks. Assume df1 is a matrix that describes buy, hold, and sell recommendations. df2 describes the market weighted stock returns. All variables/columns are different stocks. df3 creates a matrix with three different portfolios. If the stock is "buy", I want to put it in a "buy" portfolio. If the stock is "hold", I want to put it in a "hold" portfolio, etc. This is easily done in Excel with nested IF,AND,OR functions, but I do not know how to do it in R.
We could use tapply by converting the datasets to matrix, use grouping variables as the row index of the data and the index of 'df1'
tapply(as.matrix(df2), list(row(df2), as.matrix(df1)), FUN = sum)
# 1 2 3
#[1,] 0.2 0.1 0.4
#[2,] 0.3 0.5 0.5
#[3,] 0.8 0.5 0.6
Or with tidyverse, bind the datasets after gathering the two in to 'long' data, and then do a group by sum
library(tidyverse)
gather(df1) %>%
bind_cols(gather(df2)) %>%
group_by(key) %>%
group_by(rn = row_number(), value) %>%
summarise(value1 = sum(value1)) %>%
spread(value, value1) %>%
ungroup %>%
select(-rn)
# A tibble: 3 x 3
# `1` `2` `3`
# <dbl> <dbl> <dbl>
#1 0.2 0.1 0.4
#2 0.3 0.5 0.5
#3 0.8 0.5 0.6
Here is another base R method that uses rowsum to perform group sums and loops through the rows with mapply.
t(mapply(rowsum, as.data.frame(t(df2)), as.data.frame(t(df1))))
[,1] [,2] [,3]
V1 0.2 0.1 0.4
V2 0.3 0.5 0.5
V3 0.8 0.5 0.6
Note that I am using R 3.4.4. I believe that as.data.frame is not necessary with R 3.5.0+, since t should return a data.frame when it is fed a data.frame.
I am trying to simulate the forecast (rather than taking mean) using this code:
set.seed(100)
df <- data.frame("lower" = c(-1, 0.5, 0), "upper" = c(0, 2, 4), "mean" = c(-0.5, 1.2, 2.5))
df$simulation <- rnorm(1, df$mean, (df$upper - df$lower) / 2 / 1.96)
df
# lower upper mean simulation
#1 -1.0 0 -0.5 -0.6281103
#2 0.5 2 1.2 -0.6281103
#3 0.0 4 2.5 -0.6281103
I get same value in the simulation column.
However, if I provide the vector in the n parameter I get result that looks better:
df$simulation <- rnorm(nrow(df), df$mean, (df$upper - df$lower) / 2 / 1.96)
df
# lower upper mean simulation
#1 -1.0 0 -0.5 -0.6281103
#2 0.5 2 1.2 1.2503308
#3 0.0 4 2.5 2.4194724
Is the later solution is the right way of doing this?
I'm new and learning R and got a problem plotting the min and max value of a matrix.
The matrix is something like this:
X Y1 Y2 Y3 Y4 Y5
1 0.5 0.6 0.3 0.3 0.2
2 0.3 0.4 0.1 0.7 0.4
3 0.4 0.3 0.5 0.6 0.3
Now I would like to plot the first column(X) as x-axis, and pick out the min and max values of each row (e.g. X=1, Ymin=0.2, Ymax=0.6 in the first row), and plot them as the y-axis.
Could someone help me to figure it out?
Here is one possibility, considering you want a scatterplot.
#reading your data
table = read.table(header=TRUE, text="
X Y1 Y2 Y3 Y4 Y5
1 0.5 0.6 0.3 0.3 0.2
2 0.3 0.4 0.1 0.7 0.4
3 0.4 0.3 0.5 0.6 0.3", sep= " ")
#using a for loop to filter only data to be used in the plot (X, Min_Y, Max_Y)
df = data.frame(X=NA,min_Y=NA,max_Y=NA)
for (i in c(1:length(df))) {
X = table[i,1] #X values from table
min_Y = c(min(table[i,c(2:6)])) #minimum values inside table columns 2 to 6
max_Y = c(max(table[i,c(2:6)])) #maximum values inside table columns 2 to 6
df = rbind(df,c(X,min_Y,max_Y)) #new df with X, Min_Y, Max_Y
}
df = df[-1,]
df #df results
X min_Y max_Y
2 1 0.2 0.6
3 2 0.1 0.7
4 3 0.3 0.6
#produce scatterplot with R package ggplot2
library(ggplot2)
ggplot(df) +
geom_point(aes(x=X,y=min_Y),colour="red") +
geom_point(aes(x=X,y=max_Y),colour="blue") +
ylab("Y") +
theme_bw()
A solution with rbind and 2 apply functions (for min and max) (surely not the best tho) :
mat <- as.matrix(read.table(header = T, text = "X Y1 Y2 Y3 Y4 Y5
1 0.5 0.6 0.3 0.3 0.2
2 0.3 0.4 0.1 0.7 0.4
3 0.4 0.3 0.5 0.6 0.3"))
mat2 <- t(rbind(X = mat[ ,1], Ymin = apply(mat[ ,-1], 1, min), Ymax = apply(mat[ ,-1], 1, max)))
matplot(mat2[ ,1], mat2[ ,-1], pch = 20, cex = 1.5)
For example using pmin and pmax:
mn = Reduce(pmin,as.list(dat[,-1]))
mx = Reduce(pmax,as.list(dat[,-1]))
library(lattice)
xyplot(mn+mx~x,data.frame(x= dat[,1],mn=mn,mx=mx),
type='l',auto.key=T,
ylab=list(label="max and min"))
Where dat is :
dat <-
read.table(text='
X Y1 Y2 Y3 Y4 Y5
1 0.5 0.6 0.3 0.3 0.2
2 0.3 0.4 0.1 0.7 0.4
3 0.4 0.3 0.5 0.6 0.3',header=TRUE)
So here is (another...) way to get the column-wise min and max (using m as your matrix).
z <- t(apply(m,1,
function(x)return(c(x[1],min=min(x[2:length(x)]),max=max(x[2:length(x)])))))
z <- data.frame(z)
z
# X min max
# [1,] 1 0.2 0.6
# [2,] 2 0.1 0.7
# [3,] 3 0.3 0.6
From here, plotting is straightforward.
plot(z$X, z$max, ylim=c(min(z$min),max(z$max)),col="blue")
points(z$X, z$min, col="red")
Suppose I have the following vector.
test <- c(0.3,1.0,0.8,0.3,0.6,0.4,0.3,0.5,0.6,0.4,0.5,0.6,0.1,0.6,0.2,0.7,0.0,0.7,0.3,0.3,0.4,0.9,0.9,0.9,0.3,0.6,0.3,0.1)
Is there a way to get non logical frequency table such as?
Frequency between 0 and 0.1
Frequency between 0.2 and 0.4
Frequency between 0.5 and 0.8
Frequency between 0.9 and 1
Thanks
There are a few extra unnecessary groups in here but you can ignore those or subset them
table(cut(test, breaks = c(0,0.1,0.2,0.4,0.5,0.8,0.9,1)))
I'm not aware of a dedicated function, but you could write your own:
test <- c(0.3,1.0,0.8,0.3,0.6,0.4,0.3,0.5,0.6,0.4,0.5,0.6,0.1,0.6,0.2,0.7,0.0,0.7,0.3,0.3,0.4,0.9,0.9,0.9,0.3,0.6,0.3,0.1)
mapply(function (start, end) { sum(test >= start & test <= end) },
c(0, 0.2, 0.5, 0.9), # starts
c(0.1, 0.4, 0.8, 1)) # ends
# [1] 3 11 10 4
The use of mapply is purely to vectorise over the starts and ends which you supply. Note test is hard-coded into this function and the endpoints are inclusive, so adjust as necessary, etc.
Something like this maybe:
labs <- c("0 and 0.1", "0.2 and 0.4", "0.5 and 0.8", "0.9 and 1")
table(cut(test, c(0, .2, .5, .9, 1.1), right = FALSE, labels = labs))
## 0 and 0.1 0.2 and 0.4 0.5 and 0.8 0.9 and 1
## 3 11 10 4
Assuming that you really want to bin these as tenths, and there are no missing intervals, findInterval is made for the task.
Here, 1.0 is in a group by itself:
table(findInterval(test, c(0,.2, .5, .9, 1)))
## 1 2 3 4 5
## 3 11 10 3 1
With this statement, 1.0 is in the last interval, with .9:
table(findInterval(test, c(0,.2, .5, .9, 1), rightmost.closed=T))
## 1 2 3 4
## 3 11 10 4