This question already has answers here:
Interpreting "condition has length > 1" warning from `if` function
(7 answers)
Closed 7 months ago.
this is my first time asking a question in StackOverflow and also my first time coding using R
So, please understand if my explanation is unclear :(
I now have a data frame (data2000) that is 1092 x 6
The headers are year, month, predictive horizon, name of the company, GDP Price Index, and Consumer Price Index
I want to create vectors on gdppi and cpi for each month
My ultimate goal is to get the mean, median, interquartile range, and 90th-10th percentile range for each month and I thought this is the first step
and this is the code that I wrote by far
***library(tidyverse)
data2000 <- read.csv("")
for (i in 1:12) {
i_gdppi <- c()
i_cpi <- c()
}
for (i in 1:12) {
if (data2000$month == i) {
append(i_gdppi,data2000[,gdppi])
append(i_cpi, data2000[,cpi])
}
}***
Unfortunately, I got an error message saying that
Error in if (data2000$month == 1) { : the condition has length > 1
I googled it by myself and in if statement, I cannot use a vector as a condition
How can I solve this problem?
Thank you so much and have a nice day!
If you use the group_by() function then it takes care of sub-setting your data:
library(dplyr)
data2000 <- data.frame(month = rep(c(1:12), times = 2), gdppi = runif(24)*100) # Dummy data
data2000 |>
group_by(month) |>
summarise(mean = mean(gdppi), q10 = quantile(gdppi, probs = .10), q25 = quantile(gdppi, probs = .25)) # Add the other percentiles, as needed
Gives this
# A tibble: 12 x 4
month mean q10 q25
<int> <dbl> <dbl> <dbl>
1 1 12.5 3.44 6.83
2 2 34.7 7.15 17.5
3 3 37.8 22.1 28.0
4 4 30.3 19.0 23.2
5 5 65.7 62.2 63.5
6 6 60.7 38.7 47.0
7 7 43.0 38.2 40.0
8 8 77.9 60.7 67.1
9 9 56.3 44.0 48.6
10 10 53.1 19.6 32.2
11 11 63.8 40.6 49.3
12 12 59.0 49.2 52.9
If you have years and months, then group_by(year, month)
Related
I'm trying to create a function that summarizes several vectors and the prompt is
Write a function data_summary which takes three inputs:\
`dataset`: A data frame\
`vars`: A character vector whose elements are names of columns from dataset which the user wants summaries for\
`group.name`: A length one character vector which gives the name of the column from dataset which contains the factor which will be used as a grouping variable
\`var.names`: A character vector of the same length as vars which gives the names that the user would like used as the entries under “Variable” in the resulting output. This should be set equal to vars by default, so the default behavior is to use the column names from dataset.
The output of the function should be a data frame with the following structure:
Column names of the data frame will be:\
`Variable`\
`Missing`\
The `first` level of the factor group.name\
The `second` level of the factor group.name\
…\
The `kth` level of the factor group.name\
`p-value`
I've set up the code already,
data_summary <- function(dataset,vars,group.name,var.names) {
}
but I'm unsure how to proceed because I do not understand what this is trying to accomplish and what the output should look like. There is an example that shows
#data_summary<-function(dataset, vars,group.name, var.name){}
#example
#data_summary(titanic4, c("survived", "female", "age", "sibsp", "parch", "fare", "cabin"), "pclass")
#data_summary(titanic4, c("survived", "female", "age", "sibsp", "parch", "fare", "cabin"), "pclass", c("Survival rate", "% Female", "Age", "# siblings/spouses aboard", "# children/parents aboard", "Fare ($)", "Cabin"))
But it really did not help me outside of inputting the arguments for the function.
You can use dplyr package for this function. Also I don't know by which functions you want summarise your dataframe, so I use all functions which summary function returns from base package.
My data:
> NewSKUMatrix
# A tibble: 268,918 x 4
LagerID FilialID CSBID Price
<int> <int> <int> <dbl>
1 233 2578 1005 38.3
2 333 2543 NA 61.0
3 334 2543 NA 15.0
4 335 2543 NA 11.0
5 337 2301 NA 71.0
6 338 2031 NA 37.0
7 338 2044 NA 35.0
8 338 2054 NA 36.0
9 338 2060 NA 37.0
10 338 2063 NA 36.0
# ... with 268,908 more rows
Function:
data_summary <- function(data,
variables,
values,
names = NULL) {
if (is.null(x = names)) {
names <- variables
}
data %>%
group_by_at(.vars = variables) %>%
summarise_at(
.vars = values,
.funs = list(
Min. = min,
`1st Qu.` = ~ quantile(x = ., probs = 0.25),
Median = median,
Mean = mean,
`3rd Qu.` = ~ quantile(x = ., probs = 0.75),
Max. = max
)
) %>%
rename_at(.vars = variables,
.funs = ~ names)
}
Output:
data_summary(NewSKUMatrix,
c('LagerID'),
c('Price'),
c('SKU'))
# A tibble: 32,454 x 7
SKU Min. `1st Qu.` Median Mean `3rd Qu.` Max.
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 17 39.0 39.0 39.0 39.0 39.0 39.0
2 18 120. 120. 120. 121. 120. 140.
3 21 289. 289. 289. 289. 289. 289.
4 24 37.0 37.0 37.0 45.2 45.2 70.0
5 25 14.0 14.0 14.0 14.0 14.0 14.0
6 55 30.9 30.9 30.9 30.9 30.9 30.9
7 117 26.9 26.9 26.9 26.9 26.9 26.9
8 118 24.8 24.9 24.9 25.1 25.1 25.7
9 119 24.8 24.8 24.9 25.1 25.3 25.7
10 158 104. 108. 108. 107. 108. 108.
# ... with 32,444 more rows
I have a df tracking movement of points each hour. I want to find the total distance traveled by that group/trial by adding the distance between the hourly coordinates, but I'm confusing myself with apply functions.
I want to say "in each group/trial, sum [distance(hour1-hou2), distance(hour2=hour3), distance(hour3-hour4)....] until current hour so on each line, I have a cumulative distance travelled value.
I've created a fake df below.
paths <- data.frame(matrix(nrow=80,ncol=5))
colnames(paths) <- c("trt","trial","hour","X","Y")
paths$trt <- rep(c("A","B","C","D"),each=20)
paths$trial <- rep(c(rep(1,times=10),rep(2,times=10)),times=4)
paths$hour <- rep(1:10,times=8)
paths[,4:5] <- runif(160,0,50)
#this shows the paths that I want to measure.
ggplot(data=paths,aes(x=X,y=Y,group=interaction(trt,trial),color=trt))+
geom_path()
I probably want to add a column paths$dist.traveled to keep track each hour.
I think I could use apply or maybe even aggregate but I've been using PointDistance to find the distances, so I'm a bit confused. I also would rather not do a loop inside a loop, because the real dataset is large.
Here's an answer that uses {dplyr}:
library(dplyr)
paths %>%
arrange(trt, trial, hour) %>%
group_by(trt, trial) %>%
mutate(dist_travelled = sqrt((X - lag(X))^2 + (Y - lag(Y))^2)) %>%
mutate(total_dist = sum(dist_travelled, na.rm = TRUE)) %>%
ungroup()
If you wanted the total distance but grouped only by trt and not trial you would just remove that from the call to group_by().
Is this what you are trying to achieve?:
paths %>%
mutate(dist.traveled = sqrt((X-lag(X))^2 + (Y-lag(Y))^2))
trt trial hour X Y dist.traveled
<chr> <dbl> <int> <dbl> <dbl> <dbl>
1 A 1 1 11.2 26.9 NA
2 A 1 2 20.1 1.48 27.0
3 A 1 3 30.4 0.601 10.4
4 A 1 4 31.1 26.6 26.0
5 A 1 5 38.1 30.4 7.88
6 A 1 6 27.9 47.9 20.2
7 A 1 7 16.5 35.3 16.9
8 A 1 8 0.328 13.0 27.6
9 A 1 9 14.0 41.7 31.8
10 A 1 10 29.7 7.27 37.8
# ... with 70 more rows
paths$dist.travelled[which(paths$hour==1)] <- NA
paths %>%
group_by(trt)%>%
summarise(total_distance = sum(dist.traveled, na.rm = TRUE))
trt total_distance
<chr> <dbl>
1 A 492.
2 B 508.
3 C 479.
4 D 462.
I am adding the new column to calculate distances for each group, and them sum them up.
I'm trying to calculate numeric ranges based on the moving average of a column of data. I have found a way to use caTools::runmean to produce a column of moving averages, and I know how to work with this in Excel to produce the columns I want, but I would love to know a way to do all of this in one R script.
Here is my simplified reproducible example for R.
library(tidyverse)
library(caTools)
data <- as_tibble(data.frame(
Index = as.integer(c(18,19,21,22,23,25,26,29)),
mydbl = c(8.905,13.31,15.739,17.544,19.054,20.393,21.623,22.764)))
data <- data %>%
mutate(avg = runmean(mydbl,
k = 2,
alg = "exact",
endrule = "NA"))
This tibble will look like this:
> data
# A tibble: 8 x 3
Index mydbl avg
<int> <dbl> <dbl>
1 18 8.90 NA
2 19 13.3 11.1
3 21 15.7 14.5
4 22 17.5 16.6
5 23 19.1 18.3
6 25 20.4 19.7
7 26 21.6 21.0
8 29 22.8 22.2
To produce the remaining data I want, I exported this to Excel with write_csv(data,...) and the final table is shown below. The first value in dbl_i is the formula =B2-ABS(C3-B2) (the difference between mydbl and the next avg subtracted from mydbl to create an equidistant lower limit). The last value in dbl_f is the formula =B9+ABS(C9-B9) (the difference between mydbl and the avg added to mydbl to create an equidistant upper limit). The other values in the two columns are just direct references to the avg column.
Index mydbl avg dbl_i dbl_f
18 8.905 NA 6.7025 11.1075
19 13.31 11.1075 11.1075 14.5245
21 15.739 14.5245 14.5245 16.6415
22 17.544 16.6415 16.6415 18.299
23 19.054 18.299 18.299 19.7235
25 20.393 19.7235 19.7235 21.008
26 21.623 21.008 21.008 22.1935
29 22.764 22.1935 22.1935 23.3345
Yes, the dbl_i is just the avg column but with the first value being =B2-abs(C3-B2). And the dbl_f column is the same as the avg column except it's moved up one, and the final value is =B9+abs(C9=B9). Ultimately it seems the real problem lies in finding a way to reproduce the Excel calculations D2=B2-ABS(C3-B2) and E9=B9+ABS(C9-B9).
Does anyone know how they would reproduce these calculations in R? I was looking for a way to create a formula in R that could be the equivalent of B2-ABS(C3-B2), but could not find one, unless I create a matrix instead. Do I have to create a matrix?
Thanks for your time.
data %>%
mutate(
avg = zoo::rollmean(mydbl, 2, align="right", fill=NA),
dbl_i = if_else(row_number() == 1L, mydbl - abs(lead(avg) - mydbl), avg),
dbl_f = if_else(row_number() == n(), mydbl + abs(avg - mydbl), lead(avg))
)
# # A tibble: 8 x 5
# Index mydbl avg dbl_i dbl_f
# <int> <dbl> <dbl> <dbl> <dbl>
# 1 18 8.90 NA 6.70 11.1
# 2 19 13.3 11.1 11.1 14.5
# 3 21 15.7 14.5 14.5 16.6
# 4 22 17.5 16.6 16.6 18.3
# 5 23 19.1 18.3 18.3 19.7
# 6 25 20.4 19.7 19.7 21.0
# 7 26 21.6 21.0 21.0 22.2
# 8 29 22.8 22.2 22.2 23.3
Honestly it's not the most elegant, but it gets the job done.
(BTW: I'm using zoo::rollmean because I don't have caTools installed, but it's the same effect I believe.)
Please, how can I fit a function for different groups in a data set (Soil) using R. the first column is the group i.e. Plot and the second column is the observed variable i.e. Depth
Plot Depth
1 12.5
1 14.5
1 15.8
1 16.1
1 18.9
1 21.2
1 23.4
1 25.7
2 13.1
2 15.0
2 15.8
2 16.3
2 17.4
2 18.6
2 22.6
2 24.1
2 25.6
3 11.5
3 12.2
3 13.9
3 14.7
3 18.9
3 20.5
3 21.6
3 22.6
3 24.1
3 25.8
4 10.2
4 21.5
4 15.1
4 12.3
4 10.0
4 13.5
4 16.5
4 19.2
4 17.6
4 14.1
4 19.7
I used the 'for' statement but only saw output for Plot 1.
This was how I applied the 'for' statement:
After importing my data in R, I saved it as: SNq,
for (i in 1:SNq$Plot[i]) {
dp <- SNq$Depth[SNq$Plot==SNq$Plot[i]]
fit1 = fitdist(dp, "gamma") ## this is the function I'm fitting. The function is not the issue. My challenge is the 'for' statement.
fit1
}
I think this should work. Just make one change in your code:
Why would it work ?
Because: unique function will return unique values (1,2,3) which are nothing but the groups in Plot column. With unique value, we can subset the data using SNq$Depth[SNq$Plot==i] and get depth value for that group.
for (i in unique(SNq$Plot)) { # <- here
dp <- SNq$Depth[SNq$Plot==i]
fit1 = fitdist(dp, "gamma") ## this is the function I'm fitting. The function is not the issue. My challenge is the 'for' statement.
plot(fit1)
}
A tidyverse suggestion:
library("tidyverse")
library("fitdistrplus")
fits <- SNq %>%
group_by(Plot) %>%
nest() %>%
mutate(fits = map(data, ~ fitdist(data = .$Depth, distr = "gamma")),
summaries = map(fit, summary))
You could continue with print(fits$fits) and print(fits$summaries) to access the different fits and their summary. Alternatively you can use a syntax like fits$fits[[1]] and fits$summaries[[1]] to access them.
Try:
for (i in 1:nrow(SNq)) {
dp <- SNq$Depth[SNq$Plot==SNq$Plot[i]]
fit1 = fitdist(dp, "gamma")
fit1
}
I'm trying to reshape my data from a long format into a wide format based on multiple groupings, without success. with this data:
id <- 1:20
month <- rep(4:7, 50)
name <- rep(c("sam", "mike", "tim", "jill", "max"), 40)
cost <- sample(1:100, 200, replace=TRUE)
df <- data.frame(id, month, name, cost)
df.mo.mean <- aggregate(df$cost ~ df$name + df$month, FUN="mean")
df.mo.sd <- aggregate(df$cost ~ df$name + df$month, FUN="sd")
df.mo <- data.frame(df.mo.mean, df.mo.sd)
df.mo <- df.mo[,-c(4,5)]
df.mo[3:4] <- round(df.mo[3:4],2)
head(df)
id month name cost
1 1 4 sam 29
2 2 5 mike 93
3 3 6 tim 27
4 4 7 jill 67
5 5 4 max 28
6 6 5 sam 69
I'm trying to get my data to look like something below, and try to generalize it for an unknown number of names (but <15 max)
month name1.cost.mean name1.cost.sd name2.cost.mean name2.cost.sd
1 45 4 40 6
2 ...
I've tried reshape and do.call with rbind without success. The only other way I can think of doing it is with a loop, which means I'm doing something wrong. I dont have any experience with plyr and would prefer to solve this problem with base packages (for learning purposes), but if its not possible any other suggestions would be very helpful
set.seed(1)
library(plyr)
kk<-ddply(df,.(month,name),summarize,mean=mean(cost),sd=sd(cost))
reshape(kk,timevar="name",idvar="month",direction="wide")
month mean.jill sd.jill mean.max sd.max mean.mike sd.mike mean.sam sd.sam mean.tim sd.tim
1 4 55.3 34.62834 63.3 23.35261 57.6 22.91627 63.4 28.89906 43.3 25.42112
6 5 49.3 25.00689 51.1 27.85059 48.4 23.16223 43.0 24.33562 47.6 32.13928
11 6 60.4 23.61826 52.1 29.74503 38.6 34.39703 53.0 23.28567 52.4 20.88700
16 7 50.0 30.76073 62.7 23.98634 51.7 32.10763 52.8 32.27589 49.5 23.00845
> means <- with( df, tapply(cost, list(month, name), FUN=mean) )
> sds <- with( df, tapply(cost, list(month, name), FUN=sd) )
> colnames(means) <- paste0(colnames(means), ".mean")
> colnames(sds) <- paste0(colnames(sds), ".sd")
> comb.df <- as.data.frame( cbind(means, sds) )
> comb.df <- comb.df[order(names(comb.df))]
> comb.df
jill.mean jill.mean.sd max.mean max.mean.sd mike.mean mike.mean.sd
4 62.1 22.29823 39.7 25.53016 39.6 30.11164
5 40.7 30.72838 44.4 29.12502 54.2 23.91095
6 47.3 31.54556 46.9 32.30910 65.3 30.05569
7 55.5 33.16038 45.9 28.13637 59.7 31.79815
sam.mean sam.mean.sd tim.mean tim.mean.sd
4 40.9 23.54877 58.5 21.69613
5 51.5 30.76163 34.2 32.16900
6 69.1 18.26016 55.2 32.99764
7 46.9 29.90150 55.8 27.17352
I'm not sure what you are asking for, but maybe something like this could be useful
> set.seed(1)
> df <- data.frame(id=1:20, month=rep(4:7, 50),
+ name=rep(c("sam", "mike", "tim", "jill", "max"), 40),
+ cost= sample(1:100, 200, replace=TRUE))
>
> DF.mean <- aggregate(cost ~ name + month, FUN=mean, data=df) ## mean
> DF.sd <- aggregate(cost ~ name + month, FUN=sd, data=df) ## sd
>
> x1 <- as.data.frame.matrix(xtabs(cost~month+name, data=DF.mean)) # reshaping mean
> colnames(x1) <- paste0(colnames(x1), ".mean")
> x2 <- as.data.frame.matrix(xtabs(cost~month+name, data=DF.sd)) # reshaping sd
> colnames(x2) <- paste0(colnames(x2), ".sd")
>
> cbind(x1, x2)
jill.mean max.mean mike.mean sam.mean tim.mean jill.sd max.sd mike.sd sam.sd tim.sd
4 55.3 63.3 57.6 63.4 43.3 34.62834 23.35261 22.91627 28.89906 25.42112
5 49.3 51.1 48.4 43.0 47.6 25.00689 27.85059 23.16223 24.33562 32.13928
6 60.4 52.1 38.6 53.0 52.4 23.61826 29.74503 34.39703 23.28567 20.88700
7 50.0 62.7 51.7 52.8 49.5 30.76073 23.98634 32.10763 32.27589 23.00845
Also, note that #Metrics approach can be done using R base functions without any extra packages:
> kk <- aggregate(cost ~ name + month, FUN=function(x) c(mean=mean(x), sd=sd(x)), data=df)
> reshape(kk,timevar="name",idvar="month",direction="wide")
month cost.jill.mean cost.jill.sd cost.max.mean cost.max.sd cost.mike.mean cost.mike.sd cost.sam.mean cost.sam.sd cost.tim.mean cost.tim.sd
1 4 55.30000 34.62834 63.30000 23.35261 57.60000 22.91627 63.40000 28.89906 43.30000 25.42112
6 5 49.30000 25.00689 51.10000 27.85059 48.40000 23.16223 43.00000 24.33562 47.60000 32.13928
11 6 60.40000 23.61826 52.10000 29.74503 38.60000 34.39703 53.00000 23.28567 52.40000 20.88700
16 7 50.00000 30.76073 62.70000 23.98634 51.70000 32.10763 52.80000 32.27589 49.50000 23.00845
You can use two reshape and then merge the results
library(reshape2)
> dcast(df, month ~ name, mean, value.var="cost")
month jill max mike sam tim
1 4 39.5 54.6 45.6 48.4 57.4
2 5 45.1 61.7 45.4 54.5 50.8
3 6 41.9 45.7 56.4 43.1 52.1
4 7 51.6 38.6 43.6 65.1 51.5
> dcast(df, month ~ name, sd, value.var="cost")
month jill max mike sam tim
1 4 29.31154 25.25954 28.96051 31.32695 29.82989
2 5 31.02848 27.96049 34.32589 30.08599 23.95273
3 6 32.09517 32.50316 37.16988 27.03681 30.42094
4 7 19.56300 31.50026 28.65969 36.53750 26.73429