This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I am new to R and have the following problem:
I have data in two columns, one column represents the earnings on a certain day and the other column contains the weekdays (encoded modulo 7, so for example Saturday is 0, Monday is 1 etc.). An example would be
1. 12,000|6
2. 56,000|0
3. 25,000|1
4. 35,000|2
5. 18,000|3
6. 60,000|4
7. 90,000|5
8. 45,000|6
9. 70,000|0`
Are there any R commands that I can use to find out how much is earned on all Mondays, on all Tuedays and so on?
The general sequence here is grouping and then summarizing. In your case, you want to group by weekday and then sum all the earnings for each group. Here is an implementation of that using dplyr.
library(dplyr)
sample_data <- tibble(earnings = sample(seq(1, 10000, by= 1), replace = TRUE, 100), weekday = sample(seq(0, 6, by = 1),replace = TRUE, 100))
sample_data %>%
group_by(weekday) %>%
summarise(total_earnings = sum(earnings))
Related
This question already has answers here:
Calculate the mean by group
(9 answers)
Mean per group in a data.frame [duplicate]
(8 answers)
How to calculate mean of all columns, by group?
(6 answers)
Closed 1 year ago.
I have some fish catch data. Each row contains a species name, a catch value (cpue), and some other unrelated identifying fields (year, location, depth, etc). This code will produce a dataset with the correct structure:
# a sample dataset
set.seed(1337)
fish = rbind(
data.frame(
spp = "Flounder",
cpue = rnorm(5, 5, 2)
),
data.frame(
spp = "Bass",
cpue = rnorm(5, 15, 1)
),
data.frame(
spp = "Cod",
cpue = rnorm(5, 2, 4)
)
)
I'm trying to create a normalized cpue column cpue_norm. To do this, I apply the following function to each cpue value:
cpue_norm = (cpue - cpue_mean)/cpue_std
Where cpue_mean and cpue_std are, respectively, the mean and standard deviation of cpue. The caveat is that I need to do this by each species i.e. when I calculate the cpue_norm for a particular row, I need to calculate the cpue_mean and cpue_std using cpue from only that species.
The trouble is that all of the species are in the same dataset. So for each row, I need to calculate the mean and standard deviation of cpue for that species and then use those values to calculate cpue_norm.
I've been able to make some headway with tapply:
calc_cpue_norm = function(l) {
return((l - mean(l))/sd(l))
}
tapply(fish$cpue, fish$spp, calc_cpue_norm)
but I end up with lists when I need to be adding these values to the dataframe rows instead.
Anyone who knows R better than me have some wisdom to share?
This question already has answers here:
Convert integer as "20160119" to different columns of "day" "year" "month"
(5 answers)
Closed 2 years ago.
I have a dataframe:
df <- data.frame(year = c(200501:200512))
and I want to split the column into two at the 4th number, so that my data frame look like this:
df_split <- data.frame(year = rep(c(2005)), month = c(01:12)).
This is not so much a question about data frames, but about vectors in R in general. If your actual problem contains more nuances, then update your question or post a new question.
If your vector year is numerical (as asked) you can do simple maths:
year0 <- 200501:200512
year <- as.integer(year0 / 100) # examine the result of `year0 / 100` to see why I use `as.integer` at all and not `round`.
month <- year0 - year
Edit: As nicola pointed out, we can calculate it in other ways, with exact same result:
year <- year0 %/% 100
month <- year0 %% 100
Alternatively, tidyr version may be more compact
library(tidyr)
df %>% separate(year, into = c("yr", "mth"), sep = 4, convert = TRUE)
This question already has answers here:
Calculate the mean of every 13 rows in data frame
(4 answers)
Closed 1 year ago.
This seems to me a very simple question but I don't manage to come up with an efficient idea.
I have a data frame in R so composed:
column position generated as seq(from = 1, to = nrow(df), by = 1)
column value, with some values associated with the position
I want to group the dataframe each k rows (k being an integer input) and then calculate the mean of each group.
The dplyr function group_by does not allow me to group for a specific integer number of rows.
How can I do that? Is there a way to avoid creating the column position at all?
Here is one option with gl from base R. Specify the n and k values. The n would be the total number of rows in the dataset
library(dplyr)
k1 <- 5
df1 %>%
group_by(grp = as.integer(gl(n(), k = k1, n()))) %>%
summarise(value = mean(value))
This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 4 years ago.
I am looking to filter out observations within my data based on certain values by group, which is based on a separate table. I am also trying to work exclusively with dplyr whereas I've performed tasks like these with data.table and I'm not sure how to accomplish it at all.
Here is some sample data to illustrate:
#Primary dataset
dat <- data.frame(account = c(1, 3, 3, 3, 5, 5, 7),
ip = c("255.255.255",
"255.255.255", "199.199.99", "255.255.255",
"75.75.75", "120.120.120",
"50.50.50"),
value = c(50, 1000, 800, 2500, 3000, 500, 75))
From the dataset, I would like to filter based on a list of IPs per account, which is another table:
#Filtering reference table
exclude <- data.frame(account = c(3, 5),
ip = c("255.255.255", "120.120.120"))
The desired output of dat after filtering would be:
account ip value
1 1 255.255.255 50
2 3 199.199.99 800
3 5 75.75.75 3000
4 7 50.50.50 75
I am specifically unsure how to include the reference in a group_by within a piped (%>%) series of dplyr verbs on dat. I also may be approaching the task incorrectly given I am still familiarizing with the dplyr style of programming, so am open to a different way than the reference approach I am considering as long as it is within dplyr.
How about:
dat %>%
mutate(accountip = paste0(account, ip)) %>%
filter(!(accountip %in% paste0(exclude$account,exclude$ip))) %>%
select(account, ip, value)
This question already has answers here:
Filling missing dates by group
(3 answers)
Fastest way to add rows for missing time steps?
(4 answers)
Closed 5 years ago.
I have a data frame like the following and would like to pad the dates.
Notice that four days are missing for id 3.
df = data.frame(
id = rep(1,1,1,2,2,3,3,3),
date = lubridate::ymd("2017-01-01","2017-01-02","2017-01-03",
"2017-05-10","2017-05-11","2017-01-03",
"2017-01-08","2017-01-09"),
type = c("A","A","A","B","B","C","C","C"),
val1 = rnorm(8),
val2 = rnorm(8))
df
I tried the padr package as I wanted a quick solution, but this doesn't seem to work.
?pad
padr::pad(df)
library(dplyr)
df %>% padr::pad(group = c('id'))
df %>% padr::pad(group = c('id','date'))
Any ideas on tools or other packages to pad a dataset across multiple columns and based on groupings
EDIT:
So there are three missing dates in my df.
"2017-01-03","2017-01-08","2017-01-09"
Thus, I want the final dates to include three extra rows that contain
"2017-01-04","2017-01-05","2017-01-06","2017-01-07"