Looking to create a function.
I would like to add the number of occurrences of an observation up within a given group (ex 5, 5 occurrences 2 times). The same numbers of Days within a Week by Business are to be summed. The summed values will be in a new row 'Total-occurrences.'
tapply or plyr works its way into this, however I'm stuck on a few nuances.
Thanks!
14X3 matrix
Business Week Days
A **1** 3
A **1** 3
A **1** 1
A 2 4
A 2 1
A 2 1
A 2 6
A 2 1
B **1** 1
B **1** 2
B **1** 7
B 2 2
B 2 2
B 2 na
**AND BECOME**
10X4 matrix
Business Week Days Total-Occurrences
A **1** 3 2
A **1** 1 1
A 2 1 3
A 2 4 1
A 2 6 1
B **1** 1 1
B **1** 2 1
B **1** 7 1
B 3 2 2
B 2 na 0
If I understand your question correctly, you want to group your data frame by Business and Week and Days and calculate the occurences of each group in a new column Total-Occurences.
df <- structure(list(Business = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"),
Week = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 2L, 2L, 2L), .Label = c("**1**", "2"), class = "factor"),
Days = structure(c(3L, 3L, 1L, 4L, 1L, 1L, 5L, 1L, 1L, 2L,
6L, 2L, 2L, 7L), .Label = c("1", "2", "3", "4", "6", "7",
"na"), class = "factor")), .Names = c("Business", "Week",
"Days"), class = "data.frame", row.names = c(NA, -14L))
There are certainly different ways of doing this. One way would be to use dplyr:
require(dplyr)
result <- df %.%
group_by(Business,Week,Days) %.%
summarize(Total.Occurences = n())
#>result
# Business Week Days Total.Occurences
#1 A **1** 1 1
#2 A **1** 3 2
#3 A 2 1 3
#4 A 2 4 1
#5 A 2 6 1
#6 B **1** 1 1
#7 B **1** 2 1
#8 B **1** 7 1
#9 B 2 2 2
#10 B 2 na 1
You could also use plyr:
require(plyr)
ddply(df, .(Business, Week, Days), nrow)
note that based on these functions, the output would be slightly different than what you posted in your question. I assume this may be a typo because in your original data there is no Week 3 but in your desired output there is.
Between the two solutions, the dplyr approach is probably faster.
I guess there are also other ways of doing this (but im not sure about tapply)
Related
I have some sequence event data for which I want to plot the trend of missingness on value across time. Example below:
id time value
1 aa122 1 1
2 aa2142 1 1
3 aa4341 1 1
4 bb132 1 2
5 bb2181 2 1
6 bb3242 2 3
7 bb3321 2 NA
8 cc122 2 1
9 cc2151 2 2
10 cc3241 3 1
11 dd161 3 3
12 dd2152 3 NA
13 dd3282 3 NA
14 ee162 3 1
15 ee2201 4 2
16 ee3331 4 NA
17 ff1102 4 NA
18 ff2141 4 NA
19 ff3232 5 1
20 gg142 5 3
21 gg2192 5 NA
22 gg3311 5 NA
23 gg4362 5 NA
24 ii111 5 NA
The NA suppose to increase over time (the behaviors are fading). How do I plot the NA across time
I think this is what you're looking for? You want to see how many NA's appear over time. Assuming this is correct, if each time is a group, then you can count the number of NA's appear in each group
data:
df <- structure(list(id = structure(1:24, .Label = c("aa122", "aa2142",
"aa4341", "bb132", "bb2181", "bb3242", "bb3321", "cc122", "cc2151",
"cc3241", "dd161", "dd2152", "dd3282", "ee162", "ee2201", "ee3331",
"ff1102", "ff2141", "ff3232", "gg142", "gg2192", "gg3311", "gg4362",
"ii111"), class = "factor"), time = c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L,
5L, 5L), value = c(1L, 1L, 1L, 2L, 1L, 3L, NA, 1L, 2L, 1L, 3L,
NA, NA, 1L, 2L, NA, NA, NA, 1L, 3L, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-24L))
library(tidyverse)
library(ggplot2)
df %>%
group_by(time) %>%
summarise(sumNA = sum(is.na(value)))
# A tibble: 5 × 2
time sumNA
<int> <int>
1 1 0
2 2 1
3 3 2
4 4 3
5 5 4
You can then plot this using ggplot2
df %>%
group_by(time) %>%
summarise(sumNA = sum(is.na(value))) %>%
ggplot(aes(x=time)) +
geom_line(aes(y=sumNA))
As you can see, as time increases, the number of NA's also increases
I'm trying to delete some repeating information in my data set and replace it with NA. Here's an example of the data:
DataTable1
ID Day x y
1 1 1 3
1 2 1 3
2 1 2 5
2 2 2 5
3 1 3 4
3 2 3 4
4 1 4 6
4 2 4 6
I'm trying to replace "x" and "y" values with "NA" when Day=1. This is what I want:
ID Day x y
1 1 NA NA
1 2 1 3
2 1 NA NA
2 2 2 5
3 1 NA NA
3 2 3 4
4 1 NA NA
4 2 4 6
I'm not really sure where to start or how to go about this. I tried using the replace_with_na_if function from the naniar library. Otherwise, I am unsure what to try.
replace_with_na_if(data.frame=DataTable1$x,
condition=DataTable1$Day== 2)
I received an error message that reads:
Error in replace_with_na_if(data.frame = DataTable1$x, condition = DataTable1$Day == :
unused argument (data.frame = DataTable1$x)
An option in base R would be to create a logical vector based on the elements of 'Day'. Use that index to subset the 'x', 'y' columns and assign them to NA
i1 <- df1$Day == 1
df1[i1, c('x', 'y')] <- NA
Here's a data.table solution. Since you may be new to R, you need to install the data.table package first. If you have a large data set, data.table may work faster than using data frame. Also, I find the syntax to be easy to read and understand.
#Create the data frame:
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))
library(data.table)
dt <- setDT(df) # convert the data frame to a data.table
dt[Day == 1, c("x","y") := NA] # where Day equals 1, make the columns x and y equal NA
Good luck and welcome to stackoverflow!
Using dplyr, we can use mutate_at and replace like
library(dplyr)
df %>% mutate_at(vars(x, y), ~replace(., Day == 1, NA))
# ID Day x y
#1 1 1 NA NA
#2 1 2 1 3
#3 2 1 NA NA
#4 2 2 2 5
#5 3 1 NA NA
#6 3 2 3 4
#7 4 1 NA NA
#8 4 2 4 6
data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))
I have data with the status column. I want to subset my data to the condition of 'f' status, and previous condition of 'f' status.
to simplify:
df
id status time
1 n 1
1 n 2
1 f 3
1 n 4
2 f 1
2 n 2
3 n 1
3 n 2
3 f 3
3 f 4
my result should be:
id status time
1 n 2
1 f 3
2 f 1
3 n 2
3 f 3
3 f 4
How can I do this in R?
Here's a solution using dplyr -
df %>%
group_by(id) %>%
filter(status == "f" | lead(status) == "f") %>%
ungroup()
# A tibble: 6 x 3
id status time
<int> <fct> <int>
1 1 n 2
2 1 f 3
3 2 f 1
4 3 n 2
5 3 f 3
6 3 f 4
Data -
df <- structure(list(id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
status = structure(c(2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L,
1L), .Label = c("f", "n"), class = "factor"), time = c(1L,
2L, 3L, 4L, 1L, 2L, 1L, 2L, 3L, 4L)), .Names = c("id", "status",
"time"), class = "data.frame", row.names = c(NA, -10L))
This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 6 years ago.
I have the following data frame:
Event Scenario Year Cost
1 1 1 10
2 1 1 5
3 1 2 6
4 1 2 6
5 2 1 15
6 2 1 12
7 2 2 10
8 2 2 5
9 3 1 4
10 3 1 5
11 3 2 6
12 3 2 5
I need to produce a pivot table/ frame that will sum the total cost per year for each scenario. So the result will be.
Scenario Year Cost
1 1 15
1 2 12
2 1 27
2 2 15
3 1 9
3 2 11
I need to produce a ggplot line graph that plot the cost of each scenario per year. I know how to do that, I just can't get the right data frame.
Try
library(dplyr)
df %>% group_by(Scenario, Year) %>% summarise(Cost=sum(Cost))
Or
library(data.table)
setDT(df)[, list(Cost=sum(Cost)), by=list(Scenario, Year)]
Or
aggregate(Cost~Scenario+Year, df,sum)
data
df <- structure(list(Event = 1:12, Scenario = c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L), Year = c(1L, 1L, 2L, 2L, 1L, 1L,
2L, 2L, 1L, 1L, 2L, 2L), Cost = c(10L, 5L, 6L, 6L, 15L, 12L,
10L, 5L, 4L, 5L, 6L, 5L)), .Names = c("Event", "Scenario", "Year",
"Cost"), class = "data.frame", row.names = c(NA, -12L))
The following does it:
library(plyr)
ddply(df, .(Scenario, Year), summarize, Cost = sum(Cost))
#Scenario Year Cost
#1 1 1 15
#2 1 2 12
#3 2 1 27
#4 2 2 15
#5 3 1 9
#6 3 2 11
I have a dataframe in long form for which I need to aggregate several observations taken on a particular day.
Example data:
long <- structure(list(Day = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("1", "2"), class = "factor"),
Genotype = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), View = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1",
"2", "3"), class = "factor"), variable = c(1496L, 1704L,
1738L, 1553L, 1834L, 1421L, 1208L, 1845L, 1325L, 1264L, 1920L,
1735L)), .Names = c("Day", "Genotype", "View", "variable"), row.names = c(NA, -12L),
class = "data.frame")
> long
Day Genotype View variable
1 1 A 1 1496
2 1 A 2 1704
3 1 A 3 1738
4 1 B 1 1553
5 1 B 2 1834
6 1 B 3 1421
7 2 A 1 1208
8 2 A 2 1845
9 2 A 3 1325
10 2 B 1 1264
11 2 B 2 1920
12 2 B 3 1735
I need to aggregate each genotype for each day by taking the cube root of the product of each view. So for genotype A on day 1, (1496 * 1704 * 1738)^(1/3). Final dataframe would look like:
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Have been going round and round with reshape2 for the last couple of days, but not getting anywhere. Help appreciated!
I'd probably use plyr and ddply for this task:
library(plyr)
ddply(long, .(Day, Genotype), summarize,
summary = prod(variable) ^ (1/3))
#-----
Day Genotype summary
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
Or this with dcast:
dcast(data = long, Day + Genotype ~ .,
value.var = "variable", function(x) prod(x) ^ (1/3))
#-----
Day Genotype NA
1 1 A 1642.418
2 1 B 1593.633
3 2 A 1434.695
4 2 B 1614.790
An other solution without additional packages.
aggregate(list(Summary=long$variable),by=list(Day=long$Day,Genotype=long$Genotype),function(x) prod(x)^(1/length(x)))
Day Genotype Summary
1 1 A 1642.418
2 2 A 1434.695
3 1 B 1593.633
4 2 B 1614.790