I have the following data set
library(dplyr)
df<- data.frame(c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b"),
c(1, 1, 2, 2, 2, 3, 1, 2, 2, 2, 3, 3),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60))
colnames(df)<-c("name", "year", "val")
This we summarize by grouping df by name and year and then find the average and number of these entries
asd <- (df %>%
group_by(name,year) %>%
summarize(average = mean(val), `ave_number` = n()))
This gives the following desired output
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 50 1
4 b 1 20 1
5 b 2 20 3
6 b 3 50 2
Now, all entries of asd$average where asd$ave_number<2 I would like to substitute according to the following array based on year
replacer<- data.frame(c(1,2,3),
c(100,200,300))
colnames(replacer)<-c("year", "average")
In other words, I would like to end up with
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1 #substituted
4 b 1 100 1 #substituted
5 b 2 20 3
6 b 3 50 2
Is there a way to achieve this with dplyr? I guess I have to use the %>%-operator, something like this (not working code)
asd %>%
group_by(name, year) %>%
summarize(average = ifelse(n() < 2, #SOMETHING#, mean(val)))
Here's what I would do:
colnames(replacer) <- c("year", "average_replacer") #To avoid duplicate of variable name
asd <- left_join(asd, replacer, by = "year") %>%
mutate(average = ifelse(ave_number < 2, average_replacer, average)) %>%
select(-average_replacer)
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2
Regarding the following:
I guess I have to use the %>%-operator
You don't ever have to use the pipe operator. It is there for convenience because you can string (or "pipe") functions one after another, as you would with a train of thought. It's kind of like having a flow in your code.
You can do this easily by using a named vector of replacement values by year instead of a data frame. If you're set on a data frame, you'd be using joins.
replacer <- setNames(c(100,200,300),c(1,2,3))
asd <- df %>%
group_by(name,year) %>%
summarize(average = mean(val),
ave_number = n()) %>%
mutate(average = if_else(ave_number < 2, replacer[year], average))
Source: local data frame [6 x 4]
Groups: name [2]
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2
Related
I have the following data:
library(dplyr)
my_data = data.frame(patient_id = c(1,1,1,1, 2,2,2),
age = c(43, 43, 44, 44, 21, 21, 21),
gender = c("M", "M", "M", "M", "F", "F", "F"),
appointment_number = c(1,2,3,4,1,2,3),
missed = c(0, 0, 1, 1, 1, 1, 1))
My Question: Grouped by each ID, I want to create two variables:
The first variable takes the value of the previous appointment value
The second variable takes the "n-1" cumulative average of the previous appointment values (e.g. If patient_id = 1 has 8 rows, the cumulative average at this row would be the cumulative average of the first 7 rows)
Here is my attempt to do this:
my_data_final <- my_data %>%
group_by(patient_id) %>%
mutate(cummean = cumsum(missed)/(row_number() - 1)) %>%
mutate(previous_apt = lag(missed))
This results in the cummean variable being greater than 1, even though the variable in question can only be 1 or 0:
# A tibble: 7 x 7
# Groups: patient_id [2]
patient_id age gender appointment_number missed cummean previous_apt
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 43 M 1 0 NaN NA
2 1 43 M 2 0 0 0
3 1 44 M 3 1 0.5 0
4 1 44 M 4 1 0.667 1
5 2 21 F 1 1 Inf NA
6 2 21 F 2 1 2 1
7 2 21 F 3 1 1.5 1
Can someone please show me how to fix this?
Thanks!
Note: I tried to resolve this - is this correct?
my_data %>%
group_by(patient_id) %>%
mutate(previous_apt = lag(missed)) %>%
mutate(cummean = (cumsum(missed) - missed) / (row_number() - 1)) %>% mutate(previous_apt_2 = lag(missed, 2))
Lets say I have the following dataframe:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3,3),
col1 = c("a","a", "b", "c", "d", "e", "f", "g", "h", "g"),
start_day = c(NA,1,15, NA, 4, 22, 5, 11, 14, 18),
end_day = c(NA,2, 15, NA, 6, 22, 6, 12, 16, 21))
Output:
id col1 start_day end_day
1 1 a NA NA
2 1 a 1 2
3 1 b 15 15
4 2 c NA NA
5 2 d 4 6
6 2 e 22 22
7 3 f 5 6
8 3 g 11 12
9 3 h 14 16
10 3 g 18 21
I want to create a data frame such that for each unique id I get the minimum of start_day column and the maximum of the end_day column. Also I want to keep the other columns. One solution could be using group_by:
df %>% group_by(id) %>% summarise(start_day = min(start_day, na.rm = T),
end_day = max(end_day, na.rm = T))
Output:
id start_day end_day
1 1 1 15
2 2 4 22
3 3 5 21
But I loose other columns (in this example col1). How can I save the other columns. A desired outcome would look like as follow:
id start_day end_day col1_start col1_end
1 1 1 15 a b
2 2 4 22 d e
3 3 5 21 f g
Is there anyway that I can get the data frame I need?
Create the index first and then update the 'start_day' as the original column got updated with summarised output
library(dplyr)
df %>%
group_by(id) %>%
summarise(col1_start = col1[which.min(start_day)],
col1_end = col1[which.max(end_day)],
start_day = min(start_day, na.rm = TRUE),
end_day = max(end_day, na.rm = TRUE))
-output
# A tibble: 3 × 5
id col1_start col1_end start_day end_day
<dbl> <chr> <chr> <dbl> <dbl>
1 1 a b 1 15
2 2 d e 4 22
3 3 f g 5 21
A similar question was asked here... however, I cant get it to work in my case and Im not sure why.
I am trying to arrange a tibble based on 2 columns. For example, in my data, I am trying to arrange by the value and count columns. To begin, I show a working example:
library(dplyr)
dat <- tibble(
value = c("B", "D", "D", "E", "A", "A", "B", "C", "B", "E"),
ids = c(1:10),
count = c(3, 2, 1, 2, 2, 1, 2, 1, 1, 1)
)
dat %>%
group_by(value) %>%
mutate(valrank = min(ids)) %>%
ungroup() %>%
arrange(valrank, value, desc(count))
looking at the output:
# A tibble: 10 × 4
value ids count valrank
<chr> <int> <dbl> <int>
1 B 1 3 1
2 B 7 2 1
3 B 9 1 1
4 D 2 2 2
5 D 3 1 2
6 E 4 2 4
7 E 10 1 4
8 A 5 2 5
9 A 6 1 5
10 C 8 1 8
We can see that the code worked... the tibble is arranged by the value column, and the order is based on how many times each element appears in the tibble (ie, the count).
However, when I try the following example, the same code doesn't work:
dat_1 <- tibble(
value = c("x2....", "x5...." , "x5....", "x3...." , "x3....", "x4....", "x3....", "x3....", "x4....", "x2...." ),
ids = c(1:10),
count = c(2, 2, 1, 4, 3, 2, 2, 1, 1, 1)
)
dat_1 %>%
group_by(value) %>%
mutate(valrank = min(ids)) %>%
ungroup() %>%
arrange(valrank, value, desc(count))
Looking at this output, we get:
# A tibble: 10 × 4
value ids count valrank
<chr> <int> <dbl> <int>
1 x2.... 1 2 1
2 x2.... 10 1 1
3 x5.... 2 2 2
4 x5.... 3 1 2
5 x3.... 4 4 4
6 x3.... 5 3 4
7 x3.... 7 2 4
8 x3.... 8 1 4
9 x4.... 6 2 6
10 x4.... 9 1 6
So we can see, this has failed to reorder the tibble based on the count. In the 2nd example, x3 appears the most (i.e., has the highest count), so should appear at the top of the tibble.
I'm not sure what Im doing wrong here!?
UPDATE:
I think I may have solved this problem with:
dat_1 %>%
group_by(value) %>%
mutate(valrank = max(count)) %>%
ungroup() %>%
arrange(-valrank, value, -count)
I am new to R and have a simple 'how to' question, specifically, what is the best way to calculate Group and overall percentages on data frame columns? My data looks like this:
# A tibble: 13 x 3
group resp id
<chr> <dbl> <chr>
1 A 1 ssa
2 A 1 das
3 A NA fdsf
4 B NA gfd
5 B 1 dfg
6 B 1 dg
7 C 1 gdf
8 C NA gdf
9 C NA hfg
10 D 1 hfg
11 D 1 trw
12 D 1 jyt
13 D NA ghj
the test data is this:
structure(list(group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C", "D", "D", "D", "D"), resp = c(1, 1, NA, NA, 1, 1, 1, NA,
NA, 1, 1, 1, NA), id = c("ssa", "das", "fdsf", "gfd", "dfg",
"dg", "gdf", "gdf", "hfg", "hfg", "trw", "jyt", "ghj")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame")
I managed to do the group percentages by doing the following (which seems overcomplicated):
a <- test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE))
b <- test %>%
group_by(group) %>%
summarise(all = n_distinct(id, na.rm = TRUE))
result <- a %>%
left_join(b) %>%
mutate(a,resp_rate = round(no_resp/all*100))
this gives me:
# A tibble: 4 x 4
group no_resp all resp_rate
<chr> <dbl> <int> <dbl>
1 A 2 3 67
2 B 2 3 67
3 C 1 2 50
4 D 3 4 75
which is fine, but I wondered how I could make this simpler? Also, how would I do an overall percentage? E.g. an overall distinct count of resp/distinct count of id, without grouping.
Many thanks
You can add multiple statements in summarise so you don't have to create temporary objects a and b. To calculate overall percentage you can divide the number by the sum of the column.
library(dplyr)
test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE),
all = n_distinct(id),
resp_rate = round(no_resp/all*100)) %>%
mutate(no_resp_perc = no_resp/sum(no_resp) * 100)
# group no_resp all resp_rate no_resp_perc
# <chr> <int> <int> <dbl> <dbl>
#1 A 2 3 67 25
#2 B 2 3 67 25
#3 C 1 2 50 12.5
#4 D 3 4 75 37.5
Using base R we may apply tapply and table functions.
res <- transform(with(test, data.frame(no_resp=tapply(resp, group, sum, na.rm=TRUE),
all=colSums(table(id, group) > 0))),
resp_rate=round(no_resp/all*100),
overall_perc=prop.table(no_resp)*100
)
res
# no_resp all resp_rate overall_perc
# A 2 3 67 25.0
# B 2 3 67 25.0
# C 1 2 50 12.5
# D 3 4 75 37.5
I would like to tranform messy dataset in R,
However I am having issues figuring out how to do so, I provided example dataset and result that I need to achieve:
dataset <- tribble(
~ID, ~DESC,
1, "3+1Â 81Â mÂ",
2, "2+1Â 90Â mÂ",
3, "3+KK 28Â mÂ",
4, "3+1 120 m (Mezone)")
dataset
dataset_tranformed <- tribble(
~ID, ~Rooms, ~Meters, ~Mezone, ~KK,
1, 4, 81,0, 0,
2, 3, 90,0,0,
3, 3, 28,0,1,
4, 4, 120,1, 0)
dataset_tranformed
columns firstly need to be seperated, however using dataset %>% separate(DESC, c("size", "meters_squared", "Mezone"), sep = " ") does not work because (Mezone) is thrown away.
We can do this by doing evaluation and individually extract the components
library(dplyr)
library(stringr)
library(tidyr)
dataset %>%
mutate(Rooms = map_dbl(DESC, ~
str_extract(.x, "^\\d+\\+\\d*") %>%
str_replace("\\+$", "+0") %>%
rlang::parse_expr(.) %>%
eval ),
Meters = str_extract(DESC, "(?<=\\s)\\d+(?=Â)"),
Mezone = +(str_detect(DESC, "Mezone")),
KK = +(str_detect(DESC, "KK"))) %>%
select(-DESC)
# A tibble: 4 x 5
# ID Rooms Meters Mezone KK
# <dbl> <dbl> <chr> <int> <int>
#1 1 4 81 0 0
#2 2 3 90 0 0
#3 3 3 28 0 1
#4 4 4 120 1 0
Or another option is extract and then make use of str_detect
dataset %>%
extract(DESC, into = c("Rooms1", "Rooms2", "Meters"),
"^(\\d+)\\+(\\d*)[^0-9]+(\\d+)", convert = TRUE, remove = FALSE) %>%
transmute(ID, Mezone = +(str_detect(DESC, "Mezone")),
KK = +(is.na(Rooms2)), Rooms = Rooms1 + replace_na(Rooms2, 0), Meters )
# A tibble: 4 x 5
# ID Mezone KK Rooms Meters
# <dbl> <int> <int> <dbl> <int>
#1 1 0 0 4 81
#2 2 0 0 3 90
#3 3 0 1 3 28
#4 4 1 0 4 120