How can I select other columns in sf_MX dataframe to add in sumbyweek? I am stuck.
sumbyweek <- sf_MX %>%
filter(CVE_ENT %in% c("09","15","17")) %>%
group_by(CVE_ENT) %>%
summarise(across(starts_with('cumul')[13:32],
sum,na.rm = TRUE,.names = '{col}_total'))%>%
select(Col1,col2) #unable to get the idea result
sf_MX Data Table:
Col 1 | Col 2 | Col 3| Cumul1 |Cumul2 | Cumul3 …
Expected result:
Col 1 | Col 2 | Cumul1_total |Cumul2_total |Cumul3_total
We could do
library(dplyr)
sumbyweek <- sf_MX %>%
filter(CVE_ENT %in% c("09","15","17")) %>%
group_by(CVE_ENT) %>%
summarise(across(starts_with('cumul'),
sum, na.rm = TRUE, .names = '{col}_total'))
Related
I have two exsiting columns: column A and column B. I would like to subtract column B from column A to get column C. tried many codes, but don't work. Better to get this done in R. Has anyone done something like this before?
|A | B | C |
|rs17158930-G | rs17158930 | G |
|snp-120820-?xrs65832-?;rs10405-A| snp-120820xrs65832;rs10405 |?x?;A |
|rs11829119-C;rs17790731-A |rs11829119;rs17790731 | C;A |
I've changed the data a little, the data became more complicated. Still want to get column C. I've tried the following code, but an error arose.
gwas1 %>%
mutate(row1 = row_number()) %>%
separate_rows(A, B, sep = ';') %>%
mutate(row2 = row_number()) %>%
separate_rows(A, B, sep = 'x') %>%
transform(b= sub("(.*)-.*", "\\1", A), C= sub(".*-", "", A))%>%
group_by(row2) %>%
summarise(across(c(A, B, b), paste0, collapse = 'x'),
C= paste0(C[b %in% B], collapse = 'x')) %>%
group_by(row1) %>%
summarise(across(c(A, B, b), paste0, collapse = ';'),
C= paste0(C[b %in% B], collapse= ';'))
Error: Must group by variables found in .data.
Column row1 is not found.
when I generate the row1 column using gwas1<-gwas1%>% mutate(row1=row_number()), the error remained.
How to solve it?
One way to do this would be to
Create a column with row number
split A and B columns on ; to get them in different rows
split A column on '-'
For each row create C column to include only those values that match in A and B columns.
Combine A and B column into one row again.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
separate_rows(A, B, sep = ';') %>%
separate(A, c('A', 'res'), sep = '-') %>%
group_by(row) %>%
summarise(across(c(A, B), paste0, collapse = ';'),
C = paste0(res[A %in% B], collapse = ';')) %>%
select(-row) -> result
result
# A B C
# <chr> <chr> <chr>
#1 rs17158930 rs17158930 G
#2 rs16935279;rs10405744 rs16935279;rs10405744 C;A
#3 rs11829119 rs11829119 C
#4 rs17066873 rs17066873 C
#5 rs17790731 rs17790731 A
data
df <- structure(list(A = c("rs17158930-G", "rs16935279-C;rs10405744-A",
"rs11829119-C", "rs17066873-C", "rs17790731-A"), B = c("rs17158930",
"rs16935279;rs10405744", "rs11829119", "rs17066873", "rs17790731"
)), row.names = c(NA, -5L), class = "data.frame")
I'm sure there's a way to do this but I can't figure it out. I'd like to be able to pass a list of arguments to mutate_at() within a function without having to specify each argument
library(tidyverse)
fake_data <-
tibble(
id = letters[1:6],
ind_group_a = rep(0:1, times = 3),
ind_group_b = rep(1:0, each = 3)
)
# id ind_group_a ind_group_b
# a 0 1
# b 1 1
# c 0 1
# d 1 0
# e 0 0
# f 1 0
This function will then converts all 1's to "yes" and 0's to "no"
recode_indicator <- function(x, if_1 = "yes", if_0 = "no") {
ifelse(x == 1, if_1, if_0)
}
And I can use it fine like so:
fake_data %>%
mutate_at(
vars(starts_with("ind_")),
recode_indicator,
if_1 = "Has",
if_0 = "Missing"
)
# id ind_group_a ind_group_b
# chr> <chr> <chr>
# a Missing Has
# b Has Has
# c Missing Has
# d Has Missing
# e Missing Missing
# f Has Missing
This is a simplified example but what I'd like to do is make it available in a function without having to write out all of the arguments. Ideally something short like binary_values = list(...)but I can't figure out how to pass these items as the additional arguments of recode_indicator()
roll_up_indicators <- function(x,
#binary_values = list(if_1 = "yes", if_0 = "no"),
...) {
ind_cols <- grep("^ind_", names(x))
df <-
x %>%
rename_at(ind_cols, str_remove, "^ind_") %>%
mutate_at(
ind_cols,
recode_indicator # ,
# binary_values # <- here's the problem area
) %>%
group_by_at(ind_cols) %>%
count() %>%
ungroup()
knitr::kable(df, ...)
}
fake_data %>% roll_up_indicators()
# |group_a |group_b | n|
# |:-------|:-------|--:|
# |No |No | 1|
# |No |Yes | 2|
# |Yes |No | 2|
# |Yes |Yes | 1|
Update
In terms of not rewriting all of the arguments, the formals() function can be used:
roll_up_indicators <- function(x,
binary_values = formals(recode_indicator), # <--- formals
...) {
ind_cols <- grep("^ind_", names(x))
df <-
x %>%
rename_at(ind_cols, str_remove, "^ind_") %>%
mutate_at(
ind_cols,
partial(recode_indicator, !!!binary_values) # <--- the winning answer
) %>%
group_by_at(ind_cols) %>%
count() %>%
ungroup()
knitr::kable(df, ...)
}
One solution is to use purrr::partial to specify that if_1 and if_0 arguments should come from binary_values:
roll_up_indicators <- function(x,
binary_values = list(if_1 = "yes", if_0 = "no"),
...) {
ind_cols <- grep("^ind_", names(x))
df <-
x %>%
rename_at(ind_cols, str_remove, "^ind_") %>%
mutate_at(
ind_cols,
partial(recode_indicator, !!!binary_values) ## <--- partial() here
) %>%
group_by_at(ind_cols) %>%
count() %>%
ungroup()
knitr::kable(df, ...)
}
fake_data %>% roll_up_indicators()
# |group_a |group_b | n|
# |:-------|:-------|--:|
# |No |No | 1|
# |No |Yes | 2|
# |Yes |No | 2|
# |Yes |Yes | 1|
It's probably best to go with the pre-made functions, like recode, but I've also adapted your function if you wanted to add additional functionality. For that, I'm assuming that binary_values is appropriately named and will only ever include two values.
Option 1: Use recode
This requires you to put the starting and ending values within a list. You'll need to quote strings, obviously and either quote or use `` around numbers.
binary_values = list("1" = "yes", "0" = "no")
fake_data %>%
mutate_at(vars(starts_with("ind_")),
list(~recode(.,!!!binary_values)))
Option 2: Specify location or name in list within function
recode_value <- function(x,
binary_values = list(if_1 = "yes", if_0 = "no")
## You'll need to decide whether you'll name them as expected or always put them in this order; it's up to you
) {
if_1 = binary_values$if_1 # or binary_values[[1]]
if_0 = binary_values$if_0 # or binary_values[[1]]
ifelse(x == 1, if_1, if_0)
}
binary_values = list(if_1 = "yes", if_0 = "no")
fake_data %>%
mutate_at(
vars(starts_with("ind_")),
recode_value, ## fixed typo
binary_values
)
Consider the below given dataframe;
Sample DataFrame
| Name | Age | Type |
---------------------
| EF | 50 | A |
| GH | 60 | B |
| VB | 70 | C |
Code to perform Filter
df2 <- df1 %>% filter(Type == 'C') %>% select(Name)
The above code will provide me a dataframe with singe column and row.
I would like to perform a conditional filter where if a certain type is not present it should consider the name to be NULL/NA.
Example
df2 <- df1 %>% filter(Type = 'D') %>% select(Name)
Must give an output of;
| Name |
--------
| NA |
Instead of throwing an error. Any inputs will be really helpful. Either DPLYR or any other methods will be appreciable.
Here is a base R approach:
name <- df[df$Name == "D", "Name"]
ifelse(identical(name, character(0)), NA, name)
[1] NA
Should the name not match to D, the subset operation would return character(0). We can compare the output against this, and then return NA as appropriate.
Data:
df <- data.frame(Name=c("EF", "GH", "VB"),
Age=c(50, 60, 70),
Type=c("A", "B", "C"),
stringsAsFactors=FALSE)
An approach with complete from tidyr would be:
library(dplyr)
library(tidyr)
df1 %>%
complete(Type = LETTERS) %>% # Specify which Types you'd expect, other values are filled with NA
filter(Type == 'D') %>%
select(Name)
# A tibble: 1 x 1
# Name
# <fct>
# 1 NA
This question is essentially a duplicated of this question, except I am working in R. The pyspark solution looks solid, but I haven't been able to figure out how to apply collect_list over a window function in the same way in sparklyr.
I have a Spark DataFrame with the following structure:
------------------------------
userid | date | city
------------------------------
1 | 2018-08-02 | A
1 | 2018-08-03 | B
1 | 2018-08-04 | C
2 | 2018-08-17 | G
2 | 2018-08-20 | E
2 | 2018-08-23 | F
I am trying to group the DataFrame by userid, order each group by date, and collapse the city column into a concatenation of its values. Desired output:
------------------
userid | cities
------------------
1 | A, B, C
2 | G, E, F
The trouble is that each method I've tried to do this with has resulted in some users (appx. 3% on a test of 5000 users) not having their "cities" column in the correct order.
Attempt 1: using dplyr and collect_list.
my_sdf %>%
dplyr::group_by(userid) %>%
dplyr::arrange(date) %>%
dplyr::summarise(cities = paste(collect_list(city), sep = ", ")))
Attempt 2: using replyr::gapply since the operation fits the description of "Grouped-Order-Apply".
get_cities <- . %>%
summarise(cities = paste(collect_list(city), sep = ", "))
my_sdf %>%
replyr::gapply(gcolumn = "userid",
f = get_cities,
ocolumn = "date",
partitionMethod = "group_by")
Attempt 3: write as a SQL window function.
my_sdf %>%
spark_session(sc) %>%
sparklyr::invoke("sql",
"SELECT userid, CONCAT_WS(', ', collect_list(city)) AS cities
OVER (PARTITION BY userid
ORDER BY date)
FROM my_sdf") %>%
sparklyr::sdf_register() %>%
sparklyr::sdf_copy_to(sc, ., "my_sdf", overwrite = T)
^ throws the following error:
Error: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'OVER' expecting <EOF>(line 2, pos 19)
== SQL ==
SELECT userid, conversion_location, CONCAT_WS(' > ', collect_list(channel)) AS path
OVER (PARTITION BY userid, conversion_location
-------------------^^^
ORDER BY occurred_at)
FROM paths_model
Solved! I misunderstood how collect_list() and Spark SQL could work together. I didn't realize a list could be returned, I thought that the concatenation had to take place within the query. The following produces the desired result:
spark_output <- spark_session(sc) %>%
sparklyr::invoke("sql",
"SELECT userid, collect_list(city)
OVER (PARTITION BY userid
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
AS cities
FROM my_sdf") %>%
sdf_register() %>%
group_by(userid) %>%
filter(row_number(userid) == 1) %>%
ungroup() %>%
mutate(cities = paste(cities, sep = " > ")) %>%
sdf_register()
Ok: so I admit that the following solution is not at all efficient (it uses a for loop and is actually a lot of code for what seems like it could be a simple task), but I believe this should work:
#install.packages("tidyverse") # if needed
library(tidyverse)
df <- tribble(
~userid, ~date, ~city,
1 , "2018-08-02" , "A",
1 , "2018-08-03" , "B",
1 , "2018-08-04" , "C",
2 , "2018-08-17" , "G",
2 , "2018-08-20" , "E",
2 , "2018-08-23" , "F"
)
cityPerId <- df %>%
spread(key = date, value = city)
toMutate <- NA
for (i in 1:nrow(cityPerId)) {
cities <- cityPerId[i,][2:ncol(cityPerId)] %>% t() %>%
as.vector() %>%
na.omit()
collapsedCities <- paste(cities, collapse = ",")
toMutate <- c(toMutate, collapsedCities)
}
toMutate <- toMutate[2:length(toMutate)]
final <- cityPerId %>%
mutate(cities = toMutate) %>%
select(userid, cities)
I have table like this (but number of columns can be different, I have a number of pairs ref_* + alt_*):
+--------+-------+-------+-------+-------+
| GeneID | ref_a | alt_a | ref_b | alt_b |
+--------+-------+-------+-------+-------+
| a1 | 0 | 1 | 1 | 3 |
| a2 | 1 | 1 | 7 | 8 |
| a3 | 0 | 1 | 1 | 3 |
| a4 | 0 | 1 | 1 | 3 |
+--------+-------+-------+---------------+
and need to filter out rows that have ref_a + alt_a < 10 and ref_b + alt_b < 10. It's easy to do it with apply, creating additional columns and filtering, but I'm learning to keep my data tidy, so trying to do it with dplyr.
I would use mutate first to create columns with sums and then filter by these sums. But can't figure out how to use mutate in this case.
Edited:
Number of columns is not fixed!
You do not need to mutate here. Just do the following:
require(tidyverse)
df %>%
filter(ref_a + alt_a < 10 & ref_b + alt_b < 10)
If you want to use mutate first you could go with:
df %>%
mutate(sum1 = ref_a + alt_a, sum2 = ref_b + alt_b) %>%
filter(sum1 < 10 & sum2 < 10)
Edit: The fact that we don't know the number of variables in advance makes it a bit more complicated. However, I think you could use the following code to perform this task (assuming that the variable names are all formated with "_a", "_b" and so on. I hope there is a shorter way to perform this task :)
df$GeneID <- as.character(df$GeneID)
df %>%
gather(variable, value, -GeneID) %>%
rowwise() %>%
mutate(variable = unlist(strsplit(variable, "_"))[2]) %>%
ungroup() %>%
group_by(GeneID, variable) %>%
summarise(sum = sum(value)) %>%
filter(sum < 10) %>%
summarise(keepGeneID = ifelse(n() == (ncol(df) - 1)/2, TRUE, FALSE)) %>%
filter(keepGeneID == TRUE) %>%
select(GeneID) -> ids
df %>%
filter(GeneID %in% ids$GeneID)
Edit 2: After some rework I was able to improve the code a bit:
df$GeneID <- as.character(df$GeneID)
df %>%
gather(variable, value, -GeneID) %>%
rowwise() %>%
mutate(variable = unlist(strsplit(variable, "_"))[2]) %>%
ungroup() %>%
group_by(GeneID, variable) %>%
summarise(sum = sum(value)) %>%
group_by(GeneID) %>%
summarise(max = max(sum)) %>%
filter(max < 10) -> ids
df %>%
filter(GeneID %in% ids$GeneID)