how to keep ordering after spread - r

I would like to know how to keep ordering after spread.
data<-tibble(var=c("A","C","D","B"), score=c(1,2,4,3))
data_spread <-data%>%spread(key = var, value = score)
I would like to keep the order of c("A","C","D","B").

An option is to convert to factor with levels specified as the unique elements of 'var' will make sure the order is the order of occurrence
library(dplyr)
library(tidyr)
data %>%
mutate(var = factor(var, levels = unique(var))) %>%
spread(var, score)
# A tibble: 1 x 4
# A C D B
# <dbl> <dbl> <dbl> <dbl>
#1 1 2 4 3

Related

Summarise dataframe with correlation of variables based on multiple groups

I'm working with a dataset that has two level grouping. Here I give an example:
set.seed(123)
example=data.frame(
id = c(rep(1,20),rep(2,20)), # grouping
Grp = rep(c(rep('A',10),rep('B',10)),2), # grouping
target = c(rep(1:10,2), rep(20:11,2)),
var1 = sample(1:100,40,replace=TRUE),
var2 =sample(1:100,40,replace=TRUE)
)
In this case, the data is grouped by id and by Grp. I want to calculate the correlation of target with var1 and var2. However, I don't know which is the most efficient way to apply this using a tidy approach and based on the groups.
I tried with dplyr approach. Like using:
example %>% group_by(id,Grp) %>%
summarise(cor(target,c(var1,var2))) # length error
Or even creating a custom function and applying it. But this last only summarise all the data without grouping:
corr_analisis_e = function(df){
return( cor(df[,'target'] , df[,c('var1','var2')]) )
}
example %>% group_by(id,Grp) %>% corr_analisis_e() # get all the data at once
As output, I would expect to get something like a matrix or dataframe of 4 rows and 2 columns, where each row is a group (id and Grp) and columns are var1 and var2. Every value the result from the cor() method.
example %>%
group_by(id, Grp) %>%
summarise(across(c(var1, var2), ~ cor(.x, target)))
# A tibble: 4 x 4
# Groups: id [2]
id Grp var1 var2
<dbl> <chr> <dbl> <dbl>
1 1 A -0.400 0.532
2 1 B -0.133 -0.187
3 2 A -0.655 -0.103
4 2 B -0.580 0.445
The grouping columns can then be removed with %>% ungroup %>% select(var1:var2).

remove group if any member containes NA in R

How can I remove entire group if one of its values is NA. For ex - remove category B because it contains NA.
library(dplyr)
tbl = tibble(category = c("A", "A", "B", "B"),
values = c(2, 3, 1, NA))
We can use filter after grouping by 'category'
library(dplyr)
tbl %>%
group_by(category) %>%
filter(!any(is.na(values))) %>%
ungroup
-output
# A tibble: 2 x 2
category values
<chr> <dbl>
1 A 2
2 A 3
tbl %>%
filter(!category %in% category[is.na(values)])
Output
category values
<chr> <dbl>
1 A 2
2 A 3
tbl %>%
group_by(category) %>%
filter(all(!is.na(values)))
category values
<chr> <dbl>
1 A 2
2 A 3
You can get the categories which has at least one NA value and exclude them.
subset(tbl, !category %in% unique(category[is.na(values)]))
# category values
# <chr> <dbl>
#1 A 2
#2 A 3
If you prefer dplyr::filter.
library(dplyr)
tbl %>% filter(!category %in% unique(category[is.na(values)]))

Organizing a data frame with multiple entries per sample

I have the following database with several entries per individual:
record_id<-c(21,21,21,15,15,15,2,2,2,2,3,3,3)
var<-c(0,0,0,1,0,0,1,1,0,0,1,1,0)
data<-data.frame(cbind(record_id,var))
I want to create a new data frame with just 1 row per record_id. But it has to fulfill that if the individual (record_id) has a data$var == 1. The outcome data frame must indicate 1.
So, the outcome would be like this:
record_id<-c(21,15,2,3)
var<-c(0,1,1,1)
data_sol<-data.frame(cbind(record_id,var))
I have tried this:
DF1 <- data %>%
group_by(record_id) %>%
mutate(class = ifelse(var==1,1,0)) %>%
ungroup
I know it's not the best way, I was planning to obtain afterwards the unique values... But it did not make the trick.
If your 'var' is all zeroes or ones, you can also use max():
data%>%group_by(record_id)%>%
summarise(new_var=max(var))
# A tibble: 4 x 2
record_id new_var
<dbl> <dbl>
1 2 1
2 3 1
3 15 1
4 21 0
You can use mean() with the mutate to detect if there exsist any non zero value inside a group like,
data %>%
group_by(record_id) %>%
mutate(var = ifelse(mean(var)!=0,1,0)) %>%
distinct(record_id,var)
gives,
# A tibble: 4 x 2
# Groups: record_id [4]
# record_id var
# <dbl> <dbl>
# 1 21 0
# 2 15 1
# 3 2 1
# 4 3 1
We can do
library(dplyr)
data %>%
group_by(record_id) %>%
summarise(var = +(mean(var) != 0))
Or using slice
data %>%
group_by(record_id) %>%
slice_max(n = 1, order_by = var)

using mutate with row and column indexing and group by

I want to create a variable using dplyr that takes in a value conditional on another variable.
See example below.
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
I want to create a variable 'baseline' that takes the content of variable 'value' where time = 1 and by group. As such the desired output would be
data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1),
baseline = c(1,1,3,3)))
Tried to run the following code with indexing but am clearly going wrong somewhere
x <- data.frame(list(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))
x %>% group_by(group) %>%
mutate(baseline = .[[.$time==1,.$value]])
Thanks
We can use which.min
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(baseline = value[which.min(time)])
# A tibble: 4 x 4
# Groups: group [2]
# group time value baseline
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 2 1
#3 b 1 3 3
#4 b 2 4 3
and if it is already ordered by 'time', then simply use first
df1 %>%
group_by(group) %>%
mutate(baseline = first(value))
data
df1 <- data.frame(group=c('a','a','b','b'),
time=c(1,2,1,2),
value = seq(1,4,1))

How to summarize_each with mixed column class

Consider the situation, where I want to summarize_each a data.frame with mixed column type.
> (temp=data.frame(ID=c(1,1,2,2),gender=c("M","M","F","F"),val1=rnorm(4),val2=rnorm(4)))
ID gender val1 val2
1 1 M -1.7944804 0.5232313
2 1 M 0.3938437 -0.8424086
3 2 F -0.3190777 0.3220580
4 2 F 1.3667340 -0.6031376
> temp%>%group_by(ID)%>%summarize_each(funs(mean))
Source: local data frame [2 x 4]
ID gender val1 val2
(dbl) (lgl) (dbl) (dbl)
1 1 NA -0.7003184 -0.1595886
2 2 NA 0.5238282 -0.1405398
This doesn't work because mean(gender) doesn't make sense.
Question:
If all my non-numeric columns are characteristic of ID, thus are identical within each ID, can I somehow get summarize_each to return that 'unique' value?
> temp%>%group_by(ID,gender)%>%summarize_each(funs(mean))
Source: local data frame [2 x 4]
Groups: ID [?]
ID gender val1 val2
(dbl) (fctr) (dbl) (dbl)
1 1 M -0.7003184 -0.1595886
2 2 F 0.5238282 -0.1405398
is the output that I want, but I somehow feel like this is doing unnecessary nested group_by because there really is nothing to group within ID.
One option would be gather/spread from tidyr. Reshape to 'long' format with gather, grouped by 'ID', 'var', get the first element of 'gender' and mean of 'val', spread it back to 'wide' format.
library(tidyr)
library(dplyr)
gather(temp, var, val, val1:val2) %>%
group_by(ID, var) %>%
summarise(gender = first(gender), val = mean(val)) %>%
spread(var, val)
Or another is using mutate_if and unique. After grouping by 'ID', we get the mean of the numeric columns with mutate_if. As the other columns (i.e. 'gender' also remains in the output) we can just do unique to get the unique rows from the output.
temp %>%
group_by(ID) %>%
mutate_if(is.numeric, mean) %>%
unique()
# ID gender val1 val2
# <int> <chr> <dbl> <dbl>
#1 1 M -0.7003184 -0.1595886
#2 2 F 0.5238281 -0.1405398

Resources