This is an approximation of the original dataframe. In the original, there are many more columns than are shown here.
id init_cont family description value
1 K S impacteach 1
1 K S impactover 3
1 K S read 2
2 I S impacteach 2
2 I S impactover 4
2 I S read 1
3 K D impacteach 3
3 K D impactover 5
3 K D read 3
I want to combine the values for impacteach and impactover to generate an average value that is just called impact. I would like the final table to look like the following:
id init_cont family description value
1 K S impact 2
1 K S read 2
2 I S impact 3
2 I S read 1
3 K D impact 4
3 K D read 3
I have not been able to figure out how to generate this table. However, I have been able to create a dataframe that looks like this:
id description value
1 impact 2
1 read 2
2 impact 3
2 read 1
3 impact 4
3 read 3
What is the best way for me to take these new values and add them to the original dataframe? I also need to remove the original values (like impacteach and impactover) in the original dataframe. I would prefer to modify the original dataframe as opposed to creating an entirely new dataframe because the original dataframe has many columns.
In case it is useful, this is a summary of the code I used to create the shorter dataframe with impact as a combination of impacteach and impactover:
df %<%
mutate(newdescription = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %<%
group_by(id, newdescription) %<%
summarise(value = mean(as.numeric(value)))
What if you changed the description column first so that it could be included in the grouping:
df %>%
mutate(description = substr(description, 1, 6)) %>%
group_by(id, init_cont, family, description) %>%
summarise(value = mean(value))
# A tibble: 6 x 5
# Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.
# 2 1 K S read 2.
# 3 2 I S impact 3.
# 4 2 I S read 1.
# 5 3 K D impact 4.
# 6 3 K D read 3.
You just need to modify your group_by statement. Try group_by(id, init_cont, family)
Because your id seems to be mapped to init_cont and family already, adding in these values won't change your summarization result. Then you have all the columns you want with no extra work.
If you have a lot of columns you could trying something like the code below. Essentially, do a left_join onto your original data with your summarised data, but doing it using the . to not store off a new dataframe. Then, once joined (by id and description which we modified in place) you'll have two value columns which should be prepeneded with a .x and .y, drop the original and then use distinct to get rid of the duplicate 'impact' columns.
df %>%
mutate(description = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %>%
left_join(. %>%
group_by(id, description)
summarise(value = mean(as.numeric(value))
,by=c('id','description')) %>%
select(-value.x) %>%
distinct()
gsub can be used to replace description containing imact as impact and then group_by from dplyr package will help in summarising the value.
df %>% group_by(id, init_cont, family,
description = gsub("^(impact).*","\\1", description)) %>%
summarise(value = mean(value))
# # A tibble: 6 x 5
# # Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.00
# 2 1 K S read 2.00
# 3 2 I S impact 3.00
# 4 2 I S read 1.00
# 5 3 K D impact 4.00
# 6 3 K D read 3.00
Related
I have 1 dataframe of data and multiple "reference" dataframes. I'm trying to automate checking if values of the dataframe match the values of the reference dataframes. Importantly, the values must also be in the same order as the values in the reference dataframes. These columns are of the columns of importance, but my real dataset contains many more columns.
Below is a toy dataset.
Dataframe
group type value
1 A Teddy
1 A William
1 A Lars
2 B Dolores
2 B Elsie
2 C Maeve
2 C Charlotte
2 C Bernard
Reference_A
type value
A Teddy
A William
A Lars
Reference_B
type value
B Elsie
B Dolores
Reference_C
type value
C Maeve
C Hale
C Bernard
For example, in the toy dataset, group1 would score 1.0 (100% correct) because all its values in A match the values and order of values of An in reference_A. However, group2 would score 0.0 because the values in B are out of order compared to reference_B and 0.66 because 2/3 values in C match the values and order of values in reference_C.
Desired output
group type score
1 A 1.0
2 B 0.0
2 C 0.66
This was helpful, but does not take order into account:
Check whether values in one data frame column exist in a second data frame
Update: Thank you to everyone that has provided solutions! These solutions are great for the toy dataset, but have not yet been adaptable to datasets with more columns. Again, like I wrote in my post, the columns that I've listed above are of importance — I'd prefer to not drop the unneeded columns if necessary.
We may also do this with mget to return a list of data.frames, bind them together, and do a group by mean of logical vector
library(dplyr)
mget(ls(pattern = '^Reference_[A-Z]$')) %>%
bind_rows() %>%
bind_cols(df1) %>%
group_by(group, type = type...1) %>%
summarise(score = mean(value...2 == value...5))
# Groups: group [2]
# group type score
# <int> <chr> <dbl>
#1 1 A 1
#2 2 B 0
#3 2 C 0.667
This is another tidyverse solution. Here, I am adding a counter (i.e. rowname) to both reference and data. Then I join them together on type and rowname. At the end, I summarize them on type to get the desired output.
library(dplyr)
library(purrr)
library(tibble)
list(`Reference A`, `Reference B`, `Reference C`) %>%
map(., rownames_to_column) %>%
bind_rows %>%
left_join({Dataframe %>%
group_split(type) %>%
map(., rownames_to_column) %>%
bind_rows},
. , by=c("type", "rowname")) %>%
group_by(type) %>%
dplyr::summarise(group = head(group,1),
score = sum(value.x == value.y)/n())
#> # A tibble: 3 x 3
#> type group score
#> <chr> <int> <dbl>
#> 1 A 1 1
#> 2 B 2 0
#> 3 C 2 0.667
Here's a "tidy" method:
library(dplyr)
# library(purrr) # map2_dbl
Reference <- bind_rows(Reference_A, Reference_B, Reference_C) %>%
nest_by(type, .key = "ref") %>%
ungroup()
Reference
# # A tibble: 3 x 2
# type ref
# <chr> <list<tbl_df[,1]>>
# 1 A [3 x 1]
# 2 B [2 x 1]
# 3 C [3 x 1]
Dataframe %>%
nest_by(group, type, .key = "data") %>%
left_join(Reference, by = "type") %>%
mutate(
score = purrr::map2_dbl(data, ref, ~ {
if (length(.x) == 0 || length(.y) == 0) return(numeric(0))
if (length(.x) != length(.y)) return(0)
sum((is.na(.x) & is.na(.y)) | .x == .y) / length(.x)
})
) %>%
select(-data, -ref) %>%
ungroup()
# # A tibble: 3 x 3
# group type score
# <int> <chr> <dbl>
# 1 1 A 1
# 2 2 B 0
# 3 2 C 0.667
I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05
My ultimate goal is to do a series of chisq.test's on this data, comparing the values of 'dealer','store' and 'transport' by 'gender'. I'm using spread and gather to create a column of 'female' and one for 'males' then planned to use group_by & map to run the chisq.test by group of 'key', which is created in my gather argument. I'm doing something wrong because I'm getting grouped NA's back.
The code below produces my dilemma.
set.seed(123)
df_ <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
df_ %>%
gather(key,value,-gender) %>%
mutate(id = 1:nrow(.)) %>%
spread(gender,value)
Here is a data_frame of my desired outcome.
data_frame(key = sample(c('dealer','store','transport'),50,T),
male = sample(1:5,50,T),
female = sample(1:5,50,T))
You need to group_by(gender) before adding your id and spreading, i.e.
library(tidyverse)
df_ %>%
gather(key, value, -gender) %>%
group_by(gender) %>%
mutate(id = row_number()) %>%
spread(gender, value)
NOTE Substituting row_number() with 1:nrow(.) will fail because of the grouping. This is because it takes the sequence of the whole data frame (rather than a sequence for each group) and tries to assign it to each group. Hence the error you get with the length
Error in mutate_impl(.data, dots) :
Column id must be length 156 (the group size) or one, not 300
If you do say ... %>%mutate(id = 1:length(key)) It will be fine
The result in both (row_number and 1:length(key)) is,
# A tibble: 168 x 4
key id female male
* <chr> <int> <int> <int>
1 dealer 1 3 4
2 dealer 2 3 2
3 dealer 3 1 4
4 dealer 4 5 3
5 dealer 5 4 4
6 dealer 6 5 2
7 dealer 7 3 3
8 dealer 8 1 2
9 dealer 9 2 5
10 dealer 10 2 2
# ... with 158 more rows
#elliot while #Sotos has given a great answer to the challenge you were having with the tidyverse, I'm a bit confused by why you're going through all that extra effort. Your ultimate goal as stated was to run chisq.test for gender against each of the others (dealer, store & transport). Your original dataset doesn't need any modification to do that!
require(tidyverse)
set.seed(123)
yourdata <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
yourdata
# A tibble: 100 x 4
gender dealer store transport
<chr> <int> <int> <int>
1 female 2 2 5
2 male 2 4 2
3 female 2 2 1
Can be used exactly as it stands! You may have other reasons to want to change the data but it is tidy as it is representing one case or person per row.
Edited (January 16th) To achieve your stated ultimate goal you just have to:
require(dplyr)
require(broom)
allofthem <- lapply(yourdata[-1], function(y) tidy(chisq.test(x = yourdata$gender, y = y )))
allofthem <- bind_rows(allofthem, .id = "dependentv")
allofthem
You may also want to look at the lsr package which will do Chi-square independence (association tests) and provide a much more informative output. Also note that from a statistical perspective you are running very many tests and should correct your confidence appropriately... see for example http://rpubs.com/ibecav/290361
I have a data frame that contains two variables, like this:
df <- data.frame(group=c(1,1,1,2,2,3,3,4),
type=c("a","b","a", "b", "c", "c","b","a"))
> df
group type
1 1 a
2 1 b
3 1 a
4 2 b
5 2 c
6 3 c
7 3 b
8 4 a
I want to produce a table showing for each group the combination of types it has in the data frame as one variable e.g.
group alltypes
1 1 a, b
2 2 b, c
3 3 b, c
4 4 a
The output would always list the types in the same order (e.g. groups 2 and 3 get the same result) and there would be no repetition (e.g. group 1 is not "a, b, a").
I tried doing this using dplyr and summarize, but I can't work out how to get it to meet these two conditions - the code I tried was:
> df %>%
+ group_by(group) %>%
+ summarise(
+ alltypes = paste(type, collapse=", ")
+ )
# A tibble: 4 × 2
group alltypes
<dbl> <chr>
1 1 a, b, a
2 2 b, c
3 3 c, b
4 4 a
I also tried turning type into a set of individual counts, but not sure if that's actually useful:
> df %>%
+ group_by(group, type) %>%
+ tally %>%
+ spread(type, n, fill=0)
Source: local data frame [4 x 4]
Groups: group [4]
group a b c
* <dbl> <dbl> <dbl> <dbl>
1 1 2 1 0
2 2 0 1 1
3 3 0 1 1
4 4 1 0 0
Any suggestions would be greatly appreciated.
I think you were very close. You could call the sort and unique functions to make sure your result adheres to your conditions as follows:
df %>% group_by(group) %>%
summarize(type = paste(sort(unique(type)),collapse=", "))
returns:
# A tibble: 4 x 2
group type
<int> <chr>
1 1 a, b
2 2 b, c
3 3 b, c
4 4 a
To expand on Florian's answer this could be extended to generating an ordered list based on values in your data set. An example could be determining the order of dates:
library(lubridate)
library(tidyverse)
# Generate random dates
set.seed(123)
Date = ymd("2018-01-01") + sort(sample(1:200, 10))
A = ymd("2018-01-01") + sort(sample(1:200, 10))
B = ymd("2018-01-01") + sort(sample(1:200, 10))
C = ymd("2018-01-01") + sort(sample(1:200, 10))
# Combine to data set
data = bind_cols(as.data.frame(Date), as.data.frame(A), as.data.frame(B), as.data.frame(C))
# Get order of dates for each row
data %>%
mutate(D = Date) %>%
gather(key = Var, value = D, -Date) %>%
arrange(Date, D) %>%
group_by(Date) %>%
summarize(Ord = paste(Var, collapse=">"))
Somewhat tangential to the original question but hopefully helpful to someone.
I have a data set like this:
df <- data.frame(situation1=rnorm(30),
situation2=rnorm(30),
situation3=rnorm(30),
models=c(rep("A",10), rep("B",10), rep("C", 10)))
where I compare three models (A,B,C) in three situations. I have 10 measurements for each model.
I now want to summarise this into ranks, i.e. how often each models wins in each situtation. Win is defined by the highest value.
A final output could be something like this:
model situation1 situtation2 situtation3
A 4 3 3
B 7 1 2
C 1 4 5
In base R:
table(df$models,colnames(df[-4])[max.col(df[-4])])
# situation1 situation2 situation3
# A 2 4 4
# B 4 5 1
# C 2 4 4
Results may change from your OP, since you didn't set a seed.
Here is an option using data.table
library(data.table)
setDT(df)[, lapply(Map(`==`, .SD, list(do.call(pmax, .SD))), sum), models]
Here's a dplyr option:
df %>%
group_by(models) %>%
mutate_all(funs(. == pmax(situation1, situation2, situation3))) %>%
summarise_all(sum)
Or possibly a little more efficient:
df %>%
mutate_at(vars(-models), funs(. == pmax(situation1, situation2, situation3))) %>%
group_by(models) %>%
summarise_all(sum)
## A tibble: 3 × 4
# models situation1 situation2 situation3
# <chr> <int> <int> <int>
#1 A 3 3 3
#2 B 3 5 1
#3 C 6 1 2
If you're looking for the minimum, use pmin instead of pmax. And in case there may be NAs, use the na.rm-argument in pmax/pmin.
Final note: the result doesn't match OP's because the sample data was generated without setting a seed.