I am using tidyr from R and am running into an issue when using the spread() command with duplicate identifiers.
Here is a mock example that illustrates the problem:
X = data.frame(name=c("Eric","Bob","Mark","Bob","Bob","Mark","Eric","Bob","Mark"),
metric=c("height","height","height","weight","weight","weight","grade","grade","grade"),
values=c(6,5,4,120,118,180,"A","B","C"),
stringsAsFactors=FALSE)
tidyr::spread(X,metric,values)
So when I run this command I get the following error:
Error: Duplicate identifiers for rows (4, 5)
which makes sense why its an error because Bob is recorded twice for weight. It's actually nota mistake because Bob did have his weight recorded twice. What I would like to be able to do is have run the command and have it it give me back the following:
name height weight grade
Eric 6 NA A
Bob 5 120 B
Bob 5 118 B
Mark 4 180 C
Is spread not the command I should be using to accomplish this? And if there isn't an easy solution is there a simple way to remove the record with lowest weight for duplicates when running the spread() command?
After making unique identifiers, which can be done by making a new variable representing the index within each group, you can use fill to fill the second "Bob" row with a duplicate value for "height" and "grade".
You can remove the index variable at the end via select.
library(dplyr)
library(tidyr)
X %>%
group_by(name, metric) %>%
mutate(row = row_number() ) %>%
spread(metric, values) %>%
fill(grade, height) %>%
select(-row)
# A tibble: 4 x 4
# Groups: name [3]
name grade height weight
<chr> <chr> <chr> <chr>
1 Bob B 5 120
2 Bob B 5 118
3 Eric A 6 <NA>
4 Mark C 4 180
To filter to the maximum value of each name/metric group:
X %>%
group_by(name, metric) %>%
filter(values == max(values)) %>%
spread(metric, values)
# A tibble: 3 x 4
# Groups: name [3]
name grade height weight
* <chr> <chr> <chr> <chr>
1 Bob B 5 120
2 Eric A 6 <NA>
3 Mark C 4 180
Related
I'm trying to calculate the share of a certain variable cost for each country, related to the total. However, when I try to create the "share" column through mutate, it yields all answers as 1.
The code I'm using is as follows:
db %>%
group_by(country,group) %>%
summarize(cost=sum(cost)) %>%
mutate(share=cost/sum(cost))
This is the table it is generating:
# Groups: cluster [18]
cluster group cost share
<chr> <chr> <dbl> <dbl>
1 AT A 7810. 1
2 AU C 7786. 1
3 CA C 5920. 1
4 KO B 172702. 1
5 DE A 40894. 1
6 ES A 26357. 1
7 FR A 65735. 1
8 GB C 11240. 1
9 IT A 85045. 1
10 JP B 10069. 1
I've tried inverting the positions of group and country on the group_by(), but the share column is still returning the shares as a % of the group, instead of the total sum. Why is this happening and how can I fix it?
It's because the default behavior of summarise is to output a grouped dataframe when grouping by more than one variable (it drops one variable and keeps the next).
To solve it you can add an ungroup:
db %>%
group_by(country,group) %>%
summarize(cost=sum(cost)) %>%
ungroup() %>%
mutate(share=cost/sum(cost))
Or from dplyr version > 1.0.0:
db %>%
group_by(country,group) %>%
summarize(cost=sum(cost), .groups = "drop") %>%
mutate(share=cost/sum(cost))
Given the following data:
id|datee | price | quant | discrete_x
1 2018-12-19 4 -3000 A
1 2018-12-04 4 3000 A
1 2018-12-21 4 3000 B
1 2018-12-20 3 2000 A
...
Desired output:
id|datee | price | quant | discrete_x
1 2018-12-21 4 3000 B
1 2018-12-20 3 2000 A
...
In this case, it is quite clear that the quantity (quant) of 3000 is refunded, then bought again. I would like to remove the two rows for cancelling each other out. Given that id and quant match while the refund happens once and after a purchase of matching number of quant, how would I be able to remove all of them for each id value?
I've been considering (but stuck on) two ideas so far:
1) Within an arranged group_by values, check the later dates within a column to see if quant would match as opposite values
2) For loop within a for loop
I feel that for loop within a for loop is better, but not sure how I would match on discrete_x.
How would your approach be? Would you use for loop within a for loop?
Hope this solution will work for your problem.
df <- abs(df$quant)
df1 <- df[!duplicated(df[c("id","quant")]),]
assuming your data frame name is df.
This is a very ugly implementation, but I think this might work. We can create a filtering column after grouping by id and arranging by date.
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
arrange(datee) %>%
mutate(f = lead(quant) + quant == 0,
f = ifelse(f, f, lag(f)),
f = tidyr::replace_na(f, FALSE)) %>%
filter(!f) %>%
select(-f)
#> # A tibble: 2 x 6
#> # Groups: id [1]
#> id datee price quant discrete_x
#> <dbl> <date> <dbl> <dbl> <chr>
#> 1 1 2018-12-20 3 2000 A
#> 2 1 2018-12-21 4 3000 B
I have the following data frame
Name Product Unit Class
2 sushil seeds
4 sanju Soap 46 C
5 rahul 5
7 sanju 4 E
9 sushil 20 B
10 rahul Soap A
and what I need is, a data frame without duplicate rows with the below conditions.
if the row is having all columns values filled then eliminate the second duplicate row.
if the row is having few of the columns value empty then replace the empty cell with the similar column values from its duplicate row.
The desired result should look like this.
Name Product Unit Class
1 sushil seeds 20 B
2 sanju Soap 46 C
3 rahul Soap 5 A
Thanks in advance for the help!
here is the df code.
Name <- c("abbas","sushil","abbas","sanju","rahul","shweta","sanju","rajiv","sushil","rahul")
Unit <- c(18," ",18,46,5,67,4,3,20," ")
Product <- c("Rice","seeds","Rice","Soap"," ","Towel"," "," "," ","Soap")
Class <- c("A"," ","A","C"," ","D","E","A","B","A")
Data <- data.frame(Name,Product,Unit,Class)
duplicate <- which(duplicated(Data))
unique <- Data[!duplicated(Data),]
NewData <- unique[unique$Name %in% unique$Name[duplicated(unique$Name)],]
In the following I am assuming that the primary ID is the Name column.
First part (harder):
library(tidyverse)
df[ df == "" ] <- NA
df2 <- df %>%
mutate(complete=complete.cases(df)) %>%
group_by(Name) %>%
mutate(any_complete=any(complete)) %>%
filter( complete | (!complete & !any_complete)) %>%
select(-complete, -any_complete)
Result:
# A tibble: 5 x 4
# Groups: Name [3]
Name Product Unit Class
<chr> <chr> <int> <chr>
1 sushil seeds NA NA
2 sanju Soap 46 C
3 rahul NA 5 NA
4 sushil NA 20 B
5 rahul Soap NA A
Explanation: first we replace all missing strings by actual NA's. Then, we create a column, complete, which checks whether all of the columns are complete for a given row. Next we create another column that tells us whether, for any given Name there is a complete observation. Finally, we keep only the rows which are either (i) complete or (ii) not complete, but a complete observation for that Name is missing.
Second task is simpler, but boring:
df2 %>% arrange(Name, Product) %>% fill(Product) %>%
arrange(Name, Unit) %>% fill(Unit) %>%
arrange(Name, Class) %>% fill(Class) %>%
filter(!duplicated(Name))
Result:
# A tibble: 5 x 4
# Groups: Name [3]
Name Product Unit Class
<chr> <chr> <int> <chr>
1 rahul Soap 5 A
2 sanju Soap 46 C
3 sushil seeds 20 B
This is an approximation of the original dataframe. In the original, there are many more columns than are shown here.
id init_cont family description value
1 K S impacteach 1
1 K S impactover 3
1 K S read 2
2 I S impacteach 2
2 I S impactover 4
2 I S read 1
3 K D impacteach 3
3 K D impactover 5
3 K D read 3
I want to combine the values for impacteach and impactover to generate an average value that is just called impact. I would like the final table to look like the following:
id init_cont family description value
1 K S impact 2
1 K S read 2
2 I S impact 3
2 I S read 1
3 K D impact 4
3 K D read 3
I have not been able to figure out how to generate this table. However, I have been able to create a dataframe that looks like this:
id description value
1 impact 2
1 read 2
2 impact 3
2 read 1
3 impact 4
3 read 3
What is the best way for me to take these new values and add them to the original dataframe? I also need to remove the original values (like impacteach and impactover) in the original dataframe. I would prefer to modify the original dataframe as opposed to creating an entirely new dataframe because the original dataframe has many columns.
In case it is useful, this is a summary of the code I used to create the shorter dataframe with impact as a combination of impacteach and impactover:
df %<%
mutate(newdescription = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %<%
group_by(id, newdescription) %<%
summarise(value = mean(as.numeric(value)))
What if you changed the description column first so that it could be included in the grouping:
df %>%
mutate(description = substr(description, 1, 6)) %>%
group_by(id, init_cont, family, description) %>%
summarise(value = mean(value))
# A tibble: 6 x 5
# Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.
# 2 1 K S read 2.
# 3 2 I S impact 3.
# 4 2 I S read 1.
# 5 3 K D impact 4.
# 6 3 K D read 3.
You just need to modify your group_by statement. Try group_by(id, init_cont, family)
Because your id seems to be mapped to init_cont and family already, adding in these values won't change your summarization result. Then you have all the columns you want with no extra work.
If you have a lot of columns you could trying something like the code below. Essentially, do a left_join onto your original data with your summarised data, but doing it using the . to not store off a new dataframe. Then, once joined (by id and description which we modified in place) you'll have two value columns which should be prepeneded with a .x and .y, drop the original and then use distinct to get rid of the duplicate 'impact' columns.
df %>%
mutate(description = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %>%
left_join(. %>%
group_by(id, description)
summarise(value = mean(as.numeric(value))
,by=c('id','description')) %>%
select(-value.x) %>%
distinct()
gsub can be used to replace description containing imact as impact and then group_by from dplyr package will help in summarising the value.
df %>% group_by(id, init_cont, family,
description = gsub("^(impact).*","\\1", description)) %>%
summarise(value = mean(value))
# # A tibble: 6 x 5
# # Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.00
# 2 1 K S read 2.00
# 3 2 I S impact 3.00
# 4 2 I S read 1.00
# 5 3 K D impact 4.00
# 6 3 K D read 3.00
My ultimate goal is to do a series of chisq.test's on this data, comparing the values of 'dealer','store' and 'transport' by 'gender'. I'm using spread and gather to create a column of 'female' and one for 'males' then planned to use group_by & map to run the chisq.test by group of 'key', which is created in my gather argument. I'm doing something wrong because I'm getting grouped NA's back.
The code below produces my dilemma.
set.seed(123)
df_ <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
df_ %>%
gather(key,value,-gender) %>%
mutate(id = 1:nrow(.)) %>%
spread(gender,value)
Here is a data_frame of my desired outcome.
data_frame(key = sample(c('dealer','store','transport'),50,T),
male = sample(1:5,50,T),
female = sample(1:5,50,T))
You need to group_by(gender) before adding your id and spreading, i.e.
library(tidyverse)
df_ %>%
gather(key, value, -gender) %>%
group_by(gender) %>%
mutate(id = row_number()) %>%
spread(gender, value)
NOTE Substituting row_number() with 1:nrow(.) will fail because of the grouping. This is because it takes the sequence of the whole data frame (rather than a sequence for each group) and tries to assign it to each group. Hence the error you get with the length
Error in mutate_impl(.data, dots) :
Column id must be length 156 (the group size) or one, not 300
If you do say ... %>%mutate(id = 1:length(key)) It will be fine
The result in both (row_number and 1:length(key)) is,
# A tibble: 168 x 4
key id female male
* <chr> <int> <int> <int>
1 dealer 1 3 4
2 dealer 2 3 2
3 dealer 3 1 4
4 dealer 4 5 3
5 dealer 5 4 4
6 dealer 6 5 2
7 dealer 7 3 3
8 dealer 8 1 2
9 dealer 9 2 5
10 dealer 10 2 2
# ... with 158 more rows
#elliot while #Sotos has given a great answer to the challenge you were having with the tidyverse, I'm a bit confused by why you're going through all that extra effort. Your ultimate goal as stated was to run chisq.test for gender against each of the others (dealer, store & transport). Your original dataset doesn't need any modification to do that!
require(tidyverse)
set.seed(123)
yourdata <- data_frame(gender = sample(c('male','female'),100,T),
dealer = sample(1:5,100,T),
store = sample(1:5,100,T),
transport = sample(1:5,100,T))
yourdata
# A tibble: 100 x 4
gender dealer store transport
<chr> <int> <int> <int>
1 female 2 2 5
2 male 2 4 2
3 female 2 2 1
Can be used exactly as it stands! You may have other reasons to want to change the data but it is tidy as it is representing one case or person per row.
Edited (January 16th) To achieve your stated ultimate goal you just have to:
require(dplyr)
require(broom)
allofthem <- lapply(yourdata[-1], function(y) tidy(chisq.test(x = yourdata$gender, y = y )))
allofthem <- bind_rows(allofthem, .id = "dependentv")
allofthem
You may also want to look at the lsr package which will do Chi-square independence (association tests) and provide a much more informative output. Also note that from a statistical perspective you are running very many tests and should correct your confidence appropriately... see for example http://rpubs.com/ibecav/290361