I'm trying to do a Last Observation Carried Forward operation on some poorly formatted data using dplyr and tidyr. It isn't working as I'd expect.
library(dplyr)
library(tidyr)
df <- data.frame(id=c(1,1,2,2,3,3),
email=c('bob#email.com', NA, 'joe#email.com', NA, NA, NA))
df2 <- df %>% group_by(id) %>% fill(email)
This results in:
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob#email.com
2 1 bob#email.com
3 2 joe#email.com
4 2 joe#email.com
5 3 joe#email.com
6 3 joe#email.com
I expect it to be:
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob#email.com
2 1 bob#email.com
3 2 joe#email.com
4 2 joe#email.com
5 3 NA
6 3 NA
The reason I expect it to be the latter is because of group_by's documentation saying, "The group_by function takes an existing tbl and converts it into a grouped tbl where operations are performed "by group"." The group in this case is determined by the id variable, and the following operation is fill(email). However, it's pretty clearly NOT doing that.
And before anybody asks, it makes no difference if the fields are both character instead of numeric or factor.
UPDATE
#aosmith pointed out this open issue on Github. I'm going to say that there won't be a proper solution to this problem until that issue is resolved. Everything else would just be a workaround. So, if somebody makes a successful PR addressing that issue and posts it here, I'd be happy to mark it as the solution.
Looks like this has been fixed in the development version of tidyr. You now get the expected result per id using fill from tidyr_0.3.1.9000.
df %>% group_by(id) %>% fill(email)
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob#email.com
2 1 bob#email.com
3 2 joe#email.com
4 2 joe#email.com
5 3 NA
6 3 NA
Luckily you can still use zoo::na.locf for this:
df %>%
group_by(id) %>%
mutate(email = zoo::na.locf(email, na.rm = FALSE))
# Source: local data frame [6 x 2]
# Groups: id [3]
#
# id email
# (dbl) (fctr)
# 1 1 bob#email.com
# 2 1 bob#email.com
# 3 2 joe#email.com
# 4 2 joe#email.com
# 5 3 NA
# 6 3 NA
Another option is to use do from dplyr:
df3 <- df %>% group_by(id) %>% do(fill(.,email))
Two questions, does it has be duplicated and do you have to use dplyr and tidyr?
Maybe this could be a solution?
(
bar <- data.frame(id=c(1,1,2,2,3,3),
email=c('bob#email.com', NA, 'joe#email.com', NA, NA, NA))
)
#> id email
#> 1 bob#email.com
#> 1 <NA>
#> 2 joe#email.com
#> 2 <NA>
#> 3 <NA>
#> 3 <NA>
(
foo <- bar[!duplicated(bar$id),]
)
#> id email
#> 1 bob#email.com
#> 2 joe#email.com
#> 3 <NA>
This is kind of ugly, but it is another option that uses dplyr and works with your sample data
df %>%
group_by(id) %>%
mutate(email = email[ !is.na(email) ][1])
I have come across this issue quite a few times, I do worry about using this..
df2 <- df %>% group_by(id) %>% fill(email)
on large data sets as I have had mixed results and found the following work around. The split function used with map_df ensures you apply whatever you are doing to the a specific df for each id and map_df then re binds all the individual df like magic. It has also proved handy in lots of other circumstances. Somewhat obsolete now this issue has been fixed but still a useful alternative that avoids group_by().
df %>% split(.$id) %>% map_df(function(x){ x %>% fill(email)})
Related
Hi and happy new year at all.
I have a tricky task (in my opinion) and I can not find a way to solve it.
Please see following toy data. The orginal dataset has hundreds of cols/rows.
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan"),
US=c(8,2,NA,7),
UK=c(5,4,1,7))
I want to create a new column, called "origin", which pastes the colname of each cell (without NA) seperated by "|" under consideration of the corresponding value. Higher values should be pasted first. As for same values (like Zlatan), the sequence isn´t relevant. Output for Zlatan could be US|UK OR UK|US.
This is the desired ouput:
I tried some hours to solve it but no approach worked. May be it make sense to convert the values as.factor...
Help is much appreciated. Thank you in advance!
Here's a dplyr approach. First, we can use rowwise to work on individual rows independently. Next, we can use c_across which allows us to select values from that row only. We can subset a vector of c("US","UK") based on whether the US and UK columns are not NA.
paste with collapse = "|" allows us to put the values together with the seperator. I added a row to see what would happen if they are both NA.
library(dplyr)
test %>%
rowwise() %>%
mutate(origin = paste(c("US","UK")[rev(order(c_across(US:UK), na.last = NA))], collapse = "|"))
# A tibble: 5 x 4
# Rowwise:
name US UK origin
<chr> <dbl> <dbl> <chr>
1 Amber 8 5 "US|UK"
2 Thomas 2 4 "UK|US"
3 Stefan NA 1 "UK"
4 Zlatan 7 7 "UK|US"
5 Bob NA NA ""
This is also trivially expanded to more columns:
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan","Bob"),
US=c(8,2,NA,7,NA),
UK=c(5,4,1,7,NA),
AUS=c(1,2,NA,NA,1))
test %>%
rowwise() %>%
mutate(origin = paste(c("US","UK","AUS")[rev(order(c_across(US:AUS), na.last = NA))], collapse = "|"))
# A tibble: 5 x 5
# Rowwise:
name US UK AUS origin
<chr> <dbl> <dbl> <dbl> <chr>
1 Amber 8 5 1 US|UK|AUS
2 Thomas 2 4 2 UK|AUS|US
3 Stefan NA 1 NA UK
4 Zlatan 7 7 NA UK|US
5 Bob NA NA 1 AUS
Or with tidyselect assistance to perform all columns but name:
test %>%
rowwise() %>%
mutate(origin = paste(names(across(-name))[rev(order(c_across(-name), na.last = NA))], collapse = "|"))
Another possibility with tidyverse. It is longer than the other two solutions, but it should work directly with a dataframe with as many columns as you need.
I changed the dataframe to long format, filtered out NAs, grouped by name, summarized using paste, and joined with the original dataframe to get the original columns (and rows with all NAs).
library(tidyverse)
test<-data.frame(name=c("Amber","Thomas","Stefan","Zlatan","Bob"),
US=c(8,2,NA,7,NA),
UK=c(5,4,1,7,NA),
AUS=c(1,2,NA,NA,1))
test %>%
# change to long format
tidyr::pivot_longer(cols=-name, names_to = "country", values_to = "value") %>%
# remove rows with NA
dplyr::filter(!is.na(value)) %>%
# group by name and sort
dplyr::group_by(name) %>% dplyr::arrange(-value) %>%
# create summary of countries for each name in column 'origin'
dplyr::summarise(origin=paste(country, collapse = "|")) %>%
# join with original data frame to include original columns (and names with only NA) and change NA to '' in origin
dplyr::right_join(test, by='name') %>% dplyr::mutate(origin=ifelse(is.na(origin), '', origin)) %>%
# move origin column to end
dplyr::relocate(origin, .after = last_col())
Result
name US UK AUS origin
<chr> <dbl> <dbl> <dbl> <chr>
1 Amber 8 5 1 US|UK|AUS
2 Bob NA NA 1 AUS
3 Stefan NA 1 NA UK
4 Thomas 2 4 2 UK|US|AUS
5 Zlatan 7 7 NA US|UK
Here's a different tidyverse solution using case_when:
library(tidyverse)
data <- data.frame (test<-data.frame(
"name" =c("Amber","Thomas","Stefan","Zlatan"),
"US" =c(8,2,NA,7),
"UK" =c(5,4,1,7)))
data <- data %>% mutate(origin = case_when( US > UK ~ "US|UK",
UK >= US ~ "UK|US",
is.na(UK) & !is.na(US) ~ "US",
is.na(US) & !is.na(UK) ~ "UK"))
data
#> name US UK origin
#> 1 Amber 8 5 US|UK
#> 2 Thomas 2 4 UK|US
#> 3 Stefan NA 1 UK
#> 4 Zlatan 7 7 UK|US
Created on 2021-01-06 by the reprex package (v0.3.0)
I have a dataset with three columns as below:
data <- data.frame(
grpA = c(1,1,1,1,1,2,2,2),
idB = c(1,1,2,2,3,4,5,6),
valueC = c(10,10,20,20,10,30,40,50),
otherD = c(1,2,3,4,5,6,7,8)
)
valueC is unique to each unique value of idB.
I want to use dplyr pipe (as the rest of my code is in dplyr) and use group_by on grpA to get a new column with sum of valueC values for each group.
The answer should be like:
newCol <- c(40,40,40,40,40,120,120,120)
but with data %>% group_by(grpA) %>%
mutate(newCol=sum(valueC), I get newCol <- c(70,70,70,70,70,120,120,120)
How do I include unique value of idB? Is there anything else I can use instead of group_by in dplyr %>% pipe.
I cant use summarise as I need to keep values in otherD intact for later use.
Other option I have is to create newCol separately through sql and then merge with left join. But I am looking for a better solution inline.
If it has been answered before, please refer me to the link as I could not find any relevant answer to this issue.
We need unique with match
data %>%
group_by(grpA) %>%
mutate(ind = sum(valueC[match(unique(idB), idB)]))
# A tibble: 8 x 5
# Groups: grpA [2]
# grpA idB valueC otherD ind
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 1 40
#2 1 1 10 2 40
#3 1 2 20 3 40
#4 1 2 20 4 40
#5 1 3 10 5 40
#6 2 4 30 6 120
#7 2 5 40 7 120
#8 2 6 50 8 120
Or another option is to get the distinct rows by 'grpA', 'idB', grouped by 'grpA', get the sum of 'valueC' and left_join with the original data
data %>%
distinct(grpA, idB, .keep_all = TRUE) %>%
group_by(grpA) %>%
summarise(newCol = sum(valueC)) %>%
left_join(data, ., by = 'grpA')
I have a data set like this:
df <- data.frame(situation1=rnorm(30),
situation2=rnorm(30),
situation3=rnorm(30),
models=c(rep("A",10), rep("B",10), rep("C", 10)))
where I compare three models (A,B,C) in three situations. I have 10 measurements for each model.
I now want to summarise this into ranks, i.e. how often each models wins in each situtation. Win is defined by the highest value.
A final output could be something like this:
model situation1 situtation2 situtation3
A 4 3 3
B 7 1 2
C 1 4 5
In base R:
table(df$models,colnames(df[-4])[max.col(df[-4])])
# situation1 situation2 situation3
# A 2 4 4
# B 4 5 1
# C 2 4 4
Results may change from your OP, since you didn't set a seed.
Here is an option using data.table
library(data.table)
setDT(df)[, lapply(Map(`==`, .SD, list(do.call(pmax, .SD))), sum), models]
Here's a dplyr option:
df %>%
group_by(models) %>%
mutate_all(funs(. == pmax(situation1, situation2, situation3))) %>%
summarise_all(sum)
Or possibly a little more efficient:
df %>%
mutate_at(vars(-models), funs(. == pmax(situation1, situation2, situation3))) %>%
group_by(models) %>%
summarise_all(sum)
## A tibble: 3 × 4
# models situation1 situation2 situation3
# <chr> <int> <int> <int>
#1 A 3 3 3
#2 B 3 5 1
#3 C 6 1 2
If you're looking for the minimum, use pmin instead of pmax. And in case there may be NAs, use the na.rm-argument in pmax/pmin.
Final note: the result doesn't match OP's because the sample data was generated without setting a seed.
I am trying to rearrange rows into columns in a specific way (preferably using dplyr) but I dont really know where to start with this. I am trying to create one row for each person (Bill or Bob) and have all of that persons values on one row. So far I have
df<-data.frame(
Participant=c("bob1","bill1","bob2","bill2"),
No_Photos=c(1,4,5,6)
)
res<-df %>% group_by(Participant) %>% dplyr::summarise(phot_mean=mean(No_Photos))
which gives me:
Participant mean(No_Photos)
(fctr) (dbl)
1 bill1 4
2 bill2 6
3 bob1 1
4 bob2 5
GOAL:
mean_NO_Photos_1 mean_No_Photos_2
bob 1 5
bill 4 6
Using tidyr and dplyr:
library(tidyr)
library(dplyr)
df %>% mutate(rep = extract_numeric(Participant),
Participant = gsub("[0-9]", "", Participant)) %>%
group_by(Participant, rep) %>%
summarise(mean = mean(No_Photos)) %>%
spread(rep, mean)
Source: local data frame [2 x 3]
Participant 1 2
(chr) (dbl) (dbl)
1 bill 4 6
2 bob 1 5
My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?
df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)
maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
toadd<-maxrows-dim(x)[1]
replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})
Goal:
a 2
a 3
b 4
b NA
Using data.table...
my_rows <- seq.int(max(tabulate(df$Initials)))
library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]
# Initials data
# 1: a 2
# 2: a 3
# 3: b 4
# 4: b NA
.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].
The analogue in dplyr is
my_rows <- seq.int(max(tabulate(df$Initials)))
library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)
# Initials data
# (fctr) (dbl)
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b NA
Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.
Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:
library(dplyr)
library(tidyr)
df %>% group_by(Initials) %>%
mutate(row = row_number()) %>%
ungroup() %>%
complete(Initials, row) %>%
select(-row)
Source: local data frame [4 x 2]
Initials data
(fctr) (dbl)
1 a 2
2 a 3
3 b 4
4 b NA
Interesting problem. Try:
to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
# Initials data
#1 a 2
#2 a 3
#3 b 4
#4 b <NA>
We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.
max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.
We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.
To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).