rearrange specific rows into columns using dplyr - r

I am trying to rearrange rows into columns in a specific way (preferably using dplyr) but I dont really know where to start with this. I am trying to create one row for each person (Bill or Bob) and have all of that persons values on one row. So far I have
df<-data.frame(
Participant=c("bob1","bill1","bob2","bill2"),
No_Photos=c(1,4,5,6)
)
res<-df %>% group_by(Participant) %>% dplyr::summarise(phot_mean=mean(No_Photos))
which gives me:
Participant mean(No_Photos)
(fctr) (dbl)
1 bill1 4
2 bill2 6
3 bob1 1
4 bob2 5
GOAL:
mean_NO_Photos_1 mean_No_Photos_2
bob 1 5
bill 4 6

Using tidyr and dplyr:
library(tidyr)
library(dplyr)
df %>% mutate(rep = extract_numeric(Participant),
Participant = gsub("[0-9]", "", Participant)) %>%
group_by(Participant, rep) %>%
summarise(mean = mean(No_Photos)) %>%
spread(rep, mean)
Source: local data frame [2 x 3]
Participant 1 2
(chr) (dbl) (dbl)
1 bill 4 6
2 bob 1 5

Related

is there an R code for the following data wrangling and transformation

I have the following data set
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007,001,002,003,004,005,006,007,008,009,010,011,012,013,014,015,016,017,018,019,020,021,022,023,024,025,026,027,028,029)
dat1<-data.frame(id,s02)
I would wish to create a data set based on this dat1. I would wish to have an R code that creates n s02 automatically as s02__0, s02__1, s02__2, s02__3, s02__4, in which case my n==5. Then based on the ID in dat1, the code should allocate each s02 to the respective s02__0 to s02__4 in the data frame. These rows are uniquely identified by another ID_2 created based on the number of rows. If incase the s02 are less in the row created, then the remaining cells should be allocated ##N/A##. if the s02 are more than the n, then another new row with an increment from the unique ID_2 is formed to accommodate the extra s02 and every blank cell is still filled with ##N/A##.
From the dataset above, I would wish to have the following output
id<-c(1,2,3,3,4,4,4,4,4,4)
id_2<-c(1,1,1,2,1,2,3,4,5,6)
s02__0<-c(1,1,1,6,1,6,11,16,21,26)
s02__1<-c(2,2,2,7,2,7,12,17,22,27)
s02__2<-c(3,3,3,##N/A##,3,8,13,18,23,28)
s02__3<-c(4,4,4,##N/A##,4,9,14,19,24,29)
s02__4<-c(##N/A##,5,5,##N/A##,5,10,15,20,25,##N/A##)
dat2<-data.frame(id,id_2,s02__0,s02__1,s02__2,s02__3,s02__4)
This can produce what you want:
library(tidyverse)
#Data
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007)
dat1<-data.frame(id,s02)
#Code
dat2 <- dat1 %>% group_by(id) %>% mutate(id2 = ifelse(s02<=5,1,2)) %>% ungroup() %>%
group_by(id,id2) %>% mutate(val=1:n()-1,nid = cur_group_id()) %>% ungroup() %>%
select(-id2) %>% mutate(id=paste0(id,'.',nid),val=paste0('s02','.',val)) %>% select(-nid) %>%
pivot_wider(names_from = c(val),values_from = s02) %>%
mutate(id=gsub("\\..*","", id)) %>% group_by(id) %>%
mutate(id2=1:n()) %>% select(order(colnames(.)))
dat2
# A tibble: 4 x 7
# Groups: id [3]
id id2 s02.0 s02.1 s02.2 s02.3 s02.4
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 2 3 4 NA
2 2 1 1 2 3 4 5
3 3 1 1 2 3 4 5
4 3 2 6 7 NA NA NA

Joining data in R by first row, then second and so on

I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05

complete.cases for group instead of observation?

If I have tidied data:
df = expand.grid(Name=c("Sub1","Sub2","Sub3"),Vis=c("Yes","No")) %>%
mutate(KPR_mean=c(NA,1,3,2,3,2),KPR_range=c(NA,4,4,2,6,5)) %>%
filter(complete.cases(.))
I'd like to filter out incomplete factor combinations, to be left with a full factorial model. Right now, I'm doing so as follows:
df %>%
unite(KPR_mean_range,KPR_mean,KPR_range) %>%
spread(Vis,KPR_mean_range) %>%
filter(complete.cases(.)) %>%
gather(Win,KPR_mean_range,-Name) %>%
separate(KPR_mean_range,c("KPR_mean","KPR_range"),sep="_")
But that seems really verbose, and also difficult to extend once there are multiple factors and more variables. Is there a way to filter on a grouping variable, instead of a row? I.e., for each level of Name, if filter(complete.cases(.)) would remove a row from that group, then remove the entire group instead?
For the new data, expand your answer to all cases, group by whichever variable you want the completed cases in, and filter out groups with NAs:
df %>% complete(Vis, Name) %>% group_by(Name) %>% filter(!any(is.na(KPR_mean)))
# Source: local data frame [4 x 4]
# Groups: Name [2]
#
# Vis Name KPR_mean KPR_range
# (fctr) (fctr) (dbl) (dbl)
# 1 Yes Sub2 1 4
# 2 Yes Sub3 3 4
# 3 No Sub2 3 6
# 4 No Sub3 2 5
Here is one option with data.table. We convert the 'data.frame' to 'data.table' specifying the key columns, (setDT(df,..), do a cross join, grouped by 'Name', if there are no 'NA' values in 'KPP_range', subset the group of rows.
library(data.table)
setDT(df, key = c("Name", "Vis"))[CJ(Name, Vis, unique=TRUE)][,
if(all(!is.na(KPR_mean))) .SD , Name]
# Name Vis KPR_mean KPR_range
#1: Sub2 Yes 1 4
#2: Sub2 No 3 6
#3: Sub3 Yes 3 4
#4: Sub3 No 2 5

dplyr summarize date by weekdays

I have multiple observations from different persons on different dates, e.g.
df <- data.frame(id= c(rep(1,5), rep(2,8), rep(3,7)),
dates = seq.Date(as.Date("2015-01-01"), by="month", length=20))
Here we have 3 people (id), with different amount of observations each.
I now want to count the mondays, tuesdays etc for each person.
This should be done using dplyr and summarize because my real data set has many more columns which I summarize with different statistics.
It should be some something like this:
summa <- df %>% group_by(id) %>%
summarize(mondays = #numberof mondays,
tuesdays = #number of tuesdays,
.........)
How can this be achieved?
I would do the following:
summa <- count(df, id, day = weekdays(dates))
# or:
# summa <- df %>%
# mutate(day = weekdays(dates)) %>%
# count(id, day)
head(summa)
#Source: local data frame [6 x 3]
#Groups: id [2]
#
# id day n
# (dbl) (chr) (int)
#1 1 Donnerstag 1
#2 1 Freitag 1
#3 1 Mittwoch 1
#4 1 Sonntag 2
#5 2 Dienstag 2
#6 2 Donnerstag 1
But you can also reshape to wide format:
library(tidyr)
spread(summa, day, n, fill=0)
#Source: local data frame [3 x 8]
#Groups: id [3]
#
# id Dienstag Donnerstag Freitag Mittwoch Montag Samstag Sonntag
# (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
#1 1 0 1 1 1 0 0 2
#2 2 2 1 1 1 1 1 1
#3 3 1 0 2 1 2 0 1
My results are in German, but yours would be in your own language of course. The column names are German weekdays.
If you want to use summarize explicitly you can achieve the same as above using:
summa <- df %>%
group_by(id, day = weekdays(dates)) %>%
summarize(n = n()) # or do something with summarise_each() for many columns
You could use the lubridate package:
library(lubridate)
summa <- df %>% group_by(id) %>%
summarize(mondays = sum(wday(dates) == 2),
....
Base Date functions:
summa <- df %>% group_by(id) %>%
summarise(monday = sum(weekdays(dates) == "Monday"),
tuesday = sum(weekdays(dates) == "Tuesday"))

group_by() into fill() not working as expected

I'm trying to do a Last Observation Carried Forward operation on some poorly formatted data using dplyr and tidyr. It isn't working as I'd expect.
library(dplyr)
library(tidyr)
df <- data.frame(id=c(1,1,2,2,3,3),
email=c('bob#email.com', NA, 'joe#email.com', NA, NA, NA))
df2 <- df %>% group_by(id) %>% fill(email)
This results in:
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob#email.com
2 1 bob#email.com
3 2 joe#email.com
4 2 joe#email.com
5 3 joe#email.com
6 3 joe#email.com
I expect it to be:
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob#email.com
2 1 bob#email.com
3 2 joe#email.com
4 2 joe#email.com
5 3 NA
6 3 NA
The reason I expect it to be the latter is because of group_by's documentation saying, "The group_by function takes an existing tbl and converts it into a grouped tbl where operations are performed "by group"." The group in this case is determined by the id variable, and the following operation is fill(email). However, it's pretty clearly NOT doing that.
And before anybody asks, it makes no difference if the fields are both character instead of numeric or factor.
UPDATE
#aosmith pointed out this open issue on Github. I'm going to say that there won't be a proper solution to this problem until that issue is resolved. Everything else would just be a workaround. So, if somebody makes a successful PR addressing that issue and posts it here, I'd be happy to mark it as the solution.
Looks like this has been fixed in the development version of tidyr. You now get the expected result per id using fill from tidyr_0.3.1.9000.
df %>% group_by(id) %>% fill(email)
Source: local data frame [6 x 2]
Groups: id [3]
id email
(dbl) (fctr)
1 1 bob#email.com
2 1 bob#email.com
3 2 joe#email.com
4 2 joe#email.com
5 3 NA
6 3 NA
Luckily you can still use zoo::na.locf for this:
df %>%
group_by(id) %>%
mutate(email = zoo::na.locf(email, na.rm = FALSE))
# Source: local data frame [6 x 2]
# Groups: id [3]
#
# id email
# (dbl) (fctr)
# 1 1 bob#email.com
# 2 1 bob#email.com
# 3 2 joe#email.com
# 4 2 joe#email.com
# 5 3 NA
# 6 3 NA
Another option is to use do from dplyr:
df3 <- df %>% group_by(id) %>% do(fill(.,email))
Two questions, does it has be duplicated and do you have to use dplyr and tidyr?
Maybe this could be a solution?
(
bar <- data.frame(id=c(1,1,2,2,3,3),
email=c('bob#email.com', NA, 'joe#email.com', NA, NA, NA))
)
#> id email
#> 1 bob#email.com
#> 1 <NA>
#> 2 joe#email.com
#> 2 <NA>
#> 3 <NA>
#> 3 <NA>
(
foo <- bar[!duplicated(bar$id),]
)
#> id email
#> 1 bob#email.com
#> 2 joe#email.com
#> 3 <NA>
This is kind of ugly, but it is another option that uses dplyr and works with your sample data
df %>%
group_by(id) %>%
mutate(email = email[ !is.na(email) ][1])
I have come across this issue quite a few times, I do worry about using this..
df2 <- df %>% group_by(id) %>% fill(email)
on large data sets as I have had mixed results and found the following work around. The split function used with map_df ensures you apply whatever you are doing to the a specific df for each id and map_df then re binds all the individual df like magic. It has also proved handy in lots of other circumstances. Somewhat obsolete now this issue has been fixed but still a useful alternative that avoids group_by().
df %>% split(.$id) %>% map_df(function(x){ x %>% fill(email)})

Resources