complete.cases for group instead of observation? - r

If I have tidied data:
df = expand.grid(Name=c("Sub1","Sub2","Sub3"),Vis=c("Yes","No")) %>%
mutate(KPR_mean=c(NA,1,3,2,3,2),KPR_range=c(NA,4,4,2,6,5)) %>%
filter(complete.cases(.))
I'd like to filter out incomplete factor combinations, to be left with a full factorial model. Right now, I'm doing so as follows:
df %>%
unite(KPR_mean_range,KPR_mean,KPR_range) %>%
spread(Vis,KPR_mean_range) %>%
filter(complete.cases(.)) %>%
gather(Win,KPR_mean_range,-Name) %>%
separate(KPR_mean_range,c("KPR_mean","KPR_range"),sep="_")
But that seems really verbose, and also difficult to extend once there are multiple factors and more variables. Is there a way to filter on a grouping variable, instead of a row? I.e., for each level of Name, if filter(complete.cases(.)) would remove a row from that group, then remove the entire group instead?

For the new data, expand your answer to all cases, group by whichever variable you want the completed cases in, and filter out groups with NAs:
df %>% complete(Vis, Name) %>% group_by(Name) %>% filter(!any(is.na(KPR_mean)))
# Source: local data frame [4 x 4]
# Groups: Name [2]
#
# Vis Name KPR_mean KPR_range
# (fctr) (fctr) (dbl) (dbl)
# 1 Yes Sub2 1 4
# 2 Yes Sub3 3 4
# 3 No Sub2 3 6
# 4 No Sub3 2 5

Here is one option with data.table. We convert the 'data.frame' to 'data.table' specifying the key columns, (setDT(df,..), do a cross join, grouped by 'Name', if there are no 'NA' values in 'KPP_range', subset the group of rows.
library(data.table)
setDT(df, key = c("Name", "Vis"))[CJ(Name, Vis, unique=TRUE)][,
if(all(!is.na(KPR_mean))) .SD , Name]
# Name Vis KPR_mean KPR_range
#1: Sub2 Yes 1 4
#2: Sub2 No 3 6
#3: Sub3 Yes 3 4
#4: Sub3 No 2 5

Related

How to summarize_each with mixed column class

Consider the situation, where I want to summarize_each a data.frame with mixed column type.
> (temp=data.frame(ID=c(1,1,2,2),gender=c("M","M","F","F"),val1=rnorm(4),val2=rnorm(4)))
ID gender val1 val2
1 1 M -1.7944804 0.5232313
2 1 M 0.3938437 -0.8424086
3 2 F -0.3190777 0.3220580
4 2 F 1.3667340 -0.6031376
> temp%>%group_by(ID)%>%summarize_each(funs(mean))
Source: local data frame [2 x 4]
ID gender val1 val2
(dbl) (lgl) (dbl) (dbl)
1 1 NA -0.7003184 -0.1595886
2 2 NA 0.5238282 -0.1405398
This doesn't work because mean(gender) doesn't make sense.
Question:
If all my non-numeric columns are characteristic of ID, thus are identical within each ID, can I somehow get summarize_each to return that 'unique' value?
> temp%>%group_by(ID,gender)%>%summarize_each(funs(mean))
Source: local data frame [2 x 4]
Groups: ID [?]
ID gender val1 val2
(dbl) (fctr) (dbl) (dbl)
1 1 M -0.7003184 -0.1595886
2 2 F 0.5238282 -0.1405398
is the output that I want, but I somehow feel like this is doing unnecessary nested group_by because there really is nothing to group within ID.
One option would be gather/spread from tidyr. Reshape to 'long' format with gather, grouped by 'ID', 'var', get the first element of 'gender' and mean of 'val', spread it back to 'wide' format.
library(tidyr)
library(dplyr)
gather(temp, var, val, val1:val2) %>%
group_by(ID, var) %>%
summarise(gender = first(gender), val = mean(val)) %>%
spread(var, val)
Or another is using mutate_if and unique. After grouping by 'ID', we get the mean of the numeric columns with mutate_if. As the other columns (i.e. 'gender' also remains in the output) we can just do unique to get the unique rows from the output.
temp %>%
group_by(ID) %>%
mutate_if(is.numeric, mean) %>%
unique()
# ID gender val1 val2
# <int> <chr> <dbl> <dbl>
#1 1 M -0.7003184 -0.1595886
#2 2 F 0.5238281 -0.1405398

Average of Columns by Unique ID in R

I would like to average columns in a data set based on a unique identifier. I do not know ahead of time how many columns I will have for each unique identifier or what order they will come in. The unique IDs are all known before hand and are lists of weeks. I have found solutions for regular patterns but not solutions for using the actual column headers to sort out the average. Thanks for any and all help.
I present the original data and desired result. In the example there are only 2 unique IDs
x = read.table(text = "
site wk1 wk2 wk1 wk1
1 2 4 6 8
2 10 20 30 40
3 5 NA 2 3
4 100 100 NA NA",
sep = "", header = TRUE)
x
desired.outcome = read.table(text = "
site wk1avg wk2avg
1 3.3 4
2 26.6 20
3 3.3 NA
4 NA 100",
sep = "", header = TRUE)
If your original data file has duplicated column names, read.table will change them so all the columns have unique values (as you can see by checking x in your example after it's loaded). In fact, the code below depends on that happening, because melt will drop columns with duplicated names. Then we use mutate to remove the extra text added by read.table to de-duplicate the column names so that we can group properly by week.
library(reshape2)
library(dplyr)
x %>% melt(id.var="site") %>% # Convert to long format
mutate(variable = gsub("\\..*", "", variable)) %>% # "re-duplicate" original column names
group_by(site, variable) %>%
summarise(mn = mean(value)) %>%
dcast(site ~ variable)
site wk1 wk2
1 1 5.333333 4
2 2 26.666667 20
3 3 3.333333 NA
4 4 NA 100
Here's a tidyr and dplyr approach:
library(dplyr)
library(tidyr)
x %>% gather(wk, val, -site) %>% # gather wk* columns into key-value pairs
extract(wk, 'wk', '(wk\\d+).*?') %>% # trim suffixes added by read.table
group_by(site, wk) %>%
summarise(mean_val = mean(val)) %>% # calculate grouped means
spread(wk, mean_val) # spread back into wk* columns
# Source: local data frame [4 x 3]
# Groups: site [4]
#
# site wk1 wk2
# (int) (dbl) (dbl)
# 1 1 5.333333 4
# 2 2 26.666667 20
# 3 3 3.333333 NA
# 4 4 NA 100

rearrange specific rows into columns using dplyr

I am trying to rearrange rows into columns in a specific way (preferably using dplyr) but I dont really know where to start with this. I am trying to create one row for each person (Bill or Bob) and have all of that persons values on one row. So far I have
df<-data.frame(
Participant=c("bob1","bill1","bob2","bill2"),
No_Photos=c(1,4,5,6)
)
res<-df %>% group_by(Participant) %>% dplyr::summarise(phot_mean=mean(No_Photos))
which gives me:
Participant mean(No_Photos)
(fctr) (dbl)
1 bill1 4
2 bill2 6
3 bob1 1
4 bob2 5
GOAL:
mean_NO_Photos_1 mean_No_Photos_2
bob 1 5
bill 4 6
Using tidyr and dplyr:
library(tidyr)
library(dplyr)
df %>% mutate(rep = extract_numeric(Participant),
Participant = gsub("[0-9]", "", Participant)) %>%
group_by(Participant, rep) %>%
summarise(mean = mean(No_Photos)) %>%
spread(rep, mean)
Source: local data frame [2 x 3]
Participant 1 2
(chr) (dbl) (dbl)
1 bill 4 6
2 bob 1 5

Remove NAs from each variable (column) and combine cases

I have a dataset that I am cleaning up and have certain rows (observations) which I would like to combine. The best way to explain what I am trying to do is with the following example:
df<-data.frame(fruits=c("banana","banana","pineapple","kiwi"),cost=c(1,NA,2,3),weight=c(NA,1,2,3),stringsAsFactors = F)
df
cost<-df[,1:2]
weight<-df[,c(1,3)]
cost
weight
cost<-cost[complete.cases(cost),]
weight<-weight[complete.cases(weight),]
key<-data.frame(fruits=unique(df[,1]))
key
mydata<-merge(key,cost,by="fruits",all.x = T)
mydata<-merge(mydata,weight,by="fruits",all.x = T)
mydata
In the previous example I would like to keep the information from both variables (cost and weight) for bananas but unfortunately it is in different records. I am able to accomplish this manually for one variable but my actual dataset have a few dozen variables. I would like to know how can I do the task accomplished above but using dplyr or apply over a set of columns.
We can also use the combo dplyr + tidyr:
library(dplyr)
library(tidyr)
df %>%
gather(key, value, -fruits) %>%
group_by(fruits) %>%
na.omit() %>%
spread(key, value)
## Source: local data frame [3 x 3]
## fruits cost weight
## (chr) (dbl) (dbl)
## 1 banana 1 1
## 2 kiwi 3 3
## 3 pineapple 2 2
EDIT
You might want to check #Frank solution which is shorter and use dplyr only:
df %>%
group_by(fruits) %>%
summarise_each(funs(na.omit))
Using data.table I would something like
library(data.table)
setDT(df)[, lapply(.SD, function(x) x[!is.na(x)]), by = fruits]
# fruits cost weight
# 1: banana 1 1
# 2: pineapple 2 2
# 3: kiwi 3 3
A cleaner but probably slower option would be
setDT(df)[, lapply(.SD, na.omit), by = fruits]
# fruits cost weight
# 1: banana 1 1
# 2: pineapple 2 2
# 3: kiwi 3 3

How to repeat empty rows so that each split has the same number

My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?
df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)
maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
toadd<-maxrows-dim(x)[1]
replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})
Goal:
a 2
a 3
b 4
b NA
Using data.table...
my_rows <- seq.int(max(tabulate(df$Initials)))
library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]
# Initials data
# 1: a 2
# 2: a 3
# 3: b 4
# 4: b NA
.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].
The analogue in dplyr is
my_rows <- seq.int(max(tabulate(df$Initials)))
library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)
# Initials data
# (fctr) (dbl)
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b NA
Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.
Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:
library(dplyr)
library(tidyr)
df %>% group_by(Initials) %>%
mutate(row = row_number()) %>%
ungroup() %>%
complete(Initials, row) %>%
select(-row)
Source: local data frame [4 x 2]
Initials data
(fctr) (dbl)
1 a 2
2 a 3
3 b 4
4 b NA
Interesting problem. Try:
to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
# Initials data
#1 a 2
#2 a 3
#3 b 4
#4 b <NA>
We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.
max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.
We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.
To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).

Resources