Remove NAs from each variable (column) and combine cases - r

I have a dataset that I am cleaning up and have certain rows (observations) which I would like to combine. The best way to explain what I am trying to do is with the following example:
df<-data.frame(fruits=c("banana","banana","pineapple","kiwi"),cost=c(1,NA,2,3),weight=c(NA,1,2,3),stringsAsFactors = F)
df
cost<-df[,1:2]
weight<-df[,c(1,3)]
cost
weight
cost<-cost[complete.cases(cost),]
weight<-weight[complete.cases(weight),]
key<-data.frame(fruits=unique(df[,1]))
key
mydata<-merge(key,cost,by="fruits",all.x = T)
mydata<-merge(mydata,weight,by="fruits",all.x = T)
mydata
In the previous example I would like to keep the information from both variables (cost and weight) for bananas but unfortunately it is in different records. I am able to accomplish this manually for one variable but my actual dataset have a few dozen variables. I would like to know how can I do the task accomplished above but using dplyr or apply over a set of columns.

We can also use the combo dplyr + tidyr:
library(dplyr)
library(tidyr)
df %>%
gather(key, value, -fruits) %>%
group_by(fruits) %>%
na.omit() %>%
spread(key, value)
## Source: local data frame [3 x 3]
## fruits cost weight
## (chr) (dbl) (dbl)
## 1 banana 1 1
## 2 kiwi 3 3
## 3 pineapple 2 2
EDIT
You might want to check #Frank solution which is shorter and use dplyr only:
df %>%
group_by(fruits) %>%
summarise_each(funs(na.omit))

Using data.table I would something like
library(data.table)
setDT(df)[, lapply(.SD, function(x) x[!is.na(x)]), by = fruits]
# fruits cost weight
# 1: banana 1 1
# 2: pineapple 2 2
# 3: kiwi 3 3
A cleaner but probably slower option would be
setDT(df)[, lapply(.SD, na.omit), by = fruits]
# fruits cost weight
# 1: banana 1 1
# 2: pineapple 2 2
# 3: kiwi 3 3

Related

Clustering similar strings based on another column in R

I have a large data frame that shows the distance between strings and their counts.
For example, in row 1, you see the distance between apple and pple as well as the times that I have counted apple (counts_col1= 100) and the times I ve counted pple (counts_col2=2).
library(tidyverse)
df <- tibble(col1 = c("apple","apple","pple", "banana", "banana","bananna"),
col2 = c("pple","app","app", "bananna", "banan", "banan"),
distance = c(1,2,3,1,1,2),
counts_col1 = c(100,100,2,200,200,2),
counts_col2 = c(2,50,50,2,20,20))
df
#> # A tibble: 6 × 5
#> col1 col2 distance counts_col1 counts_col2
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 apple pple 1 100 2
#> 2 apple app 2 100 50
#> 3 pple app 3 2 50
#> 4 banana bananna 1 200 2
#> 5 banana banan 1 200 20
#> 6 bananna banan 2 2 20
Created on 2022-03-15 by the reprex package (v2.0.1)
Now I want to cluster the apples and the bananas based on the string that has the maximum number of counts, which is the apple (100) and the banana (200).
I want my data to look somehow like this
cluster elements sum_counts
apple apple 152
NA pple NA
NA app NA
banana banana 222
NA bananna NA
NA banan NA
The format of the output does not have to be like this. I am really struggling to break down this problem and cluster the groups.
Any help or comment are really appreciated!
You can try using random walk clustering from igraph:
count_df <- data.table::melt(
data.table::as.data.table(df),
measure = list(c("col1", "col2"), c("counts_col1", "counts_col2")),
value.name = c("col", "counts")
) %>%
select(col, counts) %>%
unique()
df %>%
igraph::graph_from_data_frame(directed = FALSE) %>%
igraph::walktrap.community(weights = igraph::E(.)$distance) %>%
# igraph::components() %>%
igraph::membership() %>%
split(names(.), .) %>%
map_dfr(
~tibble(col = .x) %>%
semi_join(count_df, ., by = "col") %>%
arrange(desc(counts)) %>%
summarise(cluster = first(col), elements = list(col), sum_count = sum(counts))
)
cluster elements sum_count
1 apple apple, app, pple 152
2 banana banana, banan, bananna 222
This works on this toy example, but I think your example is to simple and probably does not reflect your main problem. Or it might be even easier if you are interested in finding connected components (if two words are connected they are in same cluster). Then you would need to replace walktrap.community with components.
Here is one approach, where I initially add a group identifier for the sets (I presume you have this in your actual set), and then after making a longer type dataset, I group by this id, and identifier the "word" that has the largest value. I then use an inner join between the initial df and this resulting set of key rows that have the largest_value word, summarize, and rename. I push all the variants into a list column.
df <- df %>% mutate(id=c(1,1,1,2,2,2))
df %>% inner_join(
rbind(
df %>% select(id,distance,col=col1, counts=counts_col1),
df %>% select(id,distance,col=col2, counts=counts_col2)
) %>%
group_by(id) %>%
slice_max(counts) %>%
distinct(col),
by=c("col1"="col")
) %>%
group_by(col1) %>%
summarize(variants = list(c(col1, cur_group()$col1)),
total = min(counts_col1) + sum(counts_col2)) %>%
rename_all(~c("cluster", "elements", "sum_counts"))
# A tibble: 2 x 3
cluster elements sum_counts
<chr> <list> <dbl>
1 apple <chr [3]> 152
2 banana <chr [3]> 222
A similar approach in data.table (also depends on having that id column)
setDT(df)
df[rbind(
df[,.(id,col=col1,counts=counts_col1)],
df[,.(id,col=col2,counts=counts_col2)]
)[order(-counts),.SD[1], by=id],on=.(col1=col)][
, .(elements=list(c(col2,.BY$cluster)),
sum_counts = min(counts_col1) + sum(counts_col2)),
by=.(cluster=col1)]
cluster elements sum_counts
<char> <list> <num>
1: banana bananna,banan,banana 222
2: apple pple,app,apple 152

dplyr: include all elements in filter list, even if not in data set

df1
Row Taste Quantity
#1 Vanilla 3
#2 Chocolate 1
#3 Strawberry 6
I would like to filter the list and include a c(list) that has more flavors. But if the flavors in the list dont exist in the Taste column I would like to add a new row.
df1 %>% filter(Taste %in% c("Chocolate", "Strawberry", "Banana"))
but this only returns the chocolate and strawberry rows. I would like it to return:
Row Taste Quantity
#2 Chocolate 1
#3 Strawberry 6
#4 Banana 0 (or could be NA)
Is there a way to append the items in the list to the results even if the data doesn't exist in df1?
# example data frame
df = read.table(text = "
Row Taste Quantity
1 Vanilla 3
2 Chocolate 1
3 Strawberry 6
", header=T)
# vector of tastes to have in output
taste_vector = c("Chocolate", "Strawberry", "Banana")
library(dplyr)
data.frame(taste_vector) %>% # start with the vector of tastes you want to have
left_join(df, by=c("taste_vector"="Taste")) %>% # join original data to see what was found and what wasn't
mutate(Row = ifelse(is.na(Row), max(Row, na.rm = T) + cumsum(is.na(Row)), Row)) # update Row column
# taste_vector Row Quantity
# 1 Chocolate 2 1
# 2 Strawberry 3 6
# 3 Banana 4 NA
You can add mutate(Quantity = coalesce(Quantity, 0L)) if you don't want NAs in your Quantity column.
Using tidyverse (dplyr, forcats and tidyr)
First create a filter object (filter_vals) of the variables you want to filter on. In a mutate (assuming the variable is not a factor), we mutate Taste into a factor and expand the factor levels with values from the filter object. Next we use complete to expand the values in the data.frame with the missing levels that are in the factor and set empty values to 0. Finally filter the data.frame with the filter object.
library(tidyverse)
filter_vals <- c("Chocolate", "Strawberry", "Banana")
df1 %>%
mutate(Taste = as_factor(Taste),
Taste = fct_expand(Taste, filter_vals)) %>%
complete(Taste, fill = list(Quantity = 0))
filter(Taste %in% filter_vals)
# A tibble: 3 x 2
Taste Quantity
<fct> <dbl>
1 Chocolate 1
2 Strawberry 6
3 Banana 0

Getting a count of specific values in a data frame that appear in another

This question may sound similar to others, but I hope it is different enough.
I want to take a specific list of values and count how often they appear in another list of values where non-occurring values are retuned as '0'.
I have a Data Frame (df1) with the following values:
Items <- c('Carrots','Plums','Pineapple','Turkey')
df1<-data.frame(Items)
>df1
Items
1 Carrots
2 Plums
3 Pineapple
4 Turkey
And a second Data Frame (df2) that contains a column called 'Thing':
> head(df2,n=10)
ID Date Thing
1 58150 2012-09-12 Potatoes
2 12357 2012-09-28 Turnips
3 50788 2012-10-04 Oranges
4 66038 2012-10-11 Potatoes
5 18119 2012-10-11 Oranges
6 48349 2012-10-14 Carrots
7 23328 2012-10-16 Peppers
8 66038 2012-10-26 Pineapple
9 32717 2012-10-28 Turnips
10 11345 2012-11-08 Oranges
I know the word 'Turkey' only appears in df1 NOT in df2. I want to return a frequency table or count of the items in df1 that appears in df2 and return '0' for the count of Turkey.
How can I summarize values of on Data Frame column using the values from another? The closest I got was:
df2%>% count (Thing) %>% filter(Thing %in% df1$Items,)
But this return a list of items filtered between df1 and df2 so 'Turkey' gets excluded. So close!
> df2%>% count (Thing) %>% filter(Thing %in% df1$Items,)
# A tibble: 3 x 2
Thing n
<fctr> <int>
1 Carrots 30
2 Pineapple 30
3 Plums 38
I want my output to look like this:
1 Carrots 30
2 Pineapple 30
3 Plums 38
4 Turkey 0
I am newish to R and completely new to dplyr.
I use this sort of thing all the time. I'm sure there's a more savvy way to code it, but it's what I got:
item <- vector()
count <- vector()
items <- list(unique(df1$Items))
for (i in 1:length(items)){
item[i] <- items[i]
count[i] <- sum(df2$Thing == item)
}
df3 <- data.frame(cbind(item, count))
Hope this helps!
Stephen's solution worked with a slight modification, adding the [i] to the item at the end of count[i] line. See below:
item <- vector()
count <- vector()
for (i in 1:length(unique(Items))){
item[i] <- Items[i]
count[i]<- sum(df2$Thing == item[i])
}
df3 <- data.frame(cbind(item, count))
> df3
item count
1 Carrots 30
2 Plums 38
3 Pineapple 30
4 Turkey 0
dplyr drops 0 count rows, and you have the added complication that the possible categories of Thing are different between your two datasets.
If you add the factor levels from df1 to df2, you can use complete from tidyr, which is a common way to add 0 count rows.
I'm adding the factor levels from df1 to df2 using a convenience function from package forcats called fct_expand.
library(dplyr)
library(tidyr)
library(forcats)
df2 %>%
mutate(Thing = fct_expand(Thing, as.character(df1$Item) ) ) %>%
count(Thing) %>%
complete(Thing, fill = list(n = 0) ) %>%
filter(Thing %in% df1$Items,)
A different approach is to aggregate df2 first, to right join with df1 (to pick all rows of df1), and to replace NA by zero.
library(dplyr)
df2 %>%
count(Thing) %>%
right_join(unique(df1), by = c("Thing" = "Items")) %>%
mutate(n = coalesce(n, 0L))
# A tibble: 4 x 2
Thing n
<chr> <int>
1 Carrots 1
2 Plums 0
3 Pineapple 1
4 Turkey 0
Warning message:
Column `Thing`/`Items` joining factors with different levels, coercing to character vector
The same approach in data.table:
library(data.table)
setDT(df2)[, .N, by = Thing][unique(setDT(df1)), on = .(Thing = Items)][is.na(N), N := 0L][]
Thing N
1: Carrots 1
2: Plums 0
3: Pineapple 1
4: Turkey 0
Note that in both implementations unique(df1) is used to avoid unintended duplicate rows after the join.
Edit 2019-06-22:
With development version 1.12.3 data.table has gained a coalesce() function. So, above statement can be written
setDT(df2)[, .N, by = Thing][unique(setDT(df1)), on = .(Thing = Items)][, N := coalesce(N, 0L)][]
If df2 is large and df1 contains only a few Items it might be more efficient to join first and then to aggregate:
library(dplyr)
df2 %>%
right_join(unique(df1), by = c("Thing" = "Items")) %>%
group_by(Thing) %>%
summarise(n = sum(!is.na(ID)))
# A tibble: 4 x 2
Thing n
<chr> <int>
1 Carrots 1
2 Pineapple 1
3 Plums 0
4 Turkey 0
Warning message:
Column `Thing`/`Items` joining factors with different levels, coercing to character vector
The same in data.table syntax:
library(data.table)
setDT(df2)[unique(setDT(df1)), on = .(Thing = Items)][, .(N = sum(!is.na(ID))), by = Thing][]
Thing N
1: Carrots 1
2: Plums 0
3: Pineapple 1
4: Turkey 0
Edit 2019-06-22: Above can be written more concisely by aggregating in a join:
setDT(df2)[setDT(df1), on = .(Thing = Items), .N, by = .EACHI]

complete.cases for group instead of observation?

If I have tidied data:
df = expand.grid(Name=c("Sub1","Sub2","Sub3"),Vis=c("Yes","No")) %>%
mutate(KPR_mean=c(NA,1,3,2,3,2),KPR_range=c(NA,4,4,2,6,5)) %>%
filter(complete.cases(.))
I'd like to filter out incomplete factor combinations, to be left with a full factorial model. Right now, I'm doing so as follows:
df %>%
unite(KPR_mean_range,KPR_mean,KPR_range) %>%
spread(Vis,KPR_mean_range) %>%
filter(complete.cases(.)) %>%
gather(Win,KPR_mean_range,-Name) %>%
separate(KPR_mean_range,c("KPR_mean","KPR_range"),sep="_")
But that seems really verbose, and also difficult to extend once there are multiple factors and more variables. Is there a way to filter on a grouping variable, instead of a row? I.e., for each level of Name, if filter(complete.cases(.)) would remove a row from that group, then remove the entire group instead?
For the new data, expand your answer to all cases, group by whichever variable you want the completed cases in, and filter out groups with NAs:
df %>% complete(Vis, Name) %>% group_by(Name) %>% filter(!any(is.na(KPR_mean)))
# Source: local data frame [4 x 4]
# Groups: Name [2]
#
# Vis Name KPR_mean KPR_range
# (fctr) (fctr) (dbl) (dbl)
# 1 Yes Sub2 1 4
# 2 Yes Sub3 3 4
# 3 No Sub2 3 6
# 4 No Sub3 2 5
Here is one option with data.table. We convert the 'data.frame' to 'data.table' specifying the key columns, (setDT(df,..), do a cross join, grouped by 'Name', if there are no 'NA' values in 'KPP_range', subset the group of rows.
library(data.table)
setDT(df, key = c("Name", "Vis"))[CJ(Name, Vis, unique=TRUE)][,
if(all(!is.na(KPR_mean))) .SD , Name]
# Name Vis KPR_mean KPR_range
#1: Sub2 Yes 1 4
#2: Sub2 No 3 6
#3: Sub3 Yes 3 4
#4: Sub3 No 2 5

How to repeat empty rows so that each split has the same number

My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?
df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)
maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
toadd<-maxrows-dim(x)[1]
replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})
Goal:
a 2
a 3
b 4
b NA
Using data.table...
my_rows <- seq.int(max(tabulate(df$Initials)))
library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]
# Initials data
# 1: a 2
# 2: a 3
# 3: b 4
# 4: b NA
.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].
The analogue in dplyr is
my_rows <- seq.int(max(tabulate(df$Initials)))
library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)
# Initials data
# (fctr) (dbl)
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b NA
Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.
Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:
library(dplyr)
library(tidyr)
df %>% group_by(Initials) %>%
mutate(row = row_number()) %>%
ungroup() %>%
complete(Initials, row) %>%
select(-row)
Source: local data frame [4 x 2]
Initials data
(fctr) (dbl)
1 a 2
2 a 3
3 b 4
4 b NA
Interesting problem. Try:
to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
# Initials data
#1 a 2
#2 a 3
#3 b 4
#4 b <NA>
We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.
max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.
We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.
To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).

Resources