Replace a subset of a data frame with dplyr join operations

Replace a subset of a data frame with dplyr join operations - r

Suppose that I gave a treatment to some column values of a data frame like this:
id animal weight height ...
1 dog 23.0
2 cat NA
3 duck 1.2
4 fairy 0.2
5 snake BAD
df <- data.frame(id = seq(1:5),
animal = c("dog", "cat", "duck", "fairy", "snake"),
weight = c("23", NA, "1.2", "0.2", "BAD"))
Suppose that the treatment require to work in a separately table, and gave as the result, the following data frame that is a subset of the original:
id animal weight
2 cat 2.2
5 snake 1.3
sub_df <- data.frame(id = c(2, 5),
animal = c("cat", "snake"),
weight = c("2.2", "1.3"))
Now I want to put all together again, so I use an operation like this:
> df %>%
anti_join(sub_df, by = c("id", "animal")) %>%
bind_rows(sub_df)
id animal weight
4 fairy 0.2
1 dog 23.0
3 duck 1.2
2 cat 2.2
5 snake 1.3
Exist some way to do this directly with join operations?
In the case that the subset is just the key column and the variable subject to give a treatment (id, animal weigth) and not the total variables of the original data frame (id, animal, weight, height), how could assemble the subset with the original set?

What you describe is a join operation in which you update some values in the original dataset. This is very easy to do with great performance using data.table because of its fast joins and update-by-reference concept (:=).
Here's an example for your toy data:
library(data.table)
setDT(df) # convert to data.table without copy
setDT(sub_df) # convert to data.table without copy
# join and update "df" by reference, i.e. without copy
df[sub_df, on = c("id", "animal"), weight := i.weight]
The data is now updated:
# id animal weight
#1: 1 dog 23.0
#2: 2 cat 2.2
#3: 3 duck 1.2
#4: 4 fairy 0.2
#5: 5 snake 1.3
You can use setDF to switch back to ordinary data.frame.

Remove the na's first, then simply stack the tibbles:
bind_rows(filter(df,!is.na(weight)),sub_df)

Isn't dplyr::rows_update exactly what we need here? The following code should work:
df %>% dplyr::rows_update(sub_df, by = "id")
This should work as long as there is a unique identifier (one or multiple variables) for your datasets.

For anyone looking for a solution to use in a tidyverse pipeline:
I run into this problem a lot, and have written a short function that uses mostly tidyverse verbs to get around this. It will account for the case when there are additional columns in the original df.
For example, if the OP's df had an additional 'height' column:
library(dplyr)
df <- tibble(id = seq(1:5),
animal = c("dog", "cat", "duck", "fairy", "snake"),
weight = c("23", NA, "1.2", "0.2", "BAD"),
height = c("54", "45", "21", "50", "42"))
And the subset of data we wanted to join in was the same:
sub_df <- tibble(id = c(2, 5),
animal = c("cat", "snake"),
weight = c("2.2", "1.3"))
If we used the OP's method alone (anti_join %>% bind_rows), this won't work because of the additional 'height' column in df. An extra step or two is needed.
In this case we could use the following function:
replace_subset <- function(df, df_subset, id_col_names = c()) {
# work out which of the columns contain "new" data
new_data_col_names <- colnames(df_subset)[which(!colnames(df_subset) %in% id_col_names)]
# complete the df_subset with the extra columns from df
df_sub_to_join <- df_subset %>%
left_join(select(df, -new_data_col_names), by = c(id_col_names))
# join and bind rows
df_out <- df %>%
anti_join(df_sub_to_join, by = c(id_col_names)) %>%
bind_rows(df_sub_to_join)
return(df_out)
}
Now for the results:
replace_subset(df = df , df_subset = sub_df, id_col_names = c("id"))
## A tibble: 5 x 4
# id animal weight height
# <dbl> <chr> <chr> <chr>
#1 1 dog 23 54
#2 3 duck 1.2 21
#3 4 fairy 0.2 50
#4 2 cat 2.2 45
#5 5 snake 1.3 42
And here's an example using the function in a pipeline:
df %>%
replace_subset(df_subset = sub_df, id_col_names = c("id")) %>%
mutate_at(.vars = vars(c('weight', 'height')), .funs = ~as.numeric(.)) %>%
mutate(bmi = weight / (height^2))
## A tibble: 5 x 5
# id animal weight height bmi
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 1 dog 23 54 0.00789
#2 3 duck 1.2 21 0.00272
#3 4 fairy 0.2 50 0.00008
#4 2 cat 2.2 45 0.00109
#5 5 snake 1.3 42 0.000737
hope this is helpful :)

Related

How to replace NA in a dataframe for a specific value using the results of another column and taking into account conditions of another column?

I have a dataframe composed of 9 columns with more than 4000 observations. For this question I will present a simpler dataframe (I use the tidyverse library)
Let's say I have the following dataframe:
library(tidyverse)
df <- tibble(Product = c("Bread","Oranges","Eggs","Bananas","Whole Bread" ),
Weight = c(NA, 1, NA, NA, NA),
Units = c(2,6,1,2,1),
Price = c(1,3.5,0.5,0.75,1.5))
df
I want to replace the NA values of the Weight column for a number multiplied by the results of Units depending on the word showed by the column Product. Basically, is a rule like:
Replace NA in Weight for 2.5*number of units if Product contains the word "Bread". Replace for 1 if Product contains the word "Eggs"
The thing is that I don't know how to code somehting like that in R. I tried the following code that a kind user gave me for a similar question:
df <- df %>%
mutate(Weight = case_when(Product == "bread" & is.na(Weight) ~ 0.25*Units))
But it doesn't work and it doesn't take into account the fact that if there is "Whole Bread" written in my dataframe it also has to apply the rule.
Does anyone have an idea?

Some of them are not exact matches, so use str_detect
library(dplyr)
library(stringr)
df %>%
mutate(Weight = case_when(is.na(Weight) &
str_detect(Product, regex("Bread", ignore_case = TRUE)) ~ 2.5 * Units,
is.na(Weight) & Product == "Eggs"~ Units, TRUE ~ Weight))
-output
# A tibble: 5 × 4
Product Weight Units Price
<chr> <dbl> <dbl> <dbl>
1 Bread 5 2 1
2 Oranges 1 6 3.5
3 Eggs 1 1 0.5
4 Bananas NA 2 0.75
5 Whole Bread 2.5 1 1.5

mutate to merge minimum value from different df in R

I have two datasets: One of species in my study and how many times I observed them, and another larger dataset that is a broader database of observations.
I want to mutate a column in my short dataset for "lowest latitude observed" (or highest or mean) from values in the other dataset but I can't quite figure out how to match them in a mutate.
set.seed(1)
# my dataset. sightings isn't important for this, just important that the solution doesn't mess up existing columns.
fake_spp_df <- data.frame(
species = c("a","b","c","d",'e'),
sightings = c(5,1,2,6,3)
)
# broader occurrence dataset
fake_spp_occurrences <- data.frame(
species = rep(c("a","b","c","d",'f'),each=20), # notice spp "f" - not all species are the same between datasets
latitude = runif(100, min = 0, max = 80),
longitude = runif(100, min=-90, max = -55)
)
# so I know to find one species min, i could do this:
min(fake_spp_occurrences$latitude[fake_spp_occurrences$species == "a"]),
# but I want to do that in a mutate()
# this was my failed attempt:
fake_spp_df %>%
mutate(lowest_lat = min(fake_spp_occurrences$latitude[fake_spp_occurrences$species == species])
)
desired result:
> fake_spp_df
species sightings lowest_lat max_lat median_lat
1 a 5 1.7 etc...
2 b 1 5.3
3 c 2 2.2
4 d 6 4.3
5 e 3 NA
thinking this could also be done witth some kind of join or merge, but I'm not sure.
thanks!

summarise the fake_spp_occurrences dataset and then perform the join.
library(dplyr)
fake_spp_occurrences %>%
group_by(species) %>%
summarise(lowest_lat = min(latitude),
max_lat = max(latitude),
median_lat = median(latitude)) %>%
right_join(fake_spp_df, by = 'species')
# species lowest_lat max_lat median_lat sightings
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 a 4.94 79.4 48.1 5
#2 b 1.07 74.8 35.7 1
#3 c 1.87 68.9 41.9 2
#4 d 6.74 76.8 38.2 6
#5 e NA NA NA 3

using multiple data frames and lookup table to perform functions in r

I'm new to r and have a complicated set of data so hope my explanation is correct. I have multiple data frames I need to use to perform a series of things. Here's one example. I have three data frames. One is a list of species names and corresponding codes:
>df.sp
Species Code
Picea PI
Pinus CA
Another is a list of sites with species abundance data for different locations (dir). Unfortunately, the order of the species are different.
>df.site
Site dir total t01 t02 t03 t04
2 Total PI CA AB T
2 N 9 1 5 na na
2 AB ZI PI CA
2 S 5 2 2 1 4
3 DD EE AB YT
3 N 6 1 1 5 3
3 AB YT EE DD
3 S 5 4 3 1 1
Then I also have a data frame of traits corresponding to the species:
>df.trait
Species leaft rootl
Picea 0.01 1.2
Pinus 0.02 3.5
An example of one things I want to do is get the average value for each trait (df.trait$leaft and df.trait$rootl) for all the species per site (df.site$Site) and per site location (df.site$Site N, S). So the result would be for the first row:
Site dir leaft rootl
2 N 0.015 2.35
I hope that makes sense. It is very complicated for me to think through how to go about. I've attempted working from this post and this (and many others) but got lost.
Thanks for the help. Really appreciated.
UPDATE: Here is a sample of the actual df.site (reduced) using dput:
> dput(head(df.site))
structure(list(Site = c(2L, 2L, 2L, 2L, 2L, 2L), dir = c("rep17316",
"N", "", "S", "", "SE"), total = c("Total", "9", "",
"10", "", "9"), t01 = c("PI", "4", "CA", "1", "SILLAC",
"3"), t02 = c("CXBLAN", "3", "ZIZAUR", "4", "OENPIL", "2"),
t03 = c("ZIZAPT", "1", "ECHPUR", "2", "ASCSYR", "2")), .Names = c("site", "dir", "total", "t01", "t02", "t03"), row.names = 2:7, class = "data.frame")

You're going to have to first wrangle your data into a much cleaner form. I'm assuming the structure that you dput above is consistent throughout your df.site dataframe; namely that rows are paired, the first of which specifies the species code, the second of which has a count (or some other collected data?).
Starting with df as the dataframe that you dput() above, I'll first simulate some data for the other two dataframes:
df.sp <- data.frame(Species = paste0("species",1:8),
Code = c("ECHPUR", "CXBLAN", "ZIZAPT",
"CAMROT", "SILLAC", "OENPIL",
"ASCSYR", "ZIZAUR"))
df.sp
#> Species Code
#> 1 species1 ECHPUR
#> 2 species2 CXBLAN
#> 3 species3 ZIZAPT
#> 4 species4 CAMROT
#> 5 species5 SILLAC
#> 6 species6 OENPIL
#> 7 species7 ASCSYR
#> 8 species8 ZIZAUR
df.trait <- data.frame(Species = paste0("species",1:8),
leaft = round(runif(8, max=.2), 2),
rootl = round(runif(8, min=1, max=4),1))
df.trait
#> Species leaft rootl
#> 1 species1 0.12 2.5
#> 2 species2 0.04 2.6
#> 3 species3 0.12 2.1
#> 4 species4 0.05 1.1
#> 5 species5 0.15 2.5
#> 6 species6 0.15 3.3
#> 7 species7 0.05 3.9
#> 8 species8 0.13 2.1
First, let's clean up df by moving these second rows containing collected data, and moving those values into a new set of columns:
library(dplyr)
df.clean <- df %>%
#for each row, copy the direction and total from the following row
mutate_at(vars(matches("dir|total")), lead) %>%
#create new columns for observed data and fill in values from following row
mutate_at(vars(matches("t\\d+$")),
.funs = funs(n = lead(.))) %>%
#filter to rows with species code in t01
filter(t01 %in% df.sp$Code) %>%
#drop "total" column (doesn't make sense after reshape)
select(-total)
df.clean
#> site dir t01 t02 t03 t01_n t02_n t03_n
#> 1 2 N ECHPUR CXBLAN ZIZAPT 4 3 1
#> 2 2 S CAMROT ZIZAUR ECHPUR 1 4 2
#> 3 2 SE SILLAC OENPIL ASCSYR 3 2 2
We now have two sets of corresponding columns which have species codes and values respectively. To reshape the dataframe into long form we'll use the melt() from the data.table package. See the responses to this question for other examples of how to do this.
library(data.table)
df.clean <- df.clean %>%
setDT() %>% #convert to data.table to use data.tabel::melt
melt(measure.vars = patterns("t\\d+$", "_n$"),
value.name = c("Code", "Count") ) %>%
#drop "variable" column, which isn't needed
select(-variable)
Finally, join your three dataframes:
#merge tables together
df.summaries <- df.clean %>%
left_join(df.sp) %>%
left_join(df.trait)
At this point you should be able to summarize your data by whatever groupings you are interested in using group_by and summarise.

How can I match two sets of factor levels in a new data frame?

I have a large data frame and I want to export a new data frame that contains summary statistics of the first based on the id column.
library(tidyverse)
set.seed(123)
id = rep(c(letters[1:5]), 2)
species = c("dog","dog","cat","cat","bird","bird","cat","cat","bee","bee")
study = rep("UK",10)
freq = rpois(10, lambda=12)
df1 <- data.frame(id,species, freq,study)
df1$id<-sort(df1$id)
df1
df2 <- df1 %>% group_by(id) %>%
summarise(meanFreq= mean(freq),minFreq=min(freq))
df2
I want to keep the species name in the new data frame with the summary statistics. But if I merge by id I get redundant rows. I should only have one row per id but with the species name appended.
df3<-merge(df2,df1,by = "id")
This is what it should look like but my real data is messier than this neat set up here:
df4 = df3[seq(1, nrow(df3), 2), ]
df4

From the summarised output ('df2') we can join with the distinct rows of the selected columns of original data
library(dplyr)
df2 %>%
left_join(df1 %>%
distinct(id, species, study), by = 'id')
# A tibble: 5 x 5
# id meanFreq minFreq species study
# <fct> <dbl> <dbl> <fct> <fct>
#1 a 10.5 10 dog UK
#2 b 14.5 12 cat UK
#3 c 14.5 12 bird UK
#4 d 10 7 cat UK
#5 e 11 6 bee UK
Or use the same logic with the base R
merge(df2,unique(df1[c(1:2, 4)]),by = "id", all.x = TRUE)

Time for mutate followed by distinct:
df1 %>% group_by(id) %>%
mutate(meanFreq = mean(freq), minFreq = min(freq)) %>%
distinct(id, .keep_all = T)
Now actually there are two possibilities: either id and species are essentially the same in your df, one is just a label for the other, or the same id can have several species.
If the latter is the case, you will need to replace the last line with distinct(id, species, .keep_all = T).
This would get you:
# A tibble: 5 x 6
# Groups: id [5]
id species freq study meanFreq minFreq
<fct> <fct> <int> <fct> <dbl> <dbl>
1 a dog 10 UK 10.5 10
2 b cat 17 UK 14.5 12
3 c bird 12 UK 14.5 12
4 d cat 13 UK 10 7
5 e bee 6 UK 11 6
If your only goal is to keep the species & they are indeed the same as id, you could also just include it in the group_by:
df1 %>% group_by(id, species) %>%
summarise(meanFreq = mean(freq), minFreq = min(freq))
This would then remove study and freq - if you have the need to keep them, you can again replace summarise with mutate and then distinct with .keep_all = T argument.

How to run a for loop for each group in a dataframe?

This question is similar to this one asked earlier but not quite. I would like to iterate through a large dataset (~500,000 rows) and for each unique value in one column, I would like to do some processing of all the values in another column.
Here is code that I have confirmed to work:
df = matrix(nrow=783,ncol=2)
counts = table(csvdata$value)
p = (as.vector(counts))/length(csvdata$value)
D = 1 - sum(p**2)
The only problem with it is that it returns the value D for the entire dataset, rather than returning a separate D value for each set of rows where ID is the same.
Say I had data like this:
How would I be able to do the same thing as the code above, but return a D value for each group of rows where ID is the same, rather than for the entire dataset? I imagine this requires a loop, and creating a matrix to store all the D values in with ID in one column and the value of D in the other, but not sure.

Ok, let's work with "In short, I would like whatever is in the for loop to be executed for each block of data with a unique value of "ID"".
In general you can group rows by values in one column (e.g. "ID") and then perform some transformation based on values/entries in other columns per group. In the tidyverse this would look like this
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(value.mean = mean(value))
## A tibble: 8 x 3
## Groups: ID [3]
# ID value value.mean
# <fct> <int> <dbl>
#1 a 13 12.6
#2 a 14 12.6
#3 a 12 12.6
#4 a 13 12.6
#5 a 11 12.6
#6 b 12 15.5
#7 b 19 15.5
#8 cc4 10 10.0
Here we calculate the mean of value per group, and add these values to every row. If instead you wanted to summarise values, i.e. keep only the summarised value(s) per group, you would use summarise instead of mutate.
library(tidyverse)
df %>%
group_by(ID) %>%
summarise(value.mean = mean(value))
## A tibble: 3 x 2
# ID value.mean
# <fct> <dbl>
#1 a 12.6
#2 b 15.5
#3 cc4 10.0
The same can be achieved in base R using one of tapply, ave, by. As far as I understand your problem statement there is no need for a for loop. Just apply a function (per group).
Sample data
df <- read.table(text =
"ID value
a 13
a 14
a 12
a 13
a 11
b 12
b 19
cc4 10", header = T)
Update
To conclude from the comments&chat, this should be what you're after.
# Sample data
set.seed(2017)
csvdata <- data.frame(
microsat = rep(c("A", "B", "C"), each = 8),
allele = sample(20, 3 * 8, replace = T))
csvdata %>%
group_by(microsat) %>%
summarise(D = 1 - sum(prop.table(table(allele))^2))
## A tibble: 3 x 2
# microsat D
# <fct> <dbl>
#1 A 0.844
#2 B 0.812
#3 C 0.812
Note that prop.table returns fractions and is shorter than your (as.vector(counts))/length(csvdata$value). Note also that you can reproduce your results for all values (irrespective of ID) if you omit the group_by line.

A base R option would be
df1$value.mean <- with(df1, ave(value, ID))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replace a subset of a data frame with dplyr join operations - r

Remove the na's first, then simply stack the tibbles: bind_rows(filter(df,!is.na(weight)),sub_df)

Isn't dplyr::rows_update exactly what we need here? The following code should work: df %>% dplyr::rows_update(sub_df, by = "id") This should work as long as there is a unique identifier (one or multiple variables) for your datasets.

Related

How to replace NA in a dataframe for a specific value using the results of another column and taking into account conditions of another column?

mutate to merge minimum value from different df in R

using multiple data frames and lookup table to perform functions in r

How can I match two sets of factor levels in a new data frame?

How to run a for loop for each group in a dataframe?

Categories

Resources