I am examining conservation easement data from NCED. I have a data frame of parcels that have some repeated IDs and owners. I want to group the repeated IDs into a single row with a count of the distinct number of owners... but based on this question and answer I'm just returning a count of the number of rows of the ID.
uniqueID <- c(1:10)
parcelID <- c('a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c')
owner <- c('owner1', 'owner1', 'owner1', 'owner2', 'owner3',
'owner2', 'owner2', 'owner2', 'owner3', 'owner1')
mydat1 <- data.frame(uniqueID, parcelID, owner)
numberOwners <- mydat1 %>% group_by(parcelID, owner) %>% tally()
My desired output would be:
parcelID_grouped nOwners
1 a 3
2 b 1
3 c 2
Using dplyr there a couple of ways to do this:
library(dplyr)
mydat1 %>% distinct(parcelID, owner) %>% count(parcelID)
mydat1 %>% group_by(parcelID) %>% summarise(n = n_distinct(owner))
Both calls resulting in:
# parcelID n
# 1 a 3
# 2 b 1
# 3 c 2
Using data.table:-
library(data.table)
setDT(mydat1)
mydat1[, uniqueID := NULL]
mydat1 <- unique(mydat1)
mydat1[, nOwners := .N, by = parcelID]
mydat1[, owner := NULL]
mydat1 <- unique(mydat1)
setnames(mydat1, "parcelID", "parcelID_grouped")
You'll get the desired output:-
parcelID_grouped nOwners
1: a 3
2: b 1
3: c 2
Related
I have a large database where there are rows that are partially duplicated. I'm trying to use a filter in dplyr with conditional statements to remove these partially duplicated rows.
Goal: I want to remove all rows where there are duplicate combinations of a1 and id1 with var1 equaling 0. To achieve this, I tried using a duplicated() call in my filter function along with conditional statements.
Issues: The code below I've used below seems to ignore the last condition of var1 equaling zero. I tried two different ways of filtering to get the desired results to no avail. Is there something wrong with my duplicated() call? Should I use distinct() instead?
library(dplyr)
a1 <- c('adam', 'adam', 'adam', 'megan', 'megan', 'megan', 'jen', 'jen', 'jen')
id1 <- c('a', 'a', 'b', 'a', 'b', 'b', 'a', 'b', 'c')
var1 <- as.numeric(c('0', '3.2', '3', '2.2', '1.1', '0', '1.2', '2.4','3.1'))
test_df <- data.frame(a1, id1, var1)
#code to get rid of duplicates
test_df2 <- test_df %>%
filter(!(duplicated(id1) & duplicated(a1) & var1 == 0))
#alternative code
test_df3 <- test_df
test_df3$new_id <- with(test_df3, paste0(a1, sep = "-", id1))
test_df3 <- test_df3 %>%
filter(!(duplicated(new_id) & var1 == 0))
Here's a pic of what I'm getting:
Here's the desired result:
We could use group_by and summarise:
librar(dplyr)
test_df %>%
group_by(a1, id1) %>%
summarise(var1 = sum(var1))
a1 id1 var1
<chr> <chr> <dbl>
1 adam a 3.2
2 adam b 3
3 jen a 1.2
4 jen b 2.4
5 jen c 3.1
6 megan a 2.2
7 megan b 1.1
I was able to solve the question by using Janitor's get_dupes and then doing a filter from that subset. I'm not sure why I cant achieve this using conditional statements in dplyr, but this is a hack that works well enough.
library(janitor)
library(dplyr)
a1 <- c('adam', 'adam', 'adam', 'megan', 'megan', 'jen', 'jen', 'jen')
id1 <- c('a', 'b', 'a','a', 'b', 'a', 'b', 'a')
var1 <- as.numeric(c('3.2', '2.7', '0','2', '1.1', '0', '2.2','3.1'))
var2 <- as.numeric(c('3.4', '3', '0','1.7', '1.2', '3', '0','3.3'))
test_df <- data.frame(a1, id1, var1, var2)
test_df$a1_id1 <- with(test_df, paste0(a1, sep = "-", id1))
#get all instances where there is a duplicated name and id
test_df2 <- test_df %>%
get_dupes(a1_id1)
#remove rows that have var1 as 0 and remove column called dupe_count
test_df3 <- test_df2 %>%
filter(var1 != 0) %>%
select(-dupe_count)
#Remove all instances of duplicate names
test_df4 <- test_df %>%
group_by(a1_id1) %>%
filter(n() == 1)
#combine the two df's created and bind together for the desired output.
test_df_updated <- dplyr::bind_rows(test_df3, test_df4)
I have a dataframe with dates and identifiers. I want to filter this dataframe to end up with rows that have 1) consecutive dates to others in the original dataframe, 2) are not the first of a group of consecutive dates, 3) do not have the same ID as a row that was first of a group of consecutive dates after the date that it was first, 4) are deduplicated based on ID. For example
Date <- as.Date('2019.01.01', '2019.01.02', '2019.01.03', '2019.01.04', '2019.01.10', '2019.01.11', '2019.01.12', '2019.01.13', '2019.01.18', '2019.01.22', '2019.01.27', ' 2019.01.28', '2019.01.30')
ID <- c('A', 'A', 'C', 'C', 'D', 'E', 'A', 'F', 'D', 'F', 'F', 'C', 'G')
df <- data.frame(Date, ID)
Date ID
2019.01.01 A
2019.01.02 A
2019.01.03 C
2019.01.04 C
2019.01.10 D
2019.01.11 E
2019.01.12 A
2019.01.13 F
2019.01.18 D
2019.01.22 F
2019.01.27 F
2019.01.28 C
2019.01.30 G
To end up with
Date ID
2019.01.03 C
2019.01.11 E
2019.01.13 F
Help would be greatly appreciated!
First we define the elements, that are "first of a group":
library(dplyr)
library(lubridate)
first_of_groups <- df %>%
mutate(Date = as_date(Date)) %>%
group_by(grp_cond_3 = cumsum((Date - lag(Date, default = first(Date))) != 1)) %>%
filter(n() > 1) %>%
slice_min(Date)
We need this data.frame to remove this elements. Next we build up the data.frame:
df %>%
mutate(Date = as_date(Date),
condition_1 = Date == lag(Date) + 1,
condition_2 = replace_na(condition_1, FALSE)) %>%
left_join(first_of_groups, by = "ID", suffix = c("", ".y")) %>%
mutate(condition_3 = is.na(Date.y) | Date < Date.y) %>%
ungroup() %>%
filter(condition_1, condition_2, condition_3) %>%
group_by(ID) %>%
slice(1) %>%
select(Date, ID) %>%
ungroup()
This returns
# A tibble: 3 x 2
Date ID
<date> <chr>
1 2019-01-03 C
2 2019-01-11 E
3 2019-01-13 F
Basically condition_1 and condition_2 are the same. We take the date and calculate the difference to the preceding date. If the difference equals 1 we know two things:
it's not the first element of a group of consecutive days
it has consecutative days
For the "first of groups" condition, we search for the rows in your dataset, whose difference with the preceeding row isn't 1. Assuming, that there aren't rows with the same date, we can build up a grouping number using cumsum from this information.
Next we filter for "real" groups, groups with more than one element. And we take the first element of those groups.
Using this first_of_groups data.frame we are able to build condition_3 by left joining first_of_groups with our original dataframe. condition_3 is met either there is no match for the ID in the first_of_groups data.frame (is.na(Date.y)) or the Date is lesser/smaller (lies before) the date of the ID becoming "first of a group".
The final condition_4 is created by taking the first element per ID in the remaining data.frame.
ID score
a 1
a 2
b 2
b 4
c 4
c 5
I want to change id to "a,b,c" order this to
ID score
a 1
b 2
c 4
a 2
b 4
c 5
What I tried
> data <- read_csv(data)
> data <- factor(data$id, levels = c('a', 'b', 'c'))
This works for tables so I tried it but didn't work for this. Anybody know if there is a way?
Instead of assigning the 'id' column to data <- (which would replace the data with the 'id' values) it would be used for ordering. In base R, this can be done with
data1 <- data[order(duplicated(data$ID)),]
row.name(data1) <- NULL
Or with dplyr
library(dplyr)
library(data.table)
data %>%
arrange(rowid(ID))
library(dplyr)
d %>%
group_by(ID) %>%
mutate(r = row_number()) %>%
ungroup() %>%
arrange(r, ID, score) %>%
select(-r)
OR in base R
with(d, d[order(ave(seq(NROW(d)), d$ID, FUN = seq_along), ID, score),])
Consider the following example
data <- data_frame(name = c('A','B','C','C',NA,'D'))
> data
# A tibble: 6 × 1
name
<chr>
1 A
2 B
3 C
4 C
5 <NA>
6 D
Here, I know that the variable name actually maps to 'A' -> 'one' and 'B' -> 'two'. I would simply like to create a variable that gets the mapping value. Of course, in my original dataset I have many more cases to map.
Something that does not work is the following.
data <- data %>%
mutate(mapping = ifelse(name == 'A', 'one', name),
mapping = ifelse(name == 'B', 'two', name))
> data
# A tibble: 6 × 2
name mapping
<chr> <chr>
1 A A
2 B two
3 C C
4 C C
5 <NA> <NA>
6 D D
What is wrong here? What is the most efficient way to do so in dplyr?
Many thanks!
If you want to avoid nested ifelse , you should simply create a mapping data frame and inner join with it .
mapping_df <- data.frame(name = c('A', 'B', 'C' . . . . 'Z'), mapping = c(1:26))
left_join(data, mapping_df, by = "name")
data %>% mutate(mapping = recode(name, A="one", B="two"))
Recode may be handy when there aren't too many replacements.
For two values you could try something like:
data <- data %>%
mutate(mapping = ifelse(name == 'A', 'one',
ifelse(name == 'B', 'two', 'other')))
However you would be better off creating a separate data frame that contained the map and then using dplyr::left_join() to add it to your main df.
Suppose I have a dataframe generated like this
dataframe <- data.frame(name = (rep(c('A', 'B', 'C', 'D'), 25)), probe = rep(number, each = 4), a = rnorm(100), b = (rnorm(100)+1), c = (rnorm(100)+5))
> head(dataframe)
name probe a b c
1 A 1 0.03394554 2.97384424 4.173368
2 B 1 1.64304498 2.67977648 5.027671
3 C 1 0.35266588 1.62455820 5.664635
4 D 1 -1.24197302 0.29907974 5.243112
5 A 2 -0.20330593 0.45405930 6.603498
6 B 2 -1.06909795 -0.02575508 4.318659
The samples are in the columns. Variables are in the rows.
I need to calculate the ratio (A+B)/(C+D) for very set of samples using the same probe, such as when probe == 1 or probe == 2.
I can groupby by probe.
But it seems functions can be applied to the columns, how to apply functions to the rows in a groupby object?
Thanks for the help!
I'd reshape.
library(dplyr)
library(tidyr)
df %>%
gather(variable, value, -name, -probe) %>%
spread(name, value) %>%
mutate(ratio = (A+B)/(C+D) )
Or we could use recast from reshape2. It is a convenient wrapper for melt/dcast. We add the new column 'ratio' after the reshape.
library(reshape2)
transform(recast(df, measure.var=c('a', 'b', 'c'),
probe+variable~name, value.var='value'), ratio= (A+B)/(C+D))