Better subsetting and counting values in a dataframe [duplicate] - r

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 4 years ago.
I have a data frame with two columns and 70,000 rows. One column serves an identifier for a household, column b in the example below. The other column refers to the individuals in the household, numbering them from 1 to n with some error (could be 1,2,3 or 1,4,5), column a in the example below.
I'm trying to use hierarchical clustering with the number of individuals in a household as a feature. The code I've written below counts the number of individuals in a household and puts them in the proper column and row, however takes several minutes with the actual data set I have, I assume due to its size. Is there a better way of going about getting this information?
fake.data <- data.frame(a = c(1,1,5,6,7,1,2,3,1,2,4), b = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c"))
fake.cluster <- data.frame(b = unique(fake.data$b))
fake.cluster$members <- sapply(fake.cluster$b, function(x) length(unique(subset(fake.data, fake.data$b == x)$a)))

Don't know if this is quicker, but you could use dplyr in various ways. One approach: get the distinct rows and then count b.
library(dplyr)
fake.cluster <- fake.data %>%
distinct() %>%
count(b)

Here is an option using data.table
library(data.table)
setDT(fake.data)[, .(members = uniqueN(a)), b]
# b members
#1: a 4
#2: b 3
#3: c 3

Related

Creating a column based on two criteria using max values from another column

I've got a dataset of species observations over time and I am trying to calculate observation dates based on the max value of criteria:
Df <- data.frame(Sp = c(1,1,2,2,3,3),
Site = c("A", "B", "C", "D"),
date = c('2021-1-1','2021-1-2','2021-1-3','2021-1-4','2021-1-5','2021-1-6', "2021-03-01","2021-03-05")
N = c(2,5,9,4,14,7,3,11)
I want to create a new column called Nmax that showing the in which date the value of N for a Sp on a given Site was max, so the column would look something like this:
Dmax=c("2021-1-2", "2021-1-2", '2021-1-2', '2021-1-2', '2021-1-5', '2021-1-5', "2021-03-05","2021-03-05")
So Dmax would show that for Sp 1 in site A the date in which N was max was "2021-1-2" and so on.
I've tried grouping by Site, Sp, and date and using mutate together which.max(N) but didn't work. I'd like to keep all my rows.
Any help is welcome.
Thanks!
From your desired output, it seems like you want the max date regardless of site. Just group by site. Also, your sample data only has 6 rows for Sp instead of 8 so I just assumed a 4th Sp
Df |>
group_by(Sp) |>
mutate(Dmax = date[which.max(N)])

Merging dataframes in R - resulting dataframe is too large

I am trying to merge two dataframes in R, joining them by the one column that they share.
Here are screenshots of the two dataframes, and I am merging on the column "INC_KEY".
This is the code I have written to merge the two dataframes:
dp <- inner_join(d,p,by="INC_KEY")
d has 177156 observations, and p has 1641137 observations, but the final merged dataframe has 8416113 observations, which does not make sense to me. I have also tried changing the inner_join function above to the merge function, but I still get the same result. I am wondering how to fix this code so that the merged dataframe has a realistic number of observations - thanks so much for any help!
You most probably have duplicates in either d or p or both of them. Try keeping only one row for each unique INC_KEY value before joining.
library(dplyr)
dp <- inner_join(d %>% distinct(INC_KEY, .keep_all = TRUE),
p %>% distinct(INC_KEY, .keep_all = TRUE),by="INC_KEY")
This can happen if your INC_KEY is not a unique identifier. Here is a simplified example:
library(dplyr)
df1 <- data.frame(key = c("A", "B", "C", "A"),
val1 = 1:4)
df2 <- data.frame(key = c("A", "B", "C", "C", "B"),
val2 = 1:5)
inner_join(df1, df2, by = "key")
Joining, by = "key"
key val1 val2
1 A 1 1
2 B 2 2
3 B 2 5
4 C 3 3
5 C 3 4
6 A 4 1
Because there are two values of "A" in the key column in df1, both rows match the one row of df2 with "A". The one row in df1 with a key of "C" matches both rows with the key of "C" in df2. This is the expected behavior of an inner join with duplicated key values. The join returns all rows in the second data.frame that match each row in the first data.frame. If there are multiple matches, they are all returned.
If you want one row per INC_KEY, then you need to do something to your original data before the join, especially the rows are not complete duplicates.
The key column INC_KEY has duplicates in at least one of your tables. inner_join will then output a table with additional rows depending on the number of found duplicates minus the rows with INC_KEY missing in either dor p.
If you expect your new table to have the same number of rows as table d, then you need to aggregate the information in table p first; grouped by INC_KEY. Then you can perform inner_join.

Left Join with no NA's generated [duplicate]

This question already has answers here:
updating column values based on another data frame in R
(3 answers)
Update columns by joining more than one columns
(2 answers)
r Replace only some table values with values from alternate table
(4 answers)
Conditional merge/replacement in R
(8 answers)
Closed 4 years ago.
Say I have two dataframes:
a <- data.frame(id = 1:5, nom = c("a", "b", "c", "d", "e"))
b <- data.frame(id = 3:1, nom = c("C", "B", "A"))
and i would like to join the two dataframes such that it still has two columns (id and nom) but where id = 4 and id = 5 the nom column retains the d and e values respectively. How do I achieve this? the solution I'd like looks like this:
Essentially, I'd like to avoid the NA output and retain the old rows where a left join match does not exist.

Simple column splitting and joining using dplyr [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I was wondering how I might simply split a numerical column by a second grouping variable in a dataset, then cbind the numerical column. This would most likely be a simple extension of the separate function for dplyr. For example, changing X below:
Y <- rbind(2,5,3,6,3,2)
Z <- rbind("A", "A", "A", "B", "B", "B")
X <- data.frame(Y,Z)
Into
A B
2 6
5 3
3 2
Then ideally extract the rowMeans into a new vector. (Issue also arises here when there is only one character in Z, given rowmeans requires 2).
This would need to be infinitely expandable based on the number of unique variables in Z. e.g., if Z had A, B, and C, then the final data.frame would require 3 columns. This would allow me to capture the row means from infinite number of groups in Z.
Thanks in advance,
Conal
Looks like a job for tidyr::spread.
library(dplyr)
library(tidyr)
X2 <- X %>%
group_by(Z) %>%
mutate(ID = 1:n()) %>%
spread(Z, Y) %>%
select(-ID)
X2
# A tibble: 3 x 2
A B
* <dbl> <dbl>
1 2 6
2 5 3
3 3 2

Frequency of data points by two variables in R [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 5 years ago.
I have what I know must be a simple answer but I can't seem to figure it out.
Suppose I have a dataset:
id <- c(1,1,1,2,2,3,3,4,4)
visit <- c("A", "B", "C", "A", "B", "A", "C", "A", "B")
test <- c(12,16, NA, 11, 15,NA, 0,12, 5)
df <- data.frame(id,visit,test)
And I want to know the number of data points per visit so that the final output looks something like this:
visit test
A 3
B 3
C 1
How would I go about doing this? I've tried using table
table(df$visit, df$test)
but I get a full grid of the number of values present the combination of visits and test values.
I can sum each row by doing this:
sum(table(df$visit, df$test))[1,]
sum(table(df$visit, df$test))[2,]
sum(table(df$visit, df$test))[3,]
But I feel like there is an easier way and I'm missing it! Any help would be greatly appreciated!
aggregate of base R would be ideal for this. Group id by visit and count the length. Remove the rows with NA using !is.na() prior to determining the length
aggregate(x = df$id[!is.na(df$test)], by = list(df$visit[!is.na(df$test)]), FUN = length)
# Group.1 x
#1 A 3
#2 B 3
#3 C 1
How about:
data.frame(rowSums(table(df$visit, df$test)))

Resources