Left Join with no NA's generated [duplicate] - r

This question already has answers here:
updating column values based on another data frame in R
(3 answers)
Update columns by joining more than one columns
(2 answers)
r Replace only some table values with values from alternate table
(4 answers)
Conditional merge/replacement in R
(8 answers)
Closed 4 years ago.
Say I have two dataframes:
a <- data.frame(id = 1:5, nom = c("a", "b", "c", "d", "e"))
b <- data.frame(id = 3:1, nom = c("C", "B", "A"))
and i would like to join the two dataframes such that it still has two columns (id and nom) but where id = 4 and id = 5 the nom column retains the d and e values respectively. How do I achieve this? the solution I'd like looks like this:
Essentially, I'd like to avoid the NA output and retain the old rows where a left join match does not exist.

Related

Find which element (row and column) that differ between two data frames in R? [duplicate]

This question already has answers here:
Identifying specific differences between two data sets in R
(4 answers)
Select rows from one data.frame that are not present in a second data.frame
(14 answers)
Closed 10 months ago.
Is there a simple way to find which element (row and column) that differ between two data frames in R? I know I can get which rows are different using setdiff() or dplyr::anti_join(). I also know it is possible using other software, but I’d like to know if I can do this inside R (or RStudio).
Have a look at the waldo library.
> library(waldo)
> a = data.frame(x=1:3, y=c("a", "b", "c"))
> b = data.frame(x=1:3, y=c("a", "B", "c"))
> compare(a, b)
old vs new
y
old[1, ] a
- old[2, ] b
+ new[2, ] B
old[3, ] c
`old$y`: "a" "b" "c"
`new$y`: "a" "B" "c"

remove duplicates and it's intrinsec value [duplicate]

This question already has answers here:
R: Extracting non-duplicated values from vector (not keeping one value for duplicates) [duplicate]
(2 answers)
Keep only non-duplicate rows based on a Column Value [duplicate]
(1 answer)
R: Removing duplicate elements in a vector [duplicate]
(4 answers)
Closed 1 year ago.
Suppose the next vector:
just_a_random_vector <- c("A", "B", "B", "C", "C", "D")
The idea is that if certain value has duplicates then drop all duplicate values and the value itself. In order to get something that looks like this:
# A D
Is there any way to get the above output?
Using duplicated in forward and reverse to return two logical vectors, then use OR (|) when either one of them is TRUE, negate (!) and subset the vector
just_a_random_vector[!(duplicated(just_a_random_vector)|
duplicated(just_a_random_vector, fromLast = TRUE))]
[1] "A" "D"
Or another option is table to create a logical vector based on the frequency count i.e. count equal to 1 is returned
just_a_random_vector [just_a_random_vector %in%
names(which(table(just_a_random_vector) == 1))]
[1] "A" "D"

Frequency of data points by two variables in R [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 5 years ago.
I have what I know must be a simple answer but I can't seem to figure it out.
Suppose I have a dataset:
id <- c(1,1,1,2,2,3,3,4,4)
visit <- c("A", "B", "C", "A", "B", "A", "C", "A", "B")
test <- c(12,16, NA, 11, 15,NA, 0,12, 5)
df <- data.frame(id,visit,test)
And I want to know the number of data points per visit so that the final output looks something like this:
visit test
A 3
B 3
C 1
How would I go about doing this? I've tried using table
table(df$visit, df$test)
but I get a full grid of the number of values present the combination of visits and test values.
I can sum each row by doing this:
sum(table(df$visit, df$test))[1,]
sum(table(df$visit, df$test))[2,]
sum(table(df$visit, df$test))[3,]
But I feel like there is an easier way and I'm missing it! Any help would be greatly appreciated!
aggregate of base R would be ideal for this. Group id by visit and count the length. Remove the rows with NA using !is.na() prior to determining the length
aggregate(x = df$id[!is.na(df$test)], by = list(df$visit[!is.na(df$test)]), FUN = length)
# Group.1 x
#1 A 3
#2 B 3
#3 C 1
How about:
data.frame(rowSums(table(df$visit, df$test)))

Better subsetting and counting values in a dataframe [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 4 years ago.
I have a data frame with two columns and 70,000 rows. One column serves an identifier for a household, column b in the example below. The other column refers to the individuals in the household, numbering them from 1 to n with some error (could be 1,2,3 or 1,4,5), column a in the example below.
I'm trying to use hierarchical clustering with the number of individuals in a household as a feature. The code I've written below counts the number of individuals in a household and puts them in the proper column and row, however takes several minutes with the actual data set I have, I assume due to its size. Is there a better way of going about getting this information?
fake.data <- data.frame(a = c(1,1,5,6,7,1,2,3,1,2,4), b = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c"))
fake.cluster <- data.frame(b = unique(fake.data$b))
fake.cluster$members <- sapply(fake.cluster$b, function(x) length(unique(subset(fake.data, fake.data$b == x)$a)))
Don't know if this is quicker, but you could use dplyr in various ways. One approach: get the distinct rows and then count b.
library(dplyr)
fake.cluster <- fake.data %>%
distinct() %>%
count(b)
Here is an option using data.table
library(data.table)
setDT(fake.data)[, .(members = uniqueN(a)), b]
# b members
#1: a 4
#2: b 3
#3: c 3

Removing rows by reference using data.table? [duplicate]

This question already has answers here:
How to delete a row by reference in data.table?
(7 answers)
Closed 8 years ago.
I am trying to figure out how to remove a group of rows from a dataset by reference. For example, with this data set:
testset <- data.table(date=as.Date(c("2013-07-02","2013-08-03","2013-09-04","2013-10-05","2013-11-06")),
yr = c(2013,2013,2013,2013,2013),
mo = c(07,08,09,10,11),
da = c(02,03,04,05,06),
plant = LETTERS[1:5],
product = as.factor(letters[26:22]),
rating = runif(25))
I want to remove all rows where the product is "y". I have no idea how to go about this.
you can use either of the following commands -
testset_new <- subset(testset,product!="y")
or
testset_new <- testset[testset$product!="y",]

Resources