Subtracting two dataset - r

I have 2 datasets. One is the parent dataset (A) and other one is a subset (B) of it. I want to create a dataset from A which does not contain rows from B. It should be something like
C=A-B
Both the datasets A and B have same number of columns and column names.

If B is an actual subset of A, you can use setdiff on rownames:
sset <- subset(mtcars,cyl==4)
mtcars[setdiff(rownames(mtcars),rownames(sset)),]

If you do not want to convert it into a string for comparing, i.e Do exact matches
you can try this out
a <- data.frame(t(matrix(1:12,3,4)))
b <- data.frame(t(matrix(7:21,3,5)))
a[!apply(a,1,FUN=function(y){any(apply(b,1,FUN=function(x){all(x==y)}))}),]

Something like the following might do the trick:
C <- A[!(apply(A, 1, toString) %in% apply(B, 1, toString)), ]

Related

Faster Alternative for looping in combination with If in R

I have a data frame with 2,000,000 + rows and 22 columns.
In three of the columns the entries are either 0, 1 or NA.
I want to have a column which has the sum of these three columns for every row, treating NA as 0.
Using a for loop is definitely way too slow.
Have you got any alternatives for me? Another idea was using mutate in a pipe, but I have problems selecting the columns that I want to add up by name.
First attempt:
for(i in 1:nrow(T12)){
if(is.na(T12$blue[i]) & is.na(T12$blue.y[i])) {
T12$blue[i] <- T12$blue.x[i]
}else if(is.na(T12$blue[i]) & is.na(T12$blue.x[i])){
T12$blue[i] <- T12$blue.y[i]
}else if(is.na(T12$blue[i]) & is.na(T12$blue.x[i]) & is.na(T12$blue.y[i]) )
T12[i,] <- NULL
}
Thank you!
I am going to assume that the columns you wish to add are the first three. If you need different columns, just change c(1,2,3) in the code below.
apply(T12[,c(1,2,3)], 1, sum, na.rm=TRUE)
Note: #27ϕ9 comments that a faster solution is
rowSums(T12[,c(1,2,3)], 1, na.rm=TRUE)
You can first replace all the NA's to 0.
df[is.na(df)] <- 0
setDT(df)[,newcol := a + b + c]
If your object column names are a, b and c, maybe you can try the code below
within(T12, new <- rowSums(cbind(a,b,c),na.rm = TRUE))

using adist on two columns of data frame

I want to use adist to calculate edit distance between the values of two columns in each row.
I am using it in more-or-less this way:
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df$dist <- adist(my_df$A, my_df$B, ignore.case = TRUE)
my_df <- my_df[order(dist),]
The last two rows are the same as in my case, but the actual data frame looks a bit different - columns of my original data frame are character type, not factor. Also, the dist column seems to be returned as 2-column matrix, I have no idea why it happens.
Update:
I have read a bit and found that I need to apply it over the rows, so my new code is following:
apply(my_df, 1, function(d) adist(d[1], d[2]))
It works fine, but for my original dataset calling it by column numbers is inpractical, how can I refer to column names in this function?
Using tidyverse approach, you may use the following code:
library(tidyverse)
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df %>%
rowwise() %>%
mutate(Lev_dist=adist(x=A,y=B,ignore.case=TRUE))
You can overcome that problem by using mapply, i.e.
mapply(adist, df$A, df$B)
#[1] 2 1
As per adist function definition the x and y arguments should be character vectors. In your example the function is returning a 2x2 matrix because it is comparing also the cross words "mad" with "cat" and "car" with "mug".
Just look at the matrix master diagonal.

R get correct index using which() condition

I have to find an observation satisfying some criteria and then merge this indices with an other dataset. So I don't need the index of the observations satisfying the condition, but the index that refers to all the observations.
For instance, I want to find the max(x1) given that x2>20 and then use this index in another dataset later. I need the right index, in other words:
dat <- data.frame(name= c("A","B","C","D"),
x1= c(1,2,3,4),
x2= c(10,20,30,40))
dat$name[which.max(dat$x1[dat$x2>20])]
[1] B
I want to get
[1] D
i.e. an index of 4, not 2.
Here's one way using data table
library(data.table)
dat <- as.data.table(dat)
which(dat[,name]==dat[x2>20,][which.max(x1),name])
Can do something similar using data frames, but it will be rather more verbose.
which (dat$name==dat$name[which(dat$x2>20)][which.max(dat$x1[which(dat$x2>20)])])
Note that this method depends on the assumption that name contains unique values for each row.
Just use max instead of which.max. However, the whole data frame needs to be sorted based on x1, as max does 1:1 mapping. (Thanks #myk_raniu for clarifying)
dat <- dat[order(dat$x1),]
dat$name[max(dat$x1[dat$x2>20])]
#[1] D
The reason which.max doesn't give the right answer is that the filtered list of x1 is shorter than the dat$name list and there is no longer a 1:1 correspondance
Try this instead
dat <- data.frame(name= c("A","B","C","D"),
x1= c(1,2,3,4),
x2= c(10,20,30,40))
dat$name[dat$x1==max(dat$x1[dat$x2>20])]

Finding maximum value for column among a subset of a data frame

Given a data frame df with columns d, c, v. How do I find the value of d for the maximum value of v among the subset of records where c == "foo"?
I tried this:
df[df$v==max(df$v) & df$c == "foo","d"]
But I got:
character(0)
Yo can do as follows:
with(df, d[v== max(v[c=="foo"])])
EDITED:
If you want to get the value of d for all the levels of c:
library(plyr)
ddply(df, "c", subset, v==max(v))
While Manuel's answer will work most of the time, I believe a more correct version would be:
with(df, d[v== max(v[c=="foo"]) & c=="foo"])
Otherwise it's possible to match a row which has v==max but is not in fact a subset of c=="foo".

How can combine dataset in R?

I think my question is very simple.
dat1<-seq(1:100)
dat2<-seq(1:100)
how can I combine dat1 and dat2 and make it look like
dat3<-seq(1:200)
Thanks so much!
How do you want to combine dat1 and dat2? By rows or columns? I'd take a look at the help pages for rbind() (row bind) , cbind() (column bind), orc() which combines arguments to form a vector.
Let me start by a comment.
In order to create a sequence of number on can use the following syntax:
x <- seq(from=, to=, by=)
A shorthand for, e.g., x <- seq(from=1, to=10, by=1) is simply 1:10. So, your notation is a little bit weird...
On the other hand, you can combine two or more vectors using the c() function. Let us say, for example, that a <- c(1, 2) and b <- c(3, 4). Then c <- c(a, b) is the vector (1, 2, 3, 4).
There exist similar functions to combine data sets: rbind() and cbind().

Resources