How to compare a data frame with duplicates and a vector? - r

I have a data frame in which some ids appear more than once. I sampled this ids uniquely and now I have a vector with the sampled ids. Now I need to create a logical that tells me which rows in the data frame have ids that also appear on my sample.
I have tried the match function, but it selects only the first appearance and I need all appearances.
I have also tried merge but the dataset is to large so there is no memory to do it.

You can use %in% to get a logical vector and which together with in to get the row indices. Here is a reproducible example that contains duplicate IDs.
set.seed(1234)
df <- data.frame(id=sample(1:80, 100, replace=TRUE), b=rnorm(100))
mySample <- seq(1, 80, by=6)
#logical vector length of nrow(df)
myRows <- df$id %in% mySample
# row indices
myIndices <- which(df$id %in% mySample)

This is what you can do using match (as you were trying this function):
x=match(df$id, mySample, nomatch = 0) > 0
Which gives you a logical vector which is TRUE if df$id appears in mySample and FALSE otherwise.
To retrieve the respective indices:
which(x==T)

Related

How do I replace specific cell values in dataframe using continuous (sequential) indexing?

I have two dataframes of equal dimensions.
One has some value in cells (i.e. 'abc') that i need to index. Other has all different values. And I need to replace the values in other dataframe with the same index as 'abc'.
Examples:
df1 <- data.frame('1'=c('abc','bbb','rweq','dsaf','cxc','rwer','anc','ewr','yuje','gda'),
'2'=c(NA,NA,'bbb','dsaf','rwer','dsaf','ewr','cxc','dsaf','cxc'),
'3'=c(NA,NA,'dsaf','abc','bbb','cxc','yuje',NA,'ewr','anc'),
'4'=c(NA,NA,'cxc',NA,'abc','anc',NA,NA,'yuje','rweq'),
'5'=c(NA,NA,'anc',NA,'abc',NA,NA,NA,'rwer','rwer'),
'6'=c(NA,NA,'rweq',NA,'dsaf',NA,NA,NA,'bbb','bbb'),
'7'=c(NA,NA,'abc',NA,'ewr',NA,NA,NA,'abc','abc'),
'8'=c(NA,NA,'abc',NA,'rweq',NA,NA,NA,'cxc','bbb'),
'9'=c(NA,NA,NA,NA,'abc',NA,NA,NA,'anc',NA),
'10'=c(NA,NA,NA,NA,'abc',NA,NA,NA,'rweq',NA))
df2 <- data.frame('1'=c('green','black','white','yelp','help','green','red','brown','green','crack'),
'2'=c(NA,NA,'black','yelp','green','yelp','brown','help','yelp','help'),
'3'=c(NA,NA,'yelp','green','black','help','green',NA,'brown','red'),
'4'=c(NA,NA,'help',NA,'green','red',NA,NA,'green','white'),
'5'=c(NA,NA,'red',NA,'green',NA,NA,NA,'green','green'),
'6'=c(NA,NA,'white',NA,'yelp',NA,NA,NA,'black','black'),
'7'=c(NA,NA,'green',NA,'brown',NA,NA,NA,'green','green'),
'8'=c(NA,NA,'green',NA,'white',NA,NA,NA,'help','black'),
'9'=c(NA,NA,NA,NA,'green',NA,NA,NA,'red',NA),
'10'=c(NA,NA,NA,NA,'green',NA,NA,NA,'white',NA))
I can find sequential index of 'abc', but it returns one-sized vector
which(df1 == 'abc')
#[1] 1 24 35 45 63 69 70 73 85 95
And i don't know how to replace values using this method
In output expected to view df2 with replaced values 'green' only on the same indexes as values 'abc' in df1.
But note!! that 'green' values in df2 are not only in the same indexes as in df1
I don't think your problem is appropriately approached with the data in a data.frame. That introduces several complications. First, each variable (column) in the data frame is a factor with different levels! Second, your code is making a comparison between a list (data.frame) and a factor (which is coerced into an atomic vector). The help function for the == operator states ..if the other is a list R attempts to coerce it to the type of the atomic vector.. The help function also points out that factors get special handling in comparisons where it first assumes you are comparing factor levels, which your code is doing.
I think you want to convert your data frames of identical dimensions to a matrix first. If you need the results in a data.frame, convert it back after as I show here but realize that the factor levels may have changed.
# Starting with the values assigned to df1 and df2
m1 <- as.matrix(df1)
m2 <- as.matrix(df2)
index <- which(m1 == "abc")
m2[index] <- "abc"
df2 <- as.data.frame(m2)
Here is a way to. Learn about the *apply family in R: I think it is the most useful group of functions in this language, whatever you plan to do ;) Also know that data.frame are of 'list' type.
df1 <- lapply(df1, function(frame, pattern, replace){ # for each frame = column:
matches <- which(pattern %in% frame) # what are the matching indexes of the frame
if(length(matches) > 0) # If there is at least one index matching,
frame[matches] <- replace # give it the value you want
return(frame) # Commit your changes back to df1
}, pattern="abc", replace= "<whatYouWant>") # don't forget this part: the needed arguments !

Subset the remaining of a dataframe using another subset

I have a sample dataset. I've created a subset of the original data frame using some condition. Now I need to extract the remaining contents of the original sample data frame, except the subset created. How can I do this?
data("mtcars")
fulldf <- mtcars
subdf <- subset.data.frame(fulldf, subset = fulldf$disp < 100)
restdf <- subset.data.frame(fulldf, subset = <fulldf without subdf>)
There are a lot of questions on subsetting data frames in R, but I couldn't find one that satisfied my requirement.
Also the final solution need not necessarily be using subset.data.frame. Any method/package will do.
It is better to assign the logical condition in base R to an object identifier and then negate (!)
i1 <- fulldf$disp < 100
subdf <- subset.data.frame(fulldf, subset = i1)
restdf <- subset.data.frame(fulldf, subset = !i1)
Also another option is to create a list of two datasets with split
lst1 <- split(fulldf, i1)
If the 'subdf' is creating with multiple conditions (not clear though), one option is to add a sequence variable in the data and then subset with %in%
fulldf$ind <- seq_len(nrow(fulldf))
then after the 'subdf' step
restdf <- subset(fulldf, !ind %in% subdf$ind)
and remove the 'ind' columns
restdf$ind <- NULL
subdf$ind <- NULL

How to apply this code to all cells of dataframe in R

This line of code applies to a f_name column on my data frame and removes all the cells of f_name column but I want to apply it to all columns.
How do I do it?
subset(m, nchar(as.character(f_name)) <= 100
If your data.frame is named dat, try the following.
It first creates a logical index inx with values TRUE if all elements of a column have less than 100 characters. It then subsets the original data.frame keeping only those columns.
inx <- sapply(dat, function(x) all(nchar(x) < 100))
new_dat <- dat[which(inx)]

R select subset of data

I have a dataset with three columns.
## generate sample data
set.seed(1)
x<-sample(1:3,50,replace = T )
y<-sample(1:3,50,replace = T )
z<-sample(1:3,50,replace = T )
data<-as.data.frame(cbind(x,y,z))
What I am trying to do is:
Select those rows where all the three columns have 1
Select those rows where only two columns have 1 (could be any column)
Select only those rows where only column has 1 (could be any column)
Basically I want any two columns (for 2nd case) to fulfill the conditions and not any specific column.
I am aware of rows selection using
subset<-data[c(data$x==1,data$y==1,data$z==1),]
But this only selects those rows based on conditions for specific columns whereas I want any of the three/two columns to fullfill me criteria
Thanks
n = 1 # or 2 or 3
data[rowSums(data == 1) == n,]
Here is another method:
rowCounts <- table(c(which(data$x==1), which(data$y==1), which(data$z==1)))
# this is the long way
df.oneOne <- data[as.integer(names(rowCounts)[rowCounts == 1]),]
df.oneTwo <- data[as.integer(names(rowCounts)[rowCounts == 2]),]
df.oneThree <- data[as.integer(names(rowCounts)[rowCounts == 3]),]
It is better to save multiple data.frames in a list especially when there is some structure that guides this storage as is the case here. Following #richard-scriven 's suggestion, you can do this easily with lapply:
df.oneCountList <- lapply(1:3, function(i)
data[as.integer(names(rowCounts)[rowCounts == i]),]
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)
You can then pull out the data.frames using either their index, df.oneCountList[[1]] or their name df.oneCountList[["df.oneOne"]].
#eddi below suggests a nice shortcut to my method of pulling out the table names using tabulate and the arr.ind argument of which. When which is applied on a multipdimensional object such as an array or a data.frame, setting arr.ind==TRUE produces indices of the rows and the columns where the logical expression evaluates to TRUE. His suggestion exploits this to pull out the row vector where a 1 is found across all variables. The tabulate function is then applied to these row values and tabulate returns a sorted vector that where each element represents a row and rows without a 1 are filled in with a 0.
Under this method,
rowCounts <- tabulate(which(data == 1, arr.ind = TRUE)[,1])
returns a vector from which you might immediately pull the values. You can include the above lapply to get a list of data.frames:
df.oneCountList <- lapply(1:3, function(i) data[rowCounts == i,])
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)

Sort specific range of data frame by a column in R

I have a data.frame of size 8326x13. I would like to order it in parts by a specific column. E.g. order the range 1:1375 only by the column A. Then, I would like to add this order part to same data.frame into the correct place 1:1375. Is it possible?
Thanks in advanced.
Raúl.
Or, (using the dataset of useR)
indx <- rep(c(TRUE,FALSE), each=10) #create a logical index.
In this case the first 10 rows are ordered
data[indx,] <- data[order(data$A[indx]),]
Update
Or instead of creating a logical index, extract the rows that needs to be ordered and replace it with the ordered set
data[1:10,] <- data[order(data$A[1:10]),]
In your dataset if you create a index,
indx <- rep(c(TRUE,FALSE), c(1375, 8326-1375))
As suggested by #JeremyS
A <- sample(1:100, 20)
B <- sample(letters[1:26],20)
data <- data.frame(A, B)
n <- 10 # you want range 1:n
lower <- data[(n+1):dim(data)[1], ] # split to two data.frame with lower and upper part
upper <- data[1:n,]
upper <- upper[order(upper$A),] # or order(upper[,m]), m is the column index
data.new <- rbind.data.frame(upper, lower)

Resources